Category Archives: Uncategorized

Self-Publishing Tools for converting an Ebook to a Paperback

I self-published The Non-Technical Guide to Web Technologies as an ebook via Amazon Kindle store about half a year ago. After many requests for paperback copies, I decided “why not” also try self-publishing the book? I looked around for various self-publishing tools but settled on Amazon’s createspace.com

I already had a cover from the ebook, but i still needed a spine/backcover in the right format. I used fiverr to get a few versions. The starting prices are $5 at fiverr but you’ll generally spend more for the “extra” services like having a PSD format for the cover.

Createspace was very easy to use. Just type in information and upload the files. They were able to create and ship the proofs quickly.

After approving the proofs, they were in the Amazon store in a few days. It also automatically links up with the Kindle edition.

book cover for non-technical guide to web technologies

I’m very impressed by today’s tools to get things self-published. I highly recommend using the fiverr.com/createspace.com combination to convert your ebook to a paperback.

Data Mining Wikipedia Notes

I spent quite a bit of time mining Wikipedia while at Qwiki. Our original product was a visual search engine where we used Wikipedia as a main data source to generate short videos from each Wikipedia article. We had to extensively parse the wikitext and associated media from Wikipedia to generate these videos. Here’s an example of our technology being used by Bing:

Qwiki being used in Bing search results.

Qwiki being used in Bing search results.

A Qwiki video of New York City

A Qwiki video of New York City

As a first step to mining Wikipeia for your project, I recommend having a goal in mind. Don’t go down the rabbit hole if you need to just take a peek. Wikipedia is almost 100% open and you can see into their inner workings very easily. But it can be easy to get lost in everything, especially the datasets. Most of the time, you will only need a subset of the data for your goal.

For example, if you want to just grab the article internal links, you can just download the pagelinks MySql table dump and avoid parsing the every article.

There are a few categories I’ll cover:

  • Data Collection
  • Data Filtering
  • Data Extraction

Data Collection

Data Dumps

Wikipedia data comes in two sets: XML Dumps and MySql dumps. The article and revision text are in the XML format and the remaining supplementary data comes in MySql dumps. This includes image metadata, imagelinks, etc.

Both can be found on http://dumps.wikimedia.org/enwiki/latest/

You can download the entire article set enwiki-latest-pages-articles.xml or in partitioned chunks: enwiki-latest-pages-articlesX.xml.bz2 where X is 1 to 32

There is also a corresponding rss dump. If you need to be notified when a new wikipedia dump is available, you can write a script to monitor this RSS url. I’ve noticed the dumps commonly released is around the first week of each month.

Besides Wikipedia, there are related datasets which are derived from Wikipedia. The two most popular ones are Google’s Freebase and the open-source DBpedia. Both projects have primary goals of structuring data. By structuring the data, I’m referring to associating a fact with its label. ie. “Bill Gates’ Date of Birth: October 28, 1955″

Freebase comes in an easier to parse format and API. They also release a WEX formatted article text every two weeks. WEX is an xml formatted version of wikitext. Unfortunately, this WEX parser has not seem much updates lately and allegedly doesn’t parse correctly.

BEWARE of using the structured data from Freebase. Some of their data is good but sometimes, the data is very out of date and missing compared to the same wikipedia article. In particular, I noticed their population statistics are out of date. There is also less coverage with location data than Wikipedia.

DBpedia is an open source project written primary in Scala. Their main goal is to extract structured data from the infoboxes. They have a DBpedia Live system where wikipedia articles are automatically updated in DBpedia. Thus DBpedia has near real time updates from wikipedia. You can query for structured data using SPARQL. Using SPAQRL is a little more difficult than Freebase’s JSON-based query language, MQL. They also have supporting projects like DBpedia Spotlight which is designed to extract Wikipedia-derived entities from a body of text.

I would recommend seeing if either of these projects can solve your problem before trying to mine Wikipedia yourself.

Even if you do decide to mine Wikipedia, be sure to use or read the dbpedia parser. They have a lot of information regarding infobox normalization which can help in this area. Consider this page which shows a mapping of the infobox settlement. This can help you think of the heristics you need for your parser.

Data Filtering

Wikipedia has 11 namespaces. If you just want the articles, you can filter against namespace key=’0′. This can greatly reduce the amount of pages you need to process.

Wikipedia also releases just the abstract text in a separate data dump. This is considerably easier to work with if you just need the first few paragraphs of each article.

Data Extraction

The raw Wikipedia data dump comes in XML with the main body text in wikitext. It would have been great if Wikipedia continued to release their article dataset in HTML format but this no longer happens.

Wikitext syntax is very hard to parse and no software has been able to 100% match the PHP mediawiki output. Don’t be too concerned though because in most cases, you won’t need to be 100% perfect.

You will also need to deal with the surrounding metadata stored in the MySql dumps. Take a look at this guide of the mediawiki architecture. It’ll be help to use to decide which data you need.

Parsers

If you are using a JVM language like Java or Scala, I highly recommend Sweble. Beyond just doing a good job with parsing wikitext, it is a well designed package and it is easy to customize and build upon.

Wikimedia is working on a new parser called Parsoid written in node.js and C++. It is planned to be near compatible with the PHP wikitext parser. It was not as complete when I started mining wikitext so I don’t have experience with it.

What is so problematic about the wikitext format? There are many edge cases and only the original PHP parser has been able to reproduce wikitext to HTML correctly. The other big problem is template expansion which we’ll cover in the next section.

While you can spider the wikipedia website itself for text, Wikimedia recommends you use their data dumps to avoid overloading their servers unnecessarily. I would go as far as cloning wikipedia and running a mirror to get the evaluated html. The biggest problem I’ve encountered with cloning Wikipeia is that each article can take a long time to render. Wikipedia’s production site is heavily cached so reading a page from wikipedia.org will be even faster than rendering a local copy.

Templates/Macros

Beyond the syntactical parsing, the biggest challenge will be how to handle the template expansions. Wikitext has macros called templates which is essentially a language in itself. Templates in wikitext are embedded in {{template_code}}. These can be simple replacement templates or more complex ones with if loops and references to other remote data sources.

Sweble had a template expansion system but I found it didn’t work on edge cases. I resorted to modifying Sweble to call out to the Wikipedia API to expand certain templates.

Database

Beyond just the article text, you may want information with the images and link stats. You’ll need to import the MySql tables to read this information. Here’s a diagram of the mediawiki database schema.

I highly recommend turning off indexes while importing the tables and turning them back on once everything is imported.

Large Scale Processing

You need to determine how much you are processing through the wiki dataset. If you can get away with iterating through the XML dump on a single machine, I highly recommend this approach.

For my purposes, I had to run through a few iterations of the XML dump. Some of the iterations I was able to get away with running it on one machine but other times, I had to parallelize it across multiple machines.

I used Hadoop to perform the parallel processing. Hadoop does have a built-in XML splitter. Mahout also comes with a Wikipedia iterator. I found both of these non-intuitive and incorrect on some cases. I resorted to a system where we collapsed each wiki article xml entry into one line each. Hadoop makes it very easy to parallel process one entry per line datasets.

Image Processing:

Images from Wikipedia articles have two sources: wikipedia itself and commonswiki.

If you are processing images from Wikipedia or Wiki Commons, be aware some images are really large and this can kill your image processing application.

Look at this image metadata, a 26280×19877 98M image…


{"height": 19877, "width": 26280,
"source": "http://en.wikipedia.org/wiki/File:El_sue?o_de_Jacob,_by_Jos?_de_Ribera,_from_Prado_in_Google_Earth.jpg",
"url": "http://upload.wikimedia.org/wikipedia/commons/8/85/El_sue%C3%B1o_de_Jacob%2C_by_Jos%C3%A9_de_Ribera%2C_from_Prado_in_Google_Earth.jpg" }

The metadata for each image can be found in the MySql database dump table named image. Unfortunately the description field is truncated and you’ll need to join it with the page and revision tables to get the whole description. This description is also in wikitext format so you’ll need to run it through a wikitext parser.

Other Resources

Below are some resources I tried briefly.

Bliki engine: a Java wikipedia parser.

dbpedia’s extraction_framework can be used to extract infoboxes.

GWT Wiki Dump support.

Mahout’s Wikipedia Dump Splitter

Happy Wiki Mining!

Hope this article helps you get started with wikipedia mining!

Rolling Restarts with Capistrano and Haproxy for Java Web Service Apps

Java web apps can be efficient because they are multithreaded and you only need to run one copy of the process to serve multiple conconcurrent requests.
This is in constrast to Ruby apps where you often need multiple processes to serve multiple requests. Using one process instead of ~8 will save you a lot of memory on the system.

The downside of one process is dealing with rolling restarts. In the case of Ruby app servers like Unicorn, multiple processes are ran and thus can be setup to provide rolling restarts.

If you are using a web container such as Tomcat 7, it can support hot reload in place.

But let’s assume your Java JVM web app is ran with a single command(e.g. java -jar backend-1.0.jar &). The idea of this setup is that it can be abstract to any single process web service.

To get rolling restarts out of this setup, we can use capistrano with haproxy.

We want to:

* start two difference servers with one process each(or use two processes on one server, this won’t provide failover though)
* use an haproxy as a load balancer to these servers

In your Java web service apps, add a health check endpoint(/haproxy-bdh3t.txt) and have it serve an empty text file.

[It's important to use a random string as your endpoint if you are running in the public cloud since the load balancer could be referencing an old server address and haproxy could think a server is up but isn't. ]

In your haproxy.cfg, add

option httpchk HEAD /haproxy-bdh3t.txt HTTP/1.0

as your check condition to the backend services.

In your capistrano script, let’s add two servers set as the app role

server "XXX1", :app
server "XXX2", :app

and alter the restart task to:

* remove the check file for one server. This will remove the server from the load balancer
* restart server.
* ping the server.
* add the check file back to the started server which haproxy will add back into the load balancer.
* repeat as a loop for each server.


desc "Restart"
task :restart, :roles => :web do 
  haproxy_health_file = "#{current_path}/path-static-files-dir/haproxy-bdh3t.txt"

  # Restart each host serially
  self.roles[:app].each do |host|
    # take out the app from the load balancer
    run "rm #{haproxy_health_file}", :hosts => host
    # let existing connections finish
    sleep(5)

    # restart the app using upstart
    run "sudo restart backend_server", :hosts => host

    # give it sometime to startup
    sleep(5)

    # add the check file back to readd the server to the load balancer
    run "touch #{haproxy_health_file}", :hosts => host
  end
end

Installing LAME on Amazon Elastic Map Reduce (EMR)

Amazon Elastic MapReduce instances does not have the debian-multimedia sources by default, you can add the below to a bootstrap script to have it installed:


sudo sh -c "cat >> /etc/apt/sources.list << EOF
deb http://www.debian-multimedia.org squeeze main non-free
deb http://www.debian-multimedia.org testing main non-free
EOF"

gpg --keyserver hkp://pgpkeys.mit.edu --recv-keys 07DC563D1F41B907
gpg --armor --export 07DC563D1F41B907 | sudo apt-key add -

sudo apt-get update
sudo apt-get -y --force-yes install lame libmp3lame-dev faad

Debugging ActiveMQ Authentication Configuration

I recently had to add basic authentication to an ActiveMQ broker and run into some unexpected issues. I followed the example in ActiveMQ in Action to use the simpleAuthenticationPlugin by adding the below snippet to activemq.xml:

You can see the complete file at https://gist.github.com/881965

When I tried to start ActiveMQ with
ACTIVEMQ_HOME/bin/activemq start xbean:file:conf/activemq.xml
ActiveMQ wouldn’t start!

I checked the activemq.log and only see one line:
2011-03-22 13:21:02,799 | INFO | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@23abcc03: startup date [Tue Mar 22 13:21:02 PDT 2011]; root of context hierarchy | org.apache.activemq.xbean.XBeanBrokerFactory$1 | main

Not very helpful…
If i comment out the simpleAuthenticationPlugin tags, activemq starts up correctly.

Hmm, what’s wrong?

Diving into the ActiveMQ In Action book further, I noticed I can start up activemq via
ACTIVEMQ_HOME/bin/activemq console xbean:file:conf/activemq.xml

This gives us a much more verbose log output.
I now see the problem is due to:
Caused by: org.xml.sax.SAXParseException: cvc-complex-type.2.4.a: Invalid content was found starting with element 'plugins'.

Googling up the error brings me to this page http://activemq.apache.org/xml-reference.html

The problem?

** In ActiveMQ 5.4 and later, the XML elements inside broker tag have to be ordered alphabetically!

I moved the plugins tag before systemUsage tag and ActiveMQ was able to start up correctly with authentication.

Database Migrations

I’m working on a project where I needed to create and manage database tables. I find that the active_record migration system from Ruby on Rails to be the best system for creating, versioning database changes. The project itself is in Scala so I took a look at scala migrations and c5-db-migration but I found active_record migration to be better documented, supported and has a more concise syntax.

I recently created a template rails 3 project where I kept only the files necessary for generating migrations. Check it out at https://github.com/tc/database_migrations

To get started:
setup your database connection information in config/database.yml

Create a database:

rake db:create

Create a new migration:

rails g migration create_users

This will create a file in db/migrate.
Add columns:

class CreateUsers < ActiveRecord::Migration
def self.up
create_table :users do |t|
t.string :name
end
end

def self.down
drop_table :users
end
end

Perform the migration:

rake db:migrate

Now you have a working database which can be managed by this application.

You can rollback to a previous version:

rake db:migrate VERISON=XXXX

The versioning of the database is managed in the database’s schema_migrations table.

How to develop with Factorie, a probabilistic modeling toolkit written in Scala

Factorie is a toolkit for developing probabilistic modeling. It is scalable and flexible and allows you to create factor graphs and perform inference. It is written by Andrew Mccallum and his research group at UMass. They previously written Mallet, the java package for text mining. I found that being written in Scala made the code very succinct and clear. You can learn more from its google project page. Prompted by setup questions on the mailing list, I decided to write a quick guide to using Factorie on a mac osx.

Start with cloning the source code:
hg clone https://factorie.googlecode.com/hg/ factorie

Factorie uses maven to manage its build and dependenices. It is an open source package from apache so download it and learn a little how it works.
Compile the code into a jar:
cd factorie
mvn install

This will create a factorie jar file in target/factorie*.jar and install it into your local maven repo. (most likely ~/.m2)

Now that you have factorie jar in your maven repo, clone a sample factorie project:
git clone git@github.com:tc/factorie-example.git

In the factorie-example directory, you’ll notice a pom.xml. Inside the file, you’ll see:

<dependency>
<groupId>cc.factorie</groupId>
<artifactId>factorie</artifactId>
<version>0.9.1-SNAPSHOT</version>
</dependency>

You may have to change the version to match the updated factorie jar version.

This sample project has two files:
src/main/scala/factorie/LDAExample.scala
src/main/scala/factorie/LDAExampleTest.scala

It’s good practice to have a unit test for your code. In this case, the unit test is trivial as it just runs the scala class, but ideally you have some type of assert you want to perform.

Compile and run it using:
mvn test

You should see an output of listed topics:
.....Iteration 20
alpha = 0.7897046248435047 1.083615490115922 1.3493645028398928 0.8200016581775706 1.1115652808961307 1.7646165019929618 2.0397615066201604 1.8106654639049065 1.75947360834168 1.2876416992683335
Topic 0 china science achieved contacts powerful power urbana find faster projects
Topic 1 computing performance list chinese gropp problems smith years today tennessee
Topic 2 service stephen full messages members data twitter contacts services plenty
Topic 3 world erica announcement computer floating components center states couldn fastest
Topic 4 jaguar point ogg year high benchmark speed supercomputers community time
Topic 5 itunes features print mac kessler suggests back challenge tomorrow tuesday
Topic 6 mail facebook google shankland people address big reach ability aol
Topic 7 system news university tianhe top topher national called supercomputing font
Topic 8 apple share supercomputer systems operations machine company based released place
Topic 9 gmail cnet software digg supercomputing social comments technology expected linpack

Be sure to look into src/main/scala/cc/factorie/example directory in the factorie source code for more examples.

If you use an IDE like Eclipse or Intelij IdeaX, type:
mvn eclipse:eclipse
or
mvn idea:idea
to generate the IDE project files.

Additionally, you can just edit using VIM or Emacs and compile using maven.

RMongo: Accessing MongoDB in R

I recently created RMongo, a database access layer to MongoDB in R as an R package.

To install RMongo:

install.packages(“RMongo”)

If that does not work, try downloading it from http://cran.r-project.org/web/packages/RMongo/index.html and run:

install.packages("~/Downloads/RMongo_XX.XX.XX.tar.gz", repos=NULL, type="source")

I tried to mimic the RMySQL commands in RMongo. Below are some example commands.

library(RMongo)

#ask for help
?RMongo

#connect to a database
mongo results names(results)
[1] “X_id” “name” “nutrient_definition_id” “description”
> results
X_id name nutrient_definition_id
1 4cd0f8e31e627d4e6600000e Adjusted Protein 257
2 4cd0f9061e627d4e6600001a Sodium 307

> results results
X_id name nutrient_definition_id
1 4cd0f9061e627d4e6600001a Sodium 307

> dbDisconnect(mongo)

 

RMongo is very alpha at this point. I built it as a quick way to prototype algorithms with data from mongoDB in R. Most of RMongo uses the mongo-java-driver to perform json-formatted queries. The R code in the package uses rJava to communicate with the mongo-java-driver.

Please report any bugs or necessary improvements. Or better yet, send in pull requests via the RMongo github project page!

How to run Background Processes using Resque/Redis in a Ruby on Rails App

When you have a long running block of code, you don’t want to run it inside a web application request cycle. A background processing queuing system is a good solution. There are a number of open source queuing systems available(delayed_job, beanstalk, etc) so you don’t need to write your own! This article will go over how to setup the resque queuing system in a Ruby on Rails application.

Resque setup:

Install redis
brew install redis
on mac.
or

http://code.google.com/p/redis/

Add resque to your gemfile:
gem "resque"

Install the new gem:
bundle install

Create a redis config file called redis.yml in config:
defaults: &defaults
host: localhost
port: 6379

development:
<<: *defaults

test:
<<: *defaults

staging:
<<: *defaults

production:
<<: *defaults

Add an initializer file called resque.rb in config/initializers:
Dir[File.join(Rails.root, 'app', 'jobs', '*.rb')].each { |file| require file }

config = YAML::load(File.open("#{Rails.root}/config/redis.yml"))[Rails.env]
Resque.redis = Redis.new(:host => config['host'], :port => config['port'])

Add resque.rake to lib/tasks
require 'resque/tasks'
task "resque:setup" => :environment

Running Resque:

start redis:
redis-server

start resque
COUNT=5 QUEUE=* rake resque:workers

see web UI:
resque-web

How to add resque jobs:

Create a job class
class NewsCollectionJob
@queue = :news_collection_job

def self.perform(start_date, end_date)
puts "from #{start_date} to #{end_date}"
#TODO your long running process here
end
end

Run it using:
Resque.enqueue(NewsCollectionJob, start_date, end_date)

This command will not block so you can embed this code in a model. There you go! A few simple steps to getting a faster performing ruby application using background processing on resque/redis.