Data Mining Wikipedia Notes

I spent quite a bit of time mining Wikipedia while at Qwiki. Our original product was a visual search engine where we used Wikipedia as a main data source to generate short videos from each Wikipedia article. We had to extensively parse the wikitext and associated media from Wikipedia to generate these videos. Here’s an example of our technology being used by Bing:

Qwiki being used in Bing search results.

Qwiki being used in Bing search results.

A Qwiki video of New York City

A Qwiki video of New York City

As a first step to mining Wikipeia for your project, I recommend having a goal in mind. Don’t go down the rabbit hole if you need to just take a peek. Wikipedia is almost 100% open and you can see into their inner workings very easily. But it can be easy to get lost in everything, especially the datasets. Most of the time, you will only need a subset of the data for your goal.

For example, if you want to just grab the article internal links, you can just download the pagelinks MySql table dump and avoid parsing the every article.

There are a few categories I’ll cover:

  • Data Collection
  • Data Filtering
  • Data Extraction

Data Collection

Data Dumps

Wikipedia data comes in two sets: XML Dumps and MySql dumps. The article and revision text are in the XML format and the remaining supplementary data comes in MySql dumps. This includes image metadata, imagelinks, etc.

Both can be found on http://dumps.wikimedia.org/enwiki/latest/

You can download the entire article set enwiki-latest-pages-articles.xml or in partitioned chunks: enwiki-latest-pages-articlesX.xml.bz2 where X is 1 to 32

There is also a corresponding rss dump. If you need to be notified when a new wikipedia dump is available, you can write a script to monitor this RSS url. I’ve noticed the dumps commonly released is around the first week of each month.

Besides Wikipedia, there are related datasets which are derived from Wikipedia. The two most popular ones are Google’s Freebase and the open-source DBpedia. Both projects have primary goals of structuring data. By structuring the data, I’m referring to associating a fact with its label. ie. “Bill Gates’ Date of Birth: October 28, 1955″

Freebase comes in an easier to parse format and API. They also release a WEX formatted article text every two weeks. WEX is an xml formatted version of wikitext. Unfortunately, this WEX parser has not seem much updates lately and allegedly doesn’t parse correctly.

BEWARE of using the structured data from Freebase. Some of their data is good but sometimes, the data is very out of date and missing compared to the same wikipedia article. In particular, I noticed their population statistics are out of date. There is also less coverage with location data than Wikipedia.

DBpedia is an open source project written primary in Scala. Their main goal is to extract structured data from the infoboxes. They have a DBpedia Live system where wikipedia articles are automatically updated in DBpedia. Thus DBpedia has near real time updates from wikipedia. You can query for structured data using SPARQL. Using SPAQRL is a little more difficult than Freebase’s JSON-based query language, MQL. They also have supporting projects like DBpedia Spotlight which is designed to extract Wikipedia-derived entities from a body of text.

I would recommend seeing if either of these projects can solve your problem before trying to mine Wikipedia yourself.

Even if you do decide to mine Wikipedia, be sure to use or read the dbpedia parser. They have a lot of information regarding infobox normalization which can help in this area. Consider this page which shows a mapping of the infobox settlement. This can help you think of the heristics you need for your parser.

Data Filtering

Wikipedia has 11 namespaces. If you just want the articles, you can filter against namespace key=’0′. This can greatly reduce the amount of pages you need to process.

Wikipedia also releases just the abstract text in a separate data dump. This is considerably easier to work with if you just need the first few paragraphs of each article.

Data Extraction

The raw Wikipedia data dump comes in XML with the main body text in wikitext. It would have been great if Wikipedia continued to release their article dataset in HTML format but this no longer happens.

Wikitext syntax is very hard to parse and no software has been able to 100% match the PHP mediawiki output. Don’t be too concerned though because in most cases, you won’t need to be 100% perfect.

You will also need to deal with the surrounding metadata stored in the MySql dumps. Take a look at this guide of the mediawiki architecture. It’ll be help to use to decide which data you need.

Parsers

If you are using a JVM language like Java or Scala, I highly recommend Sweble. Beyond just doing a good job with parsing wikitext, it is a well designed package and it is easy to customize and build upon.

Wikimedia is working on a new parser called Parsoid written in node.js and C++. It is planned to be near compatible with the PHP wikitext parser. It was not as complete when I started mining wikitext so I don’t have experience with it.

What is so problematic about the wikitext format? There are many edge cases and only the original PHP parser has been able to reproduce wikitext to HTML correctly. The other big problem is template expansion which we’ll cover in the next section.

While you can spider the wikipedia website itself for text, Wikimedia recommends you use their data dumps to avoid overloading their servers unnecessarily. I would go as far as cloning wikipedia and running a mirror to get the evaluated html. The biggest problem I’ve encountered with cloning Wikipeia is that each article can take a long time to render. Wikipedia’s production site is heavily cached so reading a page from wikipedia.org will be even faster than rendering a local copy.

Templates/Macros

Beyond the syntactical parsing, the biggest challenge will be how to handle the template expansions. Wikitext has macros called templates which is essentially a language in itself. Templates in wikitext are embedded in {{template_code}}. These can be simple replacement templates or more complex ones with if loops and references to other remote data sources.

Sweble had a template expansion system but I found it didn’t work on edge cases. I resorted to modifying Sweble to call out to the Wikipedia API to expand certain templates.

Database

Beyond just the article text, you may want information with the images and link stats. You’ll need to import the MySql tables to read this information. Here’s a diagram of the mediawiki database schema.

I highly recommend turning off indexes while importing the tables and turning them back on once everything is imported.

Large Scale Processing

You need to determine how much you are processing through the wiki dataset. If you can get away with iterating through the XML dump on a single machine, I highly recommend this approach.

For my purposes, I had to run through a few iterations of the XML dump. Some of the iterations I was able to get away with running it on one machine but other times, I had to parallelize it across multiple machines.

I used Hadoop to perform the parallel processing. Hadoop does have a built-in XML splitter. Mahout also comes with a Wikipedia iterator. I found both of these non-intuitive and incorrect on some cases. I resorted to a system where we collapsed each wiki article xml entry into one line each. Hadoop makes it very easy to parallel process one entry per line datasets.

Image Processing:

Images from Wikipedia articles have two sources: wikipedia itself and commonswiki.

If you are processing images from Wikipedia or Wiki Commons, be aware some images are really large and this can kill your image processing application.

Look at this image metadata, a 26280×19877 98M image…


{"height": 19877, "width": 26280,
"source": "http://en.wikipedia.org/wiki/File:El_sue?o_de_Jacob,_by_Jos?_de_Ribera,_from_Prado_in_Google_Earth.jpg",
"url": "http://upload.wikimedia.org/wikipedia/commons/8/85/El_sue%C3%B1o_de_Jacob%2C_by_Jos%C3%A9_de_Ribera%2C_from_Prado_in_Google_Earth.jpg" }

The metadata for each image can be found in the MySql database dump table named image. Unfortunately the description field is truncated and you’ll need to join it with the page and revision tables to get the whole description. This description is also in wikitext format so you’ll need to run it through a wikitext parser.

Other Resources

Below are some resources I tried briefly.

Bliki engine: a Java wikipedia parser.

dbpedia’s extraction_framework can be used to extract infoboxes.

GWT Wiki Dump support.

Mahout’s Wikipedia Dump Splitter

Happy Wiki Mining!

Hope this article helps you get started with wikipedia mining!

Deploying a Rails app on Nginx/Puma with Capistrano

Puma is a fast multi-threaded Ruby app server designed to host rack-based Ruby web apps including Sinatra and Ruby on Rails. Like Unicorn, it supports rolling restarts, but since it is multi-threaded rather than Unicorn’s multi-process model, it takes far less memory while being comparable in performance. Puma can run on Ruby 1.9.X but its multi-threaded nature is better suited to run on a real multi-threaded runtime like Rubinius or JRuby.

This article will guide you to setting up a hello world Rails app with Puma/Nginx and deploy it with Capistrano onto a linux system. This guide was tested on Puma 1.6.3 and Puma 2.0.1.

Create a base Rails app

rails new appname

Adding Puma to Rails app

We’ll start with adding a puma to your Rails app.

In your Gemfile, add:

gem "puma"

then run bundle install

Now we need a puma config file: config/puma.rb

rails_env = ENV['RAILS_ENV'] || 'development'

threads 4,4

bind  "unix:///data/apps/appname/shared/tmp/puma/appname-puma.sock"
pidfile "/data/apps/appname/current/tmp/puma/pid"
state_path "/data/apps/appname/current/tmp/puma/state"

activate_control_app

Setup Nginx with Puma

Follow the instructions to install Nginx from source. It will install nginx to /usr/local/nginx

Edit your /usr/local/nginx/conf/nginx.conf file to be below:

user deploy;
worker_processes  1;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /usr/local/nginx/conf/mime.types;
    default_type  application/octet-stream;

    access_log  /var/log/nginx/access.log;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    tcp_nodelay        on;

    gzip  on;

    server_names_hash_bucket_size 128;
    
    client_max_body_size 4M; 
    client_body_buffer_size 128k;
    
    include /usr/local/nginx/conf/conf.d/*.conf;
    include /usr/local/nginx/conf/sites-enabled/*;
}

Create a file named "puma_app" in the sites-enabled directory:

upstream appname {
  server unix:///data/apps/appname/shared/tmp/puma/appname-puma.sock;
}

server {
  listen 80;
  server_name www.appname.com appname.com;

  keepalive_timeout 5;

  root /data/apps/appname/public;

  access_log /data/log/nginx/nginx.access.log;
  error_log /data/log/nginx/nginx.error.log info;

  if (-f $document_root/maintenance.html) {
    rewrite  ^(.*)$  /maintenance.html last;
    break;
  }

  location ~ ^/(assets)/  {
    root /data/apps/appname/current/public;
    expires max;
    add_header  Cache-Control public;
  }

  location / {
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;

    if (-f $request_filename) {
      break;
    }

    if (-f $request_filename/index.html) {
      rewrite (.*) $1/index.html break;
    }

    if (-f $request_filename.html) {
      rewrite (.*) $1.html break;
    }

    if (!-f $request_filename) {
      proxy_pass http://appname;
      break;
    }
  }

  # Now this supposedly should work as it gets the filenames 
  # with querystrings that Rails provides.
  # BUT there's a chance it could break the ajax calls.
  location ~* \.(ico|css|gif|jpe?g|png)(\?[0-9]+)?$ {
     expires max;
     break;
  }

  location ~ ^/javascripts/.*\.js(\?[0-9]+)?$ {
     expires max;
     break;
  }

  # Error pages
  # error_page 500 502 503 504 /500.html;
  location = /500.html {
    root /data/apps/appname/current/public;
  }
}

Using init scripts to start/stop/restart puma

We want to be able to start/restart puma using linux init scripts.

The init scripts for nginx should have been installed already as well. You can start nginx using `sudo /etc/init.d/nginx start

Install Jungle from Puma’s source repo. Jungle is a set of scripts to manage multiple apps running on Puma. You need the puma and run-puma files and place them into /etc/init.d/puma and /usr/local/bin/run-puma respectively.

Then, add your app config using: sudo /etc/init.d/puma add /data/apps/appname/current deploy

IMPORTANT: The init script comes with an assumption that your puma state directories live in /path/to/app/tmp/puma

Using Capistrano to deploy

In your Gemfile, add:

gem "capistrano"

then run bundle install

In your deploy.rb, change to the following below. Note the shared tmp dir modification.

#========================
#CONFIG
#========================
set :application, "APP_NAME"
set :scm, :git
set :repository, "GIT_URL"
set :branch, "master"
set :ssh_options, { :forward_agent => true }
set :stage, :production
set :user, "deploy"
set :use_sudo, false
set :runner, "deploy"
set :deploy_to, "/data/apps/#{application}"
set :app_server, :puma
set :domain, "DOMAIN_URL"
#========================
#ROLES
#========================
role :app, domain
role :web, domain
role :db, domain, :primary => true
#========================
#CUSTOM
#========================
namespace :puma do
  desc "Start Puma"
  task :start, :except => { :no_release => true } do
    run "sudo /etc/init.d/puma start #{application}"
  end
  after "deploy:start", "puma:start"

  desc "Stop Puma"
  task :stop, :except => { :no_release => true } do
    run "sudo /etc/init.d/puma stop #{application}"
  end
  after "deploy:stop", "puma:stop"

  desc "Restart Puma"
  task :restart, roles: :app do
    run "sudo /etc/init.d/puma restart #{application}"
  end
  after "deploy:restart", "puma:restart"

  desc "create a shared tmp dir for puma state files"
  task :after_symlink, roles: :app do
    run "sudo rm -rf #{release_path}/tmp"
    run "ln -s #{shared_path}/tmp #{release_path}/tmp"
  end
  after "deploy:create_symlink", "puma:after_symlink"
end

You’ll need to setup the directories one time using: cap deploy:setup

Now you can deploy your app using cap deploy and restart with cap deploy:restart.

EDIT: xijo has formalized the cap tasks as a gem, capistrano-puma to make it easier to use on multiple projects.

Basic Learning Scala Resources

Scala is one of my primary languages which I have found useful for creating web services, large data-modeling and data processing. Below are some of my notes from my experience learning Scala.

Installation

Scala runs on the Java Virtual Machine. You’ll need to install Java first. Install Oracle Java 7, not OpenJDK or Java 6 unless you have to. I encountered a few bugs on OpenJDK but not on Oracle Java.

To install Scala, you can install it system-wide, but your build system will download the specific version of Scala for your project.

Programming Language

The official tutorial lists a lot of topics: (Implicit Parameters, Variances, Upper Type Bounds, etc) but I recommend not diving head deep first.

I recommend just starting with these four topics first. These topics showcase the benefits of Scala while being approachable for most programmers.

  • Functions as first class citizens – You can use closures unlike Java.

  • Collections and Collection operations – The ease of performing operations on collections makes Scala a very productive language. Learn the class hierarchy and methods

  • Class system – classes, objects(singleton), traits and case classes.

  • Pattern matching – Another great language feature in making code Scala-idiomatic.

With just these four topics, you can get started writing effective programs with Scala. Once you master these, then go onto the advanced material(type system, actors, etc)

These free books from Typesafe are a good start for the language.

Coding style

Ecosystem

The ecosystem of Scala is equally as important to learn as the language itself.

Sites

  • Reddit Scala – Keep up to date quickly.
  • Scala Package Release Notes – Typically a lot of recently released packages will be noted here.
  • Maven Central – All major Scala(and Java) packages are published here. You can access these with any Scala build system.

Involved Companies

  • Typesafe – Founded by the Scala team, Typesafe is the Scala support team. They help companies build robust Scala environments.
  • Companies using Scala – See what others are doing with Scala.
  • twitter github – Twitter has a released a lot of useful Scala libs including Finagle.
  • foursquare github – Foursquare

Build System

You can decide if you want to use an IDE or just plain text editors but I highly recommend using a standard build system.

Here’s an example pom.xml from a sample project. The main commands are to compile continuously (mvn scala:cc), run tests (mvn test) and package (mvn package).

  • SBT – The official build system supported by typesafe. I had bad experiences with this tool after trying it from version 0.7 to 0.10. It has arcane syntax and breaking changes between versions so I just stuck with Maven. Luckily the fast compiler inside SBT is available for Maven now too.

Editors

I typically use a plain text editor(VIM) with mvn scala:cc running in the background. Sometimes I use an IDE(Intellij) if I need to use a debugger or working with a larger project. You’ll find both build systems(Maven and SBT) to work well with Eclipse and Intellij

  • [VIM] I use Janus to bootstrap a VIM setup with basic Scala syntax highlighting. vim-scala is another good VIM plugin.

  • Sublime Text – A great general text editor with scala plugins.

  • Eclipse – Officially supported IDE.

  • Intellij – I prefer Intellij’s interface.

Basic Libs

Web Frameworks

  • Play – A full featured web framework similar to Ruby on Rails. Good for any general purpose web application.

  • Scalatra – Modeled on Ruby’s Sinatra. A very easy micro-web framework to start with. Mostly good for web services type of applications.

  • Lift – Used by Foursquare. I have never tried it but it is stable and well supported by a community.

Database

Other

  • Json4S – This is Scala’s primary JSON parsing package.

  • logback – logback is a widely used logging package.

  • junit – Junit is the most popular Java testing package. I like the simple syntax over other testing packages which feature a DSL including Specs2.

Help

Misc

Don’t worry about the complexity of Scala early on.

Specifically if you try to read the Scala source code, you will be perplexed initially. I would start learning by writing Scala in a traditional Ruby/Python or Java programming style first.

Learn how the JVM works.

If you wrote production Java code in the past, this will come in handy as the tuning/debugging/profiling process will be nearly the same.

All your favorite JVM tools like jstack and jvisualvm will still work.

Learn the configuration flags of the java command line. The garage collector, heap memory, logging are among the configurable options. This blog post and the Resin JVM Tuning page are helpful.

Be forewarned of binary compatibility.

I find this to be biggest problem with the Scala environment. Packages built with an older major version(2.8.X) probably won’t work with newer versions(2.10.X).

If you find a relatively new Scala package and a well supported Java package which do the nearly same thing, I would recommend to pick the Java one for your project because the Scala package might be outdated in a few months and you won’t be able to upgrade.

Startup School 2012 Summary: My Takeaways and Interpretations

I attended the 2012 Startup School at Stanford this past weekend. The event is geared towards encouraging, preparing and teaching engineers/entrepreneurs/designers/anyone about startups. I highly recommend anyone interested in the startup world to attend if they have the chance next year.

Below are some of my notes, takeaways and interpretations.

Mark Zuckerberg of Facebook

Paul Graham interviewed Mark Zuckerberg, ceo of Facebook.

You can get away with the 80/20 rule with a lot of things but there should be a few things where you need to be the best and invest in the quality time. It’s your job to find what you can 80/20 and what is your primary feature.

Around the time Facebook was starting, the idea of identity on the Internet was in its infancy. Zuck wanted to create a site where people are real. He used university emails as a means of controling registration and identity.

It’s a clever hack: you don’t have to do everything yourself. You can use existing systems to accomplish your goals.

He also said he cared more about the product than his competitors. Instead of going after schools with no competitors, he choose to go against competition right away and launched against Stanford, Colombia because these colleges had active competitors. If students were willing to pick his product over another, he knew he had the best product.

Zuck mentioned every year, people are sharing more and more. Facebook was a manifestion of this principle.

See how your customers are using your product.

Zuck and his team noticed early on that students would keep changing their profile photos because they could only have one profile photo at a time. From this, Facebook would then later build out a photo albums feature.

He started Facebook as a hobby. He explored his options before fully committing. It was not until Facebook got over 1 million users that he left Harvard.

People see faces. People dream of social interactions. Facebook expands the human capacity for sharing. Before Facebook, people could only share private things to a very limited amount of friends and family.

Facebook was never a place to meet new people.

Zuck solved something that was fundamental to humans(sharing) for a small market(universities) and expanded the market(the world).

Zuck built lots of little ‘hack’ projects. Instead of studying for an Art History class, he built an online crowdsourced study tool and forwarded to his class.

Little projects are a great way to keep your mind sharp in solving problems. Keep building them even if they seem trivial and are throwaways.

Travis of Uber

Travis likes numbers. His presentation showed Uber tracks and analyze their data very well. It could be their upper hand against the competitors. They have a strong operations and optimization team.

If you want to make something simple for the user, it requires a non-trivial implementation.

He solved his own problem: he wanted to be a baller in San Francisco by getting rides in town cars.

They have a heatmap density of the riders. If they just show this heatmap to the drivers, all the drivers will stay in these dense areas and the other areas become underserved. Instead they give drivers a predicted density of the underserved riders.

Don’t just show the data. Show a version of the data to encourage the right behaviors.

Uber was founded for solving a simple problem but they encountered larger world problems: corrupt cities and legacy laws.

Jessica Livingston

Only a few startups actually make it, what goes wrong?

Startups are a very slow process.

There is no playbook. You have to improvise. Stripe had to make deals with a bank so they made phone calls instead of meeting in person because they looked young.

Cofounders breakups is a big factor in startups failing. Choose very wisely. Work with the person before hand.

“fundraising a bitch”

Investors only like to invest if others have invested so this makes the first investment very hard.

There are a lot of distractions. Try to focus on:

  • code
  • talk to users
  • exercise

One distraction is talking to corporate development of a large company. Large companies will typically just want to acquire you as a resource, not the business. Often the bonus is not that much. Talking to them will be demoralizing and put you off from working on the product.

Make something people want by talking to users!

Patrick Collison of Stripe

Stripe was originally named /dev/payments

Startups are unpredictable.

During the year, there were lots of debate and doubts about the product due to slow growth.

Technology can have a huge impact on the world. Patrick noted that before shipping containers, shipping costs accounted for up to 20% of costs. Now, goods can transfer almost “free”.

Ben Silbermann of Pinterest

Being in a startup is like a road trip.

You don’t know which route you are going to take to get to your destination. Are you going to run out of gas? Would you need to get gas from somewhere else?

Commitment matters.

They originally started with tote, a shopping app for the iPhone. They didn’t like depending on the slow Apple approval.

Don’t take things in faith: investor opinions, anything. Analyze it.

After a year in product development, they released their product. In 4 months, they only had 3,000 users.

But they did notice the people who were using it had high retention. This was a sign that the product was good and useful for some people.

They decided to improve marketing as oppposed to add features. They held pinterest community meetups. He noticed that pinterest helped people on their interests. During the meetups, people who ask about posts on users’ profile pages. They decide to find the right audience to jumpstart their growth instead of adding more product features.

They had a useful product but it was against the “fads” of the time. pinterest was a low-tech product in comparison. It was not real-time, it was not mobile. Ben ignored adding these features and focused on the core pinboard idea.

There are different ways of succeeding. Trust the data you see. When you see high retention in your product, it is useful.

Ben Horowitz

He quoted Michael Jackson saying that “it’s harder than it looks” as an analogy for startups.

If you are building something that already exists, you have to build a product that improves it by 10x.

Anything else will not convince people to switch.

Google Search was 10x better than altavista.

Dropbox was 10x better than any other file sharing software.

Tom-Preston-Werner of Github

Everything you add to a product will removes something else.

Github was started as a way to make git hosting not suck. Then it grew to helping people build software together. Your mission will change but start small.

Tom said with the $100 million in investment, github is planning to expand beyond programmers and fix collaboration.

During his talk, he often asked rhetorically what is the single thing that matters?

He went on to say people was the only thing, then product was the only thing and then philospohy.

Ask yourself if you are asking the right question.

Ron Conway

He invests in people not companies.

Web products are still in infancy. Lots of new products can still be made.

Joel Spolsky of Stackexchange

Decide if your business is big land grab or slow organic growth.

Big land grab companies include Facebook, Airbnb, Stackexchange. Slow organic growth include Fog Creek.

“failure to decide is what kills you”

David Rusenko of Weebly

In 8 months, they only had 30 signups daily.

In 11 months, they only had 100 signups daily.

Finally after 20 months, they had consistent growth.

Know what you are building is useful by measuring your retention. If it’s useful, stay in the game.

The Non-Technical Guide to Web Technologies

I’m in the process of writing an ebook titled “The Non-Technical Guide to Web Technologies”. The goal of the book is be an essential guide for folks to have a fundamental understanding and updated knowledge of web technologies.

Non-Technical Guide to Web Technologies

As suggested by the name, it is meant for a non-technical audience who perform their work in web business. This can include recruiters, business development, marketing, public relations and even non-technical startup founders.

Not everyone should be expected to fully understanding computer science/programming and that is motivation behind this book. I want to provide a resource that can provide a basic overview of the technology powering their business.

If you are interested in updates on this book, sign up for updates on the book’s landing page.

Also, feel free to let me know what topics you want covered in a book.

Rolling Restarts with Capistrano and Haproxy for Java Web Service Apps

Java web apps can be efficient because they are multithreaded and you only need to run one copy of the process to serve multiple conconcurrent requests.
This is in constrast to Ruby apps where you often need multiple processes to serve multiple requests. Using one process instead of ~8 will save you a lot of memory on the system.

The downside of one process is dealing with rolling restarts. In the case of Ruby app servers like Unicorn, multiple processes are ran and thus can be setup to provide rolling restarts.

If you are using a web container such as Tomcat 7, it can support hot reload in place.

But let’s assume your Java JVM web app is ran with a single command(e.g. java -jar backend-1.0.jar &). The idea of this setup is that it can be abstract to any single process web service.

To get rolling restarts out of this setup, we can use capistrano with haproxy.

We want to:

* start two difference servers with one process each(or use two processes on one server, this won’t provide failover though)
* use an haproxy as a load balancer to these servers

In your Java web service apps, add a health check endpoint(/haproxy-bdh3t.txt) and have it serve an empty text file.

[It’s important to use a random string as your endpoint if you are running in the public cloud since the load balancer could be referencing an old server address and haproxy could think a server is up but isn’t. ]

In your haproxy.cfg, add

option httpchk HEAD /haproxy-bdh3t.txt HTTP/1.0

as your check condition to the backend services.

In your capistrano script, let’s add two servers set as the app role

server "XXX1", :app
server "XXX2", :app

and alter the restart task to:

* remove the check file for one server. This will remove the server from the load balancer
* restart server.
* ping the server.
* add the check file back to the started server which haproxy will add back into the load balancer.
* repeat as a loop for each server.


desc "Restart"
task :restart, :roles => :web do 
  haproxy_health_file = "#{current_path}/path-static-files-dir/haproxy-bdh3t.txt"

  # Restart each host serially
  self.roles[:app].each do |host|
    # take out the app from the load balancer
    run "rm #{haproxy_health_file}", :hosts => host
    # let existing connections finish
    sleep(5)

    # restart the app using upstart
    run "sudo restart backend_server", :hosts => host

    # give it sometime to startup
    sleep(5)

    # add the check file back to readd the server to the load balancer
    run "touch #{haproxy_health_file}", :hosts => host
  end
end