Author Archives: tommychheng

About tommychheng

I write a tech blog at http://tommy.chheng.com

Self-Publishing Tools for converting an Ebook to a Paperback

I self-published The Non-Technical Guide to Web Technologies as an ebook via Amazon Kindle store about half a year ago. After many requests for paperback copies, I decided “why not” also try self-publishing the book? I looked around for various self-publishing tools but settled on Amazon’s createspace.com

I already had a cover from the ebook, but i still needed a spine/backcover in the right format. I used fiverr to get a few versions. The starting prices are $5 at fiverr but you’ll generally spend more for the “extra” services like having a PSD format for the cover.

Createspace was very easy to use. Just type in information and upload the files. They were able to create and ship the proofs quickly.

After approving the proofs, they were in the Amazon store in a few days. It also automatically links up with the Kindle edition.

book cover for non-technical guide to web technologies

I’m very impressed by today’s tools to get things self-published. I highly recommend using the fiverr.com/createspace.com combination to convert your ebook to a paperback.

Encode OpenGL to Video with OpenCV

I needed to create a working command line demo application of piping an OpenGL rendering into a video file. This experiment is intended to be just proof of concept and be a very naive implementation.

opengl-on-video

The main steps are:

  • OpenGL rendering
  • Reading the rendering’s output
  • Writing each output frame into the video

This guide will cover an OpenGL >4.0 application tested on a Mac OS X. It should be portable to Linux/Windows as well.

If you just need to create an OpenGL output only on iOS, the AVFoundation classes will make this process much easier. Additionally, if you need to only use Android (API 18), you can use MediaMuxer to also easily output an OpenGL rendering to a video file. See EncodeAndMuxTest.java for details.

OpenGL Rendering

The focus of this article isn’t on the OpenGL rendering logic, so let’s just go with a basic OpenGL application which renders a sliding window across an image texture.

The basic structure of our app will be:
initShaders();
loadTexture();
setupBuffers();
glutDisplayFunc(&drawScene);
glutIdleFunc(&replay);
glutMainLoop();

(See complete source code on Github)

Reading the Rendered Output Frames

After you draw to the window, we use glReadPixels to read the pixel data from the OpenGL frame buffer into a block of memory. ( Alternatively, you use OpenGL Pixel Buffer Object for more efficiency )

unsigned char *raw_image = (unsigned char*) calloc(width * height * 3, sizeof(char));
glReadPixels(0, 0, width, height, GL_RGB, GL_UNSIGNED_BYTE, raw_image);

Writing the output frames into a video format

Let’s start by setting up the OpenCV Video Writer:

CvVideoWriter *writer = 0;
int isColor = 1;
int fps = 30;
int width = image_width;
int height = image_height;

writer = cvCreateVideoWriter("out.avi", CV_FOURCC('D', 'I', 'V', 'X'), fps, cvSize(width, height), isColor);

You can change CV_FOURCC('') to CV_FOURCC('X', '2', '6', '4') for x264 output. Make sure you compiled ffmpeg with x264 support and OpenCV with ffmpeg.

We then create an OpenCV IplImage from this block of memory. Finally we just use OpenCV’s cvWriteFrame to append the frame to the video output.

IplImage* img = cvCreateImage(cvSize(width, height), IPL_DEPTH_8U, 3);
img->imageData = (char *)raw_image;
cvWriteFrame(writer, img); // add the frame to the file
cvReleaseImage(&img);

You will need to read the framebuffer and write to the video using OpenCV on every frame update.

After you are done writing the frames to video, be sure to release the OpenCV video writer.

cvReleaseVideoWriter(&writer);

The alternatives to using OpenCV for video writing are using the C libraries of ffmpeg or gstreamer or libx264(mp4)/libvpx(webm) directly. These would require the RGB image data be converted to YV12 (or YUV420) color space first.

This Stackoverflow post goes over the details of using the x264 C api http://stackoverflow.com/questions/2940671/how-does-one-encode-a-series-of-images-into-h264-using-the-x264-c-api

Sample Project

I posted an example project on GitHub which demonstrates how this works: https://github.com/tc/opengl-to-video-sample

Don’t Imagine Success, Make an Actionable and Trackable Plan

I’m in the process of reading 59 Seconds which relies on scientific research to debunk common self-help advice.

One common self help advice you may have heard is

“imagine success and you will be successful”

The author, Richard Wiseman suggest the better alternative is to actually create an actionable and trackable howto plan. A howto plan is simply just writing out your goal along with sub goals as a step by step process. Each sub goal must be trackable and time-based.

For example, if you want to “Find a new job”, a possible plan can be:

  • Week 1: Write or update resume
  • Every 2 weeks for Next 6 Months: Apply to a job.
  • Keep a journal of each job application and its status.

This is a simple idea but much better alternative than just wishing and imagining success.

The Non-Technical Guide to Web Technologies Ebook Published!

After a few months of writing, editing and testing with lots of readers, I’m releasing The Non-Technical Guide to Web Technologies ebook. It’ll be available as a PDF and Amazon Kindle format.

It’s roughly 60+ pages of concise text, diagrams and images to help teach the basics of web technologies. I spent a lot of time going and back forth finding the right balance point of how much detail to include.

In the book, I cover things everyone in a technical company should be aware of:

  • How a web page gets to your web browser.
  • Common software development questions.
  • HTML5, JSON, AJAX explained.
  • The programming languages and databases used to create web applications.
  • The different types of servers used.
  • The different software engineering job titles in a internet companies.
  • Common Security attacks.
  • The web technologies used by a few internet companies: Etsy, Pinterest, Square, Instagram, Tumblr.

This book will cater to the non-programmers(startup founders, recruiters, sales, business development, marketing) interested in learning more about the field without actually learning to code.

The Non-Technical Guide To Web Technologies
Read a copy now!

Thanks to Jared Cohen @jaredcohe, Joe Mahavuthivanij @epicsaurus, Tony Tran @quicksorter for the EXTENSIVE proof reading and feedback!

UPDATE:
Thanks for the support everybody, it has been on Amazon’s Best Sellers List for “Computer and Technology” in the past week. Below Walter Isaacson ‘s biography of Steve Jobs and Steven Levy’s The Plex (Google) book, but still in the top 10. :)

Amazon's Best Sellers in Computer and Technology

Hosting a Static Website on Amazon S3

So you need to setup a static web site for a friend/customer with a cost-effective and low maintenance solution?

The best solution I found is Amazon S3. Amazon S3 charges $0.095 per GB and $0.01 per 10,000 GET requests. For most sites, this is practically free. There is also no server you need to login so this is a very low security risk.

To host a static web sie on Amazon S3, follow these steps:

Get a Domain Name

I prefer to use namecheap.com.

Namecheap domains

Create an aws.amazon.com account

Sign up for an AWS Amazon account
You will get an access key and a secret key.

Setup your s3 bucket

Click on S3. Click on create bucket. Name this bucket after your domain name with a “www” prefix:

www.YOUR-DOMAIN.com

Enable the “Website hosting” option. You can select which region: us-west-1, us-east-1, etc.

S3 link

02-enable-website-hosting

Set your DNS records

On your domain page, add a CNAME record “www” and point it to http://www.SITENAME.com.s3-website-us-west-1.amazonaws.com

Add a CNAME in Namecheap

Upload your files

Upload your files from the S3 bucket UI.

Upload in S3 Console

Visit the site

That’s it!

If it shows a permission denied, make sure the permissions are set for public.

Alternative upload methods

Alternatively you can upload using a native application like Cyberduck to upload your files.

Cyberduck

Data Mining Wikipedia Notes

I spent quite a bit of time mining Wikipedia while at Qwiki. Our original product was a visual search engine where we used Wikipedia as a main data source to generate short videos from each Wikipedia article. We had to extensively parse the wikitext and associated media from Wikipedia to generate these videos. Here’s an example of our technology being used by Bing:

Qwiki being used in Bing search results.

Qwiki being used in Bing search results.

A Qwiki video of New York City

A Qwiki video of New York City

As a first step to mining Wikipeia for your project, I recommend having a goal in mind. Don’t go down the rabbit hole if you need to just take a peek. Wikipedia is almost 100% open and you can see into their inner workings very easily. But it can be easy to get lost in everything, especially the datasets. Most of the time, you will only need a subset of the data for your goal.

For example, if you want to just grab the article internal links, you can just download the pagelinks MySql table dump and avoid parsing the every article.

There are a few categories I’ll cover:

  • Data Collection
  • Data Filtering
  • Data Extraction

Data Collection

Data Dumps

Wikipedia data comes in two sets: XML Dumps and MySql dumps. The article and revision text are in the XML format and the remaining supplementary data comes in MySql dumps. This includes image metadata, imagelinks, etc.

Both can be found on http://dumps.wikimedia.org/enwiki/latest/

You can download the entire article set enwiki-latest-pages-articles.xml or in partitioned chunks: enwiki-latest-pages-articlesX.xml.bz2 where X is 1 to 32

There is also a corresponding rss dump. If you need to be notified when a new wikipedia dump is available, you can write a script to monitor this RSS url. I’ve noticed the dumps commonly released is around the first week of each month.

Besides Wikipedia, there are related datasets which are derived from Wikipedia. The two most popular ones are Google’s Freebase and the open-source DBpedia. Both projects have primary goals of structuring data. By structuring the data, I’m referring to associating a fact with its label. ie. “Bill Gates’ Date of Birth: October 28, 1955″

Freebase comes in an easier to parse format and API. They also release a WEX formatted article text every two weeks. WEX is an xml formatted version of wikitext. Unfortunately, this WEX parser has not seem much updates lately and allegedly doesn’t parse correctly.

BEWARE of using the structured data from Freebase. Some of their data is good but sometimes, the data is very out of date and missing compared to the same wikipedia article. In particular, I noticed their population statistics are out of date. There is also less coverage with location data than Wikipedia.

DBpedia is an open source project written primary in Scala. Their main goal is to extract structured data from the infoboxes. They have a DBpedia Live system where wikipedia articles are automatically updated in DBpedia. Thus DBpedia has near real time updates from wikipedia. You can query for structured data using SPARQL. Using SPAQRL is a little more difficult than Freebase’s JSON-based query language, MQL. They also have supporting projects like DBpedia Spotlight which is designed to extract Wikipedia-derived entities from a body of text.

I would recommend seeing if either of these projects can solve your problem before trying to mine Wikipedia yourself.

Even if you do decide to mine Wikipedia, be sure to use or read the dbpedia parser. They have a lot of information regarding infobox normalization which can help in this area. Consider this page which shows a mapping of the infobox settlement. This can help you think of the heristics you need for your parser.

Data Filtering

Wikipedia has 11 namespaces. If you just want the articles, you can filter against namespace key=’0′. This can greatly reduce the amount of pages you need to process.

Wikipedia also releases just the abstract text in a separate data dump. This is considerably easier to work with if you just need the first few paragraphs of each article.

Data Extraction

The raw Wikipedia data dump comes in XML with the main body text in wikitext. It would have been great if Wikipedia continued to release their article dataset in HTML format but this no longer happens.

Wikitext syntax is very hard to parse and no software has been able to 100% match the PHP mediawiki output. Don’t be too concerned though because in most cases, you won’t need to be 100% perfect.

You will also need to deal with the surrounding metadata stored in the MySql dumps. Take a look at this guide of the mediawiki architecture. It’ll be help to use to decide which data you need.

Parsers

If you are using a JVM language like Java or Scala, I highly recommend Sweble. Beyond just doing a good job with parsing wikitext, it is a well designed package and it is easy to customize and build upon.

Wikimedia is working on a new parser called Parsoid written in node.js and C++. It is planned to be near compatible with the PHP wikitext parser. It was not as complete when I started mining wikitext so I don’t have experience with it.

What is so problematic about the wikitext format? There are many edge cases and only the original PHP parser has been able to reproduce wikitext to HTML correctly. The other big problem is template expansion which we’ll cover in the next section.

While you can spider the wikipedia website itself for text, Wikimedia recommends you use their data dumps to avoid overloading their servers unnecessarily. I would go as far as cloning wikipedia and running a mirror to get the evaluated html. The biggest problem I’ve encountered with cloning Wikipeia is that each article can take a long time to render. Wikipedia’s production site is heavily cached so reading a page from wikipedia.org will be even faster than rendering a local copy.

Templates/Macros

Beyond the syntactical parsing, the biggest challenge will be how to handle the template expansions. Wikitext has macros called templates which is essentially a language in itself. Templates in wikitext are embedded in {{template_code}}. These can be simple replacement templates or more complex ones with if loops and references to other remote data sources.

Sweble had a template expansion system but I found it didn’t work on edge cases. I resorted to modifying Sweble to call out to the Wikipedia API to expand certain templates.

Database

Beyond just the article text, you may want information with the images and link stats. You’ll need to import the MySql tables to read this information. Here’s a diagram of the mediawiki database schema.

I highly recommend turning off indexes while importing the tables and turning them back on once everything is imported.

Large Scale Processing

You need to determine how much you are processing through the wiki dataset. If you can get away with iterating through the XML dump on a single machine, I highly recommend this approach.

For my purposes, I had to run through a few iterations of the XML dump. Some of the iterations I was able to get away with running it on one machine but other times, I had to parallelize it across multiple machines.

I used Hadoop to perform the parallel processing. Hadoop does have a built-in XML splitter. Mahout also comes with a Wikipedia iterator. I found both of these non-intuitive and incorrect on some cases. I resorted to a system where we collapsed each wiki article xml entry into one line each. Hadoop makes it very easy to parallel process one entry per line datasets.

Image Processing:

Images from Wikipedia articles have two sources: wikipedia itself and commonswiki.

If you are processing images from Wikipedia or Wiki Commons, be aware some images are really large and this can kill your image processing application.

Look at this image metadata, a 26280×19877 98M image…


{"height": 19877, "width": 26280,
"source": "http://en.wikipedia.org/wiki/File:El_sue?o_de_Jacob,_by_Jos?_de_Ribera,_from_Prado_in_Google_Earth.jpg",
"url": "http://upload.wikimedia.org/wikipedia/commons/8/85/El_sue%C3%B1o_de_Jacob%2C_by_Jos%C3%A9_de_Ribera%2C_from_Prado_in_Google_Earth.jpg" }

The metadata for each image can be found in the MySql database dump table named image. Unfortunately the description field is truncated and you’ll need to join it with the page and revision tables to get the whole description. This description is also in wikitext format so you’ll need to run it through a wikitext parser.

Other Resources

Below are some resources I tried briefly.

Bliki engine: a Java wikipedia parser.

dbpedia’s extraction_framework can be used to extract infoboxes.

GWT Wiki Dump support.

Mahout’s Wikipedia Dump Splitter

Happy Wiki Mining!

Hope this article helps you get started with wikipedia mining!

Deploying a Rails app on Nginx/Puma with Capistrano

Puma is a fast multi-threaded Ruby app server designed to host rack-based Ruby web apps including Sinatra and Ruby on Rails. Like Unicorn, it supports rolling restarts, but since it is multi-threaded rather than Unicorn’s multi-process model, it takes far less memory while being comparable in performance. Puma can run on Ruby 1.9.X but its multi-threaded nature is better suited to run on a real multi-threaded runtime like Rubinius or JRuby.

This article will guide you to setting up a hello world Rails app with Puma/Nginx and deploy it with Capistrano onto a linux system. This guide was tested on Puma 1.6.3 and Puma 2.0.1.

Create a base Rails app

rails new appname

Adding Puma to Rails app

We’ll start with adding a puma to your Rails app.

In your Gemfile, add:

gem "puma"

then run bundle install

Now we need a puma config file: config/puma.rb

rails_env = ENV['RAILS_ENV'] || 'development'

threads 4,4

bind  "unix:///data/apps/appname/shared/tmp/puma/appname-puma.sock"
pidfile "/data/apps/appname/current/tmp/puma/pid"
state_path "/data/apps/appname/current/tmp/puma/state"

activate_control_app

Setup Nginx with Puma

Follow the instructions to install Nginx from source. It will install nginx to /usr/local/nginx

Edit your /usr/local/nginx/conf/nginx.conf file to be below:

user deploy;
worker_processes  1;

error_log  /var/log/nginx/error.log;
pid        /var/run/nginx.pid;

events {
    worker_connections  1024;
}

http {
    include       /usr/local/nginx/conf/mime.types;
    default_type  application/octet-stream;

    access_log  /var/log/nginx/access.log;

    sendfile        on;
    #tcp_nopush     on;

    #keepalive_timeout  0;
    keepalive_timeout  65;
    tcp_nodelay        on;

    gzip  on;

    server_names_hash_bucket_size 128;
    
    client_max_body_size 4M; 
    client_body_buffer_size 128k;
    
    include /usr/local/nginx/conf/conf.d/*.conf;
    include /usr/local/nginx/conf/sites-enabled/*;
}

Create a file named "puma_app" in the sites-enabled directory:

upstream appname {
  server unix:///data/apps/appname/shared/tmp/puma/appname-puma.sock;
}

server {
  listen 80;
  server_name www.appname.com appname.com;

  keepalive_timeout 5;

  root /data/apps/appname/public;

  access_log /data/log/nginx/nginx.access.log;
  error_log /data/log/nginx/nginx.error.log info;

  if (-f $document_root/maintenance.html) {
    rewrite  ^(.*)$  /maintenance.html last;
    break;
  }

  location ~ ^/(assets)/  {
    root /data/apps/appname/current/public;
    expires max;
    add_header  Cache-Control public;
  }

  location / {
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header Host $http_host;

    if (-f $request_filename) {
      break;
    }

    if (-f $request_filename/index.html) {
      rewrite (.*) $1/index.html break;
    }

    if (-f $request_filename.html) {
      rewrite (.*) $1.html break;
    }

    if (!-f $request_filename) {
      proxy_pass http://appname;
      break;
    }
  }

  # Now this supposedly should work as it gets the filenames 
  # with querystrings that Rails provides.
  # BUT there's a chance it could break the ajax calls.
  location ~* \.(ico|css|gif|jpe?g|png)(\?[0-9]+)?$ {
     expires max;
     break;
  }

  location ~ ^/javascripts/.*\.js(\?[0-9]+)?$ {
     expires max;
     break;
  }

  # Error pages
  # error_page 500 502 503 504 /500.html;
  location = /500.html {
    root /data/apps/appname/current/public;
  }
}

Using init scripts to start/stop/restart puma

We want to be able to start/restart puma using linux init scripts.

The init scripts for nginx should have been installed already as well. You can start nginx using `sudo /etc/init.d/nginx start

Install Jungle from Puma’s source repo. Jungle is a set of scripts to manage multiple apps running on Puma. You need the puma and run-puma files and place them into /etc/init.d/puma and /usr/local/bin/run-puma respectively.

Then, add your app config using: sudo /etc/init.d/puma add /data/apps/appname/current deploy

IMPORTANT: The init script comes with an assumption that your puma state directories live in /path/to/app/tmp/puma

Using Capistrano to deploy

In your Gemfile, add:

gem "capistrano"

then run bundle install

In your deploy.rb, change to the following below. Note the shared tmp dir modification.

#========================
#CONFIG
#========================
set :application, "APP_NAME"
set :scm, :git
set :repository, "GIT_URL"
set :branch, "master"
set :ssh_options, { :forward_agent => true }
set :stage, :production
set :user, "deploy"
set :use_sudo, false
set :runner, "deploy"
set :deploy_to, "/data/apps/#{application}"
set :app_server, :puma
set :domain, "DOMAIN_URL"
#========================
#ROLES
#========================
role :app, domain
role :web, domain
role :db, domain, :primary => true
#========================
#CUSTOM
#========================
namespace :puma do
  desc "Start Puma"
  task :start, :except => { :no_release => true } do
    run "sudo /etc/init.d/puma start #{application}"
  end
  after "deploy:start", "puma:start"

  desc "Stop Puma"
  task :stop, :except => { :no_release => true } do
    run "sudo /etc/init.d/puma stop #{application}"
  end
  after "deploy:stop", "puma:stop"

  desc "Restart Puma"
  task :restart, roles: :app do
    run "sudo /etc/init.d/puma restart #{application}"
  end
  after "deploy:restart", "puma:restart"

  desc "create a shared tmp dir for puma state files"
  task :after_symlink, roles: :app do
    run "sudo rm -rf #{release_path}/tmp"
    run "ln -s #{shared_path}/tmp #{release_path}/tmp"
  end
  after "deploy:create_symlink", "puma:after_symlink"
end

You’ll need to setup the directories one time using: cap deploy:setup

Now you can deploy your app using cap deploy and restart with cap deploy:restart.

EDIT: xijo has formalized the cap tasks as a gem, capistrano-puma to make it easier to use on multiple projects.