Installing LAME on Amazon Elastic Map Reduce (EMR)

Amazon Elastic MapReduce instances does not have the debian-multimedia sources by default, you can add the below to a bootstrap script to have it installed:


sudo sh -c "cat >> /etc/apt/sources.list << EOF
deb http://www.debian-multimedia.org squeeze main non-free
deb http://www.debian-multimedia.org testing main non-free
EOF"

gpg --keyserver hkp://pgpkeys.mit.edu --recv-keys 07DC563D1F41B907
gpg --armor --export 07DC563D1F41B907 | sudo apt-key add -

sudo apt-get update
sudo apt-get -y --force-yes install lame libmp3lame-dev faad

Debugging ActionView::MissingTemplate exception in Rails 3.1

We got an ActionView::MissingTemplate exception from a remote site using our embed code.

The exception was:

ActionView::MissingTemplate: Missing template /embed, application/embed with {:handlers=>[:erb, :builder, :haml], :formats=>["*/*;q=0.01"], :locale=>[:en, :en]}.

with these http headers:

HTTP_ACCEPT "*/*;q=0.01"
HTTP_ACCEPT_LANGUAGE "en"
HTTP_USER_AGENT "Mozilla/4.0 (PSP (PlayStation Portable); 2.00)"

The strange thing is that PSP is sending us this accept header:
HTTP_ACCEPT "*/*;q=0.01";

HTTP_ACCEPT is a http request header used by the client asking for the types of formats it can support. Typically, browsers send an list of acceptable formats. Google Chrome sends Accept:text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 which means that the server should try to send back an html or xml format with a preference value of q=0.9 and if not available, send anything else(*/*) with a preference value of q=0.8.

Unfortunately, since our Rails controller code explicitly only accepted html or json with a respond_to block, Rails didn’t interpret “*/*” as html.

respond_to :html, :json
render :layout => false

We can make the fix by explicitly render the default format as html:

render "embed.html", :layout => false

Solrsan: Lightweight Solr Gem for Ruby on Rails 3 Applications

I decided to create Solrsan to use the Apache Solr search server in my various Rails 3 applications. Currently, there are two main ruby gems for using Apache Solr in a ruby project:

  • rsolr: RSolr is a low level layer to Apache Solr. Because it’s meant to be just an access layer, rsolr is missing the configuration setup such as the schema.xml, solrconfig.xml, etc which is custom per each Ruby/Rails app.
  • sunspot: Sunspot is an all in one solution for using solr with a ruby project. It even uses rsolr under the hood.

Generally, I like API access layers to be as similar to the raw api as possible. Sunspot’s api works using a search block:

Post.search do
  fulltext 'best pizza'
  with :blog_id, 1
  with(:published_at).less_than Time.now
  order_by :published_at, :desc
  paginate :page => 2, :per_page => 15
  facet :category_ids, :author_id
end

The actual query becomes an http get request. Solr itself is just a Java servlet which just reads http requests and responses with json/xml/other formats. I prefer using rsolr’s style of access because it’s most similar to http requests:

  response = solr.get 'select', :params => {
    :q=>'washington',
    :start=>0,
    :rows=>10
  }

Solrsan also uses rsolr under the hood and adds a few extra functionality. For example, when you want to add solr functionality to your ruby/rails app, you need its own set of config files, a way to start/stop the solr server, a way to deploy using capistrano. Solrsan comes with these basic setup files to help you get started.

Indexing

To index objects, edit config/solr/conf/schema.xml to state the types of fields you want to index. Or you can use dynamic fields to avoid specifying new fields each time.

Then, include Solrsan::Search into your model(Activerecord, mongoid, etc) and define a method called as_solr_document which returns a hash of the key-value pair entries to index. See the README for more examples.

You can add an after_save method to call the index method as well. I did not automatically add the index method on every object save since some systems may need index via a different method such as via a queuing system.

Searching

Search is as easy as:
response = Document.search(:q => "hello world")
This will return a hashmap response composed of docs and metadata. response[:docs] will be a will_paginated object collection and response[:metadata] contains various supporting items such as error messages, facets, etc.

Summary

So I decided not to use sunspot because I want a transparent API access over a DSL implementation and I needed something more than the basic rsolr gem.

If you are interesting in using solrsan, the important links are the readme and unit tests. I’m already using solrsan on a few projects but it is still relatively new. Feel free to email/pull request any problems/bugs!

Debugging ActiveMQ Authentication Configuration

I recently had to add basic authentication to an ActiveMQ broker and run into some unexpected issues. I followed the example in ActiveMQ in Action to use the simpleAuthenticationPlugin by adding the below snippet to activemq.xml:

You can see the complete file at https://gist.github.com/881965

When I tried to start ActiveMQ with
ACTIVEMQ_HOME/bin/activemq start xbean:file:conf/activemq.xml
ActiveMQ wouldn’t start!

I checked the activemq.log and only see one line:
2011-03-22 13:21:02,799 | INFO | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@23abcc03: startup date [Tue Mar 22 13:21:02 PDT 2011]; root of context hierarchy | org.apache.activemq.xbean.XBeanBrokerFactory$1 | main

Not very helpful…
If i comment out the simpleAuthenticationPlugin tags, activemq starts up correctly.

Hmm, what’s wrong?

Diving into the ActiveMQ In Action book further, I noticed I can start up activemq via
ACTIVEMQ_HOME/bin/activemq console xbean:file:conf/activemq.xml

This gives us a much more verbose log output.
I now see the problem is due to:
Caused by: org.xml.sax.SAXParseException: cvc-complex-type.2.4.a: Invalid content was found starting with element 'plugins'.

Googling up the error brings me to this page http://activemq.apache.org/xml-reference.html

The problem?

** In ActiveMQ 5.4 and later, the XML elements inside broker tag have to be ordered alphabetically!

I moved the plugins tag before systemUsage tag and ActiveMQ was able to start up correctly with authentication.

Database Migrations

I’m working on a project where I needed to create and manage database tables. I find that the active_record migration system from Ruby on Rails to be the best system for creating, versioning database changes. The project itself is in Scala so I took a look at scala migrations and c5-db-migration but I found active_record migration to be better documented, supported and has a more concise syntax.

I recently created a template rails 3 project where I kept only the files necessary for generating migrations. Check it out at https://github.com/tc/database_migrations

To get started:
setup your database connection information in config/database.yml

Create a database:

rake db:create

Create a new migration:

rails g migration create_users

This will create a file in db/migrate.
Add columns:

class CreateUsers < ActiveRecord::Migration
def self.up
create_table :users do |t|
t.string :name
end
end

def self.down
drop_table :users
end
end

Perform the migration:

rake db:migrate

Now you have a working database which can be managed by this application.

You can rollback to a previous version:

rake db:migrate VERISON=XXXX

The versioning of the database is managed in the database’s schema_migrations table.

How to develop with Factorie, a probabilistic modeling toolkit written in Scala

Factorie is a toolkit for developing probabilistic modeling. It is scalable and flexible and allows you to create factor graphs and perform inference. It is written by Andrew Mccallum and his research group at UMass. They previously written Mallet, the java package for text mining. I found that being written in Scala made the code very succinct and clear. You can learn more from its google project page. Prompted by setup questions on the mailing list, I decided to write a quick guide to using Factorie on a mac osx.

Start with cloning the source code:
hg clone https://factorie.googlecode.com/hg/ factorie

Factorie uses maven to manage its build and dependenices. It is an open source package from apache so download it and learn a little how it works.
Compile the code into a jar:
cd factorie
mvn install

This will create a factorie jar file in target/factorie*.jar and install it into your local maven repo. (most likely ~/.m2)

Now that you have factorie jar in your maven repo, clone a sample factorie project:
git clone git@github.com:tc/factorie-example.git

In the factorie-example directory, you’ll notice a pom.xml. Inside the file, you’ll see:

<dependency>
<groupId>cc.factorie</groupId>
<artifactId>factorie</artifactId>
<version>0.9.1-SNAPSHOT</version>
</dependency>

You may have to change the version to match the updated factorie jar version.

This sample project has two files:
src/main/scala/factorie/LDAExample.scala
src/main/scala/factorie/LDAExampleTest.scala

It’s good practice to have a unit test for your code. In this case, the unit test is trivial as it just runs the scala class, but ideally you have some type of assert you want to perform.

Compile and run it using:
mvn test

You should see an output of listed topics:
.....Iteration 20
alpha = 0.7897046248435047 1.083615490115922 1.3493645028398928 0.8200016581775706 1.1115652808961307 1.7646165019929618 2.0397615066201604 1.8106654639049065 1.75947360834168 1.2876416992683335
Topic 0 china science achieved contacts powerful power urbana find faster projects
Topic 1 computing performance list chinese gropp problems smith years today tennessee
Topic 2 service stephen full messages members data twitter contacts services plenty
Topic 3 world erica announcement computer floating components center states couldn fastest
Topic 4 jaguar point ogg year high benchmark speed supercomputers community time
Topic 5 itunes features print mac kessler suggests back challenge tomorrow tuesday
Topic 6 mail facebook google shankland people address big reach ability aol
Topic 7 system news university tianhe top topher national called supercomputing font
Topic 8 apple share supercomputer systems operations machine company based released place
Topic 9 gmail cnet software digg supercomputing social comments technology expected linpack

Be sure to look into src/main/scala/cc/factorie/example directory in the factorie source code for more examples.

If you use an IDE like Eclipse or Intelij IdeaX, type:
mvn eclipse:eclipse
or
mvn idea:idea
to generate the IDE project files.

Additionally, you can just edit using VIM or Emacs and compile using maven.

RMongo: Accessing MongoDB in R

I recently created RMongo, a database access layer to MongoDB in R as an R package.

To install RMongo:

install.packages(“RMongo”)

If that does not work, try downloading it from https://github.com/tc/RMongo/downloads and run:

install.packages("~/Downloads/RMongo_0.0.21.tar.gz", repos=NULL, type="source")

I tried to mimic the RMySQL commands in RMongo. Below are some example commands.

library(RMongo)

#ask for help
?RMongo

#connect to a database
mongo results names(results)
[1] “X_id” “name” “nutrient_definition_id” “description”
> results
X_id name nutrient_definition_id
1 4cd0f8e31e627d4e6600000e Adjusted Protein 257
2 4cd0f9061e627d4e6600001a Sodium 307

> results results
X_id name nutrient_definition_id
1 4cd0f9061e627d4e6600001a Sodium 307

> dbDisconnect(mongo)

 

RMongo is very alpha at this point. I built it as a quick way to prototype algorithms with data from mongoDB in R. Most of RMongo uses the mongo-java-driver to perform json-formatted queries. The R code in the package uses rJava to communicate with the mongo-java-driver.

Please report any bugs or necessary improvements. Or better yet, send in pull requests via the RMongo github project page!

How to run Background Processes using Resque/Redis in a Ruby on Rails App

When you have a long running block of code, you don’t want to run it inside a web application request cycle. A background processing queuing system is a good solution. There are a number of open source queuing systems available(delayed_job, beanstalk, etc) so you don’t need to write your own! This article will go over how to setup the resque queuing system in a Ruby on Rails application.

Resque setup:

Install redis
brew install redis
on mac.
or

http://code.google.com/p/redis/

Add resque to your gemfile:
gem "resque"

Install the new gem:
bundle install

Create a redis config file called redis.yml in config:
defaults: &defaults
host: localhost
port: 6379

development:
<<: *defaults

test:
<<: *defaults

staging:
<<: *defaults

production:
<<: *defaults

Add an initializer file called resque.rb in config/initializers:
Dir[File.join(Rails.root, 'app', 'jobs', '*.rb')].each { |file| require file }

config = YAML::load(File.open("#{Rails.root}/config/redis.yml"))[Rails.env]
Resque.redis = Redis.new(:host => config['host'], :port => config['port'])

Add resque.rake to lib/tasks
require 'resque/tasks'
task "resque:setup" => :environment

Running Resque:

start redis:
redis-server

start resque
COUNT=5 QUEUE=* rake resque:workers

see web UI:
resque-web

How to add resque jobs:

Create a job class
class NewsCollectionJob
@queue = :news_collection_job

def self.perform(start_date, end_date)
puts "from #{start_date} to #{end_date}"
#TODO your long running process here
end
end

Run it using:
Resque.enqueue(NewsCollectionJob, start_date, end_date)

This command will not block so you can embed this code in a model. There you go! A few simple steps to getting a faster performing ruby application using background processing on resque/redis.

Performing MongoDB JSON Queries using the mongo-java-driver

I’m writing an query front-end for MongoDB which will be eventually used from within R. If you need to make json queries to mongodb in a JVM language, here’s some example code which you may find useful. It’s in scala but can be easily adaptable to any jvm language.

Cascading Example Project using Scala and Maven

I recently pushed an example project for using Cascading with Scala and Maven to http://github.com/tc/cascading-scala-maven-example.

I created it because the main Cascading examples are built using Ant and in Java. I find I am more productive with an integrated dependency management system like Maven and a higher level language like Scala.

With this example project, you can create a JAR file that you can use with any hadoop cluster by just using mvn package. You can also do mvn assembly:assembly to include all dependencies.

Check it out and hope it helps you get started in big data processing with Hadoop.

Follow

Get every new post delivered to your Inbox.