Installing LAME on Amazon Elastic Map Reduce (EMR)

Amazon Elastic MapReduce instances does not have the debian-multimedia sources by default, you can add the below to a bootstrap script to have it installed:


sudo sh -c "cat >> /etc/apt/sources.list << EOF
deb http://www.debian-multimedia.org squeeze main non-free
deb http://www.debian-multimedia.org testing main non-free
EOF"

gpg --keyserver hkp://pgpkeys.mit.edu --recv-keys 07DC563D1F41B907
gpg --armor --export 07DC563D1F41B907 | sudo apt-key add -

sudo apt-get update
sudo apt-get -y --force-yes install lame libmp3lame-dev faad

Debugging ActiveMQ Authentication Configuration

I recently had to add basic authentication to an ActiveMQ broker and run into some unexpected issues. I followed the example in ActiveMQ in Action to use the simpleAuthenticationPlugin by adding the below snippet to activemq.xml:

You can see the complete file at https://gist.github.com/881965

When I tried to start ActiveMQ with
ACTIVEMQ_HOME/bin/activemq start xbean:file:conf/activemq.xml
ActiveMQ wouldn’t start!

I checked the activemq.log and only see one line:
2011-03-22 13:21:02,799 | INFO | Refreshing org.apache.activemq.xbean.XBeanBrokerFactory$1@23abcc03: startup date [Tue Mar 22 13:21:02 PDT 2011]; root of context hierarchy | org.apache.activemq.xbean.XBeanBrokerFactory$1 | main

Not very helpful…
If i comment out the simpleAuthenticationPlugin tags, activemq starts up correctly.

Hmm, what’s wrong?

Diving into the ActiveMQ In Action book further, I noticed I can start up activemq via
ACTIVEMQ_HOME/bin/activemq console xbean:file:conf/activemq.xml

This gives us a much more verbose log output.
I now see the problem is due to:
Caused by: org.xml.sax.SAXParseException: cvc-complex-type.2.4.a: Invalid content was found starting with element 'plugins'.

Googling up the error brings me to this page http://activemq.apache.org/xml-reference.html

The problem?

** In ActiveMQ 5.4 and later, the XML elements inside broker tag have to be ordered alphabetically!

I moved the plugins tag before systemUsage tag and ActiveMQ was able to start up correctly with authentication.

Database Migrations

I’m working on a project where I needed to create and manage database tables. I find that the active_record migration system from Ruby on Rails to be the best system for creating, versioning database changes. The project itself is in Scala so I took a look at scala migrations and c5-db-migration but I found active_record migration to be better documented, supported and has a more concise syntax.

I recently created a template rails 3 project where I kept only the files necessary for generating migrations. Check it out at https://github.com/tc/database_migrations

To get started:
setup your database connection information in config/database.yml

Create a database:

rake db:create

Create a new migration:

rails g migration create_users

This will create a file in db/migrate.
Add columns:

class CreateUsers < ActiveRecord::Migration
def self.up
create_table :users do |t|
t.string :name
end
end

def self.down
drop_table :users
end
end

Perform the migration:

rake db:migrate

Now you have a working database which can be managed by this application.

You can rollback to a previous version:

rake db:migrate VERISON=XXXX

The versioning of the database is managed in the database’s schema_migrations table.

How to develop with Factorie, a probabilistic modeling toolkit written in Scala

Factorie is a toolkit for developing probabilistic modeling. It is scalable and flexible and allows you to create factor graphs and perform inference. It is written by Andrew Mccallum and his research group at UMass. They previously written Mallet, the java package for text mining. I found that being written in Scala made the code very succinct and clear. You can learn more from its google project page. Prompted by setup questions on the mailing list, I decided to write a quick guide to using Factorie on a mac osx.

Start with cloning the source code:
hg clone https://factorie.googlecode.com/hg/ factorie

Factorie uses maven to manage its build and dependenices. It is an open source package from apache so download it and learn a little how it works.
Compile the code into a jar:
cd factorie
mvn install

This will create a factorie jar file in target/factorie*.jar and install it into your local maven repo. (most likely ~/.m2)

Now that you have factorie jar in your maven repo, clone a sample factorie project:
git clone git@github.com:tc/factorie-example.git

In the factorie-example directory, you’ll notice a pom.xml. Inside the file, you’ll see:

<dependency>
<groupId>cc.factorie</groupId>
<artifactId>factorie</artifactId>
<version>0.9.1-SNAPSHOT</version>
</dependency>

You may have to change the version to match the updated factorie jar version.

This sample project has two files:
src/main/scala/factorie/LDAExample.scala
src/main/scala/factorie/LDAExampleTest.scala

It’s good practice to have a unit test for your code. In this case, the unit test is trivial as it just runs the scala class, but ideally you have some type of assert you want to perform.

Compile and run it using:
mvn test

You should see an output of listed topics:
.....Iteration 20
alpha = 0.7897046248435047 1.083615490115922 1.3493645028398928 0.8200016581775706 1.1115652808961307 1.7646165019929618 2.0397615066201604 1.8106654639049065 1.75947360834168 1.2876416992683335
Topic 0 china science achieved contacts powerful power urbana find faster projects
Topic 1 computing performance list chinese gropp problems smith years today tennessee
Topic 2 service stephen full messages members data twitter contacts services plenty
Topic 3 world erica announcement computer floating components center states couldn fastest
Topic 4 jaguar point ogg year high benchmark speed supercomputers community time
Topic 5 itunes features print mac kessler suggests back challenge tomorrow tuesday
Topic 6 mail facebook google shankland people address big reach ability aol
Topic 7 system news university tianhe top topher national called supercomputing font
Topic 8 apple share supercomputer systems operations machine company based released place
Topic 9 gmail cnet software digg supercomputing social comments technology expected linpack

Be sure to look into src/main/scala/cc/factorie/example directory in the factorie source code for more examples.

If you use an IDE like Eclipse or Intelij IdeaX, type:
mvn eclipse:eclipse
or
mvn idea:idea
to generate the IDE project files.

Additionally, you can just edit using VIM or Emacs and compile using maven.

RMongo: Accessing MongoDB in R

I recently created RMongo, a database access layer to MongoDB in R as an R package.

To install RMongo:

install.packages(“RMongo”)

If that does not work, try downloading it from https://github.com/tc/RMongo/downloads and run:

install.packages("~/Downloads/RMongo_0.0.21.tar.gz", repos=NULL, type="source")

I tried to mimic the RMySQL commands in RMongo. Below are some example commands.

library(RMongo)

#ask for help
?RMongo

#connect to a database
mongo results names(results)
[1] “X_id” “name” “nutrient_definition_id” “description”
> results
X_id name nutrient_definition_id
1 4cd0f8e31e627d4e6600000e Adjusted Protein 257
2 4cd0f9061e627d4e6600001a Sodium 307

> results results
X_id name nutrient_definition_id
1 4cd0f9061e627d4e6600001a Sodium 307

> dbDisconnect(mongo)

 

RMongo is very alpha at this point. I built it as a quick way to prototype algorithms with data from mongoDB in R. Most of RMongo uses the mongo-java-driver to perform json-formatted queries. The R code in the package uses rJava to communicate with the mongo-java-driver.

Please report any bugs or necessary improvements. Or better yet, send in pull requests via the RMongo github project page!

How to run Background Processes using Resque/Redis in a Ruby on Rails App

When you have a long running block of code, you don’t want to run it inside a web application request cycle. A background processing queuing system is a good solution. There are a number of open source queuing systems available(delayed_job, beanstalk, etc) so you don’t need to write your own! This article will go over how to setup the resque queuing system in a Ruby on Rails application.

Resque setup:

Install redis
brew install redis
on mac.
or

http://code.google.com/p/redis/

Add resque to your gemfile:
gem "resque"

Install the new gem:
bundle install

Create a redis config file called redis.yml in config:
defaults: &defaults
host: localhost
port: 6379

development:
<<: *defaults

test:
<<: *defaults

staging:
<<: *defaults

production:
<<: *defaults

Add an initializer file called resque.rb in config/initializers:
Dir[File.join(Rails.root, 'app', 'jobs', '*.rb')].each { |file| require file }

config = YAML::load(File.open("#{Rails.root}/config/redis.yml"))[Rails.env]
Resque.redis = Redis.new(:host => config['host'], :port => config['port'])

Add resque.rake to lib/tasks
require 'resque/tasks'
task "resque:setup" => :environment

Running Resque:

start redis:
redis-server

start resque
COUNT=5 QUEUE=* rake resque:workers

see web UI:
resque-web

How to add resque jobs:

Create a job class
class NewsCollectionJob
@queue = :news_collection_job

def self.perform(start_date, end_date)
puts "from #{start_date} to #{end_date}"
#TODO your long running process here
end
end

Run it using:
Resque.enqueue(NewsCollectionJob, start_date, end_date)

This command will not block so you can embed this code in a model. There you go! A few simple steps to getting a faster performing ruby application using background processing on resque/redis.

Performing MongoDB JSON Queries using the mongo-java-driver

I’m writing an query front-end for MongoDB which will be eventually used from within R. If you need to make json queries to mongodb in a JVM language, here’s some example code which you may find useful. It’s in scala but can be easily adaptable to any jvm language.

Cascading Example Project using Scala and Maven

I recently pushed an example project for using Cascading with Scala and Maven to http://github.com/tc/cascading-scala-maven-example.

I created it because the main Cascading examples are built using Ant and in Java. I find I am more productive with an integrated dependency management system like Maven and a higher level language like Scala.

With this example project, you can create a JAR file that you can use with any hadoop cluster by just using mvn package. You can also do mvn assembly:assembly to include all dependencies.

Check it out and hope it helps you get started in big data processing with Hadoop.

Setting up a Hudson Continuous Integration Testing Server for Ruby(and Rails)

If you work on a project with multiple developers, a continuous integration testing setup is a must. A CI package will run automated tests on a server on a set(daily) or evented(whenever commits are made) interval.
hudson
There are a number of continuous integration packages for Ruby software including CI Joe, CruiseControl.rb. I chose Hudson because it is extendable via plugins and has great support for running automated testing for non-ruby software as well.

Install hudson and jetty:

Hudson itself is a self contained Java servlet. You can either run it in a servlet container like jetty by sticking hudson.war into the webapps directory or just running it using java -jar hudson.war

On your testing server:

Set a working dir:

export HUDSON_HOME=/data/apps/test/hudson

Install ci_reporter:

sudo gem install ci_reporter

Install plugins:

cd /data/apps/test/hudson/plugins
wget http://hudson-ci.org/latest/ruby.hpi
wget http://hudson-ci.org/latest/git.hpi
wget http://hudson-ci.org/latest/rake.hpi
wget http://hudson-ci.org/latest/rubyMetrics.hpi

Setting up a Project in Hudson

We can follow these tasks to create a CI job for a Ruby on Rails 3.0 project.
Hudson will be accessible from http://localhost:8080 or http://localhost:8080/hudson if you used a servlet container.

Click ‘New Job’
Click ‘Build a free-style software project’
This lets you use a custom ruby app.

On the project configuration page:
set the git repo and the option to poll SCM
Add a Execute Shell option with:

bundle install

This will install all the necessary gems from the Gemfile.

Finally, add the rake task for the actual testing itself.
Invoke Rake

ci:setup:testunit
test
CI_REPORTS=results
RAILS_ENV=staging

You can test the setup by clicking the “Build Now” link. In the future, the “Poll SCM” will just run your tests on a set interval whenever code is pushed to your git repo.

Command line script to Pretty Print a JSON URL

Don’t you hate it when you curl an api for testing and get something ugly like:

curl http://search.twitter.com/search.json?q=ruby
q=ruby","next_page":"?page=2&max_id=20407667370&q=ruby","results_per_page":15,"page":1,"completed_in":0.018639,"query":"ruby"}

Here’s a quick one line bash script to pretty print a json url using curl and ruby:

#!/bin/bash
curl $* | ruby -e "require 'rubygems';require 'json'; jj JSON.parse(STDIN.gets)"

Save it as ppcurl, set the permissions (chmod a+x ppcurl) and run it:
ppcurl http://search.twitter.com/search.json?q=ruby

    {
      "created_at": "Thu, 05 Aug 2010 18:25:41 +0000",
      "profile_image_url": "http://a3.twimg.com/profile_images/326153315/twitterProfilePhoto_normal.jpg",
      "from_user": "der_kronn",
      "text": "interesting gem: "Zucker" http://bit.ly/9kiJ3I - cool to see, how flexible ruby is /cc @rbJL",
      "to_user_id": null,
      "metadata": {
        "result_type": "recent"
      },
      "id": 20407489891,
      "geo": null,
      "from_user_id": 20887268,
      "iso_language_code": "en",
      "source": "<a href="http://termtter.org/" rel="nofollow">Termtter</a>"
    }
  ],
  "since_id": 0,
  "refresh_url": "?since_id=20407667370&q=ruby",
  "next_page": "?page=2&max_id=20407667370&q=ruby",
  "page": 1,
  "results_per_page": 15,
  "completed_in": 0.0169860000000001,
  "query": "ruby"
}

Much nicer!

Follow

Get every new post delivered to your Inbox.