Tommy Chheng

Icon

All Things Programming!

Indexing XML files using DIH in Solr 1.4

I have an large set of xml files(110K documents at ~900MB) that i wanted to import into Solr as fast as possible.
Previously I tried just using Jruby/nokogiri and embedded solr connection but that was taking about 15 minutes. So i tried using Solr’s builtin DIH to import the xml files to speed up imports(and re-imports). Here’s how to do it:

Assume you have a large list of xml records like:

<awardlist>
<award>
<awardnumber>0706313</awardnumber>
<title>Coherent Phonon Dynamics  in Semiconductors and Nanotubes</title>
<expirationdate>November 30, 2009</expirationdate>
</award>
<award>
<awardnumber>9909156</awardnumber>
<title>Sustainability of Arctic Communities: Advancing the Science of Integrated Assessment </title>
<expirationdate>November 30, 2009</expirationdate>
</award>
</awardlist>

Add a request handler to your solrconfig.xml file:

<requestHandler name="/dataimport">
<lst name="defaults">
<str name="config">xml-data-config.xml</str>
</lst>
</requestHandler>

Create the xml-data-config.xml file:

<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="nsfgrantsdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^.*\.xml$" recursive="false"
baseDir="/data/rw/data/nsf_grants_xml"
>
<entity name="nsf-grants"
pk="id"
datasource="nsfgrantsdir"
url="${nsfgrantsdir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/awardslist/award"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/awardslist/award/awardnumber" />
<field column="title_s" xpath="/awardslist/award/title"/>
<field column="expirationdate_dt" xpath="/awardslist/award/expirationdate" dateTimeFormat="MMMMM dd, yyyy" />
</entity>
</entity>
</document>
</dataConfig>

The first entity block will read all xml files in /data/rw/data/nsf_grants_xml and feed it into the second entity block for handling.

After done with the config files, start up solr and visit the dataimport admin page at http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport and click the “Full-import”

This is where the import should just magically work for you. Unfortunately, I ran into many errors on the import. The first problem was the DateFormatTransformer couldn’t parse the correct date because it assumes the locale for the documents is the same locale on your machine. Feeling like a good open source citizen, I submitted a patch!

Then, I found xml data files contained illegal characters not allowed in XML:

Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11))

Now i have to strip all the illegal characters. I found this helpful Java code snippet from lucas song in the solr mailing list. Just add a FileUtils.readFileToString and FileUtils.writeStringToFile to quickly rewrite the xml files with validated characters.

public class XmlCharFilter {
public static String doFilter(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught
// here; it should not happen.
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
}

After filtering the xml for correct characters, the DIH import ran pretty fast, around 5 minutes for 120K documents of 400 MB on my Macbook pro laptop.

A Tale of Two Mice Packaging

logitech vx nano rx1500 packaging

I recently purchased two Logitech mice: a VX Nano for my mobile use and a RX1500 for desktop use. The VX Nano came in retail packaging and the RX1500 came in OEM packaging. This photo illustrates the problem and solution to sustainable computer packaging. The OEM came with just what I need: the mouse. The retail packaging has everything and the kitchen sink. Who needs a driver CD? A manual for a mouse? All the extra cardboard?

Paypal Adaptive Ruby Gem Released

I have been tinkering with the new Paypal Adaptive Payments API and created a simple ruby gem to interface with it. Still pretty new but I’m using it with little problems so far. Submit bug reports if found. See the code at github

Paypal Adaptive Payments API
The adaptive payments api gives you the opportunity to make preapproved payments, chained payments and parallel payments. The chained/parallel payments are great for commission-based apps or if you are trying to connect a buyer to multiple sellers with a single interface.

How to use with Rails:
Install:

sudo gem install paypal_adaptive

Setup your API info by adding a paypal_adaptive.yml to your config folder:


development:
  environment: "sandbox"
  username: "sandbox_username"
  password: "sandbox_password"
  signature: "sandbox_signature"
  application_id: "sandbox_app_id"

test:
  environment: "sandbox"
  username: "sandbox_username"
  password: "sandbox_password"
  signature: "sandbox_signature"
  application_id: "sandbox_app_id"

production:
  environment: "production"
  username: "my_production_username"
  password: "my_production_password"
  signature: "my_production_signature"
  application_id: "my_production_app_id"

Make the payment request:


pay_request = PaypalAdaptive::Request.new

data = {
"returnUrl" => "http://testserver.com/payments/completed_payment_request",
"requestEnvelope" => {"errorLanguage" => "en_US"},
"currencyCode"=>"USD",
"receiverList"=>{"receiver"=>
     [{"email"=>"testpp_1261697850_per@nextsprocket.com", "amount"=>"10.00"}]},
"cancelUrl"=>"http://testserver.com/payments/canceled_payment_request",
"actionType"=>"PAY",
"ipnNotificationUrl"=>"http://testserver.com/payments/ipn_notification"
}

pay_response = pay_request.pay(data)

if pay_response.success?
  redirect_to pay_response.approve_paypal_payment_url
else
  puts pay_response.errors.first['message']
  redirect_to failed_payment_url
end

Once the user goes to pay_response.approve_paypal_payment_url, they will be prompted to login to Paypal for payment.

Upon payment completion page, they will be redirected to http://testserver.com/payments/completed_payment_request.

They can also click cancel to go to http://testserver.com/payments/canceled_payment_request

The actual payment details will be sent to your server via “ipnNotificationUrl” You have to create a listener to receive POST messages from paypal. I added a Rails metal template in the templates folder which handles the callback.

Additionally, you can make calls to Paypal Adaptive’s other APIs:


payment_details, preapproval, preapproval_details,
cancel_preapproval, convert_currency, refund

Input is just a Hash just like the pay method. Refer to the Paypal Adaptive manual for more details.

Converting from United States State Plane Coordinate System to Lat/Long WGS84

State Plane Coordinate System is a cartesian coordinate system that is widely used by many local practitioners in the United States. It’s only useful for local regions because projections used ignore the Earth’s curvature.

UTM?
When googling solutions for the conversion, I came across the UTM system. It looked related as it referenced the data in the same easting and northing points. UTM is a cartesian coordinate system but generalized for the entire world. It is not as specific as the State Plane Coordinate System.

WGS84?
WGS84 is the latest revision for the World Geodetic System widely used. Google Earth references lat/lng using this system and this is our target.

Convert from the California Zone VI State Plane Coordinate System to WGS84.
Now, back to the problem, our input is 6080411.905 ft easting 2169099.127 ft northing. We want its corresponding lat/lng.

Use cs2cs for the conversion
Download cs2cs from proj.4 cs2cs will convert coordinates from one system to another with the right projection parameters.

Find the corresponding projection for California Zone VI State Plane Coordinate System:

spatialreference.org has an extensive source of different coordinate systems.

Here’s the correct parameters for California VI as found on http://spatialreference.org/ref/esri/102646/

PROJCS["NAD_1983_StatePlane_California_VI_FIPS_0406_Feet",
    GEOGCS["GCS_North_American_1983",
        DATUM["North_American_Datum_1983",
            SPHEROID["GRS_1980",6378137,298.257222101]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295]],
    PROJECTION["Lambert_Conformal_Conic_2SP"],
    PARAMETER["False_Easting",6561666.666666666],
    PARAMETER["False_Northing",1640416.666666667],
    PARAMETER["Central_Meridian",-116.25],
    PARAMETER["Standard_Parallel_1",32.78333333333333],
    PARAMETER["Standard_Parallel_2",33.88333333333333],
    PARAMETER["Latitude_Of_Origin",32.16666666666666],
    UNIT["Foot_US",0.30480060960121924],
    AUTHORITY["EPSG","102646"]]

Run cs2cs
Pipe the input coordinates to cs2cs with the correct parameters given above.
echo '6080411.905 2169099.127' | cs2cs -f %.16f +proj=lcc +lat_1=32.78333333333333 +lat_2=33.88333333333333 +lat_0=32.16666666666666 +lon_0=-116.25 +x_0=2000000 +y_0=500000.0000000002 +ellps=GRS80 +datum=NAD83 +to_meter=0.3048006096012192 +no_defs +to +proj=latlon

Lat/Lng Output:
-117.8701429123491806 33.6091316264922995 0.0000000000000000

Alternatively, if you can’t use cs2cs or you rather implement the conversion directly with the projection parameters. The transformation equations here are listed at http://www.remotesensing.org/geotiff/proj_list/lambert_conic_conformal_2sp.html

ANTLR “Tokens has non-LL(*) decision due to recursive rule invocations”

I recently became interested in domain specific languages(dsls) and just started reading “Language Implementation Patterns”. The author, Terence Parr, wrote this great tool called ANTLR which helps you in the scanning/parsing stage of constructing a compiler.

Sometimes you’ll see this error:

antlr warning " Tokens has non-LL(*) decision due to recursive rule invocations "

when you have an invalid LL grammar such as an invalid non-recursive rule(see more on the antlr mailing list), but you maybe getting this error due to a syntax mistake.

Unlike Jflex(a scanner) and CUP(a parser), ANTLR is a both a lexer/scanner and parser. Only use all caps for the lexing(scanning) rules! Parsing statements use lower case or camel case.

grammar test;
NUMBER	:	'0'..'9'+;
TERM	:	NUMBER | '(' EXPRESSION ')';
EXPRESSION	:	TERM (('+'|'-') TERM)*;

should be:

grammar test;
NUMBER	:	'0'..'9'+;
TERM	:	NUMBER | '(' expr ')';
expr	:	TERM (('+'|'-') TERM)*;

Clojure!

I decided it’s about time to learn a functional programming language. Clojure is an excellent choice to get started. I looked at Scala and Erlang as other possible functional programming languages to learn but I settled on Clojure because:

    a) It’s a LISP dialect
    b) Runs on JVM
    c) Community support.

I was amazed at the amount of talk going on in the clojure community. The creator of clojure, Rick Hickey regularly answers questions on the clojure google groups.

To get started on the mac, Citizen428 has created an easy to use package. It has all the necessary jars and support for TextMate and Emacs. Download it from the github project

Take some time to read more from the Clojure website and google groups. They have plenty of answers.

Here’s a sample app:

(import '(javax.swing JFrame JLabel JTextField JButton)
        '(java.awt.event ActionListener)
        '(java.awt GridLayout))
(defn celsius []
  (let [frame (JFrame. "Celsius Converter")
        temp-text (JTextField.)
        celsius-label (JLabel. "Celsius")
        convert-button (JButton. "Convert")
        fahrenheit-label (JLabel. "Fahrenheit")]
    (.addActionListener convert-button
      (proxy [ActionListener] []
        (actionPerformed [evt]
          (let [c (Double/parseDouble (.getText temp-text))]
            (.setText fahrenheit-label
               (str (+ 32 (* 1.8 c)) " Fahrenheit"))))))
    (doto frame
      (.setLayout (GridLayout. 2 2 3 3))
      (.add temp-text)
      (.add celsius-label)
      (.add convert-button)
      (.add fahrenheit-label)
      (.setSize 300 80)
      (.setVisible true))))
(celsius)

Pig Textmate Bundle

I have been working with Pig as a more productive layer against Hadoop. Definitely check it out if you need to use map reduce. Textmate is my primary editor so I made a little bundle for syntax highlighting and script running. I’ll add snippets in the future.

Check it out at the github project page

Pig is a language designed to process large datasets. The execution layer runs uses Hadoop’s MapReduce framework.

This textmate bundle gives you syntax highlighting and the ability to run Pig scripts from Textmate.

Install:

mkdir -p ~/Library/Application\ Support/TextMate/Bundles
cd ~/Library/Application\ Support/TextMate/Bundles
git clone git://github.com/tc/pig-latin-tmbundle.git "PigLatin.tmbundle"
osascript -e 'tell app "TextMate" to reload bundles'

To run Pig scripts from Textmate:
Add the pig bin directory to your path env variable.
ex.

export PATH="/Users/tc/bin/pig-0.3.0/bin:$PATH"

Then, you can just press “Apple-R” to execute the active pig script in Textmate.

Must have Apps for Mac Developers

I recently got the new 15 inch Macbook Pro with Snow Leopard and decided to clean out my set of mac apps to the bare necessities. Here’s what I got it down to:

general purpose software:
Firefox: I use this along with two most-have plugins: ubiquity, firebug.
iwork: Need Pages/Numbers/Keynote.
adium: much more customizable than ichat.
VLC: plays every type of video.
colloquy: irc chat client.
transmission: torrents!
photoshop cs4: Need to edit images from time to time.
skype: Making the calls.
flickr uploadr: Easier to upload photos than their web interface.
unrarx: the builtin uncompress software doesn’t work with rar.

dev editors:
smultron: a great low footprint text editor.
textmate: Best editor for any type of programming language.
eclipse: The java IDE. i use hadoop and learning clojure so java is back in. Eclipse has great debugging and junit integration.

utilities:
quicksliver: quick app launcher.
sequel pro: Browsing mySQL DBs.
gitx: I mostly use command line for git, but i like the visual diffs in gitx.
iterm: I like this better than terminal. *mostly because of the full screen ability and preferred shortcut keys.
geektool: I keep a cat of the syslog on my desktop to make sure no background process is going haywire.
virtualbox: Free opensource VM client. For checking website compatibility in IE7.

tech(ok, you don’t need all of these, but they are great tech to try out):
java: 1.6 is already builtin.
git: distributed source control, must have for any coding project or documents you write. build from source.
mysql: need to run a local mysql server. get the 64 bit version!
ruby on rails: My choice of web development framework.
couchdbx: my choice of the nosql db. this is a self-contained package.
clojure: functional programming language and a modern lisp dialect built on the jvm.
hadoop: run a local copy of jobs for testing before deploying to ec2.
pig: the pig command line is vital for testing pig scripts locally.

CouchDB Install Problems on Snow Leopard: Trace/BPT trap

Just got the new Mac Snow Leopard this weekend! With the OS switching from 32bit to 64bit, i had to recompile much of my development software. CouchDB was particularly troubling.

I deleted all my macports, installed new macports 1.8, built 64bit versions of spidermonkey, icu and erlang using macports. Erlang looked like it was working:

→$erl
Erlang R13B01 (erts-5.7.2) [source] [64-bit] [smp:2:2] [rq:2]
[async-threads:0] [kernel-poll:false]
Eshell V5.7.2  (abort with ^G)
1> 1+1.
2
2> q().

I rebuilt couchdb from the latest sources using

./bootstrap; ./configure; make && sudo make install

When I tried to start CouchDB:

→$sudo couchdb
Apache CouchDB 0.10.0a802973 (LogLevel=info) is starting.
Trace/BPT trap

Ouch…no idea what this “Trace/BPT trap” error means, but luckily Benoit Chesneau on the CouchDB user mailing list found out it was a MacPorts build issue and not the code itself. Installing CouchDB and its dependencies from the directly from source solved the issue.  The exact solutions to rebuild from the sources rather macports are shown on the CouchDB wiki install page.

OS Process Timed out CouchDB error and fix

If you come across an OS Process Timed out using a view in CouchDB, you can adjust the os_process_timeout setting in the couchdb futon utility.

os_process_timeout

[Tue, 21 Jul 2009 23:57:45 GMT] [error] [<0.2804.0>] Uncaught error in HTTP request: {exit,
                                {{bad_return_value,
                                  {os_process_error,"OS process timed out."}},
                                 {gen_server,call,
                                  [<0.2808.0>,
                                   {prompt,
                                    [<<"reduce">>,
                                     [<<"function(keys, values, rereduce) {\n     return sum(values);\n   }">>],
                                     [[[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1],
                                      [[[<<"0832603">>,<<"and">>],
                                        <<"0832603">>],
                                       1]]]},
                                   infinity]}}}

Thanks to Dustin on the couchdb user mailing list for pointing this out.