Tommy Chheng

Icon

All Things Programming!

Indexing XML files using DIH in Solr 1.4

I have an large set of xml files(110K documents at ~900MB) that i wanted to import into Solr as fast as possible.
Previously I tried just using Jruby/nokogiri and embedded solr connection but that was taking about 15 minutes. So i tried using Solr’s builtin DIH to import the xml files to speed up imports(and re-imports). Here’s how to do it:

Assume you have a large list of xml records like:

<awardlist>
<award>
<awardnumber>0706313</awardnumber>
<title>Coherent Phonon Dynamics  in Semiconductors and Nanotubes</title>
<expirationdate>November 30, 2009</expirationdate>
</award>
<award>
<awardnumber>9909156</awardnumber>
<title>Sustainability of Arctic Communities: Advancing the Science of Integrated Assessment </title>
<expirationdate>November 30, 2009</expirationdate>
</award>
</awardlist>

Add a request handler to your solrconfig.xml file:

<requestHandler name="/dataimport">
<lst name="defaults">
<str name="config">xml-data-config.xml</str>
</lst>
</requestHandler>

Create the xml-data-config.xml file:

<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="nsfgrantsdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^.*\.xml$" recursive="false"
baseDir="/data/rw/data/nsf_grants_xml"
>
<entity name="nsf-grants"
pk="id"
datasource="nsfgrantsdir"
url="${nsfgrantsdir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/awardslist/award"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/awardslist/award/awardnumber" />
<field column="title_s" xpath="/awardslist/award/title"/>
<field column="expirationdate_dt" xpath="/awardslist/award/expirationdate" dateTimeFormat="MMMMM dd, yyyy" />
</entity>
</entity>
</document>
</dataConfig>

The first entity block will read all xml files in /data/rw/data/nsf_grants_xml and feed it into the second entity block for handling.

After done with the config files, start up solr and visit the dataimport admin page at http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport and click the “Full-import”

This is where the import should just magically work for you. Unfortunately, I ran into many errors on the import. The first problem was the DateFormatTransformer couldn’t parse the correct date because it assumes the locale for the documents is the same locale on your machine. Feeling like a good open source citizen, I submitted a patch!

Then, I found xml data files contained illegal characters not allowed in XML:

Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11))

Now i have to strip all the illegal characters. I found this helpful Java code snippet from lucas song in the solr mailing list. Just add a FileUtils.readFileToString and FileUtils.writeStringToFile to quickly rewrite the xml files with validated characters.

public class XmlCharFilter {
public static String doFilter(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught
// here; it should not happen.
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
}

After filtering the xml for correct characters, the DIH import ran pretty fast, around 5 minutes for 120K documents of 400 MB on my Macbook pro laptop.

Category: Programming

Tagged:

  • Hey what is supposed to be the application_id? I mean, how/where will I get it?

    Cheers.
  • tommychheng
    application_id? there is no application_id reference in the Solr's DIH?
  • Sagar Dubey
    Hi, I came across your bike traffic light problem blog, and then to this site and its been very helpful. Thanks for posting it up. I`m finishing up an engg degree from mumbai, india. I wonder if you could help me out with my final year project which is related to your bike problem. Sagar
  • tommychheng
    hey sagar, feel free to post any questions related to the traffic light detection on this post http://tommy.chheng.com/index.php/2009/05/visio...

    i'll try to answer or maybe someone else will come across it and offer an answer
  • jzhang
    you might want to look at vtd-xml for the best possible processing option, it has Java port so you can use it natively in jRuby

    http://vtd-xml.sf.net
blog comments powered by Disqus