Feb 8, 2010 View Comments
Indexing XML files using DIH in Solr 1.4
I have an large set of xml files(110K documents at ~900MB) that i wanted to import into Solr as fast as possible.
Previously I tried just using Jruby/nokogiri and embedded solr connection but that was taking about 15 minutes. So i tried using Solr’s builtin DIH to import the xml files to speed up imports(and re-imports). Here’s how to do it:
Assume you have a large list of xml records like:
<awardlist>
<award>
<awardnumber>0706313</awardnumber>
<title>Coherent Phonon Dynamics in Semiconductors and Nanotubes</title>
<expirationdate>November 30, 2009</expirationdate>
</award>
<award>
<awardnumber>9909156</awardnumber>
<title>Sustainability of Arctic Communities: Advancing the Science of Integrated Assessment </title>
<expirationdate>November 30, 2009</expirationdate>
</award>
</awardlist>
Add a request handler to your solrconfig.xml file:
<requestHandler name="/dataimport">
<lst name="defaults">
<str name="config">xml-data-config.xml</str>
</lst>
</requestHandler>
Create the xml-data-config.xml file:
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="nsfgrantsdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^.*\.xml$" recursive="false"
baseDir="/data/rw/data/nsf_grants_xml"
>
<entity name="nsf-grants"
pk="id"
datasource="nsfgrantsdir"
url="${nsfgrantsdir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/awardslist/award"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/awardslist/award/awardnumber" />
<field column="title_s" xpath="/awardslist/award/title"/>
<field column="expirationdate_dt" xpath="/awardslist/award/expirationdate" dateTimeFormat="MMMMM dd, yyyy" />
</entity>
</entity>
</document>
</dataConfig>
The first entity block will read all xml files in /data/rw/data/nsf_grants_xml and feed it into the second entity block for handling.
After done with the config files, start up solr and visit the dataimport admin page at http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport and click the “Full-import”
This is where the import should just magically work for you. Unfortunately, I ran into many errors on the import. The first problem was the DateFormatTransformer couldn’t parse the correct date because it assumes the locale for the documents is the same locale on your machine. Feeling like a good open source citizen, I submitted a patch!
Then, I found xml data files contained illegal characters not allowed in XML:
Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11))
Now i have to strip all the illegal characters. I found this helpful Java code snippet from lucas song in the solr mailing list. Just add a FileUtils.readFileToString and FileUtils.writeStringToFile to quickly rewrite the xml files with validated characters.
public class XmlCharFilter {
public static String doFilter(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i < in.length(); i++) {
current = in.charAt(i); // NOTE: No IndexOutOfBoundsException caught
// here; it should not happen.
if ((current == 0x9) || (current == 0xA) || (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
}
After filtering the xml for correct characters, the DIH import ran pretty fast, around 5 minutes for 120K documents of 400 MB on my Macbook pro laptop.
