Indexing XML files using DIH in Solr 1.4
Posted in Programming on February 8th, 2010 by tommy – CommentsI have an large set of xml files(110K documents at ~900MB) that i wanted to import into Solr as fast as possible.
Previously I tried just using Jruby/nokogiri and embedded solr connection but that was taking about 15 minutes. So i tried using Solr’s builtin DIH to import the xml files to speed up imports(and re-imports). Here’s how to do it:
Assume you have a large list of xml records like:
<awardlist>
<award>
<awardnumber>0706313</awardnumber>
<title>Coherent Phonon Dynamics in Semiconductors and Nanotubes</title>
<expirationdate>November 30, 2009</expirationdate>
</award>
<award>
<awardnumber>9909156</awardnumber>
<title>Sustainability of Arctic Communities: Advancing the Science of Integrated Assessment </title>
<expirationdate>November 30, 2009</expirationdate>
</award>
</awardlist>
Add a request handler to your solrconfig.xml file:
<requestHandler name="/dataimport">
<lst name="defaults">
<str name="config">xml-data-config.xml</str>
</lst>
</requestHandler>
Create the xml-data-config.xml file:
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="nsfgrantsdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^.*\.xml$" recursive="false"
baseDir="/data/rw/data/nsf_grants_xml"
>
<entity name="nsf-grants"
pk="id"
datasource="nsfgrantsdir"
url="${nsfgrantsdir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/awardslist/award"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/awardslist/award/awardnumber" />
<field column="title_s" xpath="/awardslist/award/title"/>
<field column="expirationdate_dt" xpath="/awardslist/award/expirationdate" dateTimeFormat="MMMMM dd, yyyy" />
</entity>
</entity>
</document>
</dataConfig>
The first entity block will read all xml files in /data/rw/data/nsf_grants_xml and feed it into the second entity block for handling.
After done with the config files, start up solr and visit the dataimport admin page at http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport and click the “Full-import”
This is where the import should just magically work for you. Unfortunately, I ran into many errors on the import. The first problem was the DateFormatTransformer couldn’t parse the correct date because it assumes the locale for the documents is the same locale on your machine. Feeling like a good open source citizen, I submitted a patch!
Then, I found xml data files contained illegal characters not allowed in XML:
Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11))
At this point, I realized 15 minutes of running jruby/nokogiri xml parser was better than create another transformer to strip all the illegal characters. Hopefully, the DIH will be work better on your dataset.
