Indexing XML files using DIH in Solr 1.4
February 8, 2010 18 Comments
I have an large set of xml files(110K documents at ~900MB) that i wanted to import into Solr as fast as possible.
Previously I tried just using Jruby/nokogiri and embedded solr connection but that was taking about 15 minutes. So i tried using Solr’s builtin DIH to import the xml files to speed up imports(and re-imports). Here’s how to do it:
Assume you have a large list of xml records like:
<awardlist>
<award>
<awardnumber>0706313</awardnumber>
<title>Coherent Phonon Dynamics in Semiconductors and Nanotubes</title>
<expirationdate>November 30, 2009</expirationdate>
</award>
<award>
<awardnumber>9909156</awardnumber>
<title>Sustainability of Arctic Communities: Advancing the Science of Integrated Assessment </title>
<expirationdate>November 30, 2009</expirationdate>
</award>
</awardlist>
Add a request handler to your solrconfig.xml file:
<requestHandler name="/dataimport">
<lst name="defaults">
<str name="config">xml-data-config.xml</str>
</lst>
</requestHandler>
Create the xml-data-config.xml file:
<dataConfig>
<dataSource type="FileDataSource" />
<document>
<entity name="nsfgrantsdir" rootEntity="false" dataSource="null"
processor="FileListEntityProcessor"
fileName="^.*\.xml$" recursive="false"
baseDir="/data/rw/data/nsf_grants_xml"
>
<entity name="nsf-grants"
pk="id"
datasource="nsfgrantsdir"
url="${nsfgrantsdir.fileAbsolutePath}"
processor="XPathEntityProcessor"
forEach="/awardslist/award"
transformer="DateFormatTransformer, RegexTransformer">
<field column="id" xpath="/awardslist/award/awardnumber" />
<field column="title_s" xpath="/awardslist/award/title"/>
<field column="expirationdate_dt" xpath="/awardslist/award/expirationdate" dateTimeFormat="MMMMM dd, yyyy" />
</entity>
</entity>
</document>
</dataConfig>
The first entity block will read all xml files in /data/rw/data/nsf_grants_xml and feed it into the second entity block for handling.
After done with the config files, start up solr and visit the dataimport admin page at http://localhost:8983/solr/admin/dataimport.jsp?handler=/dataimport and click the “Full-import”
This is where the import should just magically work for you. Unfortunately, I ran into many errors on the import. The first problem was the DateFormatTransformer couldn’t parse the correct date because it assumes the locale for the documents is the same locale on your machine. Feeling like a good open source citizen, I submitted a patch.
Then, I found xml data files contained illegal characters not allowed in XML:
Caused by: java.lang.RuntimeException: com.ctc.wstx.exc.WstxUnexpectedCharException: Illegal character ((CTRL-CHAR, code 11))
Now i have to strip all the illegal characters. I found this helpful Java code snippet from lucas song in the solr mailing list. I created a separate Java application to read each XML file as a string, processing using XmlCharFilter.doFilter and writing the output as a file. The apache commons-io project provides FileUtils.readFileToString and FileUtils.writeStringToFile to handle the file reading/writing.
public class XmlCharFilter {
public static String doFilter(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in)))
return ""; // vacancy test.
for (int i = 0; i = 0x20) && (current = 0xE000) && (current = 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
}
After filtering the xml for correct characters, the DIH import ran pretty fast, around 5 minutes for 120K documents of 400 MB on my Macbook pro laptop.
you might want to look at vtd-xml for the best possible processing option, it has Java port so you can use it natively in jRubyhttp://vtd-xml.sf.net
Hi, I came across your bike traffic light problem blog, and then to this site and its been very helpful. Thanks for posting it up. I`m finishing up an engg degree from mumbai, india. I wonder if you could help me out with my final year project which is related to your bike problem. Sagar
hey sagar, feel free to post any questions related to the traffic light detection on this post http://tommy.chheng.com/index.php/2009/05/visio…i'll try to answer or maybe someone else will come across it and offer an answer
Hey what is supposed to be the application_id? I mean, how/where will I get it?Cheers.
application_id? there is no application_id reference in the Solr's DIH?
Thanks for this post! Does the XPath parser in the config file require full paths to the data? I am on windows and can't seem to make this work correctly.Adam
I would recommend the full path as i'm not sure what the working path is(solr.home?)
Hi,i take an error about requestHandler : missing mandatory attribute 'class'i think in solrconfig file a reuestHandler need a class attribute. How can i handle this problem? Mehmet
Ok i solved the problem thanks.There is a mistake at your post for “xml-data-config.xml” file you wrote forEach=”/awardslist/award” but in xml it is <awardlist> (not awardslist : without 's')</awardlist>
SEVERE: Exception while processing: nsfgrantsdir document : null:org.apache.solr.handler.dataimport.DataImportHandlerException: Parsing failed for xml, url:D:\Sprint3\temp\search.xml rows processed:0 Processing Document # 1
why i am getting this error i followed the blog
public class XmlCharFilter where this file is avialble in solr
Type your comment You may use these HTML tags: <a> <abbr> <acronym> <b> <blockquote> <cite> <code> <del> <em> <i> <q> <strike> <strong>
Dear Tommy,
Much thanks for an excellent posting. I am also having the same issues with illegal characters in my XML but I can’t quite follow the changes I need to make to the Java code.
Would it be possible to provide a little bit more description of the steps required to get these illegal characters filtered within Solr?
Thanks again for some very helpful information in your blog,
Paul
Hi Paul, can you list the steps you tried and I see what’s going wrong?
Hi Tommy, I am kind of stuck with DIH. solr app is not starting in tomcat, and tomcat logs shows java.lang.RuntimeException: [solrconfig.xml] requestHandler: missing mandatory attribute ‘class’
at org.apache.solr.common.util.DOMUtil.getAttr(DOMUtil.java:72)
at org.apache.solr.common.util.DOMUtil.getAttr(DOMUtil.java:79)
at org.apache.solr.core.PluginInfo.(PluginInfo.java:53)
at org.apache.solr.core.SolrConfig.readPluginInfos(SolrConfig.java:220)
at org.apache.solr.core.SolrConfig.loadPluginInfo(SolrConfig.java:212)
at org.apache.solr.core.SolrConfig.(SolrConfig.java:184)
at org.apache.solr.core.CoreContainer$Initializer.initialize(CoreContainer.java:134)
at org.apache.solr.servlet.SolrDispatchFilter.init(SolrDispatchFilter.java:83)
———————-MY dataconfig.xml————————————————-
______________________________________________________________
Hi Tommy, much thanks again for the excellent online help and advice.
Your posting suggested “adding a FileUtils.readFileToString and FileUtils.writeStringToFile to quickly rewrite the xml files with validated characters”.
Apologies, but I am not an expert Java programmer so was not sure what code to add or modify in my Solr 1.4.1 release?
Sorry if I missed something. I followed all the other steps in your posting and everything is working well apart for the occasional XML files which contain illegal characters.
Thanks again,
Paul
Hey Paul,
I created a separate command line Java application to iterate the directory of XML files, read the file as a string(using FileUtils.readFileToString from the apache commons-io package), pass it through the doFilter function listed above and save it to a new file.
Hi Tommy, now I understand, I thought you might be updating the Solr source with the Java snippit, but the use of a command line app looks a great way to go.
Much thanks again for the helpful and clear instructions on how to use Solr DIH.
Paul
I have the this config file @ solr/config/myconfigfile.xml
How would I write the bareDir so it finds the files i want to index @ solr/data/Afolder/files here.xml
My attempt was: