<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>The GSoC Diaries</title><link href="http://navinpai.github.io/gsoc-blog/" rel="alternate"></link><link href="http://navinpai.github.io/gsoc-blog/feeds/gsoc.atom.xml" rel="self"></link><id>http://navinpai.github.io/gsoc-blog/</id><updated>2015-05-27T11:30:00+02:00</updated><entry><title>Elasticsearch Basics</title><link href="http://navinpai.github.io/gsoc-blog/elasticsearch-basics.html" rel="alternate"></link><updated>2015-05-27T11:30:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-27:gsoc-blog/elasticsearch-basics.html</id><summary type="html">&lt;p&gt;Elasticsearch is an amazing fulltext search engine which is used for searching JSONpedia. It is written in Java and &lt;a href="https://github.com/elastic/elasticsearch"&gt;is opensource&lt;/a&gt; and also provides a pretty neat REST interface as well. Also, it supports faceting out of the box, which makes it an ideal search engine candidate for JSONpedia. It has been in development for about half a decade, and it uses the amazing &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt; library behind the scenes.&lt;/p&gt;
&lt;h4&gt;Installation&lt;/h4&gt;
&lt;p&gt;Installing Elasticsearch is as easy as pie. You can either download the zip file from the site and use it like &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/setup-service.html"&gt;a normal executable&lt;/a&gt; or, as I prefer, download the precompiled package and use it &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/setup.html"&gt;as a service&lt;/a&gt;. Using it as a service is as simple as running &lt;code&gt;sudo service elasticsearch start&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Ports&lt;/h4&gt;
&lt;p&gt;The 2 main ports elasticsearch uses are 9200 and 9300. The 9300 port is used by the Java API to communicate with with cluster whereas the 9200 port is used to communicate with the cluster using a REST API.&lt;/p&gt;
&lt;h4&gt;Basic Querying&lt;/h4&gt;
&lt;p&gt;Since the REST API is accessed over port 9200, we can query the service running using CURL on the endpoint. Queries using CURL to Elasticsearch cluster are of the form:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;curl -X&amp;lt;VERB&amp;gt; &lt;span class="s1"&gt;&amp;#39;&amp;lt;PROTOCOL&amp;gt;://&amp;lt;HOST&amp;gt;/&amp;lt;PATH&amp;gt;?&amp;lt;QUERY_STRING&amp;gt;&amp;#39;&lt;/span&gt; -d &lt;span class="s1"&gt;&amp;#39;&amp;lt;BODY&amp;gt;&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The first query we almost always run is simply to check the health of the cluster. You can do this by querying:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;curl -X GET &lt;span class="s1"&gt;&amp;#39;http://localhost:9200&amp;#39;&lt;/span&gt;

&lt;span class="c"&gt;### sample output&lt;/span&gt;

&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;&amp;quot;status&amp;quot;&lt;/span&gt; : 200,
  &lt;span class="s2"&gt;&amp;quot;name&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;Jean DeWolff&amp;quot;&lt;/span&gt;,
  &lt;span class="s2"&gt;&amp;quot;version&amp;quot;&lt;/span&gt; : &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;number&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;1.0.1&amp;quot;&lt;/span&gt;,
    &lt;span class="s2"&gt;&amp;quot;build_hash&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;5c03844e1978e5cc924dab2a423dc63ce881c42b&amp;quot;&lt;/span&gt;,
    &lt;span class="s2"&gt;&amp;quot;build_timestamp&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;2014-02-25T15:52:53Z&amp;quot;&lt;/span&gt;,
    &lt;span class="s2"&gt;&amp;quot;build_snapshot&amp;quot;&lt;/span&gt; : &lt;span class="nb"&gt;false&lt;/span&gt;,
    &lt;span class="s2"&gt;&amp;quot;lucene_version&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;4.6&amp;quot;&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;,
  &lt;span class="s2"&gt;&amp;quot;tagline&amp;quot;&lt;/span&gt; : &lt;span class="s2"&gt;&amp;quot;You Know, for Search&amp;quot;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Next up, we want to count the number of documents in the cluster. For this we use:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;curl -XGET &lt;span class="s1"&gt;&amp;#39;http://localhost:9200/_count?pretty&amp;#39;&lt;/span&gt; -d &lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;
&lt;span class="s1"&gt;{&lt;/span&gt;
&lt;span class="s1"&gt;    &amp;quot;query&amp;quot;: {&lt;/span&gt;
&lt;span class="s1"&gt;        &amp;quot;match_all&amp;quot;: {}&lt;/span&gt;
&lt;span class="s1"&gt;    }&lt;/span&gt;
&lt;span class="s1"&gt;}&lt;/span&gt;
&lt;span class="s1"&gt;&amp;#39;&lt;/span&gt;

&lt;span class="c"&gt;### sample output&lt;/span&gt;


&lt;span class="o"&gt;{&lt;/span&gt;
  &lt;span class="s2"&gt;&amp;quot;count&amp;quot;&lt;/span&gt; : 12485,
  &lt;span class="s2"&gt;&amp;quot;_shards&amp;quot;&lt;/span&gt; : &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="s2"&gt;&amp;quot;total&amp;quot;&lt;/span&gt; : 5,
    &lt;span class="s2"&gt;&amp;quot;successful&amp;quot;&lt;/span&gt; : 5,
    &lt;span class="s2"&gt;&amp;quot;failed&amp;quot;&lt;/span&gt; : 0
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Similarly, we can run different CRUD queries as well &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html"&gt;as shown here&lt;/a&gt;. &lt;/p&gt;
&lt;h3&gt;Troubleshooting&lt;/h3&gt;
&lt;p&gt;Usually, errors with elasticsearch are either port related or process related. Remember your ports correctly. If CURL returns &lt;code&gt;empty reply from server&lt;/code&gt;, then the issue is most likely this. I broke my head on this for setup. Besides this, you should remember that data is stored in the &lt;code&gt;{path.home}/data&lt;/code&gt; location. So if you are adding bulk data to elasticsearch, you can also observe the size of the data files to know whether or not you are adding data correctly.&lt;/p&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry><entry><title>REST API (Part-I)</title><link href="http://navinpai.github.io/gsoc-blog/rest-api-part-i.html" rel="alternate"></link><updated>2015-05-17T19:27:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-17:gsoc-blog/rest-api-part-i.html</id><summary type="html">&lt;p&gt;JSONpedia exposes a set of REST APIs which makes it very convenient to access the data that is stored. An easy place to see the APIs available &lt;em&gt;(along with their respective options)&lt;/em&gt; is on the frontend web interface &lt;em&gt;(a live version of this is available &lt;a href="https://jsonpedia.org"&gt;here&lt;/a&gt;)&lt;/em&gt;. For this post, I will focus on the &lt;code&gt;/annotate/resource&lt;/code&gt; API. This API is used to convert MediaWiki documents to JSON format.&lt;/p&gt;
&lt;h4&gt;How to run&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c"&gt;# GET/POST /annotate/&amp;lt;format&amp;gt;/&amp;lt;uri&amp;gt; &lt;/span&gt;

GET      /annotate/resource/json/en:Albert_Einstein
POST     /annotate/resource/json/en:Albert_Einstein &lt;span class="c"&gt;# (with WikiText content to be converted as POST param)&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The supported formats which this API can return are vanilla JSON and rendered HTML.&lt;/p&gt;
&lt;p&gt;We can also provide &lt;code&gt;filters&lt;/code&gt; and &lt;code&gt;processors&lt;/code&gt; if needed with our API call using the &lt;code&gt;filter&lt;/code&gt; and &lt;code&gt;procs&lt;/code&gt; parameters respectively. Multiple processors can be used by passing a comma seperated set of the processors we want to use. Processors can be Extractors, Online Extractors &lt;em&gt;(which rely on external services DBpedia and Freebase)&lt;/em&gt; or Splitters.&lt;/p&gt;
&lt;h4&gt;Behind the Scenes&lt;/h4&gt;
&lt;p&gt;The Annotation service is a JAX-RS service, defined in &lt;code&gt;DefaultAnnotationService&lt;/code&gt;. The &lt;code&gt;annotateDocumentSource&lt;/code&gt; function uses the enrichEntity function of the &lt;code&gt;WikiPipeline&lt;/code&gt; object to generate a JSON serialized object. The powerhorse of this process is the &lt;code&gt;writeDocumentSerialization&lt;/code&gt; function of &lt;code&gt;WikiPipeline&lt;/code&gt; which does a lot of the processing of the document.&lt;/p&gt;
&lt;p&gt;The output of &lt;code&gt;enrichEntity&lt;/code&gt; function writes the document serialization to the serializer, followed by the extractors serialization and the splitters' serialization. This is the JSON object we want. This buffer is then passed to the &lt;code&gt;toOutputFormat&lt;/code&gt; function where depending on the format &lt;em&gt;(JSON or HTML)&lt;/em&gt;, the respective response is generated and sent back to the caller. The &lt;code&gt;createResultFilteredObject&lt;/code&gt; function of &lt;code&gt;JSONUtils&lt;/code&gt; is used to filter the nodes &lt;em&gt;(using the
&lt;code&gt;DefaultJSONFilterEngine.applyFilter&lt;/code&gt; function internally)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The result, as mentioned at the start of the post,is either JSON or directly renderable HTML.&lt;/p&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry><entry><title>CSV exporter</title><link href="http://navinpai.github.io/gsoc-blog/csv-exporter.html" rel="alternate"></link><updated>2015-05-13T10:13:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-13:gsoc-blog/csv-exporter.html</id><summary type="html">&lt;p&gt;The CSV exporter is a command line tool which allows you  to convert Wikipedia dumps to tabular data generated from page parsing.&lt;/p&gt;
&lt;h4&gt;How to run&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c"&gt;# java -cp build/libs/jsonpedia-{VERSION}.jar com.machinelinking.cli.exporter  --prefix page-prefix --in input-dump-file  --out output-csv-file --threads number-of-threads&lt;/span&gt;
java -cp build/libs/jsonpedia-&lt;span class="o"&gt;{&lt;/span&gt;VERSION&lt;span class="o"&gt;}&lt;/span&gt;.jar com.machinelinking.cli.exporter  --prefix http://en.wikipedia.org --in src/test/resources/dumps/enwiki-latest-pages-articles-p1.xml.gz  --out out.csv --threads 1
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;What goes on behind the scenes when you run this command is:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The exporter class parses all the command line options passed to it and then creates an exporter of the class &lt;code&gt;CSVExporter&lt;/code&gt; which is a child class of &lt;code&gt;WikiDumpMultiThreadProcessor&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;The export function of &lt;code&gt;CSVExporter&lt;/code&gt; is called, which creates a &lt;code&gt;BufferedInputStream&lt;/code&gt; object of the input stream &lt;em&gt;(assuming it isn't already of the type)&lt;/em&gt; and then calls the &lt;code&gt;process&lt;/code&gt; method of &lt;code&gt;WikiDumpMultiThreadProcessor&lt;/code&gt;. Here we do:&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;n&lt;/code&gt; processor objects, where processor is of the required class (eg.TemplatePropertyProcessor()) and &lt;code&gt;n&lt;/code&gt; is number of threads&lt;/li&gt;
&lt;li&gt;In each processor, get individual pages and run &lt;code&gt;processPage&lt;/code&gt; which works on the &lt;code&gt;WikiPage&lt;/code&gt; to extract data. If also uses the &lt;code&gt;WikiTextParser&lt;/code&gt; to parse the text. &lt;/li&gt;
&lt;li&gt;Finally, a report is created (eg. &lt;code&gt;CSVExporterReport&lt;/code&gt;) and returned&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;So this means...&lt;/h4&gt;
&lt;p&gt;This now gives us a CSV output that we need from the Wikipedia dump we provided. &lt;/p&gt;
&lt;h4&gt;Random Code-fu&lt;/h4&gt;
&lt;p&gt;Came across an interesting way to get the best number of threads to use for threading. We make use of the &lt;code&gt;Runtime&lt;/code&gt; to get the number of available processors. Didn't know you could do this! In the codebase, you will find the following code: &lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="kd"&gt;protected&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="nf"&gt;getBestNumberOfThreads&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Runtime&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getRuntime&lt;/span&gt;&lt;span class="o"&gt;().&lt;/span&gt;&lt;span class="na"&gt;availableProcessors&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;MIN_NUM_OF_THREADS&lt;/span&gt; &lt;span class="o"&gt;?&lt;/span&gt; &lt;span class="n"&gt;MIN_NUM_OF_THREADS&lt;/span&gt; &lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;candidate&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
  &lt;span class="o"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry><entry><title>facet_loader.py</title><link href="http://navinpai.github.io/gsoc-blog/facet-loader.html" rel="alternate"></link><updated>2015-05-08T10:19:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-08:gsoc-blog/facet-loader.html</id><summary type="html">&lt;p&gt;Next up, I'm looking at &lt;strong&gt;facet_loader.py&lt;/strong&gt;, which runs the Elasticsearch facet manager&lt;/p&gt;
&lt;h4&gt;How to run&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c"&gt;# bin/facet_loader.py -s &amp;lt;source-URI&amp;gt; -d &amp;lt;destination-URI&amp;gt; -l &amp;lt;limit-num&amp;gt; -c &amp;lt;config-file&amp;gt; &lt;/span&gt;
&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;facet_loader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;9300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;jsonpedia_test_load&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;en&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;9300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;jsonpedia_test_facet&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;en&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;l&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;faceting&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;properties&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;facet_loader.py&lt;/strong&gt; is a strightforward script which calls:&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="nv"&gt;MAVEN_OPTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;-Xms8g -Xmx8g -Dlog4j.configuration=file:conf/log4j.properties&amp;#39;&lt;/span&gt; mvn &lt;span class="nb"&gt;exec&lt;/span&gt;:java -Dexec.mainClass&lt;span class="o"&gt;=&lt;/span&gt;com.machinelinking.cli.facetloader -Dexec.args&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s1"&gt;&amp;#39;-s localhost:9300:jsonpedia_test_load:en -d localhost:9300:jsonpedia_test_facet:en -l 100 -c conf/faceting.properties&amp;#39;&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;The facetloader class does the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Create &lt;code&gt;fromStorage&lt;/code&gt; and &lt;code&gt;facetStorage&lt;/code&gt; instance of &lt;code&gt;ElasticJSONStorage&lt;/code&gt; using the &lt;code&gt;ElasticJSONStorageFactory&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Create an instance of &lt;code&gt;DefaultElasticFacetConfiguration&lt;/code&gt; and &lt;code&gt;DefaultElasticFacetManager&lt;/code&gt; using this configuration.&lt;/li&gt;
&lt;li&gt;The &lt;code&gt;loadFacets&lt;/code&gt; method of the &lt;code&gt;ElasticFacetManager&lt;/code&gt; is called, which converts each document from the &lt;code&gt;fromStorage&lt;/code&gt; using the provided &lt;code&gt;EnrichedEntityFacetConverter&lt;/code&gt; and puts it into the &lt;code&gt;destinationStorage&lt;/code&gt;. The converter is simply going through each document, and creating documents out of each section of the original document&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;So this means...&lt;/h4&gt;
&lt;p&gt;Now, we have elasticsearch documents for each section available with details such as &lt;code&gt;page&lt;/code&gt;,&lt;code&gt;section&lt;/code&gt;,&lt;code&gt;links&lt;/code&gt;, &lt;code&gt;content_stem&lt;/code&gt; etc.&lt;/p&gt;
&lt;p&gt;Next up, I'll be looking at the  CSV Export workflow and deep-diving into the code.&lt;/p&gt;
&lt;p&gt;Also, I need to start work on a couple of issues in the issue tracker &lt;em&gt;(which has been long delayed at this point)&lt;/em&gt;&lt;/p&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry><entry><title>loader.py</title><link href="http://navinpai.github.io/gsoc-blog/loader-py.html" rel="alternate"></link><updated>2015-05-06T11:30:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-06:gsoc-blog/loader-py.html</id><summary type="html">&lt;p&gt;I've started going through &lt;a href="https://bitbucket.org/hardest/jsonpedia"&gt;the codebase&lt;/a&gt; in a more structured manner, and thought that following workflows would be a good way to start. I plan to understand all the flows over the next few days. For a start, here's what happens when you run loader.py&lt;/p&gt;
&lt;h4&gt;How to run&lt;/h4&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="c"&gt;# bin/loader.py config-file [start-index:]end-index&lt;/span&gt;
&lt;span class="nb"&gt;bin&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt; &lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;properties&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;loader.py&lt;/strong&gt; is a Python script which basically does the following:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Download the URLs for Wikipedia Dumps from the &lt;a href="http://dumps.wikimedia.org/enwiki/latest/"&gt;wikimedia dumps page&lt;/a&gt; using &lt;code&gt;get_latest_articles_list()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Download the required dumps using the end-index &lt;em&gt;(and start-index, if provided)&lt;/em&gt; into the work directory using &lt;code&gt;download_file(url, directory, filename)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Ingest the file using &lt;code&gt;ingest_file(config, filename)&lt;/code&gt; which basically spawns a subprocess that runs &lt;code&gt;gradle runLoader -Pconfig=config -Pdump=filename 2&amp;gt;&amp;amp;1 &amp;gt; filename.log&lt;/code&gt;&lt;ol&gt;
&lt;li&gt;runLoader is a gradle task which calls &lt;code&gt;com.machinelinking.cli.loader&lt;/code&gt;&lt;ol&gt;
&lt;li&gt;&lt;code&gt;flags&lt;/code&gt; is a list of &lt;code&gt;Flag&lt;/code&gt;, each of which enables or disables Extractors, Linkers, Splitters, Validators etc. Default config file has &lt;code&gt;Extractors, Structure&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;jsonStorageFactory&lt;/code&gt; is an instance of the JSONStorageFactory. we use to store. Default config file has &lt;code&gt;com.machinelinking.storage.MultiJSONStorageFactory&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;jsonStorageConfig&lt;/code&gt; is of form &lt;code&gt;&amp;lt;store-factory 1&amp;gt;|&amp;lt;loader.storage.config 1&amp;gt;;&amp;lt;store-factory 2&amp;gt;|&amp;lt;loader.storage.config 2&amp;gt;&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;prefixURL&lt;/code&gt; is simply a prefix URL.&lt;/li&gt;
&lt;li&gt;Finally, we call &lt;code&gt;loader[0].load(prefixURL, inputstream)&lt;/code&gt; which internally calls &lt;code&gt;process(prefixURL, inputstream)&lt;/code&gt; of    &lt;code&gt;WikiDumpMultiThreadProcessor&lt;/code&gt; which uses a SAX parser (in &lt;code&gt;WikiDumpParser&lt;/code&gt;) to parse the data.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;WikiDumpRunnable&lt;/code&gt; calls the &lt;code&gt;processPage(pagePrefix, threadid,page)&lt;/code&gt; function which uses the over-riden &lt;code&gt;processPage()&lt;/code&gt; method of the nested &lt;code&gt;EnrichmentProcessor&lt;/code&gt;, which adds the document into the Mongostorage using the &lt;code&gt;MongoDBDataEncoder dataEncoder&lt;/code&gt; and &lt;code&gt;JSONStorageConnection connection&lt;/code&gt; after it is enriched using &lt;code&gt;enrichEntity(DocumentSource source, Serializer serializer)&lt;/code&gt; method of &lt;code&gt;WikiPipeline&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;So this means...&lt;/h4&gt;
&lt;p&gt;That's a of stuff happening under the hood i.e at step 3. :-)&lt;/p&gt;
&lt;p&gt;However, at the end of this simple command, we have achieved quite a bit.&lt;/p&gt;
&lt;p&gt;That's all for now... Next up, I'll be looking at &lt;code&gt;facet_loader.py&lt;/code&gt; and the CSV Export workflows.&lt;/p&gt;
&lt;h4&gt;Random Code-fu&lt;/h4&gt;
&lt;p&gt;I learnt about the &lt;a href="http://java.dzone.com/articles/java-7-new-try-resources"&gt;try-with-resources&lt;/a&gt; statement while going through the codebase today. With this type of &lt;em&gt;try&lt;/em&gt;, we can actually provide closeable resources to the block, which are automatically closed after the try. Pretty nifty indeed. An example in &lt;code&gt;loader.java&lt;/code&gt;&lt;/p&gt;
&lt;div class="highlight"&gt;&lt;pre&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;JSONStorage&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jsonStorageFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;createStorage&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;storageConfig&lt;/span&gt;&lt;span class="o"&gt;))&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="n"&gt;DefaultJSONStorageLoader&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;WikiPipelineFactory&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getInstance&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;flags&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storage&lt;/span&gt;
                &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="kd"&gt;final&lt;/span&gt; &lt;span class="n"&gt;StorageLoaderReport&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loader&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;].&lt;/span&gt;&lt;span class="na"&gt;load&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prefixURL&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;FileUtil&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;openDecompressedInputStream&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dumpFile&lt;/span&gt;&lt;span class="o"&gt;));&lt;/span&gt;

    &lt;span class="n"&gt;System&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;err&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;println&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Loading report: &amp;quot;&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;finalReportProduced&lt;/span&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;;&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/div&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry><entry><title>Hello World</title><link href="http://navinpai.github.io/gsoc-blog/hello-world.html" rel="alternate"></link><updated>2015-05-03T10:23:00+02:00</updated><author><name>Navin Pai</name></author><id>tag:navinpai.github.io,2015-05-03:gsoc-blog/hello-world.html</id><summary type="html">&lt;p&gt;Hey there!&lt;/p&gt;
&lt;p&gt;I'm Navin, and this is where I will be posting updates about the stuff I learn/explore as part of GSoC 2015. &lt;/p&gt;
&lt;p&gt;I'm going to be working with the fine people at the DBpedia project &lt;em&gt;(to be more specific, the JSONpedia project)&lt;/em&gt; and will be working on enhancing the extractors used for extracting information from Wikipedia.&lt;/p&gt;
&lt;p&gt;For a quick overview of my project, visit &lt;a href="https://www.google-melange.com/gsoc/project/details/google/gsoc2015/lifeofnavin/5676830073815040"&gt;my project proposal page&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Other quick links are in the left menu.&lt;/p&gt;</summary><category term="gsoc"></category><category term="dbpedia"></category></entry></feed>