The GSoC Diaries

Elasticsearch Basics

2015-05-27T11:30:00+02:00

Elasticsearch is an amazing fulltext search engine which is used for searching JSONpedia. It is written in Java and is opensource and also provides a pretty neat REST interface as well. Also, it supports faceting out of the box, which makes it an ideal search engine candidate for JSONpedia. It has been in development for about half a decade, and it uses the amazing Lucene library behind the scenes.

Installation

Installing Elasticsearch is as easy as pie. You can either download the zip file from the site and use it like a normal executable or, as I prefer, download the precompiled package and use it as a service. Using it as a service is as simple as running sudo service elasticsearch start.

Ports

The 2 main ports elasticsearch uses are 9200 and 9300. The 9300 port is used by the Java API to communicate with with cluster whereas the 9200 port is used to communicate with the cluster using a REST API.

Basic Querying

Since the REST API is accessed over port 9200, we can query the service running using CURL on the endpoint. Queries using CURL to Elasticsearch cluster are of the form:

curl -X<VERB> '<PROTOCOL>://<HOST>/<PATH>?<QUERY_STRING>' -d '<BODY>'

The first query we almost always run is simply to check the health of the cluster. You can do this by querying:

curl -X GET 'http://localhost:9200'

### sample output

{
  "status" : 200,
  "name" : "Jean DeWolff",
  "version" : {
    "number" : "1.0.1",
    "build_hash" : "5c03844e1978e5cc924dab2a423dc63ce881c42b",
    "build_timestamp" : "2014-02-25T15:52:53Z",
    "build_snapshot" : false,
    "lucene_version" : "4.6"
  },
  "tagline" : "You Know, for Search"
}

Next up, we want to count the number of documents in the cluster. For this we use:

curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
    "query": {
        "match_all": {}
    }
}
'

### sample output


{
  "count" : 12485,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  }
}

Similarly, we can run different CRUD queries as well as shown here.

Troubleshooting

Usually, errors with elasticsearch are either port related or process related. Remember your ports correctly. If CURL returns empty reply from server, then the issue is most likely this. I broke my head on this for setup. Besides this, you should remember that data is stored in the {path.home}/data location. So if you are adding bulk data to elasticsearch, you can also observe the size of the data files to know whether or not you are adding data correctly.

REST API (Part-I)

2015-05-17T19:27:00+02:00

JSONpedia exposes a set of REST APIs which makes it very convenient to access the data that is stored. An easy place to see the APIs available (along with their respective options) is on the frontend web interface (a live version of this is available here). For this post, I will focus on the /annotate/resource API. This API is used to convert MediaWiki documents to JSON format.

How to run

# GET/POST /annotate/<format>/<uri> 

GET      /annotate/resource/json/en:Albert_Einstein
POST     /annotate/resource/json/en:Albert_Einstein # (with WikiText content to be converted as POST param)

The supported formats which this API can return are vanilla JSON and rendered HTML.

We can also provide filters and processors if needed with our API call using the filter and procs parameters respectively. Multiple processors can be used by passing a comma seperated set of the processors we want to use. Processors can be Extractors, Online Extractors (which rely on external services DBpedia and Freebase) or Splitters.

Behind the Scenes

The Annotation service is a JAX-RS service, defined in DefaultAnnotationService. The annotateDocumentSource function uses the enrichEntity function of the WikiPipeline object to generate a JSON serialized object. The powerhorse of this process is the writeDocumentSerialization function of WikiPipeline which does a lot of the processing of the document.

The output of enrichEntity function writes the document serialization to the serializer, followed by the extractors serialization and the splitters' serialization. This is the JSON object we want. This buffer is then passed to the toOutputFormat function where depending on the format (JSON or HTML), the respective response is generated and sent back to the caller. The createResultFilteredObject function of JSONUtils is used to filter the nodes (using the DefaultJSONFilterEngine.applyFilter function internally)

The result, as mentioned at the start of the post,is either JSON or directly renderable HTML.

CSV exporter

2015-05-13T10:13:00+02:00

The CSV exporter is a command line tool which allows you to convert Wikipedia dumps to tabular data generated from page parsing.

How to run

# java -cp build/libs/jsonpedia-{VERSION}.jar com.machinelinking.cli.exporter  --prefix page-prefix --in input-dump-file  --out output-csv-file --threads number-of-threads
java -cp build/libs/jsonpedia-{VERSION}.jar com.machinelinking.cli.exporter  --prefix http://en.wikipedia.org --in src/test/resources/dumps/enwiki-latest-pages-articles-p1.xml.gz  --out out.csv --threads 1

What goes on behind the scenes when you run this command is:

The exporter class parses all the command line options passed to it and then creates an exporter of the class CSVExporter which is a child class of WikiDumpMultiThreadProcessor
The export function of CSVExporter is called, which creates a BufferedInputStream object of the input stream (assuming it isn't already of the type) and then calls the process method of WikiDumpMultiThreadProcessor. Here we do:
1. Create n processor objects, where processor is of the required class (eg.TemplatePropertyProcessor()) and n is number of threads
2. In each processor, get individual pages and run processPage which works on the WikiPage to extract data. If also uses the WikiTextParser to parse the text.
3. Finally, a report is created (eg. CSVExporterReport) and returned

So this means...

This now gives us a CSV output that we need from the Wikipedia dump we provided.

Random Code-fu

Came across an interesting way to get the best number of threads to use for threading. We make use of the Runtime to get the number of available processors. Didn't know you could do this! In the codebase, you will find the following code:

protected int getBestNumberOfThreads() {
    final int candidate = Runtime.getRuntime().availableProcessors();
    return candidate < MIN_NUM_OF_THREADS ? MIN_NUM_OF_THREADS : candidate;
  }

facet_loader.py

2015-05-08T10:19:00+02:00

Next up, I'm looking at facet_loader.py, which runs the Elasticsearch facet manager

How to run

# bin/facet_loader.py -s <source-URI> -d <destination-URI> -l <limit-num> -c <config-file> 
bin/facet_loader.py -s localhost:9300:jsonpedia_test_load:en -d localhost:9300:jsonpedia_test_facet:en -l 100 -c conf/faceting.properties

facet_loader.py is a strightforward script which calls:

MAVEN_OPTS='-Xms8g -Xmx8g -Dlog4j.configuration=file:conf/log4j.properties' mvn exec:java -Dexec.mainClass=com.machinelinking.cli.facetloader -Dexec.args='-s localhost:9300:jsonpedia_test_load:en -d localhost:9300:jsonpedia_test_facet:en -l 100 -c conf/faceting.properties'

The facetloader class does the following:

Create fromStorage and facetStorage instance of ElasticJSONStorage using the ElasticJSONStorageFactory
Create an instance of DefaultElasticFacetConfiguration and DefaultElasticFacetManager using this configuration.
The loadFacets method of the ElasticFacetManager is called, which converts each document from the fromStorage using the provided EnrichedEntityFacetConverter and puts it into the destinationStorage. The converter is simply going through each document, and creating documents out of each section of the original document

So this means...

Now, we have elasticsearch documents for each section available with details such as page,section,links, content_stem etc.

Next up, I'll be looking at the CSV Export workflow and deep-diving into the code.

Also, I need to start work on a couple of issues in the issue tracker (which has been long delayed at this point)

loader.py

2015-05-06T11:30:00+02:00

I've started going through the codebase in a more structured manner, and thought that following workflows would be a good way to start. I plan to understand all the flows over the next few days. For a start, here's what happens when you run loader.py

How to run

# bin/loader.py config-file [start-index:]end-index
bin/loader.py conf/default.properties 1

loader.py is a Python script which basically does the following:

Download the URLs for Wikipedia Dumps from the wikimedia dumps page using get_latest_articles_list()
Download the required dumps using the end-index (and start-index, if provided) into the work directory using download_file(url, directory, filename)
Ingest the file using ingest_file(config, filename) which basically spawns a subprocess that runs gradle runLoader -Pconfig=config -Pdump=filename 2>&1 > filename.log
1. runLoader is a gradle task which calls com.machinelinking.cli.loader
  1. flags is a list of Flag, each of which enables or disables Extractors, Linkers, Splitters, Validators etc. Default config file has Extractors, Structure.
  2. jsonStorageFactory is an instance of the JSONStorageFactory. we use to store. Default config file has com.machinelinking.storage.MultiJSONStorageFactory.
  3. jsonStorageConfig is of form <store-factory 1>|<loader.storage.config 1>;<store-factory 2>|<loader.storage.config 2>.
  4. prefixURL is simply a prefix URL.
  5. Finally, we call loader[0].load(prefixURL, inputstream) which internally calls process(prefixURL, inputstream) of WikiDumpMultiThreadProcessor which uses a SAX parser (in WikiDumpParser) to parse the data.
  6. WikiDumpRunnable calls the processPage(pagePrefix, threadid,page) function which uses the over-riden processPage() method of the nested EnrichmentProcessor, which adds the document into the Mongostorage using the MongoDBDataEncoder dataEncoder and JSONStorageConnection connection after it is enriched using enrichEntity(DocumentSource source, Serializer serializer) method of WikiPipeline.

So this means...

That's a of stuff happening under the hood i.e at step 3. :-)

However, at the end of this simple command, we have achieved quite a bit.

That's all for now... Next up, I'll be looking at facet_loader.py and the CSV Export workflows.

Random Code-fu

I learnt about the try-with-resources statement while going through the codebase today. With this type of try, we can actually provide closeable resources to the block, which are automatically closed after the try. Pretty nifty indeed. An example in loader.java

try (final JSONStorage storage = jsonStorageFactory.createStorage(storageConfig)) {
    loader[0] = new DefaultJSONStorageLoader(
                WikiPipelineFactory.getInstance(), flags, storage
                );

    final StorageLoaderReport report = loader[0].load(prefixURL, FileUtil.openDecompressedInputStream(dumpFile));

    System.err.println("Loading report: " + report);

    finalReportProduced[0] = true;
}

Hello World

2015-05-03T10:23:00+02:00

Hey there!

I'm Navin, and this is where I will be posting updates about the stuff I learn/explore as part of GSoC 2015.

I'm going to be working with the fine people at the DBpedia project (to be more specific, the JSONpedia project) and will be working on enhancing the extractors used for extracting information from Wikipedia.

For a quick overview of my project, visit my project proposal page.