I've started going through the codebase in a more structured manner, and thought that following workflows would be a good way to start. I plan to understand all the flows over the next few days. For a start, here's what happens when you run loader.py
How to run
# bin/loader.py config-file [start-index:]end-index
bin/loader.py conf/default.properties 1
loader.py is a Python script which basically does the following:
- Download the URLs for Wikipedia Dumps from the wikimedia dumps page using
get_latest_articles_list() - Download the required dumps using the end-index (and start-index, if provided) into the work directory using
download_file(url, directory, filename) - Ingest the file using
ingest_file(config, filename)which basically spawns a subprocess that runsgradle runLoader -Pconfig=config -Pdump=filename 2>&1 > filename.log- runLoader is a gradle task which calls
com.machinelinking.cli.loaderflagsis a list ofFlag, each of which enables or disables Extractors, Linkers, Splitters, Validators etc. Default config file hasExtractors, Structure.jsonStorageFactoryis an instance of the JSONStorageFactory. we use to store. Default config file hascom.machinelinking.storage.MultiJSONStorageFactory.jsonStorageConfigis of form<store-factory 1>|<loader.storage.config 1>;<store-factory 2>|<loader.storage.config 2>.prefixURLis simply a prefix URL.- Finally, we call
loader[0].load(prefixURL, inputstream)which internally callsprocess(prefixURL, inputstream)ofWikiDumpMultiThreadProcessorwhich uses a SAX parser (inWikiDumpParser) to parse the data. WikiDumpRunnablecalls theprocessPage(pagePrefix, threadid,page)function which uses the over-ridenprocessPage()method of the nestedEnrichmentProcessor, which adds the document into the Mongostorage using theMongoDBDataEncoder dataEncoderandJSONStorageConnection connectionafter it is enriched usingenrichEntity(DocumentSource source, Serializer serializer)method ofWikiPipeline.
- runLoader is a gradle task which calls
So this means...
That's a of stuff happening under the hood i.e at step 3. :-)
However, at the end of this simple command, we have achieved quite a bit.
That's all for now... Next up, I'll be looking at facet_loader.py and the CSV Export workflows.
Random Code-fu
I learnt about the try-with-resources statement while going through the codebase today. With this type of try, we can actually provide closeable resources to the block, which are automatically closed after the try. Pretty nifty indeed. An example in loader.java
try (final JSONStorage storage = jsonStorageFactory.createStorage(storageConfig)) {
loader[0] = new DefaultJSONStorageLoader(
WikiPipelineFactory.getInstance(), flags, storage
);
final StorageLoaderReport report = loader[0].load(prefixURL, FileUtil.openDecompressedInputStream(dumpFile));
System.err.println("Loading report: " + report);
finalReportProduced[0] = true;
}