The CSV exporter is a command line tool which allows you to convert Wikipedia dumps to tabular data generated from page parsing.
How to run
# java -cp build/libs/jsonpedia-{VERSION}.jar com.machinelinking.cli.exporter --prefix page-prefix --in input-dump-file --out output-csv-file --threads number-of-threads
java -cp build/libs/jsonpedia-{VERSION}.jar com.machinelinking.cli.exporter --prefix http://en.wikipedia.org --in src/test/resources/dumps/enwiki-latest-pages-articles-p1.xml.gz --out out.csv --threads 1
What goes on behind the scenes when you run this command is:
- The exporter class parses all the command line options passed to it and then creates an exporter of the class
CSVExporterwhich is a child class ofWikiDumpMultiThreadProcessor - The export function of
CSVExporteris called, which creates aBufferedInputStreamobject of the input stream (assuming it isn't already of the type) and then calls theprocessmethod ofWikiDumpMultiThreadProcessor. Here we do:- Create
nprocessor objects, where processor is of the required class (eg.TemplatePropertyProcessor()) andnis number of threads - In each processor, get individual pages and run
processPagewhich works on theWikiPageto extract data. If also uses theWikiTextParserto parse the text. - Finally, a report is created (eg.
CSVExporterReport) and returned
- Create
So this means...
This now gives us a CSV output that we need from the Wikipedia dump we provided.
Random Code-fu
Came across an interesting way to get the best number of threads to use for threading. We make use of the Runtime to get the number of available processors. Didn't know you could do this! In the codebase, you will find the following code:
protected int getBestNumberOfThreads() {
final int candidate = Runtime.getRuntime().availableProcessors();
return candidate < MIN_NUM_OF_THREADS ? MIN_NUM_OF_THREADS : candidate;
}