Package it.unimi.dsi.archive4j.tool

Command-line tools for archive construction.


Interface Summary
MergePreprocessedData.TermFilter Interface used to specify whether a term must be filtered (that is, eliminated from the archive) or not.

Class Summary
ArchiveBuilder Builds an archive from a document collection.
ComputeArchiveData Analyses an Archive and generates frequency and global count data.
MergePreprocessedData Filters and merges term data (term lists, frequencies, global counts) originated by one or more preprocessing phases and generates the corresponding StringMap.
MergePreprocessedData.FrequencyFilter Filter that eliminates terms that are too much or not enough frequent.
MergePreprocessedData.LengthFilter A filter that eliminates too long or too short terms.
MergePreprocessedData.MixedFilter A filter that eliminates mixed digit-nondigit terms.
MergePreprocessedData.StopwordFilter Filter that eliminates terms in a given set.
MergeSortedArchives Merges or concatenates sorted archives.
Preprocess Analyses a DocumentSequence and generates data used by MergePreprocessedData and Scan.
Scan Scans a DocumentSequence to build a SequentialBitstreamArchive or a RandomAccessBitstreamArchive, using preprocessed data generated by Preprocess and MergePreprocessedData.
SortBitstreamArchive Sorts bitstream archives.

Enum Summary
MergePreprocessedData.PropertyKeys Configuration keys that are used also by Scan.

Package it.unimi.dsi.archive4j.tool Description

Command-line tools for archive construction.

The classes in this package contain a main method, and can be used to build archives starting from a document sequence.

Building an archive

The first thing to do is to expose your data as an MG4J DocumentSequence; in general, many details of the archive construction process are similar to those of MG4J index construction process, so we suggest to be familiar with the documentation contained in it.unimi.dsi.mg4j.document. As an example, we will use a simple FileSetDocumentCollection, that treats each file of a set as a document (words by default are maximal subsequences of alphanumeric characters). We assume that your Javadoc files are in /usr/share/javadoc, and create a serialised collection:

        find /usr/share/javadoc/ -iname \*.html -type f | \
    egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \
    java it.unimi.dsi.mg4j.document.FileSetDocumentCollection \
        -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection

The -p encoding=UTF-8 option passes an encoding to the HtmlDocumentFactory (The properties you can set depend on the chosen factory).

At this point, all you have to is to invoke ArchiveBuilder:

    java -Xmx256M -server it.unimi.dsi.archive4j.tool.ArchiveBuilder \
        -S javadoc.collection -Itext basename

The -Itext option specifies to index the text field of the HtmlDocumentFactory that is processing your collection (there is also a title field). There are many more options, which you can examine using the online help.

Distributed archive construction

Archives can be built in a distributed fashion, and then combined. To do so, however, you run manually the three archive construction phases. As an example, we suppose you have two input files containing one document per line. In this case, we can use the built-in InputStreamDocumentSequence, which reads documents from standard input. Properties are specified directly to the tools running the archive construction phases.

First, you must Preprocess your files (for sake of simplicity, we assume they are both in the same directory, but of course you can run the two following commands in a distributed way):

    java -Xmx256M -server it.unimi.dsi.archive4j.tool.Preprocess \      
        -Itext -p encoding=UTF-8 basename0 <your-input-file0
    java -Xmx256M -server it.unimi.dsi.archive4j.tool.Preprocess \      
        -Itext -p encoding=UTF-8 basename1 <your-input-file1

Now you have to move all files generated to a single location, and merge them:

    java -Xmx256M -server it.unimi.dsi.archive4j.tool.MergePreprocessedData \ 
        basename basename0 basename1

In this phase you can also reduce the term set using various options. The resulting global data files have names stemmed from basename, and must be used to perform the actual Scan:

    java -Xmx256M -server it.unimi.dsi.archive4j.tool.Scan \
         -Itext -r -p encoding=UTF-8 basename0 basename <your-input-file0
    java -Xmx256M -server it.unimi.dsi.archive4j.tool.Scan \
        -Itext -r -p encoding=UTF-8 basename1 basename <your-input-file1

It is your responsibility to provide the same options related to the document sequence in the preprocessing and scanning phase. Note that we are passing the optional parameter basename, which will cause global data to be used for archive construction. The option -r suggests to build a random-access archive.

At the end, you have two partial archives that must be merged:

    java -Xmx256M -server it.unimi.dsi.archive4j.tool.MergeSortedArchives \
        -C basename basename0 basename1

The -C option suggests to concatenate the archives—document identifiers will be renumbered sequentially. It is also possible to specify an explicit document identifier using a map from URIs to integers, and in that case the resulting archives must be just merged (i.e., no -C option).

Warning: we are using the same basename for both merging phases. This is essential, as archive4j.tool.MergeProcessedData computes some global statistics, such as term frequencies, which must be associated to archive basename.