|PREV PACKAGE NEXT PACKAGE||FRAMES NO FRAMES|
|MergePreprocessedData.TermFilter||Interface used to specify whether a term must be filtered (that is, eliminated from the archive) or not.|
|ArchiveBuilder||Builds an archive from a document collection.|
|MergePreprocessedData||Filters and merges term data (term lists, frequencies, global counts) originated by one or more preprocessing phases
and generates the corresponding
|MergePreprocessedData.FrequencyFilter||Filter that eliminates terms that are too much or not enough frequent.|
|MergePreprocessedData.LengthFilter||A filter that eliminates too long or too short terms.|
|MergePreprocessedData.MixedFilter||A filter that eliminates mixed digit-nondigit terms.|
|MergePreprocessedData.StopwordFilter||Filter that eliminates terms in a given set.|
|MergeSortedArchives||Merges or concatenates sorted archives.|
|Preprocess||Analyses a DocumentSequence and generates data used by
|SortBitstreamArchive||Sorts bitstream archives.|
|MergePreprocessedData.PropertyKeys||Configuration keys that are used also by
Command-line tools for archive construction.
The classes in this package contain a main method, and can be used to build archives starting from a document sequence.
The first thing to do is to expose your data as an MG4J
in general, many details of the archive construction process are similar to those of MG4J index construction
process, so we suggest to be familiar with the documentation contained in
As an example, we will use a simple
that treats each file of a set as a document (words by default are maximal subsequences of alphanumeric characters). We
assume that your Javadoc files are in /usr/share/javadoc, and create a serialised collection:
find /usr/share/javadoc/ -iname \*.html -type f | \ egrep -v "(package-|-tree|class-use|index-.*.html|allclasses)" | \ java it.unimi.dsi.mg4j.document.FileSetDocumentCollection \ -f HtmlDocumentFactory -p encoding=UTF-8 javadoc.collection
The -p encoding=UTF-8
option passes an encoding to the
HtmlDocumentFactory (The properties you can set depend on the chosen factory).
At this point, all you have to is to invoke
java -Xmx256M -server it.unimi.dsi.archive4j.tool.ArchiveBuilder \ -S javadoc.collection -Itext basename
The -Itext option specifies to index the text field of the
HtmlDocumentFactory that is processing your collection (there is also a title field). There are many more options, which
you can examine using the online help.
Archives can be built in a distributed fashion, and then combined. To do so, however, you run manually
the three archive construction phases. As an example, we suppose you have two input files containing one
document per line. In this case, we can use the built-in
which reads documents from standard input. Properties are specified directly to the tools running the archive construction phases.
Preprocess your files (for sake of simplicity, we assume they
are both in the same directory, but of course you can run the two following commands in a distributed way):
java -Xmx256M -server it.unimi.dsi.archive4j.tool.Preprocess \ -Itext -p encoding=UTF-8 basename0 <your-input-file0 java -Xmx256M -server it.unimi.dsi.archive4j.tool.Preprocess \ -Itext -p encoding=UTF-8 basename1 <your-input-file1
Now you have to move all files generated to a single location, and merge them:
java -Xmx256M -server it.unimi.dsi.archive4j.tool.MergePreprocessedData \ basename basename0 basename1
In this phase you can also reduce the term set using various options. The resulting global data files
have names stemmed from basename, and must be used to perform the actual
java -Xmx256M -server it.unimi.dsi.archive4j.tool.Scan \ -Itext -r -p encoding=UTF-8 basename0 basename <your-input-file0 java -Xmx256M -server it.unimi.dsi.archive4j.tool.Scan \ -Itext -r -p encoding=UTF-8 basename1 basename <your-input-file1
It is your responsibility to provide the same options related to the document sequence in the preprocessing and scanning phase. Note that we are passing the optional parameter basename, which will cause global data to be used for archive construction. The option -r suggests to build a random-access archive.
At the end, you have two partial archives that must be merged:
java -Xmx256M -server it.unimi.dsi.archive4j.tool.MergeSortedArchives \ -C basename basename0 basename1
The -C option suggests to concatenate the archives—document identifiers will
be renumbered sequentially. It is also possible to specify an explicit document identifier using a map from
URIs to integers, and in that case the resulting archives must be just merged (i.e., no
Warning: we are using the same basename for both merging phases. This is essential,
archive4j.tool.MergeProcessedData computes some global statistics, such as term frequencies,
which must be associated to archive basename.
|PREV PACKAGE NEXT PACKAGE||FRAMES NO FRAMES|