it.unimi.dsi.archive4j.tool
Class Preprocess

java.lang.Object
  extended by it.unimi.dsi.archive4j.tool.Preprocess

public class Preprocess
extends Object

Analyses a DocumentSequence and generates data used by MergePreprocessedData and Scan. Each document is analysed, and its terms are passed through a TermProcessor.

We compute a number of batches, formed by the following files:

basename@n.terms
The lexicographically sorted list of terms in the batch.
basename@n.lengths
The lengths of the documents of the batch.
basename@n.frequencies
The frequency of each term in the batch.
basename@n.counts
The global count of each term in the batch.

A final overall property file basename.properties records a few properties.

Author:
Alessio Orlandi, Sebastiano Vigna

Nested Class Summary
static class Preprocess.PropertyKeys
           
 
Field Summary
static String COUNTS_EXTENSION
          The extension for file of global counts.
static String FREQUENCIES_EXTENSION
          The extension for the file of term frequencies.
static String LENGTHS_EXTENSION
          The extension for the file of document lengths.
static String PROPERTIES_EXTENSION
          The extension for the property file.
static String TERMS_EXTENSION
          The extension for the gzip'd term file.
 
Constructor Summary
Preprocess()
           
 
Method Summary
static String batchName(String basename, int batch)
          Returns the basename of a batch.
protected static void emit(String batchName, TermProcessor processor, Object2ReferenceOpenHashMap<MutableString,it.unimi.dsi.archive4j.tool.Preprocess.TermStatistics> globalTerms)
          Emits the current batch sorting and processing terms.
static void main(String[] args)
           
static void run(String basename, DocumentSequence sequence, TermProcessor processor, String indexedField)
          Preprocesses the given document sequence.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

FREQUENCIES_EXTENSION

public static final String FREQUENCIES_EXTENSION
The extension for the file of term frequencies.

See Also:
Constant Field Values

LENGTHS_EXTENSION

public static final String LENGTHS_EXTENSION
The extension for the file of document lengths.

See Also:
Constant Field Values

TERMS_EXTENSION

public static final String TERMS_EXTENSION
The extension for the gzip'd term file.

See Also:
Constant Field Values

COUNTS_EXTENSION

public static final String COUNTS_EXTENSION
The extension for file of global counts.

See Also:
Constant Field Values

PROPERTIES_EXTENSION

public static final String PROPERTIES_EXTENSION
The extension for the property file.

See Also:
Constant Field Values
Constructor Detail

Preprocess

public Preprocess()
Method Detail

batchName

public static String batchName(String basename,
                               int batch)
Returns the basename of a batch.

Parameters:
basename - the basename of the archive.
batch - the batch number.
Returns:
the basename for the given batch.

emit

protected static void emit(String batchName,
                           TermProcessor processor,
                           Object2ReferenceOpenHashMap<MutableString,it.unimi.dsi.archive4j.tool.Preprocess.TermStatistics> globalTerms)
                    throws IOException
Emits the current batch sorting and processing terms.

Throws:
IOException

run

public static void run(String basename,
                       DocumentSequence sequence,
                       TermProcessor processor,
                       String indexedField)
                throws IOException,
                       ConfigurationException
Preprocesses the given document sequence.

Parameters:
basename - the output basename.
sequence - the document sequence to collect terms from.
processor - the term processor to apply.
indexedField - the field to be indexed.
Throws:
IOException
ConfigurationException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception