it.unimi.dsi.archive4j.tool
Class MergePreprocessedData

java.lang.Object
  extended by it.unimi.dsi.archive4j.tool.MergePreprocessedData

public class MergePreprocessedData
extends Object

Filters and merges term data (term lists, frequencies, global counts) originated by one or more preprocessing phases and generates the corresponding StringMap.

After Preprocess has performed the first pass over a collection, this class filters the collected terms and merges the resulting data. Additionally, it generates the following files:

basename.termmap
For each filtered term, its position in the term list.
basename.embed
The embedding list for this archive. It maps every term to its position in the non-filtered term list. It is useful to map terms of the archive in terms of an index over the same archive.

Strategies to remove terms are provided by MergePreprocessedData.TermFilter implementations.

Author:
Alessio Orlandi, Sebastiano Vigna

Nested Class Summary
static class MergePreprocessedData.FrequencyFilter
          Filter that eliminates terms that are too much or not enough frequent.
static class MergePreprocessedData.LengthFilter
          A filter that eliminates too long or too short terms.
static class MergePreprocessedData.MixedFilter
          A filter that eliminates mixed digit-nondigit terms.
static class MergePreprocessedData.PropertyKeys
          Configuration keys that are used also by Scan.
static class MergePreprocessedData.StopwordFilter
          Filter that eliminates terms in a given set.
static interface MergePreprocessedData.TermFilter
          Interface used to specify whether a term must be filtered (that is, eliminated from the archive) or not.
 
Field Summary
static String EMBED_EXTENSION
          The extension of the embedding list.
static String TERMMAP_EXTENSION
          The extension of the map for (filtered) terms.
 
Constructor Summary
MergePreprocessedData()
           
 
Method Summary
static void main(String[] args)
           
static void run(CharSequence[] inputNames, String outputBasename, MergePreprocessedData.TermFilter[] filters, Properties properties)
          Runs the merge process.
static void run(String[] inputBasename, String outputBasename, MergePreprocessedData.TermFilter[] filters)
          Runs the merge process.
static void run(String inputBasename, String outputBasename, MergePreprocessedData.TermFilter[] filters)
          Runs the merge process.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TERMMAP_EXTENSION

public static final String TERMMAP_EXTENSION
The extension of the map for (filtered) terms.

See Also:
Constant Field Values

EMBED_EXTENSION

public static final String EMBED_EXTENSION
The extension of the embedding list.

See Also:
Constant Field Values
Constructor Detail

MergePreprocessedData

public MergePreprocessedData()
Method Detail

run

public static void run(String inputBasename,
                       String outputBasename,
                       MergePreprocessedData.TermFilter[] filters)
                throws IOException,
                       ConfigurationException,
                       IllegalArgumentException,
                       ClassNotFoundException,
                       IllegalAccessException,
                       InvocationTargetException,
                       InstantiationException,
                       NoSuchMethodException
Runs the merge process.

Parameters:
inputBasename - the basename of a previous Preprocess runs.
outputBasename - the output basename.
filters - term filters that will be used to choose which term to include in the merged data.
Throws:
IOException
ConfigurationException
IllegalArgumentException
ClassNotFoundException
IllegalAccessException
InvocationTargetException
InstantiationException
NoSuchMethodException

run

public static void run(String[] inputBasename,
                       String outputBasename,
                       MergePreprocessedData.TermFilter[] filters)
                throws IOException,
                       ConfigurationException,
                       IllegalArgumentException,
                       ClassNotFoundException,
                       IllegalAccessException,
                       InvocationTargetException,
                       InstantiationException,
                       NoSuchMethodException
Runs the merge process.

Parameters:
inputBasename - the basenames of one or more previous Preprocess runs.
outputBasename - the output basename.
filters - term filters that will be used to choose which term to include in the merged data.
Throws:
IOException
ConfigurationException
IllegalArgumentException
ClassNotFoundException
IllegalAccessException
InvocationTargetException
InstantiationException
NoSuchMethodException

run

public static void run(CharSequence[] inputNames,
                       String outputBasename,
                       MergePreprocessedData.TermFilter[] filters,
                       Properties properties)
                throws IOException,
                       ConfigurationException
Runs the merge process.

Parameters:
inputNames - the basenames for all sets of term lists and frequency files to merge.
outputBasename - the output basename.
properties - an initialised property object containing additional properties to be saved (usually, at least Preprocess.PropertyKeys.TERMPROCESSOR and Preprocess.PropertyKeys.FIELD).
filters - term filters that will be used to choose which term to include in the merged data.
Throws:
IOException
ConfigurationException

main

public static void main(String[] args)
                 throws Exception
Throws:
Exception