it.unimi.dsi.archive4j
Class BitstreamArchiveWriter<T extends DocumentSummary>

java.lang.Object
  extended by it.unimi.dsi.archive4j.ArchiveWriter<T>
      extended by it.unimi.dsi.archive4j.BitstreamArchiveWriter<T>
All Implemented Interfaces:
Closeable

public class BitstreamArchiveWriter<T extends DocumentSummary>
extends ArchiveWriter<T>

A writer for SequentialBitstreamArchive or RandomAccessBitstreamArchive archives.

Summaries are stored in a bitstream. Each summary is preceded by its id (for a SequentialBitstreamArchive) or by the gap (reduced by one) with the previous id (for a RandomAccessBitstreamArchive). Then, we write the size and the length of the document.

Terms are renumbered by descending frequency rank, and sorted so to be able to store gaps. Counts follow, stored in reverse order again using gaps: of course, now gap can be negative (albeit, due to the correlation between global and local counts, we expect them to be small) so we pass them through Fast.int2nat(int). Codes for writing the above data can be selected into by passing a set of flags to the constructors.

It is possible to override permutation behaviour by explicitly passing a term-to-rank permutation and its inverse.

All in all, we generate the following files:

basename.archive
The archive bitstream.
basename.permutation
The rank-to-term permutation (translates data in the archive to term numbers).
basename.offsets
For random-access archives, a serialised EliasFanoMonotoneLongBigList storing the start of each summary.
basename.missing
For random-access archives, a serialised SparseRank storing the ids of the documents missing from the archive.

A final overall property file basename.properties records a few properties.

Author:
Alessio Orlandi, Sebastiano Vigna
See Also:
RandomAccessBitstreamArchive, SequentialBitstreamArchive

Field Summary
static int CURRENT_VERSION
           
 
Constructor Summary
  BitstreamArchiveWriter(CharSequence basename, int[][] permutations, boolean sorted, Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
          Creates a new bitstream archive writer with given basename and sorting permutations.
  BitstreamArchiveWriter(CharSequence basename, int[] frequency, boolean sorted, Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
          Creates a new bitstream archive writer with given basename and frequency array.
protected BitstreamArchiveWriter(CharSequence basename, int[] term2Rank, int[] rank2Term, boolean randomAccess, Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
          Creates a new bitstream archive writer with given basename and permutations.
  BitstreamArchiveWriter(String basename, SequentialBitstreamArchive prototype, boolean randomAccess)
           
  BitstreamArchiveWriter(String basename, SequentialBitstreamArchive prototype, boolean randomAccess, Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
           
 
Method Summary
 void append(T summary)
          Appends a new document summary to the archive.
 void close()
           
protected static int[] invertPermutation(int[] source)
          Returns an array containing the inverse permutation of the source one.
static int[][] makeSortingPermutations(int[] frequency)
          Build the frequency-rank-to-term and term-to-frequency-rank permutations, mapping each frequency rank to the respective term and viceversa.
 
Methods inherited from class it.unimi.dsi.archive4j.ArchiveWriter
appendAll
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

CURRENT_VERSION

public static int CURRENT_VERSION
Constructor Detail

BitstreamArchiveWriter

public BitstreamArchiveWriter(CharSequence basename,
                              int[] frequency,
                              boolean sorted,
                              Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
                       throws IOException
Creates a new bitstream archive writer with given basename and frequency array.

You have the option to create an archive marked as sorted. Such an archive will be loaded as a RandomAccessBitstreamArchive and allow random access. You must, however, guarantee that calls to ArchiveWriter.append(DocumentSummary) happen with increasing document ids.

Parameters:
basename - the basename of the archive.
frequency - the array of frequencies.
sorted - if true, the resulting archive will be sorted.
Throws:
IOException

BitstreamArchiveWriter

public BitstreamArchiveWriter(CharSequence basename,
                              int[][] permutations,
                              boolean sorted,
                              Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
                       throws IOException
Creates a new bitstream archive writer with given basename and sorting permutations.

This constructor can be useful if you want to call makeSortingPermutations(int[]) on your own and then free the memory occupied by the frequency array.

You have the option to create an archive marked as sorted. Such an archive will be loaded as a RandomAccessBitstreamArchive and allow random access. You must, however, guarantee that calls to ArchiveWriter.append(DocumentSummary) happen with increasing document ids.

Parameters:
basename - the basename of the archive.
permutations - two arrays containing the sorting-by-rank and the inverse term permutations (i.e., the output of makeSortingPermutations(int[])).
sorted - if true, the resulting archive will be sorted.
Throws:
IOException
See Also:
BitstreamArchiveWriter(CharSequence, int[], boolean, Map)

BitstreamArchiveWriter

protected BitstreamArchiveWriter(CharSequence basename,
                                 int[] term2Rank,
                                 int[] rank2Term,
                                 boolean randomAccess,
                                 Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
                          throws IOException
Creates a new bitstream archive writer with given basename and permutations.

Parameters:
basename - the basename of the archive.
term2Rank - the permutation from term number to frequency ranks.
rank2Term - the inverse of term2Rank.
randomAccess - if true, the resulting archive will be accessible randomly; documents must be provided in increasing order.
Throws:
IOException
See Also:
BitstreamArchiveWriter(CharSequence, int[], boolean, Map)

BitstreamArchiveWriter

public BitstreamArchiveWriter(String basename,
                              SequentialBitstreamArchive prototype,
                              boolean randomAccess)
                       throws IOException
Throws:
IOException

BitstreamArchiveWriter

public BitstreamArchiveWriter(String basename,
                              SequentialBitstreamArchive prototype,
                              boolean randomAccess,
                              Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings)
                       throws IOException
Throws:
IOException
Method Detail

makeSortingPermutations

public static int[][] makeSortingPermutations(int[] frequency)
Build the frequency-rank-to-term and term-to-frequency-rank permutations, mapping each frequency rank to the respective term and viceversa.

Returns:
a two-element array containing the term-to-rank and the rank-to-term permutations, respectively.

invertPermutation

protected static int[] invertPermutation(int[] source)
Returns an array containing the inverse permutation of the source one.


append

public void append(T summary)
            throws IOException
Description copied from class: ArchiveWriter
Appends a new document summary to the archive.

Specified by:
append in class ArchiveWriter<T extends DocumentSummary>
Parameters:
summary - a document summary.
Throws:
IOException

close

public void close()
           throws IOException
Throws:
IOException