it.unimi.dsi.archive4j
Class SequentialBitstreamArchive

java.lang.Object
  extended by it.unimi.dsi.archive4j.SequentialBitstreamArchive
All Implemented Interfaces:
Archive<ArrayDocumentSummary>, FlyweightPrototype<Archive<ArrayDocumentSummary>>, Closeable, Iterable<ArrayDocumentSummary>
Direct Known Subclasses:
RandomAccessBitstreamArchive

public class SequentialBitstreamArchive
extends Object
implements Archive<ArrayDocumentSummary>

An Archive implementation providing sequential access only.

Author:
Alessio Orlandi, Sebastiano Vigna
See Also:
RandomAccessBitstreamArchive, BitstreamArchiveWriter

Nested Class Summary
static class SequentialBitstreamArchive.CompressionFlags
          Class representing compression flags for much of the data in this archive.
static class SequentialBitstreamArchive.PropertyKeys
          Additional properties (w.r.t.
 
Field Summary
static String ARCHIVE_EXTENSION
          The standard archive extension.
protected  CharSequence basename
          The basename of this archive.
protected  Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings
          The codings of this archive.
protected  InputBitStream data
          The input bit stream for the data file.
protected  FastMultiByteArrayInputStream fmbais
          If not null, the in-memory stream upon which data is based.
protected  int[] frequency
          The frequency of each term.
protected  int numberOfDocuments
          The number of document summaries in this archive
protected  int numberOfTerms
          The number of terms in this archive.
protected  long numberOfWords
          The number of words in the documents summarized by this archive
static String PERM_EXTENSION
          The standard permutation extension.
protected  int[] rank2Term
          The map from frequency rank to terms.
protected  List<? extends CharSequence> uriList
          An optional list of URIs that will be used to create the URI associated to each summary.
 
Constructor Summary
protected SequentialBitstreamArchive(CharSequence basename, int[] rank2Term, Properties properties, List<? extends CharSequence> uriList, int[] frequency)
          Creates a new sequential bitstream archive.
protected SequentialBitstreamArchive(SequentialBitstreamArchive archive)
           
 
Method Summary
 void close()
           
 SequentialBitstreamArchive copy()
           
protected  void ensureOpen()
           
 int frequency(int term)
          Return the frequency of a given term.
 Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> getCodings()
          Returns an unmodifiable copy of the codings used by this archive.
 ArrayDocumentSummary getDocumentById(int id)
          Returns a document given its id (optional operation).
 ArrayDocumentSummary getDocumentByIndex(int index)
          Returns a document by index (position in the archive) (optional operation).
static SequentialBitstreamArchive getInstance(CharSequence basename, Properties properties, CharSequence uriFilename)
          Returns a SequentialBitstreamArchive obtained by loading with given basename and optional URI list.
 int[] getPermutation()
          Returns the rank-to-term permutation.
 boolean hasRandomAccess()
          Returns whether the archive supports random access, that is, Archive.getDocumentById(int) and Archive.getDocumentByIndex(int).
 Iterator<ArrayDocumentSummary> iterator()
           
protected static int[] loadFrequencies(CharSequence basename, int numTerms)
          Loads γ-coded frequencies, if they exist.
 int numberOfDocuments()
          Returns the number of documents in the archive.
 int numberOfTerms()
          Returns the number of terms in the archive.
 long numberOfWords()
          Returns the number of words in the collection (i.e., the sum of the lengths of all documents).
protected  ArrayDocumentSummary readCurrentDocument(int id)
          Reads the document record beginning at the current file position and builds a ArrayDocumentSummary object representing it, if necessary.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

ARCHIVE_EXTENSION

public static final String ARCHIVE_EXTENSION
The standard archive extension.

See Also:
Constant Field Values

PERM_EXTENSION

public static final String PERM_EXTENSION
The standard permutation extension.

See Also:
Constant Field Values

data

protected InputBitStream data
The input bit stream for the data file. If fmbais is not null, it wraps it; otherwise, it refers directly to a file named basename + ARCHIVE_EXTENSION). When the archive is closed it is nullified.


fmbais

protected final FastMultiByteArrayInputStream fmbais
If not null, the in-memory stream upon which data is based.


uriList

protected final List<? extends CharSequence> uriList
An optional list of URIs that will be used to create the URI associated to each summary.


numberOfDocuments

protected final int numberOfDocuments
The number of document summaries in this archive


numberOfTerms

protected final int numberOfTerms
The number of terms in this archive.


numberOfWords

protected final long numberOfWords
The number of words in the documents summarized by this archive


rank2Term

protected final int[] rank2Term
The map from frequency rank to terms.


frequency

protected final int[] frequency
The frequency of each term.


codings

protected final Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> codings
The codings of this archive.


basename

protected CharSequence basename
The basename of this archive.

Constructor Detail

SequentialBitstreamArchive

protected SequentialBitstreamArchive(CharSequence basename,
                                     int[] rank2Term,
                                     Properties properties,
                                     List<? extends CharSequence> uriList,
                                     int[] frequency)
                              throws IOException
Creates a new sequential bitstream archive.

Parameters:
basename - the basename of the archive.
rank2Term - the permutation from rank to terms.
properties - the properties of the archive.
uriList - an optional list of URIs that will be used to associate a URI to each summary, or null.
frequency - the term frequencies.
Throws:
IOException

SequentialBitstreamArchive

protected SequentialBitstreamArchive(SequentialBitstreamArchive archive)
                              throws IOException
Throws:
IOException
Method Detail

ensureOpen

protected void ensureOpen()
                   throws IllegalStateException
Throws:
IllegalStateException

close

public void close()
           throws IOException
Specified by:
close in interface Closeable
Throws:
IOException

numberOfWords

public long numberOfWords()
Description copied from interface: Archive
Returns the number of words in the collection (i.e., the sum of the lengths of all documents).

Specified by:
numberOfWords in interface Archive<ArrayDocumentSummary>
Returns:
the number of words in the collection.

readCurrentDocument

protected ArrayDocumentSummary readCurrentDocument(int id)
                                            throws IOException
Reads the document record beginning at the current file position and builds a ArrayDocumentSummary object representing it, if necessary.

Throws:
IOException

getCodings

public Map<SequentialBitstreamArchive.CompressionFlags.Component,SequentialBitstreamArchive.CompressionFlags.Coding> getCodings()
Returns an unmodifiable copy of the codings used by this archive.

Returns:
an unmodifiable copy of the codings used by this archive.

getPermutation

public int[] getPermutation()
Returns the rank-to-term permutation.

Returns:
the rank-to-term permutation.

iterator

public Iterator<ArrayDocumentSummary> iterator()
Specified by:
iterator in interface Iterable<ArrayDocumentSummary>

numberOfDocuments

public int numberOfDocuments()
Description copied from interface: Archive
Returns the number of documents in the archive.

Specified by:
numberOfDocuments in interface Archive<ArrayDocumentSummary>
Returns:
the number of documents in the archive.

numberOfTerms

public int numberOfTerms()
Description copied from interface: Archive
Returns the number of terms in the archive.

Specified by:
numberOfTerms in interface Archive<ArrayDocumentSummary>
Returns:
the number of terms in the archive.

frequency

public int frequency(int term)
Description copied from interface: Archive
Return the frequency of a given term.

Specified by:
frequency in interface Archive<ArrayDocumentSummary>
Parameters:
term - a term number.
Returns:
the frequency of the given term.

hasRandomAccess

public boolean hasRandomAccess()
Description copied from interface: Archive
Returns whether the archive supports random access, that is, Archive.getDocumentById(int) and Archive.getDocumentByIndex(int).

Specified by:
hasRandomAccess in interface Archive<ArrayDocumentSummary>
Returns:
whether the archive supports random access.

loadFrequencies

protected static int[] loadFrequencies(CharSequence basename,
                                       int numTerms)
                                throws IOException
Loads γ-coded frequencies, if they exist.

Throws:
IOException

getInstance

public static SequentialBitstreamArchive getInstance(CharSequence basename,
                                                     Properties properties,
                                                     CharSequence uriFilename)
                                              throws IOException,
                                                     ClassNotFoundException
Returns a SequentialBitstreamArchive obtained by loading with given basename and optional URI list.

Parameters:
basename - the archive basename.
properties - the archive properties.
uriFilename - the filename of a URI list, or null; the file must contained either a StringMap supporting StringMap.list(), or a List of CharSequences.
Returns:
the SequentialBitstreamArchive with given basename and URI list.
Throws:
IOException
ClassNotFoundException

getDocumentById

public ArrayDocumentSummary getDocumentById(int id)
                                     throws IOException
Description copied from interface: Archive
Returns a document given its id (optional operation).

Specified by:
getDocumentById in interface Archive<ArrayDocumentSummary>
Parameters:
id - a document id.
Returns:
the document with given id, or null if no such document exists.
Throws:
IOException

getDocumentByIndex

public ArrayDocumentSummary getDocumentByIndex(int index)
                                        throws IOException
Description copied from interface: Archive
Returns a document by index (position in the archive) (optional operation).

Specified by:
getDocumentByIndex in interface Archive<ArrayDocumentSummary>
Parameters:
index - the document index.
Throws:
IOException

copy

public SequentialBitstreamArchive copy()
Specified by:
copy in interface FlyweightPrototype<Archive<ArrayDocumentSummary>>