it.unimi.dsi.archive4j
Interface DocumentSummary

All Known Implementing Classes:
ArrayDocumentSummary

public interface DocumentSummary

A summary of the term information of a document.

Summaries contain a document id which is unique within the archive, and optionally a URI providing the original document. Beside that, they provide the document length (e.g. before term pruning), the terms appearing in the document and their counts (a.k.a. within-document frequencies).

Additionally, if sorted() returns true then terms appear in increasing order. That is, term(i) < term(j) for i < j. You can force this property by calling sort().

The equality contract for document summaries is that summaries containing the same information should be equal, independently of the order in which it is provided. Note that you should not use sort() when implementing Object.equals(Object), as the user does not expect the object to change when it is compared.

Author:
Alessio Orlandi, Sebastiano Vigna

Method Summary
 int count(int index)
          Returns the count (a.k.a.
 int id()
          Returns the id of the document this summary represents.
 int indexOf(int term)
          Returns the index of the given term.
 int length()
          Returns the length in words of the document this summary represents.
 int size()
          Returns the number of terms in this summary.
 DocumentSummary sort()
          Sorts this summary is sorted.
 boolean sorted()
          Returns whether this summary is sorted.
 int term(int index)
          Returns the term of given index.
 URI uri()
          A URI representing the source of this summary.
 

Method Detail

id

int id()
Returns the id of the document this summary represents.

Returns:
the id of the document this summary represents.

size

int size()
Returns the number of terms in this summary.

Note that due to pruning (e.g., of hapax legomena) the number of terms might be different from the number of terms of the document this summary represents.

Returns:
the number of terms in this summary.

term

int term(int index)
Returns the term of given index.

Returns:
the term of given index.

count

int count(int index)
Returns the count (a.k.a. within-document frequency) of the term of given index.

Returns:
the count of the term of given index.

indexOf

int indexOf(int term)
Returns the index of the given term.

Parameters:
term - a term number.
Returns:
the index of the given term in this summary, or -1 if the term does not appear in this summary.

uri

URI uri()
A URI representing the source of this summary.

Returns:
a URI representing the source of this summary, or null if no such URI is available.

length

int length()
Returns the length in words of the document this summary represents.

Returns:
the length in words of the document this summary represents.

sorted

boolean sorted()
Returns whether this summary is sorted.

Returns:
whether this summary is sorted.

sort

DocumentSummary sort()
Sorts this summary is sorted.

After calling this method, sorted() will return true.

Returns:
this document summary.