Introduction

Archive4J is a free archive engine for large document collections written in Java. By “archive engine” we mean a set of algorithmic tools and implementations that make it possible to build a direct index of a document collection. In particular, for each document we want to be able to recover some basic data such as the length of the document in words, the list of distinct terms appearing in the document, and the number of occurrences of each term in the document (the count). We strive for a very high compression rate, and for very fast random access. To obtain this result, Archive4J combines techniques typical of search engines with succinct data structures.

Archive4J is free software distributed under the GNU Lesser General Public License. Datasets containing summaries of web snapshots are available at the LAW web site.

Installation

For a quick start, you just have to install the .jar file coming with the distribution and the dependencies, which are gathered for your convenience in a tarball.

A more detailed list of the dependencies can be found in the overview of the Javadoc documentation. There is also a Jpackage-like RPM.