LLM data

It is well understood that LLMs thrive on high-quality data. We have the largest collection of books, papers, magazines, etc in the world, which are some of the highest quality text sources.

Unique scale and range

Our collection contains over a hundred million files, including academic journals, textbooks, and magazines. We achieve this scale by combining large existing repositories.

Some of our source collections are already available in bulk (Sci-Hub, and parts of Libgen). Other sources we liberated ourselves. Datasets shows a full overview.

Our collection includes millions of books, papers, and magazines from before the e-book era. Large parts of this collection have already been OCR’ed, and already have little internal overlap.

How we can help

We’re able to provide high-speed access to our full collections, as well as to unreleased collections.

This is enterprise-level access that we can provide for donations in the range of tens of thousands USD. We’re also willing to trade this for high-quality collections that we don’t have yet.

We can refund you if you’re able to provide us with enrichment of our data, such as:

OCR
Removing overlap (deduplication)
Text and metadata extraction

Support long-term archival of human knowledge, while getting better data for your model!