# Atlas: Orienting the Pre-Training data of an LLM

* * *

Authors:**Nathaniel Monson**, Founding Research Scientist**Saqib Azim**, Research Engineer**Julius Adebayo**, Co-Founder & CEO

Published: **December 02, 2025**

* * *

### Introduction

We built Atlas, an automated system for annotating language modeling corpora with human-understandable concepts at a sub-document level. Using this system, we annotated a 1.5 trillion-token corpus spanning webtext, scientific writing, code, and synthetic data with over 33,000 concepts across science, technology, philosophy, medicine, and law. These annotations allow us to train interpretable language models whose representations are aligned with human-meaningful abstractions. Beyond model training, the annotations enable transparent model auditing, contamination detection, and fine-grained model control. We have replicated the system on FineWeb and will be releasing `concept-fineweb-10b`, a 10-billion-token corpus annotated with its own data-derived concept library.

The Concept Atlas below is an interactive visualization (UMAP projection) showcasing a representative 10% subset (3,372 concepts) from our comprehensive set of 33,732 concepts.

**WebGL is not supported by your browser - visit https://get.webgl.org for more info**

Science (mathematics, astronomy, physics, chemistry, life and computer sciences)  
Philosophy, psychology, religion  
Technology (engineering, manufacturing, patents, home economics)  
Geography, anthropology, recreation (travel, maps, folklore, sports)  
Agriculture (crops, livestock, forestry, fisheries, agribusiness)  
Auxiliary sciences of history (archaeology, genealogy, numismatics, chronology)  
Political science (theory, international, public administration, lawmaking)  
World history; history of Europe, Asia, Africa, Australia, New Zealand, etc.  
National History of the USA (general and period histories)  
Fine arts (architecture, sculpture, painting, decorative arts, photography)  
Medicine (clinical, public health, nursing, pharmacy, veterinary)  
Education (theory, practice, administration)  
Language and literature (linguistics, philology, all world literatures)  
Military science (organization, strategy, tactics, weapons)  
Bibliography, library science, information resources  
History of the Americas: local U.S., Canada, Latin America, Caribbean  
Naval science (naval organization, warfare, navigation, merchant marine)  
General works (encyclopedias, periodicals, museums, reference tools)  
Law (all jurisdictions, comparative and international law)  
Social sciences (economics, sociology, business, statistics, transportation)  
+14 more

### Tags → Concepts

| Tag | Concept Name | Concept Description |
| --- | --- | --- |
| mythology | Mythological narratives | Narratives, characters, and themes from traditional stories |
| mesopotamian-mythology | Ancient Mesopotamian studies | History, literature, mythology of ancient Mesopotamia |
| god-relationships | Divine being classifications | Characteristics, roles, interactions of divine beings |
| goddess-suitors | Divine Feminine Figures | Divine feminine attributes and symbolic representations |
| character-conflicts | Interpersonal disputes | Tensions between individuals or groups |

### Conclusion

We have presented Atlas, a 3 stage pipeline, for concept annotation of large-scale LLM pre-training corpora. By moving from raw documents to high-recall chunk tags to coherent canonical concepts to a unified multi-domain annotator, we create a foundation that enables interpretable model training. While there is still room for refinement, the combination of large-scale automation and targeted human validation proves that high-quality concept structure can be embedded directly into modern LLM workflows.
