# PRISM: Training Data Prototypes for Language Models

* * *

Authors:**Dan Ley**, Research Scientist Intern**Julius Adebayo**, Co-Founder & CEO

Published: **December 08, 2025**

* * *

Contents

- [Introduction](https://www.guidelabs.ai/post/prism/#introduction)
- [PRISM Architecture & Loss Functions](https://www.guidelabs.ai/post/prism/#prism-architecture--loss-functions)
  - [Architecture](https://www.guidelabs.ai/post/prism/#architecture)
- [Loss functions](https://www.guidelabs.ai/post/prism/#loss-functions)
- [Training Data Attribution in a Single Forward Pass](https://www.guidelabs.ai/post/prism/#training-data-attribution-in-a-single-forward-pass)
- [Automated Interpretability Pipeline](https://www.guidelabs.ai/post/prism/#automated-interpretability-pipeline)
  - [Nearest Neighbor Search](https://www.guidelabs.ai/post/prism/#nearest-neighbor-search)
  - [Automatic Labeling](https://www.guidelabs.ai/post/prism/#automatic-labeling)
- [Scaling from 124M to 1.6B](https://www.guidelabs.ai/post/prism/#scaling-from-124m-to-16b)
- [Conclusion](https://www.guidelabs.ai/post/prism/#conclusion)

We have trained PRISM, a family of interpretable language models, to answer the question: _when an LLM predicts the next token, which training samples is it relying on?_

PRISM traces its prediction to the training data in a single forward pass; the same cost as generating a single token.
Across parameter sizes from **130M to 1.6B**, PRISM models stay within 5% of their unconstrained counterparts on validation loss and downstream benchmarks, with negligible impact on training time.

**Tracing the language model’s outputs to training data:** In the following demo, PRISM-1.6B decomposes each token it generates into contributions across a handful of prototypes.
A prototype is a learned pattern that represents a cluster of similar examples in the training data.

- Each colored slice on the right shows a prototype’s contribution to the logit for the sampled token,
- Together, the slices add up exactly to the final logit,
- Hover over a slice to see the prototype’s broad category (e.g., _Medical & Bio_), its more specific role (e.g., _“physiology”_), and its representative training data snippet that most strongly activates it.

Prompt

Digestive disorders, diabetes, cancer, obesity, asthma, allergies, and even depression are just a few complications that can arise by not

Token 122 / 250

having enough fiber in your diet.↵

A recent study conducted at the Yale School of Public Health found that those receiving a high fiber diet were less likely to have diabetes. The study showed that those taking one or more daily servings of fruits and vegetables had a 32% lower risk of diabetes, while those receiving less than one daily serve of whole grains showed a 23% lower risk of diabetes.↵

A study conducted on adults with Type 1 Diabetes, found that those suffering from diarrhea and/or constipation experienced a 35% lower risk of developing diabetes. Researchers also noted that fiber may be effective in controlling blood sugar levels, and that it may aid in the prevention of diabetes.↵

A recent study conducted at the National Cancer Institute found that high fiber intake was associated with a lower risk of colorectal cancer, according to lead researcher Dr. Robert Lustig.↵

Dr. Lustig stated that while he was not a doctor and didn't have any medical training, his study showed that those consuming more fiber had a 35% lower risk of colorectal cancer. He stated that the fiber in food has been shown to lower the risk of colon cancer, while increasing your intake of fiber-rich food.↵

The fiber that we consume also

Boilerplate / Artifact

Finance / Econ

Grammar / Scaffold

Institutions / Civic / Academic

Medical / Biology

Misc / Other

Morphology

Named Entities

Numbers / Time

Science / Tech

0246810121416

Logit Contribution

Move your cursor over any bar to inspect prototype details.

Prototypes are colored by category and shaded by contribution to the predicted token.

`We show an interactive projection (UMAP) visualization of all 16,384 prototypes that PRISM-1.6B learned.`
Overall, we observe a prototype dictionary where a bit more than half of the units specialize on low-level morphology and grammatical scaffolding (~35% and ~20%, respectively), while a large minority capture domain-heavy patterns such as medical and biological language (~7%), science and technology (~5%), institutional and civic text (~6%), social and demographic or family descriptions (~3%), named entities (~5%), environment and climate content (~2%), and finance and economics (~1%).
Other remaining prototypes concentrate on structured artifacts like numbers and time expressions (~6%), boilerplate fragments (~3%), URLs and identifiers (~2%), and remaining miscellaneous patterns (~3%).

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

WebGL is not supported by your browser - visit https://get.webgl.org for more info

Morphology

Grammar / Scaffold

Numbers / Time

Boilerplate / Artifact

Medical / Biology

Science / Tech

Named Entities

Institutions / Civic / Academic

Social / Demographic / Family

Environment / Climate

URLs / IDs

Finance / Econ

Misc / Other

UX / Meta

`We now pick a few prototypes and show the training data snippets they map to.`

In the interactive below, each card shows a single learned prototype. The header gives its automatically inferred category and name.
For each prototype, we present the top tokens that are most strongly associated with the prototype, and the training data snippets it maps to.
We can directly trace any generated token to prototypes, and from there to the training data.

Medical / BiologyScience / TechNumbers / TimeSocial / Demographic / FamilyFinance / EconUX / MetaURLs / IDsEnvironment / ClimateInstitutions / Civic / AcademicNamed Entities

‹ PrevPROTOTYPE 6009(1 of 10)Next ›

medical\_bio›medical imaging

Name

imaging\_terms(noun)

Description

Medical imaging and scanning terminology

Top logits

imaging(+6.77)

scanning(+5.80)

scans(+5.66)

scan(+5.34)

microsc(+5.18)

Nearest neighbor contexts

1. 1.….D., and colleagues, reported on their functional magnetic resonance
2. 2.… and their presence can be confirmed only with the help of ultrasound
3. 3.… patients underwent emergency computed tomography (CT) or magnetic resonance
4. 4.… of Down syndrome.
Your health care provider will use ultrasound
5. 5.… following data is from a study of 18 pregnant women using ultrasound

`Intervening on prototypes during text generation`

In this demo, we pick one prototype and, at every token, clamp its activation so that its contribution to the sampled token’s logit is forced to be a fixed fraction of PRISM’s original top-1 logit.
This lets us see directly how amplifying or muting a single, training pattern (for example, _“clinical trial boilerplate”_ or _“fraction arithmetic”_) changes the model’s behavior.
Hover over the text to visualize how the sampled token’s probability shifts as a result of boosting the prototype.

Google PrototypeCourt System PrototypeCancer PrototypeNewline PrototypeMillion PrototypeSaid PrototypeClimate PrototypePhysics / Chemistry PrototypeWeb URL PrototypeU.S. Prototype

Prototype ID:10764

Category:Science / Tech

Top prototype logits

Google+2.32

Android+2.22

Microsoft+2.20

Windows+2.20

Twitter+2.18

macOS+2.11

Chrome+2.09

iPod+2.08

Linux+2.01

other+1.99

Mac+1.98

Vine+1.93

Yahoo+1.91

Chromebook+1.90

Wikipedia+1.87

iPad+1.87

i+1.87

your+1.87

Apple+1.87

Pinterest+1.86

BlackBerry+1.86

MAC+1.85

Kindle+1.85

Instagram+1.83

Opera+1.83

Kik+1.82

windows+1.81

tv+1.81

Firefox+1.81

Samsung+1.79

Generation with intervention

The24.7% → 20.1% (-4.6%) best5.0% → 4.4% (-0.6%) way59.8% → 57.4% (-2.4%) to96.0% → 78.6% (-17.4%) learn13.1% → 15.7% (+2.6%) more7.0% → 12.4% (+5.4%) about90.6% → 34.7% (-55.9%) the53.1% → 68.0% (+14.9%) Internet0.0% → 7.1% (+7.1%) is55.1% → 48.6% (-6.5%) via1.0% → 7.4% (+6.4%) the38.9% → 44.7% (+5.8%) Internet55.8% → 74.4% (+18.6%).19.2% → 13.2% (-6.0%) The23.7% → 26.1% (+2.4%) best11.3% → 3.5% (-7.8%) way72.6% → 67.7% (-4.9%) to96.1% → 82.9% (-13.2%) learn78.9% → 81.7% (+2.8%) more76.0% → 88.3% (+12.3%) is9.7% → 54.9% (+45.2%) via6.9% → 32.1% (+25.2%) books4.3% → 1.4% (-2.9%),26.1% → 13.7% (-12.4%) movies8.0% → 7.6% (-0.4%) and14.8% → 60.7% (+45.9%) other15.5% → 35.3% (+19.8%) media61.2% → 71.6% (+10.4%) that5.0% → 18.0% (+13.0%) you18.7% → 27.8% (+9.1%) can64.6% → 56.3% (-8.3%) access40.8% → 41.1% (+0.3%) online6.7% → 19.9% (+13.2%).85.7% → 46.5% (-39.2%)↵46.3% → 41.1% (-5.2%)

In5.9% → 3.2% (-2.7%) addition22.8% → 3.3% (-19.5%) to62.3% → 47.0% (-15.3%) learning10.2% → 5.7% (-4.5%) the9.0% → 20.5% (+11.5%) Web0.0% → 1.8% (+1.8%) and7.8% → 34.9% (+27.1%) using15.6% → 16.6% (+1.0%) it55.1% → 52.3% (-2.8%) to37.2% → 22.4% (-14.8%) your10.1% → 81.3% (+71.2%) advantage90.9% → 92.5% (+1.6%),94.0% → 73.3% (-20.7%) it11.5% → 16.0% (+4.5%) also4.6% → 8.4% (+3.8%) opens0.0% → 5.6% (+5.6%) up48.7% → 37.3% (-11.4%) the10.4% → 14.4% (+4.0%) doors7.3% → 41.9% (+34.6%) for24.3% → 72.3% (+48.0%) you78.2% → 68.9% (-9.3%) and1.3% → 13.9% (+12.6%) your77.9% → 90.0% (+12.1%) students42.9% → 42.3% (-0.6%) to93.4% → 83.2% (-10.2%) share5.9% → 13.4% (+7.5%) the1.9% → 2.9% (+1.0%) Internet27.1% → 50.1% (+23.0%) with80.1% → 58.7% (-21.4%) other15.7% → 32.0% (+16.3%) people29.0% → 43.1% (+14.1%).56.0% → 16.3% (-39.7%) Google0.0% → 4.5% (+4.5%) has17.2% → 6.6% (-10.6%) many8.9% → 17.4% (+8.5%) Web0.0% → 4.8% (+4.8%) sites70.9% → 79.9% (+9.0%) that52.8% → 70.9% (+18.1%) you56.9% → 62.8% (+5.9%) can93.4% → 80.4% (-13.0%) use58.2% → 88.6% (+30.4%) to58.8% → 15.3% (-43.5%) share26.7% → 56.3% (+29.6%) your22.3% → 57.7% (+35.4%) own7.6% → 8.4% (+0.8%) work4.7% → 7.1% (+2.4%) and21.8% → 57.5% (+35.7%) other7.9% → 21.6% (+13.7%) students15.9% → 8.0% (-7.9%) and0.0% → 6.0% (+6.0%) your0.0% → 6.4% (+6.4%) students39.5% → 32.4% (-7.1%) can76.4% → 79.9% (+3.5%) use41.0% → 73.9% (+32.9%) the18.1% → 17.2% (-0.9%) Google3.6% → 31.3% (+27.7%) Sites14.1% → 3.9% (-10.2%) to50.2% → 26.2% (-24.0%) share64.5% → 75.8% (+11.3%) work3.8% → 4.5% (+0.7%).34.3% → 7.8% (-26.5%) Google2.7% → 70.5% (+67.8%) also14.8% → 19.5% (+4.7%) has58.3% → 62.5% (+4.2%) other3.3% → 10.1% (+6.8%)

`Group prototype intervention, during generation, for science & tech (sky blue) prototypes.`

In the demo below, we instead act on an entire labeled category: at each token we inspect the top-16 active prototypes, aggregate the logit signatures of those tagged _Science & Tech_, and add or subtract a fixed fraction of that aggregate, to amplify or reduce the influence of science and tech patterns wherever they appear in the mixture.
Suppressing this category removes the sky blue highlights and shifts the text toward other patterns (such as institutional or civic language), while boosting it produces more science and tech content, like references to web browsers and email infrastructure.

Intervention strength

-100%-50%0%+50%+100%

Prompt

Across the web, phishing attacks are prompting unsuspecting victims to hand over

sensitive17.7% → 17.7% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic information55.1% → 55.1% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic such17.3% → 17.3% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact as99.6% → 99.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold online0.3% → 0.3% (+0.0%)
Science / Tech → Science / Tech banking46.2% → 46.2% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic credentials32.4% → 32.4% (+0.0%)
Science / Tech → Science / Tech or15.1% → 15.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold passwords11.1% → 11.1% (+0.0%)
Science / Tech → Science / Tech,15.8% → 15.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold so1.8% → 1.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold we1.3% → 1.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold thought6.7% → 6.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold it48.1% → 48.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold was14.0% → 14.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold important6.5% → 6.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold to92.0% → 92.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold share15.1% → 15.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold some34.7% → 34.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold insights0.5% → 0.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold about14.9% → 14.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold them1.8% → 1.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold in3.7% → 3.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold this34.0% → 34.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold article38.8% → 38.8% (+0.0%)
UX / Meta → UX / Meta (0.3% → 0.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffoldsee7.3% → 7.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the14.6% → 14.6% (+0.0%)
UX / Meta → UX / Meta attached7.3% → 7.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold PDF41.3% → 41.3% (+0.0%)
Science / Tech → Science / Tech):8.0% → 8.0% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact↵93.9% → 93.9% (+0.0%)
URLs / IDs → URLs / IDs

When0.5% → 0.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold ph7.5% → 7.5% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academicishing95.2% → 95.2% (+0.0%)
Morphology → Morphology is15.8% → 15.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold carried5.9% → 5.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold out98.7% → 98.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold by21.7% → 21.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold hackers8.0% → 8.0% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic,75.9% → 75.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the19.0% → 19.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold victim6.8% → 6.8% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic sends0.8% → 0.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold an40.0% → 40.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold email82.1% → 82.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold asking3.8% → 3.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold for63.7% → 63.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a5.5% → 5.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold username10.8% → 10.8% (+0.0%)
Science / Tech → Science / Tech and83.5% → 83.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold password90.6% → 90.6% (+0.0%)
Science / Tech → Science / Tech,20.0% → 20.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold then7.0% → 7.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold pops0.9% → 0.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a8.6% → 8.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold link54.8% → 54.8% (+0.0%)
Science / Tech → Science / Tech sent0.2% → 0.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold via1.9% → 1.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold Facebook0.5% → 0.5% (+0.0%)
Science / Tech → Science / Tech.9.7% → 9.7% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact As1.3% → 1.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold soon24.5% → 24.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold as99.2% → 99.2% (+0.0%)
Numbers / Time → Numbers / Time the65.1% → 65.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold hacker1.7% → 1.7% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic logs0.8% → 0.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold in57.8% → 57.8% (+0.0%)
Morphology → Morphology,59.0% → 59.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold he20.5% → 20.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold or61.8% → 61.8% (+0.0%)
Medical / Biology → Medical / Biology she99.9% → 99.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'0.5% → 0.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'99.9% → 99.9% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifactll47.1% → 47.1% (+0.0%)
Morphology → Morphology be22.1% → 22.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold asked12.1% → 12.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a7.8% → 7.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold follow0.6% → 0.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold-89.5% → 89.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffoldup98.4% → 98.4% (+0.0%)
Morphology → Morphology email19.4% → 19.4% (+0.0%)
Science / Tech → Science / Tech to16.7% → 16.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold confirm40.3% → 40.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the46.7% → 46.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold details1.4% → 1.4% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic and12.7% → 12.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the5.4% → 5.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold victim23.3% → 23.3% (+0.0%)
Science / Tech → Science / Tech will27.4% → 27.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold be55.5% → 55.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold redirected23.0% → 23.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold.1.3% → 1.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold↵57.5% → 57.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold

There2.4% → 2.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold is12.6% → 12.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold another2.8% → 2.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold form12.5% → 12.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold of98.8% → 98.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold ph80.1% → 80.1% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academicishing99.9% → 99.9% (+0.0%)
Morphology → Morphology attack16.2% → 16.2% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic that24.4% → 24.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'10.5% → 10.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'99.9% → 99.9% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifacts99.7% → 99.7% (+0.0%)
Morphology → Morphology more9.5% → 9.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold subtle6.1% → 6.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold.18.4% → 18.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold The8.5% → 8.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold attacker22.4% → 22.4% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic sends25.3% → 25.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a42.6% → 42.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold request4.6% → 4.6% (+0.0%)
Numbers / Time → Numbers / Time to38.2% → 38.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold users0.9% → 0.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold within0.4% → 0.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold an23.9% → 23.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold organization18.8% → 18.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold and7.6% → 7.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold then20.6% → 20.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold asks20.8% → 20.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold them30.8% → 30.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold to73.1% → 73.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold confirm13.6% → 13.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the38.0% → 38.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold request7.5% → 7.5% (+0.0%)
Science / Tech → Science / Tech,14.2% → 14.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold either2.2% → 2.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold by50.4% → 50.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold sending21.2% → 21.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the5.4% → 5.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold code0.5% → 0.5% (+0.0%)
Science / Tech → Science / Tech of1.2% → 1.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a25.2% → 25.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold website17.7% → 17.7% (+0.0%)
Science / Tech → Science / Tech or53.3% → 53.3% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact email14.0% → 14.0% (+0.0%)
Science / Tech → Science / Tech,38.7% → 38.7% (+0.0%)
UX / Meta → UX / Meta or90.9% → 90.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold by69.6% → 69.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold typing2.5% → 2.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold in54.3% → 54.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold some3.9% → 3.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold text15.1% → 15.1% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic which0.6% → 0.6% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact they3.3% → 3.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold must1.5% → 1.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold enter14.1% → 14.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold in28.1% → 28.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold order47.7% → 47.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold to97.3% → 97.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold complete14.8% → 14.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the83.5% → 83.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold request79.0% → 79.0% (+0.0%)
Science / Tech → Science / Tech.87.7% → 87.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold↵57.2% → 57.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold

The14.5% → 14.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold first4.7% → 4.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold type17.9% → 17.9% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic of87.0% → 87.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold message0.7% → 0.7% (+0.0%)
Science / Tech → Science / Tech is48.1% → 48.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold usually10.1% → 10.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold sent36.1% → 36.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold to19.9% → 19.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a28.4% → 28.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold small1.5% → 1.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold group39.9% → 39.9% (+0.0%)
Numbers / Time → Numbers / Time of89.7% → 89.7% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold people40.3% → 40.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold within9.9% → 9.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold an65.2% → 65.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold organization95.6% → 95.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold or4.0% → 4.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold by3.5% → 3.5% (+0.0%)
Medical / Biology → Medical / Biology an7.9% → 7.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold employee25.6% → 25.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold,17.6% → 17.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold but20.8% → 20.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold there18.1% → 18.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'3.9% → 3.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold'99.9% → 99.9% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifacts99.8% → 99.8% (+0.0%)
Morphology → Morphology another17.9% → 17.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold form20.6% → 20.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold of79.1% → 79.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold ph73.9% → 73.9% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academicishing99.3% → 99.3% (+0.0%)
Morphology → Morphology that18.8% → 18.8% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact is20.0% → 20.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold more40.0% → 40.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold subtle25.2% → 25.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold.42.2% → 42.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold↵17.8% → 17.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold

Here1.3% → 1.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold,35.3% → 35.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the74.3% → 74.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold attacker53.1% → 53.1% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic creates2.2% → 2.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a75.0% → 75.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold fake30.9% → 30.9% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic website40.9% → 40.9% (+0.0%)
Science / Tech → Science / Tech by2.0% → 2.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold asking2.9% → 2.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold its0.9% → 0.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold victim1.6% → 1.6% (+0.0%)
Medical / Biology → Medical / Biology to80.9% → 80.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold submit4.4% → 4.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a44.5% → 44.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold single0.5% → 0.5% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold website1.6% → 1.6% (+0.0%)
Science / Tech → Science / Tech.7.9% → 7.9% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact The35.2% → 35.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold website26.2% → 26.2% (+0.0%)
Science / Tech → Science / Tech usually1.2% → 1.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold includes8.8% → 8.8% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a65.0% → 65.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold link62.7% → 62.7% (+0.0%)
Science / Tech → Science / Tech or3.4% → 3.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold email1.6% → 1.6% (+0.0%)
Science / Tech → Science / Tech that23.7% → 23.7% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact contains2.2% → 2.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold some6.2% → 6.2% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold information6.2% → 6.2% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic such6.7% → 6.7% (+0.0%)
Morphology → Morphology as99.6% → 99.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold a58.0% → 58.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold link42.1% → 42.1% (+0.0%)
Science / Tech → Science / Tech to80.3% → 80.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold an10.4% → 10.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold official9.1% → 9.1% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold document1.5% → 1.5% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic or34.6% → 34.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold another1.0% → 1.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold website24.8% → 24.8% (+0.0%)
Science / Tech → Science / Tech.21.0% → 21.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold The25.0% → 25.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold attacker36.2% → 36.2% (+0.0%)
Institutions / Civic / Academic → Institutions / Civic / Academic then48.6% → 48.6% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold sends33.3% → 33.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the44.0% → 44.0% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold link21.8% → 21.8% (+0.0%)
Science / Tech → Science / Tech or23.9% → 23.9% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold email95.4% → 95.4% (+0.0%)
Science / Tech → Science / Tech to61.4% → 61.4% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold the52.3% → 52.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold site1.3% → 1.3% (+0.0%)
Science / Tech → Science / Tech which3.7% → 3.7% (+0.0%)
Boilerplate / Artifact → Boilerplate / Artifact receives0.3% → 0.3% (+0.0%)
Grammar / Scaffold → Grammar / Scaffold

## Introduction

Generative AI has a data provenance challenge.
AI labs have paid [record-breaking settlements](https://www.npr.org/2025/09/05/nx-s1-5529404/anthropic-settlement-authors-copyright-ai) over training data.
[Others face ongoing litigation](https://www.npr.org/2025/03/26/nx-s1-5288157/new-york-times-openai-copyright-case-goes-forward) from publishers.
When statutory damages reach exorbitant amounts, the question at the center of these cases becomes urgent: _when a model generates an output, what training data is it relying on?_

This problem, training data attribution (TDA), matters beyond the courtroom.
Reliable attribution lets us value data appropriately, understand how LLMs solve hard problems, and verify their outputs.
We would prefer a model answering a medical question to rely on journal articles rather than personal blog posts.

Existing approaches based on [influence functions](https://arxiv.org/abs/1703.04730) and [training data attribution](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054) that addresses this question.
However, these methods often require [careful approximations](https://arxiv.org/abs/2506.12965) to [scale](https://arxiv.org/abs/2405.13954) to billion parameter models, and yet [struggle](https://arxiv.org/abs/2305.16971) to provide [reliable](https://arxiv.org/abs/2006.14651) [insights](https://arxiv.org/abs/2506.12965).

**PRISM** takes a different approach: it ties training data attribution directly to the model architecture.
Every prediction decomposes into a sparse combination of learned prototypes; patterns corresponding to clusters of training examples.
The architecture is explicitly constrained so that every output logit can be faithfully traced back to these clusters.
Consequently, attributing the model’s output back to the training data is a **single forward pass**.
A medical answer might draw 60% from a prototype grounded in peer-reviewed abstracts; a code completion might trace back to documentation rather than Stack Overflow.

In the sections that follow, we cover PRISM’s architecture and training losses, our automated pipeline for labeling prototypes and retrieving training neighbors, and scaling results from 124M to 1.6B parameters; where PRISM stays within 5% of baseline with under 2% overhead.

## PRISM Architecture & Loss Functions

`We now discuss the key technical underpinnings of our approach, introducing the prototype matrix, routing rule, residual path, and training losses.`

Standard LM heads collapse all learned patterns into a single dense weight matrix; no row or column corresponds to a reusable training pattern.
PRISM asks: what if we made the logit layer an interpretable map of the training data?

We leave the transformer backbone unchanged and only modify the output layer.
Instead of sending the final hidden state ztz\_t directly through a dense matrix WW, we first express ztz\_t as a sparse, non-negative mixture of prototypes plus a residual, then map to logits.

> Two components replace the dense LM head:
>
> 1. a **bank of prototypes**, each intended to automatically learn a recurring pattern in the training data while being strongly tied to specific training instances; and
> 2. a **sparse mixing mechanism** that, given the current hidden state ztz\_t, selects a small set of relevant prototypes and combines their contributions to produce the next-token logits, plus a residual term for whatever is not captured by the prototypes.

Informally: the model asks which few prototypes does this context resemble, and how do they score possible next tokens?

**Notation**

Embedding dimension dd, no. of prototypes KK, vocabulary size VV, training dataset size NN.

At step tt, the decoder hidden state is zt∈Rdz\_t\\in\\R^d and the vocab logits are ℓt∈RV\\ell\_t\\in\\R^V.

Let P=\[p1,…,pK\]∈Rd×KP=\[p\_1,\\dots,p\_K\]\\in\\R^{d\\times K} denote the prototype codebook and αt∈R≥0K\\alpha\_t\\in\\R\_{\\ge 0}^K the (sparse) prototype activations. Prototypes live within the model’s final layer embedding.

We write W∈RV×dW\\in\\R^{V\\times d} for the LLM’s output projection (optionally tied to embedding).

### Architecture

To trace the prediction of an LLM back to recurring patterns in the training dataset, we draw inspiration from Prototype Networks,
\[1, 2, 3, 4, 5\]
[\[1\]Snell et al., 2017. Prototypical Networks for Few-shot Learning](https://arxiv.org/abs/1703.05175) [\[2\]Li et al., 2017. Deep Learning for Case-Based Reasoning through Prototypes](https://arxiv.org/abs/1710.04806) [\[3\]Chen et al., 2018. This Looks Like That: Deep Learning for Interpretable Image Recognition](https://arxiv.org/abs/1806.10574) [\[4\]Ming et al., 2019. ProSeNet: Interpretable and Steerable Sequence Learning via Prototypes](https://arxiv.org/abs/1907.09728) [\[5\]Willard et al., 2024. This Looks Better than That: Better Interpretable Models with ProtoPNeXt](https://arxiv.org/abs/2406.14675) a family of interpretable models with a long history in deep image classification, that make predictions by comparing the current input to some aspect of the training dataset that was previously seen, yielding explanations in the form _this-looks-like-that_. Recent work has proposed bringing these ideas to NLP, but progress remains limited to narrow text classification tasks
\[6, 7, 8, 9\]
[\[6\]Arik & Pfister, 2020. ProtoAttend: Attention-Based Prototypical Learning](https://arxiv.org/abs/1902.06292) [\[7\]Das et al., 2022. ProtoTEx: Explaining Model Decisions with Prototype Tensors](https://aclanthology.org/2022.acl-long.213/) [\[8\]Hong et al., 2023. ProtoryNet: Interpretable Text Classification via Prototype Trajectories](https://arxiv.org/abs/2007.01777) [\[9\]Xie et al., 2023. Proto-lm: A Prototypical Network-Based Framework for Built-in Interpretability in Large Language Models](https://aclanthology.org/2023.findings-emnlp.261/) . Large vocabularies and free-form text generation have proven a major barrier in this respect.

The most direct way to bring _this-looks-like-that_ into next-token prediction is to treat it as a VV-way classification problem: compute KK prototype activations and mix them into VV vocabulary scores with a dense matrix M∈RV×KM\\in\\R^{V\\times K}, as in ProtoPNet-style classifiers. This adds KVKV parameters and O(KV)O(KV) FLOPs per token on top of the O(Kd)O(Kd) prototype similarity cost; with vocabularies V≈50,000V\\approx 50{,}000, even moderate KK already implies tens to hundreds of millions of new weights (e.g., K=2,000⇒KV=100MK=2{,}000\\Rightarrow KV=100\\,\\text{M}), making this approach prohibitively expensive at language-model scale.

PRISM’s head instead keeps computation in the model’s embedding space, forming a reconstruction

z^t=∑kαt,kpk\\hat{z}\_t=\\sum\_k \\alpha\_{t,k} p\_k

and applying W∈RV×dW\\in\\R^{ V\\times d} to obtain logits: Wz^t=(WP)αtW\\hat{z}\_t=(WP)\\alpha\_t. This is functionally equivalent to a ProtoPNet style mixer with M=WPM=WP while yielding significant parameter reduction (e.g., d≈500⇒Kd+dV=1M+25Md\\approx 500\\Rightarrow Kd+d V=1\\text{M}+25\\text{M}, or 1M1\\text{M} with weight-tying), and preserving metric continuity by avoiding a coarse d ⁣→ ⁣K ⁣→ ⁣Vd\\!\\to\\!K\\!\\to\\! V switch. Empirically, we find that this reparameterization trains faster and more smoothly. On toy experiments with the TinyStories dataset, the ProtoPNet style head required up to 3×3\\times longer wallclock time to reach the same perplexity.

Following the literature, we adopt an autoregressive backbone as input to the prototype layer. Our modifications are restricted to the final layer, so PRISM can be implemented in a way that is compatible with standard transformer training recipes and, in principle, could also be adapted to other sequence models such as diffusion based language models. We train the entire model end-to-end, allowing PRISM to learn its own prototypical representation of inputs.

**Positive similarity scoring and top-kk routing**

Once the backbone GPT model has processed the current input, the prototype layer computes the similarity of the input zt\\,z\_t\\, to every prototype in the bank \[p1,…,pK\]\[p\_1,\\dots,p\_K\]. For each prototype, compute its cosine similarity to the current state, ztz\_t

ct,i=zt⊤pi∥zt∥2∥pi∥2for i∈\[K\]c\_{t,i}\\;=\\;\\frac{z\_t^\\top p\_i}{\\\|z\_t\\\|\_2\\,\\\|p\_i\\\|\_2}\\quad\\text{for }i\\in\[K\]

We apply an optional learned scalar τ ⁣∈ ⁣R>0\\tau\\!\\in\\!\\R\_{>0} to expand the effective dynamic range of cosine scores. Intuitively, we want the model to expose _non-negative reasoning_ : predictions are explained as _this-looks-like-that_  (positive evidence from similar prototypes) rather than _this-does-not-look-like-that_  (subtractive evidence). Thus, we enforce non-negativity via a rectifier:

α~t,i=ReLU(τct,i)\\tilde\\alpha\_{t,i}\\;=\\;\\mathrm{ReLU}(\\tau\\,c\_{t,i})

We select the index set Kt=TopK⁡({α~t,i}i=1K,k)\\mathcal{K}\_t=\\operatorname{TopK}(\\{\\tilde\\alpha\_{t,i}\\}\_{i=1}^K,k) and define the final, few-hot similarities

αt,i=α~t,i1{i∈Kt}\\alpha\_{t,i}\\;=\\;\\tilde\\alpha\_{t,i}\\,\\mathbf{1}\\{i\\in\\mathcal{K}\_t\\}

This top-kk routing ensures that each token prediction is explained in terms of a small, human-readable set of prototypes rather than a dense mixture over all KK.

**Sparse reconstruction**

We would like to reason about a prediction using as few prototypical contexts as possible, to enable crisp interpretability. Sparse activations encourage each prototype to specialize and represent tighter clusters of the training data, which makes it easier to summarize what the model is “thinking” in terms of a handful of distinct patterns i.e. the prototype logit signatures become more fine-grained. Given the kk most similar prototypes, we form a kk -sparse reconstruction

z^t=Pαt=∑i∈Ktαt,ipi∈Rd\\hat{z}\_t\\;=\\;P\\,\\alpha\_t\\;=\\;\\sum\_{i\\in\\mathcal{K}\_t}\\alpha\_{t,i}\\,p\_i\\;\\in\\;\\R^d

This follows existing SAE literature, which learns sparse dictionaries for hidden states at intermediate layers. In contrast, PRISM learns a sparse dictionary of training-grounded prototypes that directly explain the model’s output logits without a separate decoder. The features learned are also directly tied to groups of training examples (see next section).

**Merge and logits**

We use a residual merge with the original state to account for parts of the input not reconstructed by prototypes. The residual rt=zt−z^tr\_t=z\_t-\\hat{z}\_t is computed as the difference between the original ztz\_t and the reconstruction z^t\\hat{z}\_t (thus, zt′=ztz'\_t=z\_t ). The vocabulary projection is standard:

zt′=z^t+rt→ℓt=Wzt′→p(xt+1 ⁣∣x≤t)=softmax(ℓt).
z'\_t \\;=\\; \\hat{z}\_t+r\_t\\qquad\\rightarrow\\qquad
\\ell\_t \\;=\\; W\\,z'\_t \\qquad\\rightarrow\\qquad
p(x\_{t+1}\\!\\mid x\_{\\le t})=\\mathrm{softmax}(\\ell\_t).

Keeping an explicit residual path preserves the expressivity of the original backbone. Rare or input-dependent tokens need not be forced through the prototype dictionary. Measuring how much of each prediction is accounted for by prototypes versus the residual is straightforward.

**Faithful Logit decomposition**

The PRISM head builds an interpretable logit map at the model’s final layer, ensuring that we can directly quantify the effect and importance of any prototype to any output token by design. By linearity of WW , the next-token logits decompose into per-prototype contributions:

ℓt=Wrt+∑i∈Ktαt,i(Wpi)\\ell\_t = Wr\_t + \\sum\_{i\\in\\mathcal{K}\_t}\\alpha\_{t,i}(Wp\_i)

Each prototype pip\_i thus induces a fixed token–logit signature Wpi∈RVW p\_i\\in\\R^V , and the model’s prediction is an explicit, sparse, non-negative mixture over at most kk such signatures. This yields additive, causally faithful units that can be ablated or amplified directly at the logit level. When a model predicts a given token, we can recover a given prototype’s exact contribution simply by multiplying its input activation αt,i\\alpha\_{t,i} by its fixed logit signature WpiWp\_i (indexed at the predicted token). As a matter of preference, we combine the scalar τ\\tau into WpiWp\_i when interpreting the prototype signature. This restricts our interpretation of the final logits to a weighted superposition in the range \[0,1\]\[0,1\] of top-”/> k”/> prototype signatures.

## Loss functions

Here we detail the loss functions used to train PRISM. Let I(B)\\mathcal{I}(\\mathcal{B}) denote the index set of token positions across the current macro-batch. Additionally, let di(j)=−c(pi,zj)d\_i(j)=-c(p\_i, z\_j) be the negative cosine distance between prototype ii and the token representation at position j∈I(B)j\\in\\mathcal{I}(\\mathcal{B}) .

LPRISM=LCE+LR1+LR2⏟Clustering Losses+LRES\\mathcal{L}\_\\text{PRISM} = \\mathcal{L}\_\\text{CE} + \\underbrace{\\mathcal{L}\_{R\_1} + \\mathcal{L}\_{R\_2}}\_{\\text{Clustering Losses}} + \\mathcal{L}\_\\text{RES}

**Cross-Entropy (LCE\\mathcal{L}\_{\\mathrm{CE}} ) _._**

We use the standard objective

−1∣B∣∑(x1:T)∈B∑t=1T−1log⁡pθ ⁣(xt+1∣x≤t)
-\\frac{1}{\|\\mathcal{B}\|}\\sum\_{(x\_{1:T})\\in\\mathcal{B}}
\\sum\_{t=1}^{T-1}\\log p\_\\theta\\!\\left(x\_{t+1}\\mid x\_{\\le t}\\right)

where pθ(xt+1∣x≤t)=softmax(ℓt)xt+1p\_\\theta(x\_{t+1}\\mid x\_{\\le t})=\\mathrm{softmax}(\\ell\_t)\_{x\_{t+1}} and ℓt\\ell\_t are the logits computed from the merged state zt′z'\_t .

**Prototype Pull (LR1\\mathcal{L}\_{R\_1} ).**

We encourage each prototype to anchor to some token in the batch with

LR1=1K∑i=1Kmin⁡j∈I(B)di(j).\\mathcal{L}\_{R\_1}=\\frac{1}{K}\\sum\_{i=1}^{K}\\min\_{j\\in\\mathcal{I}(\\mathcal{B})} d\_i(j).

**Training-Point Pull (LR2\\mathcal{L}\_{R\_2} ).**

Symmetrically, every token position should be close to at least one prototype via

LR2=1∣I(B)∣∑j∈I(B)min⁡i∈\[K\]di(j).
\\mathcal{L}\_{R\_2}
=\\frac{1}{\|\\mathcal{I}(\\mathcal{B})\|}\\sum\_{j\\in\\mathcal{I}(\\mathcal{B})}
\\min\_{i\\in\[K\]} d\_i(j).

Combined, the R1R\_1 and R2R\_2 terms can be viewed as clustering losses in the backbone LM’s final layer embedding.

**Residual (LRES\\mathcal{L}\_{\\mathrm{RES}} ).**

We set zt′=z^t+rtz'\_t=\\hat{z}\_t+r\_t with rt:=zt−z^tr\_t:=z\_t-\\hat{z}\_t _._ We simply minimize the mean-squared residual

LRES=∥rt∥22=∥zt−∑i∈Ktαt,ipi∥22\\mathcal{L}\_{\\mathrm{RES}} \\;=\\; \\\|r\_t\\\|\_2^2 \\;=\\; \\\|z\_t-\\sum\_{i\\in\\mathcal{K}\_t}\\alpha\_{t,i}\\,p\_i\\\|\_2^2

i.e., the MSE of the mismatch between the prototype reconstruction and the original state.

**(Optional) Prototype Diversity (LDIV\\mathcal{L}\_{\\mathrm{DIV}}).**

We optionally encourage prototypes to cover diverse representations within the final layer’s embedding, to reduce prototype overlap and encourage specialization. For this setting, we penalize off-diagonal coherence of the ℓ2\\ell\_2 -normalized prototypes. With p~i=pi/∥pi∥2\\tilde{p}\_i=p\_i/\\\|p\_i\\\|\_2 and G=P~⊤P~G=\\tilde{P}^{\\top}\\tilde{P} , LDIV=1K(K−1)∑i≠jGij2\\mathcal{L}\_{\\mathrm{DIV}}=\\frac{1}{K(K-1)}\\sum\_{i\\neq j} G\_{ij}^{2} . For K>DK>D , the average squared coherence is lower bounded by the Welch bound
\[10\]
[\[10\]Welch, 1974. Lower bounds on the maximum cross correlation of signals](https://doi.org/10.1109/TIT.1974.1055219) . Driving LDIV\\mathcal{L}\_{\\mathrm{DIV}} toward this limit spreads prototypes nearly optimally on SD−1\\mathbb{S}^{D-1} and empirically yields crisper, more distinct roles without harming validation cross-entropy.

## Training Data Attribution in a Single Forward Pass

PRISM exposes all quantities needed for attribution during inference. Given hidden state ztz\_t :

ct,i=zt⊤pi∥zt∥2∥pi∥2,α~t,i=ReLU(τct,i),Kt=TopK⁡({α~t,i}i=1K,k)c\_{t,i}=\\frac{z\_t^\\top p\_i}{\\\|z\_t\\\|\_2\\\|p\_i\\\|\_2},\\qquad \\tilde\\alpha\_{t,i}=\\mathrm{ReLU}(\\tau c\_{t,i}),\\qquad K\_t=\\operatorname{TopK}(\\{\\tilde\\alpha\_{t,i}\\}\_{i=1}^K,k)

The attribution measure over training data is:

At=∑i∈Ktαt,iμSi\\mathcal{A}\_t = \\sum\_{i\\in K\_t}\\alpha\_{t,i}\\,\\mu\_{S\_i}

where SiS\_i is the precomputed set of training tokens nearest to prototype ii , and μSi\\mu\_{S\_i} is a weighting over that set (commonly uniform). At\\mathcal{A}\_t is fully determined by forward pass values and static mappings. No gradients, no Hessians, no dataset search.

## Automated Interpretability Pipeline

PRISM gives us two handles for automation: each prototype is tied to training contexts via its activations, and each has a fixed logit signature WpiWp\_i over the vocabulary. We use these to (i) find training snippets each prototype represents, and (ii) assign human-readable labels.

### Nearest Neighbor Search

For each prototype, we recover concrete training examples with a single streaming pass over the dataset. We retain the top-LL positions with highest activations:

- For every token position jj , compute αj,i\\alpha\_{j,i} for all prototypes
- Maintain a max-heap of size LL per prototype storing best matches
- Enforce distinct-position constraints to avoid redundant sliding-window variants

This is a one-pass O(NK)O(NK) procedure with O(KL)O(KL) memory. Because the LR1\\mathcal{L}\_{R1} loss pulls each prototype toward training tokens, high-activation neighbors exist by construction. In practice, similarity converges after scanning roughly 1% of training data.

### Automatic Labeling

For each prototype, we build a compact “card” containing (i) top tokens from its logit signature and (ii) local contexts where it fires. A small labeling model converts this into human-readable metadata: a short name, a one-line description (e.g., “clinical trial boilerplate”, “Unix timestamps”), and example contexts.

A second pass assigns coarse tags used in visualizations: broad category (Science & Tech, Numbers & Time, URLs & IDs), syntactic role (noun-like, function word, scaffold phrase), and optional domain tags (medical, US universities). This runs offline on learned prototypes and their neighbors.

# Performance & Scaling

`We now discuss the training procedure and performance details from training PRISM end-to-end across various model sizes.`

## Scaling from 124M to 1.6B

**Overview of PRISM performance compared to an unconstrained GPT model**

We train GPT backbones from 124M to 1.6B parameters end-to-end with the prototype layer for one epoch on FineWeb-Edu-10B. PRISM stays within 5% of unconstrained baselines on validation loss and downstream benchmarks across all scales. The prototype layer adds d×Kd \\times K parameters: at GPT-XL scale with K=16384K=16384 prototypes, this is 26M parameters (1.7% overhead).
Training time increases by less than 2%. The overhead shrinks as a fraction of total parameters as backbones scale up.
Faithful attribution to training data does not require sacrificing model quality.

0.00.10.20.30.40.50.6ARC-ChallengeARC-EasyBoolQHellaSwagMMLUOpenBookQAPIQAWinograndeTask-wise LM Evaluation AccuracyTaskAccuracyGPTPRISM

Backbone GPTPRISM Layer

| Parameter | Small | Medium | Large | XL |
| --- | --- | --- | --- | --- |
| Block Size | 1024 | 1024 | 1024 | 1024 |
| Embed. Dim. | 768 | 1024 | 1280 | 1600 |
| No. Heads | 12 | 16 | 20 | 25 |
| No. Layers | 12 | 24 | 36 | 48 |
| Total Parameters | 124M | 355M | 774M | 1.558B |

| Dim \ K | 4096 | 8192 | 16384 | 32768 |
| --- | --- | --- | --- | --- |
| **768 (S)** | 3.15M (+2.5%) | 6.29M (+5.1%) | 12.58M (+10.1%) | 25.17M (+20.2%) |
| **1024 (M)** | 4.19M (+1.2%) | 8.39M (+2.4%) | 16.78M (+4.7%) | 33.55M (+9.5%) |
| **1280 (L)** | 5.24M (+0.7%) | 10.49M (+1.4%) | 20.97M (+2.7%) | 41.94M (+5.4%) |
| **1600 (XL)** | 6.55M (+0.4%) | 13.11M (+0.8%) | 26.21M (+1.7%) | 52.43M (+3.4%) |

0.400.420.440.461243557741558LM Eval vs Model Size (↑)Model Size (M parameters)LM EvalGPT Small (124M, 0.415675)GPT Medium (355M, 0.434400)GPT Large (774M, 0.443812)GPT XL (1558M, 0.458041)PRISM Small (130.29M, 0.404110)PRISM Medium (363.39M, 0.421375)PRISM Large (784.49M, 0.426225)PRISM XL (1571.11M, 0.447210)GPT (-5%) Small (124M, 0.394891)GPT (-5%) Medium (355M, 0.412680)GPT (-5%) Large (774M, 0.421621)GPT (-5%) XL (1558M, 0.435139)GPTPRISMGPT (-5%)2.82.93.03.13.21243557741558Validation CE vs Model Size (↓)Model Size (M parameters)Validation CEGPT Small (124M, 3.0659)GPT Medium (355M, 2.8811)GPT Large (774M, 2.7835)GPT XL (1558M, 2.7242)PRISM Small (130.29M, 3.1595)PRISM Medium (363.39M, 3.0066)PRISM Large (784.49M, 2.9110)PRISM XL (1571.11M, 2.8368)GPT (+5%) Small (124M, 3.2192)GPT (+5%) Medium (355M, 3.0252)GPT (+5%) Large (774M, 2.9227)GPT (+5%) XL (1558M, 2.8604)GPTPRISMGPT (+5%)

## Conclusion

PRISM demonstrates that training data attribution doesn’t have to be a post-hoc approximation bolted onto an opaque model.
By building interpretability into the architecture, we get faithful explanations at the cost of a single forward pass.
At 1.6B parameters, PRISM stays within 5% of baseline performance with under 2% overhead.
The prototype dictionary is inspectable, editable, and directly tied to training data.
This works lays a foundation for language models that can be more easily audited, steered, and whose predictions can be faithfully traced to the training data.

Our results indicate that, at GPT XL scale, there are solutions comfortably within 5% of the original backbone’s performance that satisfy PRISM’s interpretability constraints.
In this view, PRISM does not enforce an accuracy–interpretability tradeoff so much as bias optimization toward a part of the Rashomon set where the logit layer admits a structured, training-data–grounded decomposition into prototypes.

[←\\
Previous blog\\
\\
Scaling Interpretable Language Models to 8 Billion Parameters](https://www.guidelabs.ai/post/scaling-interpretable-models-8b/) [Next blog\\
\\
Steerling-8B: The First Inherently Interpretable Language Model\\
→](https://www.guidelabs.ai/post/steerling-8b-base-model-release/)
