AI in archaeology: reading what fire and time erased

A team of researchers recently read words on a Roman scroll that no human eye has seen since the eruption of Mount Vesuvius in AD 79. The scroll was not unrolled. It was not chemically treated. It was scanned, and a machine learning model trained on the faint patterns of carbonised ink picked out letters from what looked, to every previous generation of papyrologists, like a lump of charcoal. The Guardian covered the moment as a breakthrough, and it is — but it is also the clearest recent example of a quieter shift in how archaeology actually works.

The shift is this: more and more of what archaeologists call “reading evidence” is now reading it through a model. That changes what is recoverable, what is fragile, and what counts as a finding.

What was actually done with the Vesuvius scrolls

The Herculaneum scrolls were carbonised, not burned to ash, by the pyroclastic surge that destroyed Pompeii’s quieter neighbour. They survived precisely because they were buried fast and starved of oxygen. The cost was that the papyrus and the ink fused into the same brittle black mass. Unrolling them physically destroys them, which is why most attempts since the 18th century have failed.

The current approach combines two things. The first is high-resolution X-ray microtomography, which produces a volumetric scan dense enough to resolve individual layers of papyrus inside the rolled scroll. The second is a convolutional neural network trained to detect the subtle textural difference between inked papyrus and bare papyrus — a difference of microns, invisible to the naked eye even in the scan itself. The model does not “read Latin” or “read Greek.” It flags ink. Humans, working with the model’s output, read the text.

That distinction matters. The headline framing — “AI deciphers ancient scroll” — collapses a careful division of labour between scanning physics, a narrow-domain visual classifier, and a domain expert who still does the linguistic work.

Where AI is genuinely useful in archaeology

Strip away the more excitable coverage, and a small number of techniques have become reliably useful across the field:

Task	What the model does	What it does not do
Ink detection on damaged manuscripts	Flags pixel-level evidence of pigment	Translate or transcribe text
Site discovery from satellite or LiDAR	Surfaces geometric anomalies in terrain	Confirm a site is anthropogenic
Sherd and artefact classification	Groups fragments by typology	Date an object on its own
Reconstruction of broken inscriptions	Suggests probable missing characters	Settle which suggestion is correct
Image enhancement of degraded media	Recovers contrast and edges	Add information that was never recorded

The pattern across all of these is the same. The model expands what a human expert can attend to — by surfacing candidates, ranking possibilities, or filtering noise — without replacing the judgement step. We see this same division of labour across most applied computer vision work outside the headlines: the model narrows the search space, and a domain expert decides what the narrowed result means.

Why the technique is harder than it looks

There are three honest constraints that the press tends to skip past.

Training data is scarce by definition. A model that reads carbonised Herculaneum ink cannot be trained on millions of examples, because there are not millions of carbonised Herculaneum scrolls. Teams work around this with synthetic data, transfer learning from related domains, and aggressive augmentation. Each of those introduces its own risk of hallucination — the model “sees” patterns that are artefacts of its training pipeline, not the underlying material.

Validation is partial. When a vision model proposes a letter, there is often no ground truth to check it against. The scroll has never been read. Confidence comes from internal consistency across passages, cross-checking with known authors, and forward-prediction tests on scrolls that were later unrolled and verified. None of those is the closed loop a benchmark normally enjoys.

The cost of error is asymmetric. A misread medical image gets caught downstream. A misread ancient scroll can quietly enter the literature and stay there. This is why the more careful projects publish not just the proposed transcription but the model’s confidence maps and the raw scan data — so other teams can disagree.

Why does AI work better here than in some other domains?

The honest answer is that archaeology has a structural advantage: the questions are usually well-bounded, even when the data is poor. “Is there ink at this voxel?” is a much narrower question than “what does this CT scan mean for this patient?” Narrow questions, applied to information-rich physical evidence, tend to be where deep learning earns its keep. The same principle drives the rise of AI in archaeological discoveries more broadly — site-detection tasks on LiDAR, sherd classification, inscription completion — none of them are open-ended in the way clinical or legal AI work tends to be.

That bounded quality is also why teams collaborate so freely across institutions. The Vesuvius Challenge made its scan data and its baseline models openly available, and most of the early progress came from outside the original archaeological institute. The field can do this because nobody is competing for a market — they are competing for who reads a particular passage first.

What the next few years probably look like

Expect more of the same pattern, slightly faster. The scanning hardware is improving on its own schedule, driven mostly by medical imaging and materials science rather than by archaeology. The models get cheaper to train and easier to fine-tune. The bottleneck is increasingly not the AI side — it is the supply of scanned, labelled, expert-annotated source material to train against.

Two things to watch. First, multimodal models that combine visual evidence with linguistic priors are starting to suggest plausible completions for damaged inscriptions in ways that single-modality systems could not. The risk of plausible-but-wrong output rises sharply here, and the field is still working out how to publish such suggestions responsibly. Second, the techniques developed for reading Herculaneum are being ported to medieval palimpsests, Egyptian mummy cartonnage, and Mesoamerican bark codices — each with its own physical constraints and its own community of experts. The portability is real but never automatic.

What stays the same is the underlying claim. AI in archaeology is not a discovery machine. It is a way of seeing more of evidence that already exists, and the test of any new system is not how impressive its output looks, but whether the experts in the room are willing to defend the result in print.

For a closer look at how these techniques are being applied across specific sites and projects, see AI in Archaeology: Advancements and Applications.

Frequently asked questions

Did AI actually read the Vesuvius scrolls on its own? No. A machine learning model detected the location of carbonised ink in CT scans of the scrolls. Human papyrologists then read the text the model had made visible. The press shorthand “AI read the scroll” compresses several distinct steps performed by different tools and people.

Can AI date archaeological objects? Not reliably on its own. Models can group artefacts by typology — suggesting that a fragment looks like material from a particular period — but dating depends on stratigraphic context, radiocarbon analysis, or other physical methods. The model narrows the question; it does not answer it.

Is AI replacing archaeologists? No, and the framing misunderstands the work. The labour that AI is taking off archaeologists’ plates — pixel-level scanning of damaged surfaces, large-scale terrain analysis, fragment sorting — is exactly the labour that used to limit how much actual interpretive work a team could do. More AI in the pipeline tends to mean more interpretation per archaeologist, not fewer archaeologists.

Why does AI work well for reading damaged manuscripts but struggle elsewhere? Because the question is narrow (“is there ink here?”) and the physical evidence is rich (high-resolution volumetric scans). Deep learning tends to perform best on bounded perception tasks over information-dense inputs. Open-ended reasoning tasks, where the question itself is fuzzy, remain much harder.

Credits: original framing inspired by The Guardian’s coverage of the Vesuvius Challenge.