Document intelligence at scale: lessons from 37,000 lines of code

A few years ago I started building what became a document intelligence platform for the insurance and benefits industry. The codebase grew to about 37,000 lines. The platform shipped. It was licensed by a Fortune 1000 broker. It is in production.

The technical content of the work has been written about elsewhere. What I want to write about here is the architectural and operational lessons that I would carry forward into a second build, because the second build would look meaningfully different from the first.

Schemas first, prompts later

The single biggest investment that paid back was building a strict, versioned schema for every output the system produces. The LLM is constrained to produce JSON that conforms to the schema. Validation is non-negotiable. If the model fails to produce valid JSON, the system retries with a corrective prompt, and if that fails, it routes to a human reviewer.

This sounds obvious now. It was not obvious when I started. The temptation in early prototyping is to let the model produce free text and to extract structure post hoc with regular expressions or follow-up prompts. That works for demos. It does not survive contact with real document variance.

The lesson generalises. For any document intelligence system you are building, the schema is the architecture. Spend more time on the schema than on the prompts. The schema constrains the failure modes of the system in ways the prompts cannot.

Treat OCR as a first-class subsystem

For documents that are scanned or image-based, the OCR layer is the foundation of everything that comes after. A weak OCR layer produces garbage that no LLM can recover. I underweighted this in the first build and spent months debugging downstream hallucinations that turned out to be OCR failures.

What I would do differently. Build the OCR pipeline as a distinct service with its own quality metrics, its own regression suite, and its own escalation path. Test it on the specific document types and quality levels you will see in production. Measure character error rate, word error rate, and structural recovery on tables. The downstream LLM is only as good as the text it receives.

Layout recovery is harder than text recovery

Most documents in this domain have structure that text-only extraction loses. Tables. Multi-column layouts. Form fields. Headers, footers, and footnotes that are spatially separated from the content they relate to.

I underestimated how much of the work would be layout-aware extraction, not text extraction. Document layout models, table extraction models, and form-aware parsers were essential. The LLM was good at interpreting the content once it was correctly extracted, and weak at recovering structure from a flat text dump.

The architecture I would build next time has layout extraction as a separate stage from content interpretation, with explicit intermediate representations that the LLM operates on.

Evaluation is the product

The evaluation harness ended up being half the value of the platform. The customer cared about accuracy on their specific document mix. The only way to know whether a model change or a prompt change improved or regressed accuracy was to run a comprehensive eval on a held-out set of real documents, scored against ground truth.

Building that ground truth was expensive. Maintaining it was expensive. It was also the single most defensible part of the business. A competitor can copy the architecture. A competitor cannot quickly build a 5,000-document ground truth set labelled by trained domain experts.

If I were starting again, I would budget the ground truth set explicitly from week one, and I would not let any model change ship without an eval run.

Failure modes you will not predict

A short list of failure modes I did not predict and which cost real time.

Documents with watermarks that the OCR partially recovered, producing garbled text in spatially specific regions. The LLM hallucinated to fill in the gaps. The fix was OCR-side, not LLM-side.

Documents in image formats with embedded fonts the OCR did not recognise, where every "5" was extracted as "S". Numerical fields were quietly wrong in a way that downstream calculations exposed.

Documents that were scanned upside down or at an angle. The fix was a preprocessing rotation correction stage. This was not in the original architecture.

Documents with multiple pages that were stapled together but were logically separate documents. The system processed them as one. The fix was a document boundary detection stage.

None of these are interesting. All of them mattered.

What I would tell someone starting today

Define the schema first. Treat OCR and layout as serious subsystems, not as preprocessing. Build the eval harness in week one. Plan for a long tail of preprocessing edge cases that have nothing to do with the LLM. Budget more time for the boring infrastructure than for the model work.

The model work is the part that gets the conference talks. The boring infrastructure is the part that gets the licensing deal.