Side note: the diagrams are kind of AI slop in this, but I wrote the rest. I will get around to fixing them later.
One of my favorite bits of Matt coding lore–a recent great addition to the canon of his many heroic battles against tests, interviews, and other mythical beasts–is the story of how I made Everstar's MVP for document drafts. Thinking back on it now after becoming infantilized and subsequently lobotomized from coding agents, I am particularly shocked.
Essentially, on a random Friday afternoon, on a flight from JFK to BZN (Bozeman, MT, where my Dad lives), I decided to refactor and overhaul how draft generation was done in our monorepo. When I joined Everstar, draft generation was still in its infancy, so it was more or less a chaotic amalgamation of prompts and sequential control flow that took ages and degraded in quality over time. So I decided on this flight, with no internet, that I would reinvent it from the ground up. That's when I came up with the Document as a DAG model and implemented the multi-agent orchestration system for parallel section generation, which led to 10x faster document generation and fewer revisions overall. And the best part is that it worked the first time I tried it after landing in BZN. I still find it insane that I used to write all code from scratch, and even more insane that it would sometimes work.
Nuclear energy compliance is not as easy a domain as most B2B SaaS companies try to take on. Whereas most domains can get away with basic AI slop, nuclear compliance is vastly complicated, highly gatekept in the minds of domain experts, and has very few examples. Since there are only so many commercial reactors running in America, there are only so many examples of each document type being approved by the NRC, and each of those has multiple levels of revision. Even information that was not included in the revisions can be factored into approvals. Furthermore, these drafts are impossibly long, contain numerous diagrams that won't be picked up by normal OCR, and have countless data sources from numerous sensors and LERs. Gordian had to operate over licensing, aging management, quality assurance, safety basis, and security material.
We had a partnership with Excel Services (a nuclear compliance consultancy that specialized in a few types of reports, such as SLRs and LARs), so we started with Subsequent License Renewals and License Amendment Requests. We started by collecting all approved requests to the NRC for these document types, then separating them into training, validation, and test groups. The training examples were used to create templates, extract document structure and tone, and provide direct reference when generating new documents. The validation set was used for tuning the procedure to ensure that it generated valid drafts. Finally, the test set was used to benchmark the various draft generation approaches.
The key insight I used for creating my architecture was that documents have two primary sources of structure. First, they have order in the document. This usually shows up as a nested tree of section > heading > subheading > text (this is oversimplified). This is how we originally generated drafts: in the order they showed up in the draft. This is problematic for two main reasons: 1) different sections cannot be generated in parallel if we generate in this autoregressive-style approach, and 2) many sections either don't depend on earlier sections or do depend on later sections (like an analysis of an aging report table in the appendix). This is why there is a second structure that must be extracted from the document: the DAG dependency graph.
The dependency graph encodes what needs to be done before generating each section of a paper. For example, for some arbitrary section, it might discuss a previous section as well as a chart in the appendix, so it would mark both of those as dependencies. Furthermore, it can mark data sources as dependencies, but we can also just treat those as entities that must be curated but won't be included in the structure of the final document. The template extraction step accomplishes both of these: it extracts the nested structure as well as the dependency graph.
During inference, the orchestrator oversees the whole draft generation and spawns various subagents to tackle each individual section. Each of the subagents has a variety of retrieval steps (see my retrieval post, computation steps (if needed), generation steps, and validation/critique steps. Because context is minimized to only the dependencies and multiple sections can generate in parallel, long compliance drafts can be generated relatively quickly and with much higher accuracy.
Once the draft was generated, the composer was the interactive editing layer on top of the document graph: select a span, ask for a change, stream a patch into the editor, preserve IDs, and keep the trace graph aligned with user edits and regenerated sections.
All sources were traced so that each section had a clear history of what was used to write it and who did it (LLM or human alike).
The number one thing that any AI company can do, especially one in a critical industry like nuclear, is implement solid evaluations, so eval loops were a crucial part of everything I made at Everstar. Another tab in the document draft editor is the evaluation tab, where users can run various evaluator agents against the draft to analyze and critique things like tone, accuracy, hallucinations, or redundancy. From there, users can ask the composer to fix issues or fix them manually.