Most AI-Powered Engineering Tools Share One Hidden Dependency

Foundation model capability is a fast-moving market, which is good for buyers except for the part of a product built directly on top of it. For tools that analyze engineering specifications, that part matters more than it looks.

By Peter Virk, Patrick Bartsch • June 13, 2026 • 8 min read

Detailed grayscale view of a jet engine turbine, showing the layered precision of its internal blades

Photo by Lee Mills on Pexels

Ask most vendors selling an "AI-powered" specification review tool what happens inside their product, and the honest answer is some version of: we send your document to a large language model and return what comes back. For a lot of software categories, that's a perfectly workable architecture. The value sits in the integration, the workflow, the data connectors built around the model call.

For a tool whose output ends up in a quality gate, a supplier dispute, or a certification record, the same architecture carries a different kind of exposure. Foundation model capability is a fast-moving market. New versions ship every few months, pricing shifts, older versions get retired, and what a given model is willing or able to do can change between releases. None of that is a problem with the market. A market that improves quickly is good for buyers. The problem is when the differentiated part of a product sits directly on top of that volatility, with nothing absorbing the shock when it moves.

Foundation model capability is a market, and markets move

Foundation model providers compete on general capability, and that capability changes on a timeline measured in months. A new model version can ship faster and cheaper. It can also get measurably better at some tasks while behaving differently on others, all within the same release. Two companies sending the same requirements document to the same model with similarly worded prompts get similarly shaped output today. Whether they get similarly shaped output in a year depends on a roadmap neither of them controls.

If a model provider changes pricing, retires a version, or narrows what a particular model variant can access, every company built on that dependency feels it at the same time. Not because anyone did anything wrong. The layer they built on is, by design, the part of the stack that moves fastest. Betting the differentiated part of a product on the fastest-moving layer underneath it is a bet on that layer staying still long enough to matter. It usually doesn't.

This isn't hypothetical. In mid-2026, Anthropic restricted access to two of its model variants, Fable and Mythos, after a government directive required it. Anthropic said it disagreed with the decision but complied anyway, which is the point: the restriction wasn't a product or business choice either Anthropic or its customers could have anticipated or negotiated around. For companies that had built products on those two variants, a capability their differentiation depended on became unavailable on a timeline neither they nor Anthropic had set.

Why a changed model is a bigger problem for engineering than for chat

A chatbot that responds slightly differently after a model upgrade is, at worst, an inconsistency users adjust to. A specification analysis tool that responds differently creates a harder problem, because its output isn't conversation. It's a finding that other people act on.

Picture a programme that builds a release gate around "fewer than five unresolved high-severity contradictions in the requirements set." If that count comes from sending the document to a model and parsing its judgment, the count can shift for reasons that have nothing to do with the document. A model update that makes the underlying system slightly more, or less, willing to flag borderline cases will move the number. Nobody touched the specification. The gate's reading changed anyway, and the team now has to work out whether four contradictions are real or whether the model just got more conservative over an update nobody on the team knew had happened.

That's the gap between a number that sounds authoritative and a number that's reproducible. The first is fine for a sanity check. The second is what a quality gate, an audit trail, or a supplier dispute actually requires.

Two tools can look identical in a demo and diverge within a year

From the outside, a wrapper and a purpose-built engine can produce nearly identical-looking output. Both return a list of findings with severity ratings and plain-language explanations, and both can be pointed at a ReqIF export and return results within minutes. Most procurement evaluations compare two tools on a single sample document, because nobody runs a year-long trial before signing a contract. That kind of comparison will often turn up two outputs that look comparably thorough.

What that evaluation won't show is what happens when the same document runs again next quarter, after the vendor's model provider has shipped a couple of point releases neither company tracked closely. A pipeline built on fixed detection logic returns the same structural findings, because the logic didn't change. A wrapper returns whatever the new model version produces for that prompt. The overlap with the old output might be high, or it might not be. There's no way to tell from outside which one you bought.

What a purpose-built analysis engine looks like instead

WYZER Detective's analysis engine, Sherlock, is built around a different premise: large language models are useful for specific, bounded jobs within an analysis, not as the entire analysis.

Contradiction detection is split into three classes: logical negation, numeric constraint, and semantic contradiction, each with its own detection logic. Duplicate detection runs on semantic similarity across the full requirement set, not a single model's holistic read of the document. Quality scoring breaks each requirement down across defined dimensions: ambiguity, measurability, conciseness, and completeness, with scoring logic that stays the same regardless of which model generates the explanatory text a reviewer reads afterward. This layered approach is part of why SDV complexity demands a different kind of detective than a single model call can provide. The complexity has structure, and the analysis has to mirror it.

Run against a specification with a few hundred requirements, this pipeline flagged duplicate candidates covering 12.8% of the set, came back with zero false conflicts on requirements that were already consistent, and scored the document at 6.7 out of 10 with a breakdown for every requirement. On a different specification, one that did contain genuine contradictions, the same detection logic isolated the directly conflicting, high-risk pairs, roughly 2% of the requirement set, from the surrounding low-risk overlap. Neither result comes from asking a model whether the document has problems. Both come from a methodology that runs the same way whether the specification holds hundreds of requirements or tens of thousands, and whether the input is plain text, tables, or embedded diagrams, and would produce the same structural findings even if the language model generating explanations were swapped out tomorrow.

An engine and a wrapper differ in one practical way. A wrapper's output is a function of the model. An engine's output is a function of the methodology, and the model is one component inside it: the part that explains a finding in plain language, not the part that decided there was a finding.

The honest tradeoff: pipelines take longer to build than prompts

Building separate detection logic for three contradiction classes, a dedicated duplicate-detection layer, and per-dimension quality scoring takes longer than writing a prompt that asks a capable model to review a specification and list its concerns. The prompt-based approach can look impressive in a demo: point it at a messy document, get back a plausible list of problems within minutes.

The tradeoff is between time-to-demo and time-to-trust. A well-crafted prompt against a strong model often does surface real issues, and that's a legitimate starting point, not a gimmick. What it won't reliably do is surface the same issues, at the same severity, against the same document, after several model versions have shipped underneath it. For a one-off internal review, that might not matter much. For a finding that needs to hold up when a supplier disputes it, or when an auditor asks how a quality score was calculated a year later, it does.

We've made similar tradeoffs inside our own infrastructure. Evaluating data formats for the messages our analysis agents exchange took longer to validate than defaulting to JSON and moving on, but at the scale of a 1,000-requirement document, that kind of detail compounds across every analysis run.

What to ask before an AI tool touches your specifications

The model behind a tool is the least useful question to ask during an evaluation. Every vendor will name a capable one. The more useful questions are about what's fixed and what's exposed to the next model release:

If you ran the same document through this tool twice, would you get the same findings?
If the vendor's underlying model is upgraded next quarter, would the same document still produce the same findings, and would the vendor know if it didn't?
Which parts of the output come from fixed logic, and which parts are a model's judgment on the day?
How much of the product's differentiated value sits above the model, versus inside it?

None of this is an argument against AI in specification review. The case for it gets stronger as specifications grow past what manual review can realistically cover: a 1,000-requirement document creates roughly a million comparison pairs, and a 10,000-requirement document creates a hundred million. The argument is narrower: the parts of a product that explain a finding in plain language should track the model closely and improve as it does. The parts that decide whether there's a finding at all need to return the same answer no matter what's running underneath, or which version it's on. Knowing which is which, before you build or buy, is the part that's easy to skip.

About the Authors

Peter Virk

Co-founder at Wyzer — building Sherlock to find what specifications hide

30+ years in automotive technology and digital innovation, including senior roles at Jaguar Land Rover, FORSEVEN, BlackBerry QNX, and Lotus Cars.

Patrick Bartsch

Co-founder at Wyzer — turning requirements intelligence into engineering confidence

20+ years in automotive software, cloud, and AI, including roles at Volkswagen Group, Audi, Jaguar Land Rover, and AWS, with a PhD in Electronics and Computer Science.