Back to Blog

Why a Trust Diagnostic Needs More Than Evals

Three layers of AI-native QA: how the AVS Rubric engineers for evidence integrity.

TL;DR

Most AI-native tools run evals on the model and traditional QA on the system. For diagnostics that read the external world, a third layer — input verification — has to sit upstream of both. Without it, the model produces confident, well-structured analysis of whatever evidence it was given. The AVS Rubric is built around six pipeline disciplines that make sure the evidence reaching the model deserves to be reasoned over.

If you're building an AI-native product, your public surface is already being evaluated by AI. Before a buyer books a demo, they're scanning your pricing page, your docs, your trust surfaces that are increasingly discovered through ChatGPT, Claude, Gemini, and Perplexity. By the time they reach your site, a machine has often already shaped their impression of you.

When those signals are incomplete or contradictory, buyers don't always churn in obvious ways. They hesitate. They might still try your product, but would churn after a 7-day trial. They quietly route the budget to a competitor whose trust posture is easier to verify. You see the gap late in the retention curve, in a lost deal, in pricing confusion a CFO flags on a call.

Which means the question isn't whether to use AI diagnostics on your own trust infrastructure. The question is whether the diagnostic's output deserves the trust it asks for.

Most don't. An AI system will confidently analyze whatever evidence it's given. If the evidence reaching the model is incomplete, contaminated, or wrong, the output will still look structured, still sound authoritative, and still be wrong. That failure mode is silent. It doesn't raise an exception.

The AVS Rubric is an evidence-based trust infrastructure diagnostic, which means its output is only as good as the evidence the pipeline delivers to the model. This post is about the engineering discipline that sits behind that output and the single QA layer most AI-native tools skip.


Three layers of QA, not two

Most software quality comes in two layers: evals (does the model reason correctly?) and traditional QA (does the feature work?). For AI systems that read external data, a third layer has to sit upstream of both.

What it testsWhen it runsWhat it assumes
Input VerificationAre the inputs actually correct?Before the model callNothing. This is the check.
EvalsDoes the model reason correctly?Against model outputsInputs are correct
Traditional QADoes the feature work?Against your systemModel and inputs are correct

Evals assume the inputs are correct. QA assumes the model and inputs are both correct. For AI systems that read the external world (web scraping, RAG, agent tool calls), the inputs are dynamic and constantly changing. The assumption breaks. And when it breaks, the model doesn't error out. It produces confident, well-structured analysis of whatever it was given.

Most AI-native tools run evals on their model. Fewer run a dedicated verification layer on their inputs. The ones that don't can still produce trustworthy-sounding output until the evidence shifts underneath them.


What this looks like in practice

In a recent stress test, the AVS Rubric scored an AI-native company on its credit-based pricing model. The early runs surfaced all the wrong evidence.

The scraper walks a company's sitemap and ranks pages by URL pattern: /pricing high, /faq and /help next, generic paths lower. It has precise logic until the sitemap itself is noisy. In this case, the sitemap included developer API documentation, terms of service, Zendesk category navigation pages (titles, no content), and user-generated documents in Korean and Indonesian the scraper couldn't distinguish from the company's own product pages.

The pricing page itself was reachable. However, the detailed explanation of how credits actually work was not captured and fed to the model. That article existed, but only inside the signed-in product experience, behind a FAQ link in an in-product modal. The scraper had no path to it from the marketing site, help center, or sitemap.

Missing that article, the rubric scored the company's Safety Rails dimension 0 out of 2 total points. Its top recommendation was to publish a detailed explanation of how credits work. The company already had one.

After the pipeline was fixed and a path to that article was in, the score moved from 8/16 (50%) to 12/16 (75%). Same company. Same public information. Four points of score movement driven entirely by what the model was allowed to see.

That is exactly the failure mode a trust diagnostic cannot afford to produce silently.


What the rubric does to make its output defensible

Six pipeline disciplines work together to keep the evidence that reaches the model worth reasoning over.

URL-pattern exclusion rules. Certain page types are noise for a trust infrastructure scan: template pages, legal boilerplate, sitemap XML, developer subdomains, category navigation. These are blocked before entering the evidence pool. Every exclusion lives in a documented file with a "do not revert" rationale so we don't accidentally undo the product logic.

Intent-weighted page priority. Pricing pages rank highest, with FAQ and billing close behind. Comparison and solution pages get reserved slots. When crawl capacity forces tradeoffs, the pages that carry commercial signals are selected first.

Manual overrides for undiscoverable content. Some of the most important pages — including in-product credit explanations, trust center pages, and compliance documentation — are linked from JavaScript tooltips or in-product modals that standard crawlers cannot follow. A community_evidence table includes them explicitly on every run.

Every wrong output logged. A running log captures every scan that produces a surprising result: the company, the affected dimension, the root cause, the fix, whether resolved. Over time, the log becomes a prioritization tool. Pipeline misses get fixed first because they outnumber model errors by a significant margin.

Versioned cache. Every pipeline change bumps an analysis version, so scans never return stale results from before a fix.

Three-pass median voting. Every scan runs three independent LLM passes at temperature 0.1. The median score per dimension is reported. Disagreement between passes becomes a diagnostic about evidence quality. When passes land on different scores, the variance is usually tracing back to thin or contradictory evidence in the input layer, not model randomness.

None of these is visible in the final report. All of them are why the final report is defensible.


What a report includes

Every scan and analysis returns an evidence-backed diagnostic, not a verdict. Specifically:

  • A Trust Stack score across eight dimensions on a 0–16 scale, with a maturity band from Nascent to Advanced.
  • Dimension-level breakdowns showing pass/fail on each underlying subtest so you can see exactly which elements of your trust posture are landing and which aren't.
  • An evidence ledger. Every score cites the specific URL it was drawn from, with the extracted evidence visible. If you disagree with a score, you can trace the reasoning back to the source.
  • Prioritized recommendations organized by Trust Stack layer, so a product or GTM lead can sequence fixes from foundational (Product-ICP clarity, Pricing Architecture) up through enterprise readiness.

The evidence ledger is the part that tends to surprise people. Most AI tools produce rationale. The rubric produces a record you can audit.


What the rubric isn't

A common question: why not just use an answer engine?

A sophisticated user can replicate roughly 70% of a single rubric run with ChatGPT or Perplexity and good prompting. What changes with repeated use is what answer engines can't reproduce: the same score on the same evidence tomorrow, or a comparable score across companies against identical subtests, or a dataset large enough to tell you where you sit in your category.

A related question: is this an AEO audit?

No. AEO tools evaluate individual pages for discoverability by answer engines, answering questions like: can ChatGPT find and cite this page? The AVS Rubric evaluates the public-facing trust layer as a system, assessing whether pricing logic, cost drivers, safety rails, enterprise controls, and support content cohere into a trust posture a buyer can predict, verify, and defend across eight dimensions. A company can pass an AEO audit (every page findable) and still fail the rubric (the pieces don't add up to a coherent story a buyer can act on). They answer different questions, and a company serious about growth should probably run both.


See what your surface looks like

The AVS Rubric is live at app.valuetempo.com. A single scan runs the full trust infrastructure analysis across eight dimensions and returns an evidence-backed score, gap breakdown, and prioritized recommendations. It is grounded in the engineering described above.

If you spot something the rubric should do differently on your own report, we want to know. Rate and comment on your results page. That feedback is how the loop keeps tightening.