A Stable Score Can Still Hide Unstable Evidence
What hardening AVS Rubric across three companies taught me about building an AI-native trust infrastructure diagnostic
TL;DR
Hardening the AVS Rubric across Beautiful.ai, Hex.tech, and ZoomInfo revealed that reliability is the visible outcome but evidence integrity is the underlying condition. Stable scores can hide unstable evidence. The biggest gains came from the evidence pipeline — not the model. Structured extraction multiplies both signal and noise. And the bar for an AI-native trust diagnostic is not interesting output — it is trustworthy evidence.
Trust in AI-native products is one of the harder problems to diagnose — and one of the most expensive to ignore. A trust gap can open before buyers ever try the product, and widen further before churn shows up in the performance data. That's because buyers often cannot predict how the product will behave, what they will pay, or whether value will match their spend — from the company's public surface alone.
The Adaptive Value System (AVS) Rubric exists to diagnose whether that trust infrastructure is visible before the gap slows growth.
Last week I wrote about vibecoding the first version of AVS Rubric using Lovable and Claude Code. The core lesson was simple: shipping fast was not the hard part. Making the output reliable enough to trust was.
This week pushed that lesson further.
As I hardened the rubric across Beautiful.ai, Hex.tech, and ZoomInfo, the pattern became clear: reliability was only the visible problem. Evidence integrity was the deeper one.
What changed this week was not just the output. It was the discipline around how I checked it. Each iteration followed the same loop: inspect the evidence trail, classify the failure, fix the earliest point in the pipeline, then rerun the same company set to check for regressions. Tools like Lovable and Claude Code sped up parts of that work, but they did not define correctness. I still had to decide what counted as valid commercial evidence, what should be excluded, and which contradictions mattered enough to change the score.
That is where AVS Rubric started to feel like more than a scoring tool. It started to feel like the category ValueTempo is building toward: an AI-native trust infrastructure diagnostic.
The real problem was not scoring accuracy, it was evidence integrity
Here is what I learned this week:
- stable scores can hide unstable evidence quality
- the biggest failures happened before scoring, not during scoring
- structured extraction is a force multiplier, it multiplies both signal and noise
- some of the worst errors came from pages that looked legitimate, not obviously broken
- the most important fixes happened in the evidence pipeline, not in the model
Reliability is the visible outcome. Evidence integrity is the underlying condition.
The pattern showed up before the sample got big
I tested successive rubric versions across Beautiful.ai, Hex.tech, and ZoomInfo because I wanted different kinds of public surfaces.
The pattern showed up faster than I expected.
Across all three, the same failure classes kept recurring: evidence contamination from the wrong source pages, citation and export artifacts that weakened trust in the output, and wasted crawl attention on low-signal pages that crowded out the pages that mattered.
The examples were different. The failure class was the same.
In ZoomInfo's first version, developer API documentation was included as pricing evidence — technically from ZoomInfo's domain, clearly product-related, but describing integration patterns for developers rather than commercial terms for buyers.
In Hex's earliest version, several citations in the evidence set didn't resolve to real pages. They passed surface-level inspection, but the confidence scores they supported couldn't be independently verified.
In Beautiful.ai, a single missing source — the pricing page dropped from the evidence set in one version — simultaneously collapsed three dimension scores and generated a recommendation to publish a pricing page that already existed.
Different companies. Different surfaces. Same upstream problem: the evidence entering the system did not deserve to be reasoned over.
Beautiful.ai became the clearest lens for this post — its score progression made the deeper problem easiest to see.
Beautiful.ai made the hidden problem easiest to see
Across eight successive scans, Beautiful.ai became the clearest case study for one uncomfortable truth: the headline score can stay stable while the trustworthiness of the evidence changes materially.
In seven of eight runs, Beautiful.ai held at 11/16 (68%).
That sounds stable. It was not.
Those eight runs were not identical reruns on a frozen system. Each one reflected a successive fix to the rubric, while the same evolving diagnostic was being tested across multiple companies. And under that apparently stable score, the evidence changed materially:
- early runs still carried weak or misleading commercial logic
- one run dropped a key pricing source and destabilized multiple dimensions
- a later run pulled in a legitimate-looking page that contradicted the actual refund policy
- only the final run resolved all tracked issues at the same time
That was the real lesson. Not that the number changed dramatically — most of the time, it did not. The real lesson was that a stable score can still hide unstable evidence.
The worst errors came from pages that looked legitimate
The most damaging failures were not obvious hallucinations.
They were pages that looked structured, clean, and trustworthy, but were not the right source of truth.
ZoomInfo: Legal page misread as product evidence
ZoomInfo's biometric data privacy notice — a compliance document — was pulled into the evidence set and interpreted as a customer-facing safety rail. It was structured, official-looking, and clearly from ZoomInfo's domain. But it described regulatory obligations, not product capabilities. The rubric scored a safety rail that didn't exist from a buyer's perspective.
Hex: A citation with no URL
A "Structured Pricing Data" source appeared in Hex's Cost Driver Mapping rationale — formatted like a real citation, labeled as structured data, specific enough to seem authoritative. It had no traceable URL. The evidence looked more rigorous than it was because structured formatting signals reliability.
That kind of error is worse than obvious noise. Obvious noise gets ignored. Clean but wrong evidence gets trusted. That asymmetry is what makes evidence contamination the hardest class of failure to catch. That is why evidence contamination became the most important failure class to fix.
Structured extraction made the right pages more useful, and the wrong pages more dangerous
This was the most important technical lesson:
Structured extraction is a force multiplier. It multiplies both signal and noise.
When it points at the right page, it helps. Plans, limits, refund rules, overage behavior, and packaging logic become easier to compare and reason over.
When it points at the wrong page, it makes the error worse. The bad source becomes more legible, more confident, and more persuasive than it deserves to be.
That changed how I think about the system.
The hard problem is not just better scoring logic. It is upstream:
Does the evidence entering the system deserve to be reasoned over at all?
Small evidence artifacts weakened trust in the whole diagnostic
A second class of failures came from citation hygiene and rendering artifacts.
These were not glamorous bugs. A malformed citation fragment. A note that exported badly. A formatting issue that made the evidence trail look less reliable than it was.
But that is why they mattered.
If the diagnostic claims to be evidence-based, the evidence presentation layer is not cosmetic. It is part of the product.
A founder or GTM leader should not have to wonder whether a strange citation artifact means the reasoning is sloppy too. Clean evidence presentation is part of what makes a diagnostic trustworthy enough to act on.
Low-signal pages quietly crowded out the pages that mattered
The third recurring failure was quieter, but just as important.
Low-signal pages were taking attention away from higher-signal surfaces like pricing, FAQ, billing, solutions, enterprise pages, and comparison pages.
This was not just an efficiency problem.
It changed what the rubric was able to see, and therefore what it was able to score.
That was a reminder that evidence collection is not neutral. It shapes the diagnostic long before the model starts reasoning.
The biggest gains came from the evidence pipeline, not the model
The most important improvements this week did not come from making the model smarter.
They came from making the evidence pipeline more disciplined.
The fixes fell into four buckets:
- excluding low-signal and misleading pages earlier
- validating source quality before structured extraction gets trusted
- cleaning evidence artifacts before they leak into the final diagnostic
- manually auditing the evidence trail instead of trusting the final score alone
That last point mattered more than I expected.
Some of the worst failures would have passed if I had only looked at the final score and rationale. They became obvious only when I reviewed the evidence trail line by line.
That is the kind of work that does not look dramatic from the outside, but it is where reliability gets built.
Restraint improved the diagnostic more than another rushed fix would have
One thing I do not want this story to become is performance theater, where every unresolved issue gets framed as progress.
That is not discipline. That is thrashing.
Some issues were deliberately deferred — not because they do not matter, but because fixing them without a broader baseline in place would have introduced new variance rather than reducing it.
Not every bug deserves an immediate fix. Some deserve to be isolated, documented, and left untouched until lower-risk improvements establish a cleaner baseline.
That restraint made the diagnostic better.
Trust infrastructure diagnostics need auditable evidence, not just plausible output
This is the broader point behind the week's work.
ValueTempo is not trying to produce another generic AI output.
The ambition is narrower and more useful: to build an AI-native trust infrastructure diagnostic that helps founders and GTM leaders see whether their publicly observable signals make their product legible enough to support growth.
That is a different question from:
- is our pricing page good
- how do we compare to competitors
- what does an answer engine say about us
The real question is:
Can a buyer predict how this product behaves, what it may cost, and where the risk sits, before they commit more attention, budget, or trust?
That is what trust infrastructure is about.
And this week made one thing clear to me:
If the evidence base is weak, the diagnostic may still sound smart. It just does not deserve to be trusted.
The bar is no longer interesting output. It is trustworthy evidence.
This week's lesson was simple: stable numbers are not enough. The real bar is evidence integrity.
That is the difference between an AI output that sounds smart and a diagnostic a founder or GTM leader can use.
If your pricing, limits, usage model, or enterprise trust surfaces are hard for a buyer to piece together from your public surface, that is not just a messaging problem.
It is a trust infrastructure problem.
That is the problem AVS Rubric is designed to make legible.
If you want to see how the diagnostic evaluates trust infrastructure across the Trust Stack, start with the methodology page.
If you want to pressure test your own public surface, run the rubric on your own product at app.valuetempo.com.
