March 2026

What I Learned Vibecoding an AI Startup Tool using Lovable + Claude Code

A build-in-public note on what broke, what worked, and what vibecoding an AI product taught me about reliability, production readiness, and trust infrastructure.

TL;DR

I vibecoded an AI scoring tool using Lovable and Claude Code. It shipped fast but failed on reliability — the same company got different scores on repeat runs. Fixing it required separating evidence collection from scoring, treating prompts as business logic, and building trust infrastructure (confidence labels, uncertainty flags, evidence citations). Five key lessons: (1) prompts are business logic, not glue code; (2) data quality matters more than model choice; (3) surface uncertainty instead of hiding it; (4) prototype-to-production is a full rebuild; (5) reliability is the product, not the feature. The AVS Rubric now scores trust infrastructure across 8 dimensions for AI-native SaaS companies.

I vibecoded an AI scoring tool to score the trust infrastructure of AI startups.

It looked great.

Then I ran the same company through it twice and got two different scores.

That's when I realized the hard part of building AI products today isn't shipping them. It's making the results reliable enough to trust.

Over the past few weeks I ran the rubric across 50+ AI-native SaaS companies, which learned several unexpected failure modes.

The Rubric Failed Its First Test

The AVS Rubric exists to answer one question:

Does your AI product expose enough trust infrastructure for growth to compound?

Before it could answer that for anyone else, it had to answer it honestly about itself.

Early on, it couldn't.

Three failures showed up immediately:

the rubric wasn't scanning wide enough for evidence
identical URLs produced different scores
results differed across devices and sessions

The system looked deterministic on the surface. Underneath, it wasn't.

The root cause turned out to be architectural. Evidence collection and scoring were too tightly coupled. Any variance in what the scraper captured flowed directly into the score.

Separating those layers stabilized the results.

Lesson one: AI evaluation systems are harder to make reliable than they appear.

The Trust Infrastructure the Rubric Measures

For context, the rubric evaluates eight dimensions organized into four layers.

The idea is simple: gaps at the foundation cascade upward.

Observable Signals

(public evidence)

Pricing pages

Trust & security

Product UI

Help center / Billing docs

Admin controls

Spend limits

Audit logs

Help docs

AVS Rubric Engine

Evidence Processing

Link prioritization

Evidence extraction

Facts ledger

Dimension scoring

Confidence evaluation

Reliability Guardrails

3-pass voting

Median aggregation

Score floor logic

Confidence bands

Trust Infrastructure Layers

Layer 4

Enterprise Readiness

Buyer & Budget Alignment

Layer 3

Operational Controls

Safety RailsOverages & Risk AllocationPools & Packaging

Layer 2

Pricing Architecture

Value UnitCost Driver Mapping

Layer 1

Product Clarity

Product North StarICP & Job Clarity

Buyer Trust Outcomes

Cost predictability

Spend control

Operational safety

Procurement confidence

Faster enterprise evaluations

Compounding trust

The AVS Rubric Engine evaluates observable signals across four layers of trust infrastructure to predict buyer trust outcomes.

Layer 1: Product Clarity

Who is this product for, and what outcome are they paying for?

Product North Star
Ideal Customers & Job Clarity

Layer 2: Pricing Architecture

Can customers understand what they pay for and why?

Value Unit
Cost Driver Mapping

Layer 3: Operational Controls

Does the product protect users from runaway usage and mistakes?

Pools & Packaging
Overages & Risk Allocation
Safety Rails & Trust Surfaces

Layer 4: Enterprise Readiness

Can finance, IT, and legal easily find what they need through official channels to close the deal?

Buyer & Budget Alignment

If customers cannot identify your value unit, cost drivers, controls for monthly spending and token allocation, no amount of security documentation will close enterprise deals.

Five Things I Learned Vibecoding an AI Product

I built the first version using Lovable. The speed-to-done was impressive. After I uploaded a product requirements document, a working product with a nicely designed UI appeared in less than an hour.

But sooner or later we all realize that shipping a prototype and building something reliable are two very different problems.

Here are the lessons that mattered most to me.

1. Your Prompt Becomes Your Business Logic

The scoring rubric lives inside the edge function prompt.

Every nuance about how trust infrastructure is evaluated exists as natural-language instructions to the model.

That includes rules like:

when missing rollover policy is a real weakness
when it's a false negative for flat-rate pricing
when safety rails may exist behind login walls

In AI systems, prompts are not just instructions.

They effectively become application logic written in natural language.

2. Data Collection Quality Matters More Than the Model

Early versions simply scraped the first N links from a website.

That meant help center articles crowded out the pages that actually mattered.

The fix was a link prioritization layer that weighted paths like:

/pricing
/plans
/security
/enterprise

Once the scraper started feeding the model better evidence, accuracy improved more than any prompt tweak.

Garbage in still equals garbage out.

3. AI Assessments Must Surface Uncertainty

One of the most common questions about the rubric is: "Isn't this subjective?"

The real issue is that a score without confidence is misleading.

Early versions produced a simple 0–2 score per dimension. That looked clean, but it hid a critical problem: the system sometimes had very little evidence to justify that score.

So instead of pretending the system knew more than it did, I redesigned the output to surface uncertainty explicitly.

Each dimension now includes:

a score
a confidence band (Strong / Partial / Sparse)
an uncertainty explanation

For example, the system might say: "Budget caps not found."

That could mean the feature does not exist. Or it could mean the feature exists but lives behind a login wall the scraper cannot access.

To avoid hallucinating certainty, the rubric now includes:

"not observable" flags
missing insider prompts

This allows the system to say: "I cannot verify this from public evidence yet."

That small design decision dramatically increased the credibility of the output.

Ironically, the system became more trustworthy once it started admitting what it didn't know.

4. Vibecoding Makes Prototypes Fast. Production Requires Discipline.

Lovable made it possible to ship a working pipeline incredibly quickly.

Scraping → analysis → UI rendering → PDF export all appeared within a few sessions.

But moving from prototype to production exposed a completely different set of problems.

Interestingly, I found myself trusting Claude Code more to assess whether the system was production-grade.

Part of that is psychological. It simply felt more trustworthy to use a separate third-party system to judge the code quality rather than relying on the same tool that generated it.

Another reason was transparency. Claude Code gave much clearer visibility into:

structural weaknesses
security issues
type safety gaps
missing tests

That external assessment made it easier to decide what to fix first, second, and third.

Vibecoding makes it easy to build quickly. Production readiness requires a different mindset.

5. Reliability Matters: AI Is Probabilistic

Another subtle issue appeared early. Running the same company through the rubric twice could produce slightly different scores.

That is not surprising once you remember that LLMs are probabilistic systems.

The solution was to stop treating model output as ground truth and instead treat it as a noisy signal that needs stabilization.

The scoring engine now uses:

three-pass voting
median aggregation
low-temperature inference

Each assessment runs multiple model passes and aggregates the result.

We stress-tested the setup with temperature variance tests across all eight dimensions. Variance dropped to 0 percent during calibration runs.

We also added a deterministic safety net. If the evidence clearly shows high-signal trust cues but the model still outputs a zero, the system floors the dimension score to 1 and raises the confidence level.

This prevents obvious false negatives where the model detects evidence but scores too conservatively.

LLM output must be surrounded by deterministic guardrails to behave like production-grade software.

What It Took to Make the Rubric Production-Grade

Shipping fast with Lovable created technical debt that needed to be addressed.

But interestingly, the hardest production problems were not model problems. They were infrastructure problems.

Five things mattered most.

1. Security and Bounded Inputs

The backend endpoints were hardened with:

authenticated access
strict input validation
SSRF protections
generic error masking

Without those protections, simple scraping requests could expose internal infrastructure or create unpredictable failures.

2. Resilient Execution

Some rubric runs involve dozens of page scans and LLM calls. Early versions frequently hit serverless timeouts.

The fix was moving long-running workflows into background execution using async jobs. This allowed the system to:

avoid request timeouts
track job states (pending / completed / failed)
complete large scans reliably

3. Versioned Caching

Another subtle problem was result drift across releases.

To stabilize outputs we added versioned caching with a 7-day TTL. This improved three things simultaneously:

cost control
latency
consistency across runs

4. The Facts Ledger Pattern

One design pattern that emerged was what I now think of as a facts ledger.

Every dimension score is backed by structured evidence. That includes:

binary subtests
observed signals
source evidence (URL + snippet pairs)

This prevents the model from generating convincing explanations without supporting data. The score must always tie back to observable facts.

5. What Actually Failed

One of the most annoying failures was not even model-related. It was schema drift.

Some results stored the dimension label as name, others as dimension.

That small inconsistency broke downstream queries.

The fix was simple but humbling. The query layer now uses COALESCE to normalize the schema.

A reminder that sometimes the biggest production issues come from the least interesting parts of the stack.

The Bigger Idea Behind the Rubric

The rubric is based on the Adaptive Value System (AVS).

AVS is a framework for aligning four forces inside AI-native products:

value
usage
cost
trust

Most companies measure the first three. Trust is often not designed intentionally.

The rubric evaluates whether enough of that infrastructure is visible to buyers before they commit.

What the Rubric Currently Measures

This version focuses on economic trust infrastructure:

pricing legibility
cost predictability
spend controls
operational safeguards

These are the first trust problems most AI products encounter.

The next evolution of the rubric will examine action accountability — what happens when AI systems begin executing tasks autonomously and something goes wrong.

But diagnosing that through public evidence is significantly harder. That capability is still being developed.

Try the Rubric

The AVS Rubric is free and live at valuetempo.com.

It evaluates only what buyers can see publicly.

Running your own product through it can be surprisingly revealing.

And if you're building an AI product yourself, I'd be curious what it finds.