What I Learned Vibecoding an AI Startup Tool using Lovable + Claude Code
A build-in-public note on what broke, what worked, and what vibecoding an AI product taught me about reliability, production readiness, and trust infrastructure.
TL;DR
I vibecoded an AI scoring tool using Lovable and Claude Code. It shipped fast but failed on reliability — the same company got different scores on repeat runs. Fixing it required separating evidence collection from scoring, treating prompts as business logic, and building trust infrastructure (confidence labels, uncertainty flags, evidence citations). Five key lessons: (1) prompts are business logic, not glue code; (2) data quality matters more than model choice; (3) surface uncertainty instead of hiding it; (4) prototype-to-production is a full rebuild; (5) reliability is the product, not the feature. The AVS Rubric now scores trust infrastructure across 8 dimensions for AI-native SaaS companies.
I vibecoded an AI scoring tool to score the trust infrastructure of AI startups.
It looked great.
Then I ran the same company through it twice and got two different scores.
That's when I realized the hard part of building AI products today isn't shipping them. It's making the results reliable enough to trust.
Over the past few weeks I ran the rubric across 50+ AI-native SaaS companies, which learned several unexpected failure modes.
The Rubric Failed Its First Test
The AVS Rubric exists to answer one question:
Does your AI product expose enough trust infrastructure for growth to compound?
Before it could answer that for anyone else, it had to answer it honestly about itself.
Early on, it couldn't.
Three failures showed up immediately:
- the rubric wasn't scanning wide enough for evidence
- identical URLs produced different scores
- results differed across devices and sessions
The system looked deterministic on the surface. Underneath, it wasn't.
The root cause turned out to be architectural. Evidence collection and scoring were too tightly coupled. Any variance in what the scraper captured flowed directly into the score.
Separating those layers stabilized the results.
Lesson one: AI evaluation systems are harder to make reliable than they appear.
The Trust Infrastructure the Rubric Measures
For context, the rubric evaluates eight dimensions organized into four layers.
The idea is simple: gaps at the foundation cascade upward.
Observable Signals
(public evidence)
AVS Rubric Engine
Evidence Processing
Reliability Guardrails
Trust Infrastructure Layers
Enterprise Readiness
Operational Controls
Pricing Architecture
Product Clarity
Buyer Trust Outcomes
The AVS Rubric Engine evaluates observable signals across four layers of trust infrastructure to predict buyer trust outcomes.
Layer 1: Product Clarity
Who is this product for, and what outcome are they paying for?
- Product North Star
- Ideal Customers & Job Clarity
Layer 2: Pricing Architecture
Can customers understand what they pay for and why?
- Value Unit
- Cost Driver Mapping
Layer 3: Operational Controls
Does the product protect users from runaway usage and mistakes?
- Pools & Packaging
- Overages & Risk Allocation
- Safety Rails & Trust Surfaces
Layer 4: Enterprise Readiness
Can finance, IT, and legal easily find what they need through official channels to close the deal?
- Buyer & Budget Alignment
If customers cannot identify your value unit, cost drivers, controls for monthly spending and token allocation, no amount of security documentation will close enterprise deals.
Five Things I Learned Vibecoding an AI Product
I built the first version using Lovable. The speed-to-done was impressive. After I uploaded a product requirements document, a working product with a nicely designed UI appeared in less than an hour.
But sooner or later we all realize that shipping a prototype and building something reliable are two very different problems.
Here are the lessons that mattered most to me.
1. Your Prompt Becomes Your Business Logic
The scoring rubric lives inside the edge function prompt.
Every nuance about how trust infrastructure is evaluated exists as natural-language instructions to the model.
That includes rules like:
- when missing rollover policy is a real weakness
- when it's a false negative for flat-rate pricing
- when safety rails may exist behind login walls
In AI systems, prompts are not just instructions.
They effectively become application logic written in natural language.
2. Data Collection Quality Matters More Than the Model
Early versions simply scraped the first N links from a website.
That meant help center articles crowded out the pages that actually mattered.
The fix was a link prioritization layer that weighted paths like:
/plans
/security
/enterprise
Once the scraper started feeding the model better evidence, accuracy improved more than any prompt tweak.
Garbage in still equals garbage out.
3. AI Assessments Must Surface Uncertainty
One of the most common questions about the rubric is: "Isn't this subjective?"
The real issue is that a score without confidence is misleading.
Early versions produced a simple 0–2 score per dimension. That looked clean, but it hid a critical problem: the system sometimes had very little evidence to justify that score.
So instead of pretending the system knew more than it did, I redesigned the output to surface uncertainty explicitly.
Each dimension now includes:
- a score
- a confidence band (
Strong / Partial / Sparse) - an uncertainty explanation
For example, the system might say: "Budget caps not found."
That could mean the feature does not exist. Or it could mean the feature exists but lives behind a login wall the scraper cannot access.
To avoid hallucinating certainty, the rubric now includes:
- "not observable" flags
- missing insider prompts
This allows the system to say: "I cannot verify this from public evidence yet."
That small design decision dramatically increased the credibility of the output.
Ironically, the system became more trustworthy once it started admitting what it didn't know.
4. Vibecoding Makes Prototypes Fast. Production Requires Discipline.
Lovable made it possible to ship a working pipeline incredibly quickly.
Scraping → analysis → UI rendering → PDF export all appeared within a few sessions.
But moving from prototype to production exposed a completely different set of problems.
Interestingly, I found myself trusting Claude Code more to assess whether the system was production-grade.
Part of that is psychological. It simply felt more trustworthy to use a separate third-party system to judge the code quality rather than relying on the same tool that generated it.
Another reason was transparency. Claude Code gave much clearer visibility into:
- structural weaknesses
- security issues
- type safety gaps
- missing tests
That external assessment made it easier to decide what to fix first, second, and third.
Vibecoding makes it easy to build quickly. Production readiness requires a different mindset.
5. Reliability Matters: AI Is Probabilistic
Another subtle issue appeared early. Running the same company through the rubric twice could produce slightly different scores.
That is not surprising once you remember that LLMs are probabilistic systems.
The solution was to stop treating model output as ground truth and instead treat it as a noisy signal that needs stabilization.
The scoring engine now uses:
- three-pass voting
- median aggregation
- low-temperature inference
Each assessment runs multiple model passes and aggregates the result.
We stress-tested the setup with temperature variance tests across all eight dimensions. Variance dropped to 0 percent during calibration runs.
We also added a deterministic safety net. If the evidence clearly shows high-signal trust cues but the model still outputs a zero, the system floors the dimension score to 1 and raises the confidence level.
This prevents obvious false negatives where the model detects evidence but scores too conservatively.
LLM output must be surrounded by deterministic guardrails to behave like production-grade software.
What It Took to Make the Rubric Production-Grade
Shipping fast with Lovable created technical debt that needed to be addressed.
But interestingly, the hardest production problems were not model problems. They were infrastructure problems.
Five things mattered most.
1. Security and Bounded Inputs
The backend endpoints were hardened with:
- authenticated access
- strict input validation
- SSRF protections
- generic error masking
Without those protections, simple scraping requests could expose internal infrastructure or create unpredictable failures.
2. Resilient Execution
Some rubric runs involve dozens of page scans and LLM calls. Early versions frequently hit serverless timeouts.
The fix was moving long-running workflows into background execution using async jobs. This allowed the system to:
- avoid request timeouts
- track job states (pending / completed / failed)
- complete large scans reliably
3. Versioned Caching
Another subtle problem was result drift across releases.
To stabilize outputs we added versioned caching with a 7-day TTL. This improved three things simultaneously:
- cost control
- latency
- consistency across runs
4. The Facts Ledger Pattern
One design pattern that emerged was what I now think of as a facts ledger.
Every dimension score is backed by structured evidence. That includes:
- binary subtests
- observed signals
- source evidence (URL + snippet pairs)
This prevents the model from generating convincing explanations without supporting data. The score must always tie back to observable facts.
5. What Actually Failed
One of the most annoying failures was not even model-related. It was schema drift.
Some results stored the dimension label as name, others as dimension.
That small inconsistency broke downstream queries.
The fix was simple but humbling. The query layer now uses COALESCE to normalize the schema.
A reminder that sometimes the biggest production issues come from the least interesting parts of the stack.
The Bigger Idea Behind the Rubric
The rubric is based on the Adaptive Value System (AVS).
AVS is a framework for aligning four forces inside AI-native products:
- value
- usage
- cost
- trust
Most companies measure the first three. Trust is often not designed intentionally.
The rubric evaluates whether enough of that infrastructure is visible to buyers before they commit.
What the Rubric Currently Measures
This version focuses on economic trust infrastructure:
- pricing legibility
- cost predictability
- spend controls
- operational safeguards
These are the first trust problems most AI products encounter.
The next evolution of the rubric will examine action accountability — what happens when AI systems begin executing tasks autonomously and something goes wrong.
But diagnosing that through public evidence is significantly harder. That capability is still being developed.
Try the Rubric
The AVS Rubric is free and live at valuetempo.com.
It evaluates only what buyers can see publicly.
Running your own product through it can be surprisingly revealing.
And if you're building an AI product yourself, I'd be curious what it finds.
