How to Apply Karpathy's AutoResearch to Marketing

Andrej Karpathy released a 630-line Python repo called autoresearch on March 7, 2026. The repo lets an AI agent edit one variable, test, score, keep or revert, then repeat. It hit 65,900 GitHub stars in three weeks. The original target was ML model training. But the underlying pattern works on anything you produce repeatedly and score with a clear metric, including marketing assets of every kind.

Most marketing teams run about 30 experiments per year. AutoResearch runs about 100 overnight. The cost: roughly $15 in API fees. The constraint: AI judging AI does not predict real-world conversion. But as a pre-filter before you spend budget, it eliminates your worst-performing variants while you sleep.

TL;DR

AutoResearch is Karpathy’s autonomous experiment loop adapted for marketing: an AI agent modifies one variable in your skill, scores it against a binary checklist, keeps improvements, reverts failures, and repeats. It runs ~12 experiments per hour for ~$15 overnight. The method works best for short-form, repeatable marketing tasks like ad creation, email sequences, landing pages, and outreach. Treat results as a pre-filter before real A/B tests, not as a replacement for them.

Key Takeaways

AutoResearch runs ~12 experiments per hour and ~100 overnight for roughly $15 in API costs
The scoring mechanism uses 3-6 binary yes/no questions, not subjective 1-10 scales
Short-form, repeatable marketing tasks (ads, emails, landing pages, outreach, social) are the best fit
AI quality scores do not reliably predict real conversion, so treat output as a pre-filter
The pattern works on any marketing asset you produce repeatedly and score consistently
Claude Code, Claude skills, and Cowork provide the infrastructure to run this today
Closing the feedback loop with production conversion data is the step most teams skip

What Is Karpathy’s AutoResearch and How Does It Apply to Marketing?

AutoResearch is an open-source autonomous experiment loop built by Andrej Karpathy, former Tesla AI director and OpenAI co-founder. The repo lives at github.com/karpathy/autoresearch. Karpathy built it for ML model training, but the pattern transfers to any repeatable task with a measurable output.

The architecture uses three files. prepare.py defines the scoring metric and never changes. train.py is the single file the AI agent edits. program.md contains instructions telling the agent what to explore and what to avoid. For marketing, swap train.py for your marketing skill file and prepare.py for your quality checklist. Same loop, different inputs.

How Does the Core Loop Work?

The autoresearch loop runs nine steps: read instructions, review past results, propose a hypothesis, modify the file, commit the change, run a test, handle errors, evaluate the score, and keep or revert. If the score improved, the commit stays. If it dropped, git reset HEAD~1 removes it. The codebase ratchets forward. It never gets worse.

Each experiment takes about 5 minutes. Roughly 12 experiments per hour. About 100 overnight on a single run. Karpathy’s first overnight session tried 83 experiments, kept 15, and found a bug he had personally missed for months.

Fortune magazine named the underlying pattern “The Karpathy Loop” and identified three primitives: an editable asset (one file the agent modifies), a scalar metric (one number measuring improvement), and a time-boxed cycle (fixed duration making experiments comparable). Those three primitives map directly to marketing optimization.

Action item: Read the autoresearch README at github.com/karpathy/autoresearch. Identify which of your recurring marketing assets (ad copy, email subjects, landing pages) fits the single-file, single-metric pattern.

How Do You Translate the AutoResearch Loop to Marketing?

The translation is straightforward. Replace the ML training file with a Claude skill file (SKILL.md) and replace validation loss with a binary yes/no checklist. A Claude skill is a markdown instruction file telling Claude how to produce a specific type of output, like landing page sections, email sequences, or campaign briefs.

When you run autoresearch on a marketing skill, the agent follows this cycle: generate output from your skill, score it against your checklist, identify the weakest eval question, modify one instruction in the skill to address the weakness, regenerate output, score again, keep or revert. It repeats until you stop it or it hits a target score three times in a row.

What Makes the Scoring Mechanism Work?

The scoring uses 3-6 binary yes/no evaluation questions. Not subjective 1-to-10 scales. Not open-ended quality ratings. Hard pass/fail on specific, observable criteria. Three to six questions is the sweet spot. More than six and the skill starts gaming the checklist, producing outputs technically passing every question but reading like garbage.

The scoring formula is simple. Max score equals the number of eval questions multiplied by runs per experiment. Four evals across five runs gives a max of 20. The agent optimizes toward the ceiling with surgical, one-change-at-a-time modifications.

What Kind of Changes Does the Agent Make?

The mutations are small and targeted. The agent does not rewrite your entire skill. It adds a single specific instruction addressing the most common failure, removes an ambiguous sentence, adds a banned-pattern list, or inserts one worked example showing correct output. One change per round. Test. Score. Keep or revert.

A typical successful change looks like adding: “Your headline MUST include a specific number or result. NEVER use vague promises like ‘Transform Your Business.’” A typical reverted change: tightening word count limits, which made outputs too thin and lost key details. The system catches changes looking good in isolation but hurting overall quality.

Action item: Open your most-used Claude prompt or skill for producing marketing output. Write 4 yes/no questions defining “good” output. Score your last 10 outputs manually against those questions to find your baseline.

Which Marketing Tasks Benefit Most from AutoResearch?

AutoResearch works best on repeatable, high-frequency marketing tasks where quality is measurable at the individual output level. The tighter the metric, the better the results. Here are five marketing functions where the loop delivers immediate value.

1. Ad Copy Variations

Ad copy is the ideal autoresearch target. Short form. Clear metrics. High production volume. Most teams write 5-10 ad variants per campaign. AutoResearch generates 50+ overnight, each scored against your quality criteria before you spend a single ad dollar.

Build a Claude skill for your ad copy. Define your eval checklist:

“Does the headline address one specific pain point?”
“Is the primary text under 90 characters?”
“Does the CTA specify what happens after clicking?”
“Is the copy free of generic phrases like ‘Learn More’ or ‘Get Started’?”

Run autoresearch on the skill. Wake up to a tighter skill and 50 pre-filtered variants. Feed the top performers into Meta or Google Ads as your A/B test candidates.

Time saved per campaign: 3-5 hours of manual copywriting and variant review.

2. Email Subject Lines

Email subject lines follow the same pattern. Short, scorable, produced in volume. Open rate is the downstream metric, but you do not need to wait for open data to filter out weak subject lines before hitting send.

Build an eval checklist for subject lines:

“Is it under 50 characters?”
“Does it include a specific number or timeframe?”
“Is it free of spam trigger words (free, guarantee, act now)?”
“Does it create a concrete information gap?”

Run autoresearch on your email subject line skill. Generate 30 subject lines. The loop eliminates the generic ones before you A/B test the survivors in Klaviyo, Mailchimp, or Customer.io.

Time saved per email send: 1-2 hours of brainstorming and manual filtering.

3. Landing Page Headlines and Hero Copy

Landing page hero sections are high-stakes, low-volume assets. Most teams agonize over one headline for days. AutoResearch flips this. You write one skill defining your landing page copy standards, then let the agent refine it overnight. Every landing page you produce afterward starts from a stronger baseline.

An eval checklist for landing page hero copy:

“Does the headline include a specific number or result?”
“Does the first line call out a specific pain point?”
“Is the CTA using a specific verb phrase (not ‘Learn More’)?”
“Is the hero copy under 150 words?”
“Is the copy free of buzzwords like ‘revolutionary’ or ‘next-level’?”

Run the loop. The agent tightens your skill’s instructions until 90%+ of outputs pass every check. Your next 10 landing pages all start from the improved baseline.

Time saved per landing page: 2-4 hours of headline iteration and copy review.

4. Cold Outreach Emails

Cold outreach has clear, binary quality signals. Personalization present or absent. Length within range or not. Specific ask or vague request. These translate perfectly to autoresearch eval questions.

Build your cold outreach eval checklist:

“Does it mention the prospect’s company by name?”
“Is the email under 75 words?”
“Does it end with a specific, answerable question?”
“Is the first sentence free of generic openers like ‘I hope this finds you well’?”

Run autoresearch on your outreach skill. The agent adds rules like “ALWAYS reference one specific detail from the prospect’s website” and “NEVER open with a self-introduction.” Each change makes the next batch of outreach emails tighter.

Time saved per outreach campaign: 2-3 hours of template iteration.

Social posts are short, formulaic, and produced at high volume. Perfect autoresearch territory. The eval criteria change by platform, but the pattern stays the same.

For LinkedIn posts, an eval checklist:

“Does the first line contain a specific, surprising claim?”
“Is the post under 200 words?”
“Does it end with a specific question or call to engage?”
“Is it free of hashtag spam (more than 3 hashtags)?”

For X/Twitter posts:

“Is it under 240 characters?”
“Does it contain one specific data point or example?”
“Does it avoid generic motivational language?”

Run the loop on each platform-specific skill. Generate a week of posts overnight. Review the top performers in the morning.

Time saved per week: 3-5 hours of social copy writing.

Action item: Pick the marketing function where you spend the most time iterating on output quality. Build an eval checklist using the examples above. Set up your first autoresearch run tonight.

How Do You Set Up AutoResearch for Marketing in Claude Code?

The infrastructure runs on Anthropic’s Claude skills ecosystem. Claude skills are markdown instruction files (SKILL.md) persisting across conversations and loading automatically when relevant. They are the “editable asset” in the Karpathy Loop.

Claude Code is the agentic coding tool running the autoresearch loop. It reads your project, modifies files, runs tests, handles failures, and commits results. Cowork, launched January 2026, extends the same execution model to non-coding knowledge workers through the Claude Desktop app.

Step 1: Set Up Your Folder Structure

Your project folder needs three things: the autoresearch skill, the marketing skill you want to improve, and an eval guide. Here is the exact folder layout:

my-marketing-skills/
  .claude/
    skills/
      autoresearch/
        SKILL.md          (the autoresearch runner)
        eval-guide.md     (your scoring criteria)
  skills/
    landing-page-copy/
      SKILL.md            (the skill being optimized)
  test-inputs/
    input-1.md            (sample briefs for testing)
    input-2.md
    input-3.md

The autoresearch/SKILL.md is the orchestrator. Download it from GitHub and drop it in. The landing-page-copy/SKILL.md is your marketing skill, the file the agent will modify. The test-inputs/ folder holds 3-5 sample briefs the agent uses to test each iteration.

Step 2: Write Your Marketing Skill File

Here is a complete SKILL.md for landing page hero copy. This is the file the autoresearch agent will edit one instruction at a time:

SYSTEM: You are a conversion copywriter for B2B SaaS landing pages.

TASK: Write a landing page hero section given the product brief below.

RULES:
1. Headline MUST include a specific number or measurable result
2. Headline MUST stay under 12 words
3. First body sentence MUST name a specific pain point the reader faces
4. CTA button text MUST use a verb phrase describing what happens next
5. Total hero section MUST stay under 150 words
6. NEVER use buzzwords: revolutionary, next-level, synergy, game-changing, seamless, empower
7. NEVER use generic CTAs: Learn More, Get Started, Click Here, Submit
8. Subhead MUST explain HOW the product solves the pain point in one sentence

<brief>
Product: {{PRODUCT_NAME}}
Target audience: {{TARGET_AUDIENCE}}
Primary pain point: {{PAIN_POINT}}
Key differentiator: {{DIFFERENTIATOR}}
Proof point: {{PROOF_POINT}}
</brief>

OUTPUT FORMAT:
Headline: [headline text]
Subhead: [subhead text]
Body: [1-2 sentences]
CTA: [button text]

The agent modifies the RULES section. It never touches the OUTPUT FORMAT or the brief variables. Each round, it adds one new rule, sharpens an existing rule, or adds a worked example.

Step 3: Write Your Eval Guide

The eval guide defines how the agent scores each output. This is your eval-guide.md file. Use binary yes/no questions only:

# Eval Guide: Landing Page Hero Copy

Score each output against these questions. Answer YES or NO for each.

1. Does the headline include a specific number, percentage, or timeframe?
   - YES: "Cut onboarding time by 60% in 14 days"
   - NO: "Transform your onboarding experience"

2. Is the hero copy completely free of banned buzzwords?
   - Check against: revolutionary, next-level, synergy, game-changing, seamless, empower, innovative, cutting-edge
   - YES: Zero banned words found
   - NO: One or more banned words present

3. Does the CTA use a specific action verb describing the next step?
   - YES: "Start your 14-day pilot" or "See the demo dashboard"
   - NO: "Learn More" or "Get Started" or "Sign Up"

4. Does the first body sentence name a concrete pain point?
   - YES: "Your sales team spends 6 hours per week on manual data entry"
   - NO: "Businesses today face many challenges"

5. Is the total word count under 150?
   - Count all words in headline + subhead + body + CTA
   - YES: 150 or fewer
   - NO: 151 or more

SCORING: Each YES = 1 point. Max score per output = 5.
Run the skill 5 times per experiment. Max total = 25.

Including YES and NO examples for each question gives the agent clear signal on what passes and what fails. This is how you avoid ambiguous scoring.

Step 4: Create Test Inputs

Write 3-5 realistic product briefs the agent will use to test each skill iteration. Here is one sample test-inputs/input-1.md:

Product: DataPipe
Target audience: RevOps managers at B2B SaaS companies (50-500 employees)
Primary pain point: CRM data is stale because reps do not log activities
Key differentiator: Automatic activity capture from email, calendar, and Slack
Proof point: Customers see 3x more logged activities within the first week

Each test input should represent a different product type or audience segment. The more varied your inputs, the harder it is for the skill to overfit to one scenario.

Step 5: Start the AutoResearch Run

Open Claude Code in your project folder and run:

claude "Run autoresearch on my landing-page-copy skill. Use the eval guide in autoresearch/eval-guide.md and the test inputs in test-inputs/. Start with a baseline, then enter the optimization loop."

The agent does the following:

Reads your SKILL.md, eval-guide.md, and test inputs
Runs the skill 5 times against your test inputs
Scores each output using your eval guide
Reports baseline score (example: 12/25, or 48%)
Analyzes which eval questions fail most often
Makes one targeted change to SKILL.md
Re-runs and re-scores
Keeps the change if score improved, reverts if it dropped
Opens a live HTML dashboard tracking progress
Repeats until you stop it or it hits 95%+ three consecutive times

Walk away. Come back to an improved skill, a detailed changelog, and a backup of your original.

What Does the Agent Change? A Real Before/After

Here is what a typical autoresearch run looks like for the landing page skill above.

Round 1: Baseline score 12/25 (48%)

The agent identifies eval question #1 (headline includes a specific number) as the most frequent failure. 4 of 5 outputs had vague headlines.

Agent adds this rule to SKILL.md:

9. If the brief includes a proof point with a number, the headline MUST reference it directly. Example: If proof point says "3x more logged activities," headline should include "3x" or an equivalent specific metric.

Round 1 result: 17/25 (68%). Change kept.

Round 2: Score 17/25

Eval question #3 (CTA specificity) fails 3 of 5 times. The agent sees CTAs like “Try DataPipe” still getting through.

Agent adds this example to SKILL.md:

GOOD CTA EXAMPLES:
- "Start your 14-day pilot"
- "See 3x more activities in your CRM"
- "Watch the 2-minute demo"

BAD CTA EXAMPLES:
- "Try DataPipe"
- "Request a Demo"
- "Sign Up Free"

Round 2 result: 21/25 (84%). Change kept.

Round 3: Score 21/25

Eval question #4 (specific pain point in first sentence) fails twice. The agent sees generic openers still slipping through.

Agent adds this anti-pattern:

10. First body sentence MUST follow this pattern: "Your [specific role] spends [specific time/effort] on [specific task]." NEVER open with "Businesses today..." or "Teams struggle with..." or any sentence not naming the reader's job title.

Round 3 result: 24/25 (96%). Change kept. Target hit.

Round 4: Score 24/25

The agent tries tightening the headline word limit from 12 to 8 words.

Round 4 result: 20/25 (80%). Change reverted. Shorter headlines dropped proof points.

Final result: 24/25 (96%). Three files delivered:

landing-page-copy/SKILL.md (improved version)
landing-page-copy/SKILL.md.backup (original)
autoresearch-changelog.md (every change, score, and reasoning)

The changelog is the most valuable artifact. It tells you exactly which instructions improved output and which made it worse, specific to your product and audience.

Action item: Copy the folder structure, skill file, and eval guide from this section. Swap in your product details and test inputs. Run your first autoresearch loop tonight in Claude Code.

What Does a Full Setup Look Like for Email Subject Lines?

The same pattern works for every marketing asset. Here is the complete setup for email subject line optimization, so you see how the files change by asset type.

The Skill File (email-subject-lines/SKILL.md)

SYSTEM: You are an email marketer specializing in B2B SaaS.

TASK: Write 5 email subject line variations for the campaign brief below.

RULES:
1. Each subject line MUST be under 50 characters (including spaces)
2. At least 3 of 5 MUST include a specific number or timeframe
3. NEVER use spam triggers: free, guarantee, act now, limited time, urgent
4. NEVER use all caps or excessive punctuation (!! or ??)
5. At least 2 of 5 MUST create a specific information gap
6. NEVER start with the company or product name
7. At least 1 MUST use a "you/your" framing

<brief>
Campaign type: {{CAMPAIGN_TYPE}}
Audience segment: {{SEGMENT}}
Key message: {{KEY_MESSAGE}}
Send context: {{SEND_CONTEXT}}
</brief>

OUTPUT: Numbered list of 5 subject lines, each on its own line.

The Eval Guide (autoresearch/eval-guide-email.md)

# Eval Guide: Email Subject Lines

Score each batch of 5 subject lines against these questions.

1. Are all 5 subject lines under 50 characters?
   - Count characters including spaces for each line
   - YES: All 5 are under 50
   - NO: One or more exceeds 50

2. Do at least 3 of 5 include a specific number or timeframe?
   - YES: "47% faster pipeline" or "in 14 days" counts
   - NO: "Faster pipeline" or "quickly" does not count

3. Are all 5 free of spam trigger words?
   - Check against: free, guarantee, act now, limited time, urgent, exclusive offer
   - YES: Zero spam triggers found
   - NO: One or more found

4. Do at least 2 of 5 create a concrete information gap?
   - YES: "The metric your board asks about first" (reader wants to know which metric)
   - NO: "Important metrics for your board" (no gap, no curiosity)

5. Is every subject line free of the company/product name as the first word?
   - YES: None start with the product name
   - NO: One or more leads with the product name

SCORING: Each YES = 1 point. Max per batch = 5. Run 5 batches. Max total = 25.

A Test Input (test-inputs/email-input-1.md)

Campaign type: Feature launch announcement
Audience segment: Active users who logged in within the last 30 days
Key message: New dashboard shows pipeline velocity by rep, updated hourly
Send context: Tuesday morning, follows a product webinar last week

Run autoresearch on this skill the same way. The agent will tighten subject line rules across rounds, adding specifics like “information gap MUST reference a concrete noun, not an abstract concept” or inserting worked examples of strong vs. weak curiosity hooks.

Action item: Duplicate the landing page setup above and swap in the email subject line files shown here. Run both overnight. Compare changelogs in the morning to see which types of instructions the agent adds for each asset type.

Why AI Quality Scores Do Not Equal Real Conversion

This is the caveat every marketer needs to internalize. LLM quality scores do not reliably predict real-world conversion. Peer-reviewed research confirms this.

A March 2026 paper by Chen et al. tested a 7-dimension quality rubric scored by Claude against verified business conversion on a major e-commerce platform. Only 2 of 7 quality dimensions showed statistically significant association with conversion. Some dimensions rated as “high quality” had zero correlation with whether people bought.

What Biases Affect LLM Scoring?

A NeurIPS 2024 paper documented self-preference bias: GPT-4 rates its own outputs higher than human evaluators do. The CALM framework identified 12 distinct biases in LLM-as-judge systems, including verbosity bias (preferring longer answers), authority bias (favoring text with citations, even fabricated ones), and position bias (reversing preferences 25% of the time when option order swaps).

Real conversion data tells a different story. Human-written Google Ads achieve roughly 60% more clicks than AI-written ads in aggregate studies. And 52% of consumers disengage from content they suspect is AI-generated. But AI does excel in narrow contexts: AI-optimized email subject lines lifted open rates by about 31.6% across 2.1 billion Mailchimp sends.

How Do You Prevent Goodhart Drift?

Goodhart’s Law applies directly: when your eval checklist becomes the optimization target, the agent sacrifices everything not measured. The Langfuse team experienced this firsthand. Their agent optimized for exactly what was in the eval suite and removed useful features not covered by scoring.

Three practices reduce this risk:

Rotate your eval questions every 2-3 runs to prevent the skill from overfitting to a fixed checklist
Inspect the changelog after each run and flag changes making outputs look identical
Pull 5 outputs from the improved skill and manually score them against criteria NOT in your eval checklist

If unmeasured quality dropped, broaden your eval questions before running again.

Action item: After your first autoresearch run, pull 5 outputs from the improved skill. Score them against 3 criteria not in your eval checklist. If quality dropped on those dimensions, add one new eval question and re-run.

How Should Marketers Combine AutoResearch with Real A/B Testing?

The winning strategy for 2026 is hybrid. Use autoresearch as a cheap, fast pre-filter. Then validate winners with real traffic and real budget.

A McKinsey 19-industry benchmark found hybrid AI-plus-human teams deliver 42% higher ROI versus either alone. The pattern: AI generates volume and variety at low cost, humans apply judgment, and real-world testing confirms what converts.

What Does a Hybrid Workflow Look Like?

A practical hybrid workflow for a marketing team runs in six steps:

Build or refine your Claude skill for the target asset type (ad copy, email subject, landing page)
Run autoresearch overnight to bring the skill’s pass rate above 90%
Generate 15-20 variants using the improved skill
Have a human review and shortlist the top 5
Run a real A/B test with live traffic and ad budget
Feed conversion data back into your eval criteria for the next autoresearch cycle

The cost comparison makes this decision easy. Running 16 autoresearch experiments costs about $15. A traditional A/B testing program for the same number of variants requires weeks of calendar time and thousands of dollars in ad spend. Even imperfect automated pre-filtering at $15 beats the status quo.

How Do You Close the Feedback Loop with Real Data?

The biggest gap in the current autoresearch pattern: real conversion data never flows back into the scoring mechanism. Teams with access to HubSpot, Klaviyo, or Google Ads via MCP connectors already have the infrastructure to start closing this gap.

The sequence: pull performance data from your ad or email platform after a live test. Identify which variants converted best. Look at what the winners had in common and the losers did not. Update one eval question in your checklist to reflect the pattern. Re-run autoresearch with the updated checklist.

Each cycle gets your eval criteria closer to predicting what converts in production. After 3-4 cycles of autoresearch plus live validation, your checklist starts encoding real buyer behavior instead of LLM preferences.

Action item: After your first autoresearch + A/B test cycle, compare the top AI-scored variants against the top-converting variants. Identify one pattern the AI missed. Add it as an eval question for your next run.

What Does AutoResearch Look Like Applied to a Full Campaign Launch?

Here is a concrete example of how a growth marketer would use autoresearch for a product launch campaign across multiple channels.

The Setup

You are launching a new feature for a B2B SaaS product. You need ad copy for Meta and Google, email subject lines for a 3-email launch sequence, a landing page hero section, and LinkedIn announcement posts. Normally this takes 2-3 days of copywriting, review, and iteration.

Night 1: Run AutoResearch on All Four Skills

Start four autoresearch runs on four separate skills: ad copy, email subject line, landing page hero, and LinkedIn post. Each skill has its own eval checklist tuned to the asset type. Total cost: about $60 in API fees for all four runs.

Morning 1: Review and Generate

Check the dashboards. Each skill now passes 85-95% of its quality checks. Generate 20 variants from each improved skill. Time spent: 30 minutes reviewing changelogs and generating output.

Afternoon 1: Human Shortlisting

A human marketer reviews 80 total variants across four asset types. Selects 5 ad variants, 5 subject lines, 3 landing page options, and 5 LinkedIn posts. Time spent: 1-2 hours.

Days 2-5: Live Testing

Run the shortlisted variants as live A/B tests. Track CTR for ads, open rate for emails, conversion rate for landing pages, engagement rate for LinkedIn. Cost: standard ad budget, no incremental spend.

Day 6: Feed Back

Pull conversion data. Identify which variants won. Update your eval checklists. Re-run autoresearch for the next campaign. Each cycle compounds the quality of your skills.

Total time for campaign assets: ~4 hours of human work versus 2-3 days without autoresearch. Total autoresearch cost: ~$60.

Action item: Map your next campaign launch to this template. Identify the 3-4 asset types you need. Build or refine a Claude skill and eval checklist for each. Run all of them overnight before your creative kickoff meeting.

Final Takeaways

AutoResearch is Karpathy’s autonomous experiment loop applied to marketing: one file, one metric, one change at a time, iterated overnight for about $15 per skill.

The method works best for repeatable, high-frequency marketing tasks: ad creation, email subject lines, landing page sections, cold outreach templates, social posts, and any workflow where you produce similar outputs on a regular cycle.

AI quality scores do not reliably predict real conversion. Only 2 of 7 quality dimensions correlated with business outcomes in a March 2026 study. Treat autoresearch output as a pre-filter, not a final verdict.

The hybrid approach wins: autoresearch eliminates bad variants cheaply, then real A/B tests with live traffic confirm what converts. Teams closing the feedback loop between AI scoring and production data compound their advantage with every cycle.

The cost of marketing experimentation dropped to near zero. The question is whether your eval checklists encode what your buyers respond to. Start with 4 binary questions. Run the loop. Test the winners. Update the checklist. Repeat.