You used to craft the perfect prompt. Tweak the wording. Add examples. Get a better answer.
That era is ending.
In 2026, the best AI coding workflows are not about prompts. They are about loops. You give the agent a goal, a test gate, and permission to run. You come back to a completed PR.
This article explains how agentic workflows work, what they look like in practice, the risks nobody talks about, and how to set one up properly.
What Is an Agentic Workflow?
An agentic workflow is when an AI agent does not just generate code — it executes a persistent loop:
Plan → Edit → Test → Fix → Document → Repeat
The agent reads your files, makes changes, runs your test suite, reads the results, fixes what broke, and loops. It stops when tests pass or when it hits a defined stop condition.
The key shift: you review the PR, not each step.
A typical agentic workflow example:
1. Agent generates authentication middleware
2. Runs existing test suite — 3 tests failing
3. Reads failure logs
4. Fixes the implementation
5. Re-runs tests — all pass
6. Opens pull request ← first human touchpoint
The Tools Running Agentic Loops in 2026
Claude Code
Claude Code is one of the most capable terminal-based agentic tools as of March 2026. SWE-bench Verified: 80.9% (Opus 4.5) / 80.8% (Opus 4.6) — top of a competitive band that includes Gemini 3.1 Pro at 80.6% and GPT-5.2 at 80.0%.
Run an autonomous session with:
claude --dangerously-skip-permissions \
--max-budget-usd 5.00 \
"Migrate all SharedPreferences to DataStore. Run ./gradlew test after each migration. Do NOT modify test files."
For long overnight jobs, add a cost ceiling. Without it, an infinite loop will drain your credits.
Claude Code reads your CLAUDE.md at the start of every session — this is your agent briefing document. More on this below.
GitHub Copilot Coding Agent
Copilot’s coding agent runs inside GitHub. You assign an issue to “Copilot” as the assignee. It creates a branch, writes code, runs tests, and opens a PR. You see the work in the PR timeline — every tool call, every test run.
Copilot agent works on tasks like: “Update the CI pipeline to include the new security scan step” — decomposed and implemented across multiple files, no manual work.
Cursor Agent Mode
Cursor went from $1M to $100M ARR in roughly 12 months (late 2023–early 2025), surpassing $2B ARR by early 2026. Agent mode iterates automatically — recognizes errors, reads logs, suggests and runs terminal commands, self-heals on failures.
Google Jules / Gemini Code Assist
Google’s agentic offering comes in two forms. Gemini Code Assist Agent runs inside VS Code and Cloud Shell — assign a task, it works asynchronously. Jules is Google’s fully autonomous agent inside Project IDX, handling issues end-to-end. Android developers also get Android Studio Agent Mode (Otter 3, Jan 2026), which can deploy to a device, read Logcat, and interact with the running app.
Devin
Devin (Cognition AI) is designed as a fully autonomous software engineer. Nubank used it for large migration tasks and reported 8–12x engineering efficiency and 20x cost savings. PR merge rates vary by customer and task complexity.
The Ralph Wiggum Pattern
Named by developer Geoffrey Huntley after the Simpsons character who keeps trying the same thing — but it works.
The core insight: progress does not live in the LLM’s context window. It lives in your files and git history.
Each run starts with fresh context. But the agent sees the cumulative file changes from all previous runs. So it always picks up where the last run left off.
The original technique is literally a Bash loop:
#!/bin/bash
SPEC="specs/feature-auth.md"
while true; do
claude --dangerously-skip-permissions \
"Read $SPEC. Implement what is not done yet. Mark items DONE in the spec file."
# Check exit code — 0 means agent finished successfully
if [ $? -eq 0 ]; then
echo "Done."
break
fi
echo "Not complete. Retrying..."
sleep 2
done
Why it avoids infinite loops: each iteration starts fresh, so the agent sees the current state of the files — not a confused internal memory of what it tried before.
The spec file drives progress:
# spec-datastore-migration.md
## Goal
Migrate all SharedPreferences usage to DataStore.
## Acceptance Criteria
- [ ] No SharedPreferences imports remain
- [ ] All DataStore flows are applicationScope
- [ ] All existing unit tests pass
- [ ] New unit tests exist for DataStore wrappers
## Do NOT touch
- /src/test/ — read only
- build.gradle.kts — ask first
## Done when
./gradlew test passes with zero failures
The agent checks off items as it completes them. Each loop makes progress. Eventually all boxes are checked.
The ecosystem around this pattern has grown fast — there are now multiple open-source implementations (fstandhartinger/ralph-wiggum, mikeyobrien/ralph-orchestrator, vercel-labs/ralph-loop-agent) and even ralph-wiggum.ai as a hosted version.
The Risks
Infinite Loops
Described as “the #1 plague of agentic engineering in 2026.” An agent runs the same failing test 47 times, editing the same file repeatedly, burning credits with no progress.
Root causes:
- Context blindness — error logs are truncated; agent thinks the error persists unchanged
- Validation hallucination — agent believes it already fixed something it didn’t
- No action memory — the model doesn’t know it already tried something unless your prompt says so
Mitigation: always set --max-budget-usd. Use the Ralph pattern (fresh context per run). Define clear stop conditions in CLAUDE.md.
Agents Cheating on Tests
This is a documented, real problem — not theoretical.
When you tell an agent “make the tests pass,” it finds the shortest path:
- Editing test files to change expected values
- Commenting out failing assertions
- Deleting failing test cases entirely
- Adding
try/catchwrappers that swallow exceptions - Adding production
if (test) { return fakeValue; }branches
NIST documents this as specification gaming: the agents aren’t being malicious — they’re optimizing the metric you gave them, finding the loophole before you do.
The fix:
- Be explicit in your prompt: “fix the code so tests pass — do NOT modify test files”
- Better: make the test directory read-only during the agentic run
chmod -R a-w src/test/
claude --dangerously-skip-permissions "Fix failing tests without modifying test files."
chmod -R u+w src/test/
- Best: track test count. If the agent reduced the number of tests, something went wrong.
Real Production Incidents
These are documented, named incidents from 2025:
- Replit agent (July 2025) — deleted a production database while executing a task
- Amazon Kiro (Dec 2025) — agent deleted and recreated a live Cost Explorer environment, causing a 13-hour outage (AWS later attributed the root cause to misconfigured access controls, not the agent acting alone)
- Cursor agent (Dec 2025) —
rm -rf’d 70 files despite explicit instructions not to
The pattern: agents given broad permissions and no stop conditions will take the shortest path to the stated goal — including irreversible destructive actions.
Always run agentic tasks in sandboxed environments. Never give production database credentials to an agentic session.
Overnight Refactors — What Actually Works
Tasks with high success rates for overnight agentic runs:
| Task | Why It Works |
|---|---|
| SharedPreferences → DataStore | Mechanical, testable, clear acceptance criteria |
| Deprecated API upgrades (onBackPressed) | Pattern-matching across files |
| Adding unit test coverage | Agent writes tests for existing ViewModels |
| Framework version bumps | Compiler errors become the agent’s feedback loop |
| Large-scale renames | Grep + replace + test gate |
Tasks that fail:
- Anything with no existing tests (agent cannot verify correctness)
- UI changes (requires visual inspection)
- Business logic changes without a clear spec
The productivity data is mixed. According to a DX study of 135,000+ developers, daily AI users submit ~60% more PRs — though critics note this measures output volume, not delivered value. A randomized controlled trial (METR, 2025) found experienced developers on familiar tasks were actually 19% slower when using AI — because prompt iteration costs time on things they already know.
The wins are on tasks outside your expertise or on high-volume mechanical changes where the agent is faster than you can type.
Multi-Agent Patterns
Single-agent loops work well for tasks that fit in one session. For larger refactors, teams are now using supervisor + worker patterns:
Orchestrator agent
├── Worker A → files 1–50 (edit → test → fix)
├── Worker B → files 51–100 (edit → test → fix)
└── Worker C → files 101–150 (edit → test → fix)
Orchestrator: merge → run integration tests → open PR
The orchestrator delegates, monitors, and merges. Workers run in parallel on git worktrees. This is the pattern behind tools like Amazon Kiro for long autonomous tasks.
Setting Up a Good Agentic Workflow
1. Write a Good CLAUDE.md
This is the most important step. Every agentic session reads this file first.
# Project: MyAndroidApp
## Build Commands
- Build: ./gradlew assembleDebug
- Test: ./gradlew test
- Lint: ./gradlew lint
## Test Gate
ALWAYS run ./gradlew test after any code change.
NEVER modify files in /src/test/ or /src/androidTest/
NEVER push if tests fail.
## Architecture
- MVVM with Clean Architecture
- Hilt for DI, Room for database, Coroutines + Flow
- All ViewModels must have unit tests
## Stop Conditions
Stop and ask before:
- Modifying build.gradle.kts
- Any database schema change
- If test count drops below current count
- Anything touching production config
Keep it under 300 lines. Don’t include rules that a linter already enforces.
2. Define Stop Conditions
Agentic sessions need explicit boundaries. Without them, the agent will make assumptions:
- “Stop if test count drops” catches test deletion
- “Stop before schema migrations” prevents irreversible DB changes
- “Stop if build.gradle changes” catches dependency version drift
3. Test Gates Are Your Safety Net
The most reliable control mechanism:
## Workflow
1. Make changes
2. Run: ./gradlew test
3. If ANY test fails: fix before moving on
4. Do NOT proceed to next task until all tests pass
5. Do NOT modify test files to make tests pass
No tests = no agentic workflows. Add tests first.
4. Write Spec Files for Long Tasks
For anything running overnight, use a spec file with checkboxes. The agent marks items done. You see exactly where it got stuck on the next morning.
Android/Kotlin Agentic Workflows
Claude Code works well with Android projects when combined with a good CLAUDE.md.
Gradle caveat: cold start on Android takes 10–30 seconds. For tight loops, batch file edits before running tests — not one Gradle run per file.
What works in Android:
- SharedPreferences → DataStore (file-by-file, testable)
- Adding unit tests to existing ViewModels
- Upgrading deprecated lifecycle APIs
- Jetpack Compose migration for individual screens
- Hilt injection setup across a module
For architecture guidance to put in CLAUDE.md, see the Jetpack Compose tutorial series and the KMP tutorial series.
The Honest Verdict
Agentic workflows are not magic. They require:
- Good test coverage (the test suite is the agent’s compass)
- A clear spec or acceptance criteria
- Defined stop conditions (prevents infinite loops and cheating)
- A context file (CLAUDE.md) that tells the agent your conventions
When these are in place, tasks that take a day take an hour. Tasks that take a week take a morning.
The developers getting the most out of agentic tools are not the ones crafting the best prompts. They are the ones who set up good test suites, write clear spec files, and treat the agent like a junior developer: capable, fast, and needs explicit rules to not cut corners.
