← Back

The State of Agent Architecture: Summary Insights for Builders

Based on interviews conducted from April 25 to May 10, 2026 with seed-to-Series C startups building agents for enterprise customers. This is an excerpt — for the full paper see The State of Agent Architecture.

By Industry
Healthcare 5, Financial Services 2, Security 1, Industrial Services 1, Creative 1.
Healthcare5 Financial Services2 Security1 Industrial Services1 Creative1
By Funding
100k–1M: 3, 1–10M: 4, 10–20M: 1, 20M+: 2.
100k–1M3 1–10M4 10–20M1 20M+2

I spent the last eight months building an agentic platform for wealth management back office tasks. Every few months a model release would push me to rearchitect, and I wanted to understand how other teams were handling that. So I interviewed 10 seed to Series C startups building agents for enterprise customers. Here’s what I found.

Model Provider Selection

By Model Provider
Anthropic+OpenAI+Gemini: 5, OpenAI only: 3, Anthropic+OpenAI: 1, Anthropic+OpenAI+Open Source: 1.
Anthropic, OpenAI, Gemini5 OpenAI only3 Anthropic, OpenAI1 Anthropic, OpenAI, Open Source1

Selection criteria across the ten teams is roughly reliability > capability > price.

Two companies switched off Anthropic-only setups during my interview window because of production reliability issues, now routing to both Anthropic and OpenAI via OpenRouter. Price is not a huge factor: one company told me a single run costs $50 in tokens and generates $10k in revenue. Niches are forming, Gemini dominates OCR and image tasks even for teams whose primary workhorse is OpenAI or Anthropic, GPT5.4-mini owns small atomic tasks because it’s the cheapest. Open source is tested but rarely shipped: quality lags the frontier by about eight months, and the production bar has only been met in that window.

The Agentic Continuum

Purely Deterministic Purely Agentic
Stage 1
Pure Determinism
No models, code controls everything
0 / 10
Stage 2
Task Agents
Code controls flow, models handle narrow tasks
7 / 10
Stage 3
Workflow Agents
Agents control flow, code provides broad scaffolding
3 / 10
Stage 4
Self-Directed Agents
Agents control everything — flow, context, escalation
0 / 10

Companies’ systems sit on a continuum from deterministic to fully agentic. The transition from single-task, narrow scoped model calls (Stage 2) to multi-task, broad agentic workflows with skills and tool calling (Stage 3) is the most significant change happening right now.

Because Stage 3 teams lean into model capabilities, they ship faster, handle more customer variation, and are best positioned to benefit from future model improvements.

Key frictions I heard for teams transitioning from Task Agents to Workflow Agents

Teams have access to the same models, but only some are getting the most out of them. The difference lies in agent engineering: prompt/skill/tool linting that enables better instruction organization, debug primitives that surface silent failures, and eval harnesses that improve setups over time.

1) How to divide logic across prompts, skills, and tools isn't widely understood yet.

At Stage 2, prompts and tool calls get over-stuffed. Stage 3 teams settle on a three-layer split: task-specific procedures in skills, the logic for when to invoke them in short structured base prompts, and deterministic operations or credential boundaries in tools.

Anthropic Financial Services is a useful reference for proper prompt/skill/tool architecture. I also built a static linter that pulls best practices and flags formatting and architectural issues: plint.

2) Agentic failures are silent and traditional logging misses them.

A slightly mutated tool call or a poorly followed instruction cascades into subtle issues that traditional logging tends not to catch. At Stage 2, failures in production are common and often surfaced by customers. Stage 3 teams build agent-native telemetry, collate it with business logic, and action it continuously.

I spun up an SDK wrapper that detects silent failures using call output to estimate model confidence: plint runtime. Feel free to try it out, fork it, or use as a reference.

3) Without evals, every prompt, skill, and tool change is difficult to validate.

At Stage 2, evals are either absent or so expensive they discourage experimentation. Stage 3 teams have automated loops that prevent regression and improve prompts, skills, and tools over time.

I built an eval harness prototype modeled on Andrej Karpathy’s Autoresearch. It hill climbs a collection of prompts, skills, and tools against an eval suite: pts-autoresearch. Best used as a reference.

Thank you for reading this summary. Full paper here: The State of Agent Architecture.