The State of Agent Architecture: Summary Insights for Model Platforms

Based on interviews conducted from April 25 to May 10, 2026 with seed-to-Series C startups building agents for enterprise customers. This is an excerpt — for the full paper see The State of Agent Architecture.

By Industry

Healthcare5 Financial Services2 Security1 Industrial Services1 Creative1

By Funding

100k–1M3 1–10M4 10–20M1 20M+2

I spent the last eight months building an agentic platform for wealth management back office tasks. To understand how others are designing similar systems, I interviewed ten seed to Series C startups using model provider APIs and SDKs to build agents for enterprise customers.

Model Provider Selection

By Model Provider

Anthropic, OpenAI, Gemini5 OpenAI only3 Anthropic, OpenAI1 Anthropic, OpenAI, Open Source1

Selection criteria across the ten teams is roughly reliability > capability > price.

Two companies switched off Anthropic-only setups during my interview window because of production reliability issues, now routing to both Anthropic and OpenAI via OpenRouter. Price is not a huge factor: for one company, a single run costs $50 in tokens and generates $10k in revenue. Niches are forming, Gemini dominates OCR and image tasks even for teams whose primary workhorse is OpenAI or Anthropic, GPT5.4-mini owns small atomic tasks because it’s the cheapest. Open source is tested but rarely shipped: quality lags frontier by about eight months, production bar for many tasks has only been met in that window.

The Agentic Continuum

Purely Deterministic Purely Agentic

Stage 1

Pure Determinism

No models, code controls everything

0 / 10

Stage 2

Task Agents

Code controls flow, models handle narrow tasks

7 / 10

Stage 3

Workflow Agents

Agents control flow, code provides broad scaffolding

3 / 10

Stage 4

Self-Directed Agents

Agents control everything — flow, context, escalation

0 / 10

Companies’ systems sit on a continuum from deterministic to fully agentic. The transition from single-task, narrow scoped model calls (Stage 2) to multi-task, broad agentic workflows with skills and tool calling (Stage 3) is the most significant change happening right now.

Because Stage 3 teams lean into model capabilities, they ship faster, handle more customer variation, and are best positioned to benefit from future model improvements.

Key frictions I observed for teams transitioning from Task Agents to Workflow Agents

Teams have access to the same models, but only some are getting the most out of them. Platforms should focus on tools that guide better agent engineering: prompt/skill/tool linting that enables better instruction organization, debug primitives that surface silent failures, and eval harnesses that improve setups over time.

1) How to divide logic across prompts, skills, and tools isn't widely understood yet.

At Stage 2, prompts and tool calls get over-stuffed. Stage 3 teams settle on a clear three-layer split: task-specific procedures in skills, the logic for when to invoke them in short structured base prompts, and deterministic operations or credential boundaries in tools.

What platforms can build: More reference implementations of proper prompt/skill/tool architecture (like Anthropic Financial Services) and a linter that flags formatting and architectural issues (either third party or integrated through a ‘debug’ API call). I built a prototype: plint.

2) Agentic failures are silent and traditional logging misses them.

A slightly mutated tool call or a poorly followed instruction cascades into subtle issues that traditional logging tends not to catch. At Stage 2, failures in production are common and often surfaced by customers. Stage 3 teams build agent-native telemetry, collate it with business logic, and action it continuously.

What platforms can build: An API primitive that returns confidence scores for tool and skill invocations to help surface silent failures. I built an SDK wrapper that tries to detect this from outside the model call: plint runtime. A native primitive from the provider would do it better.

3) Without evals, every prompt, skill, and tool change is difficult to validate.

At Stage 2, evals are either absent or so expensive they discourage experimentation. Stage 3 teams have automated loops that prevent regression and improve prompts, skills, and tools over time.

What platforms can build: Evals are very company specific, but an eval harness can be generalized. Using Andrej Karpathy’s Autoresearch as a reference, I built a prototype that uses an agent to hill climb a collection of prompts, skills, and tools against an eval suite: pts-autoresearch. This could be part of the agent tooling ecosystem or part of a managed agent platform from model providers.

Thank you for reading this summary. Full paper here: The State of Agent Architecture.