The State of Agent Architecture: Summary Insights for Investors

Based on interviews conducted from April 25 to May 10, 2026 with seed-to-Series C startups building agents for enterprise customers. This is an excerpt — for the full paper see The State of Agent Architecture.

By Industry

Healthcare5 Financial Services2 Security1 Industrial Services1 Creative1

By Funding

100k–1M3 1–10M4 10–20M1 20M+2

I spent the last eight months building an agentic platform for wealth management back office tasks. To understand how others are designing similar systems, I interviewed 10 seed to Series C startups building agents for enterprise customers.

I intentionally excluded coding agent startups from the sample. My sense is that coding will be dominated by the labs themselves, given TAM is large and the work is homogeneous enough for a general solution.

Headline Findings

By Model Provider

Anthropic, OpenAI, Gemini5 OpenAI only3 Anthropic, OpenAI1 Anthropic, OpenAI, Open Source1

1) Model selection criteria is roughly reliability > capability > price.

Two companies switched off Anthropic-only setups during my interview window because of production reliability issues, now routing to both Anthropic and OpenAI via OpenRouter. Price is not a huge factor: for one company, a single run costs $50 in tokens and generates $10k in revenue. Niches are forming, Gemini dominates OCR and image tasks even for teams whose primary workhorse is OpenAI or Anthropic, GPT5.4-mini owns small atomic tasks because it’s the cheapest. Open source is tested but rarely shipped: quality lags the frontier by about eight months, and the production bar for many tasks has only been met in that window.

2) Companies’ systems sit on a continuum from deterministic to fully agentic. The transition from single-task, narrow scoped model calls (Stage 2) to multi-task, broad agentic workflows with skills and tool calling (Stage 3) is the most significant change happening right now.

Purely Deterministic Purely Agentic

Stage 1

Pure Determinism

No models, code controls everything

0 / 10

Stage 2

Task Agents

Code controls flow, models handle narrow tasks

7 / 10

Stage 3

Workflow Agents

Agents control flow, code provides broad scaffolding

3 / 10

Stage 4

Self-Directed Agents

Agents control everything — flow, context, escalation

0 / 10

Of the ten teams, I classified seven as Stage 2 and three as Stage 3. None have reached Stage 4.

Because Stage 3 teams lean into model capabilities, they ship faster, handle more customer variation, and are best positioned to benefit from future model improvements.

Stage 2 is separated from Stage 3 by engineering, not model capability. At Stage 2, prompts are bloated, traditional logging misses silent failures, and evals are either absent or expensive enough to discourage iteration. Stage 3 teams have cleaner architectural separation between prompts, skills, and tools; in-house observability stacks that collate model traces with business logic; and automated eval loops that improve setups over time.

Three friction points consistently blocked teams from reaching Stage 3:

How to divide logic across prompts, skills, and tools isn’t widely understood yet
Agentic failures are silent and traditional logging misses them
Without evals, every prompt, skill, and tool change is difficult to validate

3) Most companies are building their own agent tooling for observability and evals, but use Temporal for durable execution.

Every Stage 3 team I talked to had built their own observability stack, and several viewed their proprietary eval harness as a competitive advantage. The reasons came up across interviews: customization matters, the barrier to writing new code is low, and in some cases teams see doing it well as a moat. That makes me cautious about the broader eval and observability tooling category. For platforms targeting this area, I’d be looking for solutions that address the frictions above. In the full version of this paper, I go deeper into each friction and provide prototypes for what the right tools might look like. Interestingly, despite DIY agent tooling for observability and eval, all but one team uses Temporal for durable execution. When I asked one whether they’d replace it: “I wouldn’t even consider it.”

4) Agent harness (prompt/skill/tool) engineering is not leading to model lock in.

I initially expected that all of the engineering around prompts, skills, and tools would make teams less likely to switch model providers, but that is not what I found. Two companies I spoke to recently switched providers and found the migration straightforward. Others run near-identical harnesses across multiple models without much friction. A well-structured, observable, self-improving harness absorbs the small performance differences between models. It’s also likely that harnesses will get thinner over time as models continue to improve. This dynamic is likely to keep provider advantages temporary and competition intense.

Thank you for reading this summary. Full paper here: The State of Agent Architecture.