The State of Agent Architecture: Summary Insights for Builders
Based on interviews conducted from April 25 to May 10, 2026 with seed-to-Series C startups building agents for enterprise customers. This is an excerpt — for the full paper see The State of Agent Architecture.
I spent the last eight months building an agentic platform for wealth management back office tasks. Every few months a model release would push me to rearchitect, and I wanted to understand how other teams were handling that. So I interviewed 10 seed to Series C startups building agents for enterprise customers. Here’s what I found.
Model Provider Selection
Selection criteria across the ten teams is roughly reliability > capability > price.
Two companies switched off Anthropic-only setups during my interview window because of production reliability issues, now routing to both Anthropic and OpenAI via OpenRouter. Price is not a huge factor: one company told me a single run costs $50 in tokens and generates $10k in revenue. Niches are forming, Gemini dominates OCR and image tasks even for teams whose primary workhorse is OpenAI or Anthropic, GPT5.4-mini owns small atomic tasks because it’s the cheapest. Open source is tested but rarely shipped: quality lags the frontier by about eight months, and the production bar has only been met in that window.
The Agentic Continuum
Companies’ systems sit on a continuum from deterministic to fully agentic. The transition from single-task, narrow scoped model calls (Stage 2) to multi-task, broad agentic workflows with skills and tool calling (Stage 3) is the most significant change happening right now.
Because Stage 3 teams lean into model capabilities, they ship faster, handle more customer variation, and are best positioned to benefit from future model improvements.
Key frictions I heard for teams transitioning from Task Agents to Workflow Agents
Teams have access to the same models, but only some are getting the most out of them. The difference lies in agent engineering: prompt/skill/tool linting that enables better instruction organization, debug primitives that surface silent failures, and eval harnesses that improve setups over time.
1) How to divide logic across prompts, skills, and tools isn't widely understood yet.
At Stage 2, prompts and tool calls get over-stuffed. Stage 3 teams settle on a three-layer split: task-specific procedures in skills, the logic for when to invoke them in short structured base prompts, and deterministic operations or credential boundaries in tools.
Anthropic Financial Services is a useful reference for proper prompt/skill/tool architecture. I also built a static linter that pulls best practices and flags formatting and architectural issues: plint.
2) Agentic failures are silent and traditional logging misses them.
A slightly mutated tool call or a poorly followed instruction cascades into subtle issues that traditional logging tends not to catch. At Stage 2, failures in production are common and often surfaced by customers. Stage 3 teams build agent-native telemetry, collate it with business logic, and action it continuously.
I spun up an SDK wrapper that detects silent failures using call output to estimate model confidence: plint runtime. Feel free to try it out, fork it, or use as a reference.
3) Without evals, every prompt, skill, and tool change is difficult to validate.
At Stage 2, evals are either absent or so expensive they discourage experimentation. Stage 3 teams have automated loops that prevent regression and improve prompts, skills, and tools over time.
I built an eval harness prototype modeled on Andrej Karpathy’s Autoresearch. It hill climbs a collection of prompts, skills, and tools against an eval suite: pts-autoresearch. Best used as a reference.
Thank you for reading this summary. Full paper here: The State of Agent Architecture.