← Back

The State of Agent Architecture

Purely Deterministic Purely Agentic
Stage 1
Pure Determinism
No models, code controls everything
0 / 10
Stage 2
Task Agents
Code controls flow, models handle narrow tasks
7 / 10
Stage 3
Workflow Agents
Agents control flow, code provides broad scaffolding
3 / 10
Stage 4
Self-Directed Agents
Agents control everything — flow, context, escalation
0 / 10

Based on interviews conducted from April 25 to May 10, 2026 with seed-to-Series C startups building agents for enterprise customers.

1. Introduction

I spent the last eight months building an agentic platform to automate back office tasks for wealth management firms (RIAs). Every few months there would be a new model release, and each time I’d spend the next week tweaking what I’d built to take advantage of it. Over time, these changes added up, so much so that the platform I started with eight months ago is architecturally unrecognizable from the platform as it exists today.

My own experience made me curious about how other companies are evolving their agent systems. I decided to talk with ten seed to Series C startups and explore how they are designing their platforms. I was surprised by the level of variation.

I’m sharing here what I learned in those conversations, namely:

  1. How companies choose, and switch between, model providers
  2. The current friction points in building agents
  3. How model provider and agent tooling platforms can help

2. Data Sample: Mostly vertical AI startups, seed to Series C

First, the data sample. For this research, I interviewed ten different early-to-mid-stage application layer startups building agents for enterprise customers.

By Industry
Healthcare 5, Financial Services 2, Security 1, Industrial Services 1, Creative 1.
Healthcare5 Financial Services2 Security1 Industrial Services1 Creative1
By Funding
100k–1M: 3, 1–10M: 4, 10–20M: 1, 20M+: 2.
100k–1M3 1–10M4 10–20M1 20M+2

An important caveat: I did not interview any coding agent startups. My personal thesis is that agentic coding will largely be dominated by the labs (OpenAI, Anthropic, DeepMind) and I’m not sure how much room there will be for smaller players. More broadly, I think you can roughly predict which markets the labs may enter by ranking industries on two axes: TAM and heterogeneity of the tasks. Where TAM is large and the work is homogeneous enough that a general solution can capture most of the value, I would expect the labs to go directly after those spaces. Coding is the clearest example: huge market, and the core workflow looks similar enough across companies that a general-purpose tool like Codex or Claude Code can serve most of it.

The long-tail of industry verticals looks different. TAMs are smaller, and the work requires last-mile customization that a general model provider has less incentive to build and less industry-specific expertise in. I expect we’ll continue to have a robust ecosystem of vertical applications, or at least industry-specific integrators, to take the labs’ general-purpose tools and build solid businesses by delivering the last ~10% of customization.

3. Model Provider Selection: Anthropic reliability challenges are causing switching, multi-model is common, and niches are forming

By Model Provider
Anthropic+OpenAI+Gemini: 5, OpenAI only: 3, Anthropic+OpenAI: 1, Anthropic+OpenAI+Open Source: 1.
Anthropic, OpenAI, Gemini5 OpenAI only3 Anthropic, OpenAI1 Anthropic, OpenAI, Open Source1

Across the subset of ten companies, in selecting model providers, teams roughly prioritize criteria of reliability > capability > price.

Two companies switched from Anthropic only to Anthropic + others because of reliability problems in prod. Both had initially tested with all three providers and selected Anthropic because it performed the best on their internal tests. It then had outage issues in prod and both companies switched to OpenRouter and now route to both OpenAI and Anthropic. “We feel confident in both.”

Price alone doesn’t appear to drive switching. One Series C founder noted “I could switch to optimize margin, but my margins are so good already.” Another company explained “a single run costs me about $50 in tokens and generates about $10k in revenue. I don’t really care about the price that much.” For the right use cases, the value of the business outcomes far outweighs the cost of token input.

Teams don’t use open source much in prod, but they actively test it. Open source quality lags the frontier by about eight months, and the bar for many production tasks has only been met in the last eight months. That means open source still isn’t good enough for most teams. Those using OpenRouter test it most, since trying a new model doesn’t require self-hosting. These teams are often motivated by potential cost savings, but capability and reliability still win out as deciding factors.

Notable Gemini niche in OCR and other image tasks. Companies using OpenAI or Anthropic as their primary workhorse model still often use Gemini for image generation and image analysis. They see better performance for their use cases with Gemini.

For very small tasks, teams use OpenAI GPT5.4-mini. It’s the cheapest frontier model available. The three “OpenAI only” teams in my sample fell into this camp.

Most teams are building on the API directly and not using the SDK. My intuition is that agentic coding has compressed a lot of the value that a nice SDK interface previously provided on top of APIs. One SDK user switched to OpenRouter without much hassle.

4. An Agent Architecture Evolution: Less code, more model

Over the last eight months, the most dramatic shift I’ve observed is how much previously hard-coded logic I could remove and delegate to models in a real production system. That was not possible to do reliably a year ago. This has been driven in part by model improvements, but also by the release of agent skills and more reliable tool calling.

Skills in particular have allowed developers to remove multi-step task logic from bloated prompts and enable agents to better construct their own context at runtime. This keeps context cleaner and enables longer-running tasks. As tool calling has matured, developers are reducing their use of deterministic break points between tasks and instead calling tools that hard code common failure points, enable mid-run validations, and shield credentials.

Both agent skills and improved tool calling inject better scaffolding into long-running model execution, helping preserve coherence and enabling developers to delegate more and more to a single model call.

While these capabilities now exist, adoption is not binary. Across the ten companies I talked to, I found a wide variance in the quantity and complexity of tasks different teams delegated to models.

4a. The Agentic Continuum

Purely Deterministic Purely Agentic
Stage 1
Pure Determinism
No models, code controls everything
0 / 10
Stage 2
Task Agents
Code controls flow, models handle narrow tasks
7 / 10
Stage 3
Workflow Agents
Agents control flow, code provides broad scaffolding
3 / 10
Stage 4
Self-Directed Agents
Agents control everything — flow, context, escalation
0 / 10

For this analysis, I plotted the ten companies I talked to on a continuum of agentic implementations, split into four primary stages. I believe the companies, as plotted on this diagram, represent a snapshot in time and a map of where development is headed.

Stage 1: Pure Determinism | No models, code controls everything

Traditional software without LLMs. Deterministic code that executes the same way every time; my old memory-leaking C++ kernels notwithstanding. None of the companies I talked to fall into this camp, and at this stage of model capability very few systems do. Even low-level, performance-optimized code for inference (my background) uses small models for speculative decoding.

Stage 2: Task Agents | Code controls flow, models handle narrow tasks

Each model call contains a narrow, scoped single-step or few-step task, while everything else lives in hard-coded logic.

Examples:

“We use the model to parse the HTML file of the webpage we’re accessing. It returns to us the username and password fields as well as the login button. Then we have code to inject our username and password and click the button.”

“We provide the model one page of a document at a time, and it returns to us a classification of that page. We then perform a hard-coded calculation on it and save it into a database.”

Stage 3: Workflow Agents | Agents control flow, code provides broad scaffolding

Single model calls are structured as multi-step, multi-task agentic workflows with skills and tool calling. Deterministic code is limited to breaks for human approval, evals, or context management (multi-agent/sub-agent orchestration).

Examples:

“When a new email inquiry comes in, the agent selects the appropriate skill based on the type of request. That skill contains several steps and also allows the agent to spin up a sub-agent to retrieve more context if needed. When complete, it writes the result to a database for the next agent to pick up. The next agent actively monitors the database waiting for new work.”

“We have a primary agent who triages the request when it comes in and then delegates to a collection of agents each with their own skill-based workflow. Once work is returned from those intermediate agents, a final rendering agent renders the output. Traces are logged but the entire process is all scoped within a single agent call.”

Stage 4: Self-Directed Agents | Agents control everything, including flow, context, and escalation

A single agent that completes any collection of multi-task processes, manages its own context manually or by spinning up subagents with skills/tools/prompt personas built on the fly, and defers to human and eval approval of its own accord. Tools become less important aside from auth injection, whereas skills remain important as recipes that provide expert human knowledge as context to an agent.

4b. Key frictions blocking teams from building multi-task Workflow Agents, and where platforms can step in

i. How to divide logic across prompts, skills, and tools isn't widely understood yet

At Stage 2, prompts and tool calls get over-stuffed because the delineation between what belongs in a prompt, skill, or tool isn’t always obvious. Stage 3 companies have settled on a three-layer split: task-specific procedures in skills, the logic for when to invoke them in short structured base prompts, and deterministic operations and credential boundaries in tools.

What Stage 2 looks like:

Skills weren’t a core part of any Stage 2 workflow I saw. One team tried to make the leap and found that without understanding the role of skills, the effort didn’t pay off:

“We thought tool calling was the key, so we had several very large tool calls. It didn’t work, and we realized we hard-coded so much logic into the tool calls that we basically just rewrote our workflow in tool calls and didn’t really get much of the development velocity benefit anyway.”

Others are trying more creative approaches:

“We have the same instruction in there several times. We also have some curse words to try to get it to follow instructions. In total, it’s about 500 lines long.”

What Stage 3 looks like:

One company moved from one agent per workflow to a single agent that reads skill files and constructs its own context based on the task:

“Our base prompt is almost empty. It just says: ‘you are [job function], do not forget to read skills.’”

Another company took a similar approach:

“We went from having effectively five different products with hard coded changes for each Fortune 500 customer to now a single multi-agent platform. We add new features by writing new skills, not by writing code.

What’s missing from platforms?

Platforms could address two key gaps. First, create and share more reference implementations of proper prompt/skill/tool architecture, like Anthropic did for Financial Services. Writing skills well is still more art than science, and the best way to diffuse the art is to demonstrate it.

Second, a linter that flags prompt/tool/skill formatting and architectural issues, based on these reference implementations best practices. I envision being able to toggle a ‘debug’ API call to surface formatting issues like overstuffed prompts, logic that belongs in a skill but lives in a tool, or tool definitions that have collapsed into mini-workflows. This could be a part of the agent tooling ecosystem, but it could also be a feature directly in the model platforms.

As a starting point, I built a prototype: plint.

ii. Agentic failures are silent and traditional logging misses them

Agentic systems fail silently by default. A slightly mutated tool call or a poorly followed instruction cascades into subtle issues that traditional logging is unlikely to catch. At Stage 2, observability tends to look like traditional software logging: capture what happened, and read the logs when something breaks. That leaves most failures invisible. Stage 3 teams build in-depth telemetry, collate it with business logic, monitor it habitually or automatically, and act on what they find.

What Stage 2 looks like:

At one company, failures surface through customer reports:

“Our customers would let us know when the output was wrong and we’d have a fire drill to go fix it.”

Another records model reasoning, but the output is recorded in metadata where it may be harder to act on:

“We have our model call write ‘breadcrumbs’. It outputs its reasoning into the metadata of the document we’re analyzing: two to three sentences about what it saw.”

What Stage 3 looks like:

One company built a custom telemetry stack on top of their existing infrastructure:

“We’re able to kind of construct our own curated telemetry set that gives us a much richer data set than what Temporal alone gives us. We then layer on top of that a visualizer to see what’s happening within agents, but then also what happens between agents. I start every morning looking at it.”

Another invested in tooling that greatly improves the loop between observability and iteration:

“Once I root cause an issue, I can rebuild the state of not just the agent, but of the agent’s environment (our internal task boards and databases). I can then make a change, and re-run it to see if that change worked with one click.”

A third boiled it down to:

“Observability gives us confidence. It gives us confidence as developers to plan more ambitious features, and it gives confidence to our users.”

What’s missing from platforms?

I see two ways platforms could be helpful. First, observability platforms like Braintrust and LangSmith should lean into collation with business outcomes, not just LLM traces. The Stage 3 teams I talked to all built their own stacks because off-the-shelf tools showed them what the model did, but not what it meant for the work in their environment. Visualization across entire workflows matters too, since a single piece of work often moves between several agents.

Second, model platforms could create an API primitive for surfacing silent failures. A ‘debug’ flag that forces the model to return confidence scores for tool and skill invocations would give teams something to alert on. Evals catch some of this, but cheap runtime signals would help too.

I built an SDK wrapper prototype that tries to detect this from outside the model call: plint runtime. However, a native primitive from the provider would do it better.

iii. Without evals, every prompt, skill, and tool change is difficult to validate

Without evals, changes are like input to a black box. There are two important parts of evals: a good eval set generated from telemetry or human review, and an automated system that flags regressions and updates prompts, skills, and tools programmatically. At Stage 2, evals are often absent, or present but not well-integrated into the development flow. Stage 3 teams have automated loops that prevent regression and improve prompts, skills, and tools over time.

What Stage 2 looks like:

One team described their approach as “scrappy”: manual run review, manual prompt updates, with the hope that changes hold. Another had high-quality eval data, thanks to summer interns who labeled hundreds of documents over a few months, but the feedback loop was tedious enough to discourage iteration: every prompt change required running the full set multiple times for statistical significance.

What Stage 3 looks like:

One company developed an automated eval loop for prompt and skill changes. It generates an A/B test on different versions and runs the loop against a growing eval set. Over time they’ve achieved 99% accuracy on their document classification step.

Another was more secretive, but described their eval harness as their “secret sauce.”

What’s missing from platforms?

Generating the right evals is company-specific. Open-source datasets exist, but the gold standard comes from working with customers to determine what needs to be measured to drive a business outcome. The more interesting opportunity for the tooling ecosystem and model platforms is a generalizable structure for the harness around the evals, something that could hold across companies.

This led me back to Andrej Karpathy’s Autoresearch. Instead of hill climbing a model against val_bpb, why not hill climb a collection of prompts, skills, and tools against an eval suite? train.py holds the code to modify the prompts, skills, and tools. prepare.py holds the fixed test cases and scoring rubrics. program.md holds the meta-instructions to the agent on how to improve.

I built a prototype of this here: pts-autoresearch. I could see this being part of the agent tooling ecosystem or part of a managed agent platform from model providers.

5. Conclusion

From my observation, companies that made the jump to Stage 3 did so through better agent engineering: cleaner separation between prompts, skills, and tools; observability that is interpretable and actionable; and automated eval loops that improve setups over time.

Every team has access to the same models, but the ones getting the most out of them are doing so on better agent engineering. Platforms should focus on tools that guide all teams there: prompt/skill/tool linting that enables better instruction organization, debug primitives that surface silent failures, and eval harnesses that improve setups over time.

Ultimately, models will get better and inference will get cheaper. When I started building eight months ago, I grounded too much of my work in what the models could do at the time, and it led to a lot of rewrites. The teams that benefit most from the next eight months of progress will be the ones whose architecture is model-first, ready to absorb what models can do next, rather than hard-coded around what they can do today.

Thank you for reading this far. If this matches what you’re seeing, or if you disagree, I’d love to chat. Send me a message on LinkedIn!

If somehow you still haven't had enough, a few final thoughts…

6. Appendix

6a. Other Observations

Desire for better browser use models. Several teams brought up web automation as the one capability where model improvements could unlock new use cases. 2FA, anti-bot measures, and rate limiting all came up as specific pain points. One team called it the “single most underestimated challenge in building agents.”

Context pollution. This came up in nearly every conversation. Stage 3 teams handle it by manually segmenting into subagents, loading skills selectively at the persona level, and staying disciplined about prompt length. Model platforms could help with conditional skill loading: if skill A is invoked, drop skills B and C from skill search and add D and E. The gains are marginal at low skill counts, but as teams trust agents with more responsibility, skill libraries grow and hierarchy starts to matter. One team has 150 skill files on a single agent. At roughly 200 tokens of metadata each, that’s 30,000 tokens of context burned before the task even starts, even with progressive disclosure.

Pressure to ship drives companies toward Stage 3. Workflow-based systems grow more complex with each new customer: different steps for each medical ERP, different processes for each underwriter, different renderings for each creative asset. With deterministic code, that means more hard-coded branches until updates become prohibitively tedious. This is the problem that RPA faced, and that Stage 2 teams are still facing. The promise of Stage 3 is generalizability: new customers absorbed through new skills rather than new code.

Everyone is using Temporal. It’s a key part of systems for both Stage 2 and Stage 3 teams. When asked if they would move off it, one team said: “I wouldn’t even consider it. You get a lot of benefits: replayability, failure recovery, observability.” Of the ten companies I spoke to, only one wasn’t using Temporal, and they had built their own durable execution engine internally.

More people are using OpenRouter. I used to think they’d get disintermediated, but as reliability has become a constraint, dynamic routing has become more valuable. They also make open source models much easier to use. Their trajectory probably depends on whether model platform reliability stabilizes and on how demand for open source evolves.

6b. Lingering Questions

Does agent harness (prompt/skill/tool) engineering lead to model lock in? Surprisingly, no. I initially expected that all of the engineering around prompts, skills, and tools would make teams less likely to switch model providers, but that is not what I found. Two companies I spoke to recently switched providers and found the migration straightforward. Others run near-identical harnesses across multiple models without much friction. A well-structured, observable, self-improving harness absorbs the small performance differences between models. If this dynamic holds, it is likely to keep provider advantages temporary and competition intense.

Will observability always get built in-house? Even with the gaps above, agentic coding makes it cheap to build your own observability layer. The Stage 3 teams I spoke with all built their own interpretation layer on top of OTel and Temporal and found the customization was worth it. There is a broader question about whether a platform-level observability product can ever be specific enough to compete, or whether this is a layer that will increasingly move in-house because every team’s debugging needs are too distinct and code is so easy to write. A related question: with the barrier to writing new code this low, is there still value to an SDK over a raw API?

Does the eval harness plus prompts, skills, and tools replace fine-tuning? Multiple companies described skills as “the new fine-tuning.” Under what conditions is actual fine-tuning still the best solution? My guess is that fine-tuning matters most for narrow, high-volume tasks where you want to bake the optimal behaviors into the weights of a small, open-source model. Outside of those use cases, the skills-plus-evals loop might just be faster and cheaper to iterate on.

What does Stage 4 actually look like in production? None of the teams I talked to are there yet. The Stage 3 architecture of skills, observability, and evals feels like it scales toward Stage 4, but it’s not clear whether the jump is continuous or whether something else has to change first.