Andrew Do :)

What an Agentic System Actually Is

An agentic system is not just a model with tool calling. It is a product where an AI can inspect the current state of a shared world, decide what to do next, take an action through validated primitives, and then verify the result before continuing.

The important change is not reasoning quality. It is state mutation. Once the model can move real objects, edit production data, kick off jobs, or coordinate with a human in the same workspace, you are no longer building a chatbot. You are building a distributed system with an AI participant inside it.

The loop every serious agent eventually becomes

+---------+     +------+     +------+     +---------+
| Gather  | --> | Plan | --> | Act  | --> | Observe |
+---------+     +------+     +------+     +---------+
      ^                                              |
      +----------------------------------------------+

Shared state means the human and the agent can both be right at different times unless the system defines a single source of truth.
Undo matters because the safest recovery path is usually reversing a valid action, not inventing a bespoke rollback path under pressure.
Trust matters because a user will forgive a slow agent before they forgive one that silently mutates the wrong thing.
Observability matters because most failures look like model failures from the outside and orchestration failures from the inside.

One Shared World, One API

The most durable design rule is that the agent should behave like another client of the product. It should use the same commands, permissions, validation, and audit paths as the UI. If the agent has a special mutation lane that bypasses product rules, the bugs it creates will be unique, hard to reproduce, and harder to trust.

A healthy stack usually separates the interface that renders state, the transport that streams progress, the API that exposes commands, the runtime that orchestrates turns, and the domain engine that owns truth. The runtime can be flexible. The domain rules cannot.

Layer	Job	What breaks when blurred
Client	Render shared state and collect human input	UI starts inventing truth instead of projecting it
Transport	Stream events and recover from disconnects	Dropped connections look like failed runs
API	Expose commands, queries, and run lifecycle	The agent gets accidental backdoor writes
Runtime	Manage turns, tools, gates, and recovery	The loop becomes unbounded or incoherent
Domain engine	Validate state changes, history, and undo	You lose determinism and safe reversal

A useful mental model: the runtime decides when work happens; the domain engine decides whether that work is legal.

Commands Before Direct Mutation

Once both humans and agents can change the same world, direct mutation is too weak an abstraction. Writes need to become commands: typed, validated, logged, undoable operations that make the system answerable after the fact.

The result matters as much as the input. A good command result tells you whether it succeeded, what changed, and what version the world is now on. Without that, sync becomes guesswork and recovery becomes folklore.

The public contract is usually small

Command -> validate -> apply -> version++ -> emit result
                    |
                    +--> push undo unit

Default to atomic command batches unless partial success is genuinely safe.
Treat undo groups as user-facing actions, not low-level implementation trivia.
Make version increments monotonic so every client can answer whether it is stale.
Do not bolt undo on later. If the agent can write today, it can make mistakes today.

Tools Are a Permission Boundary

Teams often treat tools as a model integration detail. In practice they behave more like a security and product boundary. The tool surface defines what the model can inspect, how it can express an action, and how much damage a confused turn can do.

That is why good tool systems are narrow, explicit, and phase-aware. They are not generated from every useful helper function in the codebase. The agent does better when the surface is reviewed like an API, not discovered like a filesystem.

Tool class	Why it exists	Public rule of thumb
Read	Load context without mutating state	Keep outputs focused and cheap to repeat
Write	Apply changes to shared state	Route through commands, not ad hoc mutation
Gate	Ask for input or approval	Suspend cleanly and resume in the same run
Job	Kick off longer async work	Return status handles, not giant payloads

Use stable names and explicit schemas so the model knows what a tool is for.
Split read and write tools for telemetry, policy, and loop-control reasons.
Narrow the surface further by phase when the workflow already knows what stage it is in.

Context Is a Budget, Not a Dump

Most agent systems fail long before the model reaches its intellectual limit. They fail because the model sees too little of the right state, too much of the wrong state, or stale state presented as current truth. Context strategy is usually a higher leverage problem than prompt wording.

A context pipeline that scales better than full reloads

[ world header ] -> [ scoped read ] -> [ diff since last version ] -> [ compaction ]
      small              targeted                incremental               bounded

A good world header gives orientation, not exhaustiveness. It tells the model what exists, what is active, what changed recently, and which capabilities are actually available. After that, the system should load focused state only when the current task needs it.

Tell the agent what a resource can do now, not what it might do after more background work finishes.
Prefer diffs after the first load so multi-turn runs are not re-paying for old context.
Compact older conversation items aggressively enough to preserve coherence without burying the active task.
Treat raw tool outputs as logs first and model context second.

Runtime Quality Is the Product

A production agent is a bounded loop wrapped around a model, not a single model invocation. The visible quality of the product comes from how that loop handles tool execution, malformed turns, recovery, checkpoints, and progress signaling under stress.

When teams say an agent feels flaky, they usually mean the runtime feels flaky. The model might be fine. What failed was the surrounding system's ability to constrain the phase, keep the conversation valid, stop a loop at the right time, and explain what happened next.

Runtime concern	Why it matters
Turn budgets	Prevent expensive loops from becoming operational incidents
Protocol repair	Keeps malformed tool conversations from poisoning the next turn
Checkpointing	Lets long-running work resume without replaying everything
Phase handoffs	Reduces the chance of wandering back into the wrong workflow step
Progress emission	Creates trust before the full job is finished

A useful rule: treat repeated phase errors as an orchestration problem, not a prompting problem.

Long-Running Work Needs Structure

The bigger the task, the less you want a single undifferentiated loop. Long-running agents benefit from sections: chunks of work that can be checkpointed, verified, and resumed without re-planning the entire world every time something changes.

A more reliable run shape

Run
|- Section 1: resolve target
|- Section 2: mutate shared state
|- Section 3: verify outcome
`- Section 4: summarize or hand back to user

This structure matters even more in local-first products. If the human is editing a browser-owned draft while the agent wants to reason over shared durable state, you need a flush barrier or an explicit block. Otherwise the agent plans against a world the human is no longer looking at.

Checkpoint between sections, not after every tiny action.
Re-read relevant context before writing if humans can edit concurrently.
Measure first visible change, not just total completion time.

Human Gates Are Runtime Features

Human-in-the-loop behavior should not be treated as a special-case UI overlay. It is a core runtime feature. If the agent needs clarification or approval, the system should pause cleanly, hold context, keep the connection alive, and resume from the same run state after the user answers.

This also means your UI needs two concepts of state: transport state and gate state. If the system is waiting for a human answer, that must override any stale 'running' indicator. Otherwise you get the classic failure mode where the user sees a spinner while the agent is actually blocked on them.

Gate type	Typical trigger	What the runtime must guarantee
Question	Missing information changes the next action	Pause, preserve context, inject the answer back into the same run
Confirmation	Risky or destructive action	Do not proceed until approval semantics are explicit
Mode change	User escalates autonomy	Apply the new permission mode immediately, not on a later run

Streaming and Replay Are Core, Not Polish

Agent products feel alive when the runtime can stream progress and recover from network reality. Connections drop, tabs sleep, corporate proxies misbehave, and long reasoning windows create silence that looks like failure unless the transport is designed for it.

Concern	SSE default	Why it usually wins
Direction	Server to client	Most agent traffic is progress, text, and tool status flowing outward
Operations	Simple HTTP semantics	Less fragile through proxies and standard infra
Recovery	Sequence-based replay	Clients can catch up after disconnects instead of starting blind

Replay is about position, not ceremony

seq: 41 -> 42 -> 43   [disconnect]   44 -> 45 -> 46
client reconnects with cursor 43 and asks for >43

The important primitive is not the transport brand. It is monotonic event sequencing plus replay. If the client cannot reconnect and reconstruct what it missed, the system will eventually lie about run state even when the backend stayed healthy the entire time.

Reliability Comes From Recovery

Most runtime failures are ordinary: rate limits, timeouts, malformed tool calls, stale state, broken handoffs, and repeated reads that never become action. Reliable systems classify these failures, bound retries, and choose recovery paths that preserve both state integrity and user trust.

Detect stuck loops before they turn into credit-card bugs.
Retry by failure class, not with a single global panic budget.
Prefer undo over bespoke rollback when the system already has a valid reversible primitive.
Keep the latest local draft visible in interactive products even when durability is lagging behind it.
Do not let mutation queues become the source of visual truth.

One subtle failure pattern is self-inflicted: feeding giant raw tool outputs back into the model. The system then slows itself down, blows context, and creates new reliability incidents from traces that should have stayed in logs. Compact outputs are not just an optimization. They are a stability tool.

Observability and Optimization

Users need a readable narrative of what the system is doing. Operators need a faithful trace of what actually happened. Those are related problems, but they are not the same problem. A good agent product humanizes tool activity for the user while preserving enough raw trace to reconstruct the run after the fact.

Optimization comes after that foundation. Stable prompt prefixes, diff-based sync, compacted event streams, and adaptive model routing all matter, but only after the system is already correct. Saving tokens on a broken runtime just lets you fail more cheaply.

Optimization	What it helps	What not to sacrifice
Prefix stability	Latency and cost	Do not rewrite old prompt history casually
Diff-based sync	Bandwidth and context size	Do not hide capability state behind vague summaries
Model routing	Cost per turn	Do not route away from the model that owns the hard decision
Event compaction	Memory and replay cost	Do not compact away the trace needed for diagnosis

Anti-Patterns and the Practical Goal

The recurring mistakes are now familiar. Teams build a separate agent API, expose too many tools, reload full state every turn, treat SSE as reliable, skip undo-first recovery, and blur the line between local drafts and durable truth. The result is usually the same: impressive demos, fragile products, and hard-to-explain failures.

Do not confuse autonomy with absence of guardrails.
Do not confuse internal convenience with a safe public tool surface.
Do not confuse durable metadata with what the user should see on screen.
Do not confuse more context with better context.

The practical goal is not to make the model feel magical. It is to make autonomy bounded, observable, reversible, and trustworthy enough that humans can work alongside it in the same product. When those properties are strong, agentic systems stop feeling like prototypes and start feeling like infrastructure.

That is usually the real milestone: not 'the agent can do everything', but 'the agent can do enough, safely, repeatedly, and without surprising the user'.

Building Agentic Systems