Atlas Labs
Lab Notes
AgentsOrchestrationObservabilityLLMs

Agent workflows that don't break: orchestration, guardrails, and observability

·7 min read

The demo works perfectly. The agent calls the right tools, produces the right output, finishes in seconds. You show it to the stakeholders and everyone is excited.

Then you run it on 500 real cases. Forty fail silently. Eight enter infinite loops that burn through tokens. Three call the wrong API with malformed parameters and corrupt some data. Two produce plausible-sounding output that is completely wrong.

Multi-agent systems are in a peculiar position: they're powerful enough to do genuinely useful work, and fragile enough to fail in genuinely unexpected ways. The engineering discipline required to make them production-reliable is underappreciated.

This post covers the three areas that matter most: orchestration design, guardrails, and observability.

Orchestration: explicit state machines over implicit flows

The most common mistake in agent orchestration is letting the LLM manage its own control flow without constraints. The LLM decides what to do next, calls tools, interprets results, and decides again. This is flexible, but it's also an untestable black box.

Use explicit state machines. Define the states your workflow can be in, the valid transitions between them, and what triggers each transition. The LLM operates within each state; the orchestrator manages state transitions.

from enum import Enum
from dataclasses import dataclass

class WorkflowState(Enum):
    PLANNING = "planning"
    DATA_RETRIEVAL = "data_retrieval"
    ANALYSIS = "analysis"
    VALIDATION = "validation"
    COMPLETE = "complete"
    FAILED = "failed"

@dataclass
class WorkflowContext:
    state: WorkflowState
    inputs: dict
    retrieved_data: list
    analysis_result: dict | None
    validation_errors: list[str]
    attempt_count: int = 0
    max_attempts: int = 3

def run_workflow(context: WorkflowContext) -> WorkflowContext:
    while context.state not in (WorkflowState.COMPLETE, WorkflowState.FAILED):
        context = step(context)
    return context

def step(context: WorkflowContext) -> WorkflowContext:
    if context.state == WorkflowState.PLANNING:
        return plan(context)
    elif context.state == WorkflowState.DATA_RETRIEVAL:
        return retrieve(context)
    elif context.state == WorkflowState.ANALYSIS:
        return analyze(context)
    elif context.state == WorkflowState.VALIDATION:
        return validate(context)
    raise ValueError(f"Unknown state: {context.state}")

This pattern gives you:

  • Testability: Each state function is independently testable
  • Debuggability: You always know what state the workflow is in
  • Retry logic: You can retry individual states without restarting the whole workflow
  • Timeouts: You can bound each state independently

Checkpoint and resume. For long-running workflows, persist state at each transition. If the workflow fails mid-execution (network error, rate limit, crash), you can resume from the last checkpoint rather than starting over.

import json
from pathlib import Path

def checkpoint(workflow_id: str, context: WorkflowContext):
    state_file = Path(f"checkpoints/{workflow_id}.json")
    state_file.write_text(json.dumps({
        "state": context.state.value,
        "attempt_count": context.attempt_count,
        "retrieved_data": context.retrieved_data,
        "analysis_result": context.analysis_result,
    }))

def load_checkpoint(workflow_id: str) -> dict | None:
    state_file = Path(f"checkpoints/{workflow_id}.json")
    if state_file.exists():
        return json.loads(state_file.read_text())
    return None

Guardrails: defense in depth

Guardrails are the safety nets that prevent bad outputs from reaching your users or downstream systems. The key insight is that you need guardrails at multiple layers — not just at the final output.

Input validation. Validate and sanitize inputs before they reach the LLM. This is especially important for user-supplied content that goes into prompts (prompt injection) and for tool arguments that will hit external APIs or databases.

def validate_tool_args(tool_name: str, args: dict) -> dict:
    schemas = {
        "search_database": {
            "query": str,
            "limit": (int, lambda x: 1 <= x <= 100),
            "filters": (dict, None),
        },
        "send_email": {
            "to": (str, lambda x: "@" in x),  # basic email validation
            "subject": (str, lambda x: len(x) <= 200),
            "body": (str, lambda x: len(x) <= 10000),
        }
    }

    if tool_name not in schemas:
        raise ValueError(f"Unknown tool: {tool_name}")

    schema = schemas[tool_name]
    validated = {}
    for field, (expected_type, validator) in schema.items():
        if field not in args:
            raise ValueError(f"Missing required field: {field}")
        if not isinstance(args[field], expected_type):
            raise TypeError(f"Field {field} must be {expected_type}")
        if validator and not validator(args[field]):
            raise ValueError(f"Field {field} failed validation")
        validated[field] = args[field]

    return validated

Output validation. Check the LLM's output before acting on it. For structured outputs, validate against a schema. For natural language outputs, run secondary checks (another LLM call, regex patterns, business logic).

Circuit breakers. If a tool is failing repeatedly, stop calling it. This prevents runaway costs and protects downstream systems from being hammered by a malfunctioning agent.

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=60):
        self.failures = 0
        self.failure_threshold = failure_threshold
        self.last_failure_time = None
        self.recovery_timeout = recovery_timeout
        self.state = "closed"  # closed = normal, open = blocking

    def call(self, fn, *args, **kwargs):
        if self.state == "open":
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = "half-open"
            else:
                raise CircuitOpenError("Circuit breaker is open")

        try:
            result = fn(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures = 0
            return result
        except Exception as e:
            self.failures += 1
            self.last_failure_time = time.time()
            if self.failures >= self.failure_threshold:
                self.state = "open"
            raise

Human escalation paths. Not every edge case can be handled automatically. Design explicit escalation paths for situations the agent can't resolve confidently. This is better than a hallucinated response.

Observability: you can't fix what you can't see

The operational reality of multi-agent systems is that failures are often subtle, distributed, and difficult to reproduce. Good observability is what makes them debuggable.

Trace every step. Use OpenTelemetry or a dedicated LLM observability platform (LangSmith, Phoenix, Helicone) to capture every LLM call, tool call, and state transition. Each trace should include:

  • Input and output (full content, not just summaries)
  • Latency
  • Token usage and estimated cost
  • Tool name and arguments
  • Any errors or retries
import opentelemetry.trace as trace

tracer = trace.get_tracer("agent-workflow")

def analyze(context: WorkflowContext) -> WorkflowContext:
    with tracer.start_as_current_span("analysis") as span:
        span.set_attribute("workflow.state", "analysis")
        span.set_attribute("input.length", len(str(context.retrieved_data)))

        try:
            result = llm_analyze(context.retrieved_data)
            span.set_attribute("output.tokens", result.usage.total_tokens)
            span.set_attribute("output.length", len(result.content))
            return dataclasses.replace(
                context,
                state=WorkflowState.VALIDATION,
                analysis_result=result.content
            )
        except Exception as e:
            span.record_exception(e)
            span.set_status(trace.StatusCode.ERROR)
            raise

Metrics to track in production:

  • Workflow completion rate (by workflow type)
  • Step success/failure rates
  • P50/P95/P99 latency per step
  • Token consumption per workflow run
  • Tool call failure rates
  • Guardrail trigger rates (how often are inputs/outputs being caught?)
  • Human escalation rate

Structured logging. Every log line should include the workflow ID, step name, and enough context to reconstruct what happened. This makes debugging a specific failed run much easier.

Sampling strategy for production. Logging every token of every LLM call is expensive. Use a tiered approach:

  • Log all metadata (latency, token counts, tool calls) for 100% of runs
  • Log full content for a random sample (5–10%)
  • Log full content for 100% of failed or escalated runs
  • Log full content for 100% of runs flagged by guardrails

The production readiness checklist

Before deploying an agent workflow to production:

Orchestration:

  • [ ] Explicit state machine with defined transitions
  • [ ] Checkpoint and resume capability for long workflows
  • [ ] Maximum iteration/step limits to prevent infinite loops
  • [ ] Timeout per step and per workflow
  • [ ] Graceful degradation path when the agent can't complete

Guardrails:

  • [ ] Input validation for all user-supplied content
  • [ ] Tool argument schemas with validation
  • [ ] Output validation before downstream action
  • [ ] Circuit breakers on external tool calls
  • [ ] Human escalation path for low-confidence outputs
  • [ ] Dry-run mode for testing without side effects

Observability:

  • [ ] Distributed tracing on all LLM and tool calls
  • [ ] Cost tracking per workflow run
  • [ ] Metrics on completion rate, latency, and tool failures
  • [ ] Structured logging with workflow context
  • [ ] Alerting on anomalous failure rates or costs
  • [ ] Dashboard for operational monitoring

Agent systems that are well-orchestrated, well-guarded, and well-observed behave differently from the ones that aren't: they fail loudly when something goes wrong, they recover gracefully when possible, and they produce debugging artifacts that make fixing things tractable. That's the gap between a demo and a production system.