Caimeo Tyche

Rehearse your agent
before you let it act

Run deterministic scenario sweeps, compare strategies under fixed conditions, and export replay bundles you can actually trust.

Lucky demos don’t prove production readiness

Without a rehearsal layer, agent systems jump from prompt experiments straight to production. Tyche creates the missing middle: a repeatable, measurable environment where decisions, memory, and evaluator outcomes can be inspected and rerun.

Two transcript cards comparing a lucky Tyche demo run with a rerun where the invoice amount changes from 4,200 to 42,000.

The left transcript is titled Lucky demo, March 12. It uses seed default, searches vendor X invoice, summarizes total 4,200 dollars due March 30, schedules payment, and reports verdict PASS. The right transcript is titled Same code, rerun March 14. It has the same seed and steps, but the summary line changes to total 42,000 dollars due March 30 while the verdict still says PASS. The message below says Tyche fixes this with seeds, replay manifests, and reproducible verdicts.

Same agent. Same code. Two runs. One bug nobody noticed.

From scenario to evidence in three steps

1

Scenario Pack

Define the environment, starting state, tools, memory settings, and scoring rules for the run.

2

Sweeps + Comparison

Run the same scenario across prompts, models, policies, or tool chains under controlled conditions.

3

Replay Bundle

Export deterministic run evidence with state snapshots, decisions, and outcomes for review or postmortem.

Tyche replay bundle image showing a file tree and compare-runs table with strategy, accuracy, cost, steps, and failures.

Replay bundle file tree: manifest.json with seed 0xA3F1 and scenario version 1.4, transcripts run_01 through run_03, state snapshots turn_00 and turn_01 through turn_12, scorecard.csv, comparison.html, and README.md. Compare-runs table columns are strategy, accuracy, cost, steps, and failures. gpt-4o plus aggressive scores 0.87, costs 0.43 dollars, 8 steps, 1 failure. claude plus conservative scores 0.92, costs 1.20 dollars, 12 steps, 0 failures and is marked as winner. llama-local plus default scores 0.71, costs 0.08 dollars, 6 steps, 3 failures.

The replay bundle — file tree + compare-runs grid, exactly as delivered

What Tyche gives your team

Deterministic seeds and loop controls

Runs carry seeds, scenario versions, adapter versions, and replay manifests so results can be reproduced — not just described.

Scenario packs and fixtures

Versioned definitions of actors, tools, environment rules, start states, stop conditions, and evaluator criteria. Sharable, reviewable, diffable.

Replay bundles with evidence

Run metadata, scoring, state snapshots, and enough context to explain the result and justify the decision to widen autonomy.

Token and context accounting

Memory budgets, context windows, and cost are visible per-run, not mystical. Know what each strategy costs before production does.

Hardware-neutral runners

API runners first, with local and self-hosted options as deployment choices, not the product definition. No hardware shopping list required.

Before and after production

Pre-production rehearsal and post-incident reconstruction use the same primitives. One tool for both confidence and accountability.

Where Tyche creates the most value

Pre-production rehearsal

Test whether an agent workflow behaves acceptably before it is allowed anywhere near live systems.

Post-incident replay

An approved agent sent the wrong vendor message on a Tuesday. The team grabs the trace, feeds its seed and scenario version into Tyche, reruns with alternate prompts, and within an afternoon has three candidate fixes, a scorecard comparing them, and a replay bundle the incident review can cite. The patched scenario becomes the next regression test.

Strategy comparison

Measure multiple prompts, models, or tool chains under the same conditions instead of arguing from vibes.

Cost and privacy tuning

Use local or self-hosted runners where the economics or data sensitivity justify it, without making hardware the core story.

Better together with Forseti

Forseti tells you whether an agent may act. Tyche tells you how that agent is likely to behave before you let it act. Together they form a credible enterprise control and rehearsal story. Winning policies from Tyche runs can graduate directly into Forseti policy packs.

Linear Tyche and Forseti timeline showing scenario authoring, sweep completion, policy extraction, production release, and incident replay.

Timeline: March 2, Tyche scenario authored for invoice-review version 0.1. March 5, Tyche sweep complete with 24 runs and claude-conservative winning. March 6, Forseti policy extracted: pay requires approval and payments of 5,000 dollars or more need 2 C-level approvals or 10 member approvals. March 12, Forseti live in production and first governed intent released. March 18, Tyche incident replay reconstructs a denied intent and patches the scenario.

Two products, one timeline — how a policy actually travels from Tyche into Forseti and back

Common questions

No. API-backed runners are enough for the first pilots. Local hardware is an optional optimization path, not the product definition.
No. The core job is rehearsal, replay, comparison, and evidence generation around agent behavior - not training new models.
Yes. The strongest story is Tyche before production for rehearsal, Forseti at the execution boundary for governance, and Tyche again for replay or postmortem after incidents.
One scenario family, one scoring rubric, one comparison pack, and a replay bundle fit for operator review. Most discovery sprints run 1-2 weeks.

Bring one workflow or one incident. Leave with a replay bundle.

A Tyche discovery sprint is 1–2 weeks. We take one high-value scenario or one real incident, turn it into a seeded, reproducible simulation, and hand back a replay bundle your team can open, rerun, and cite. If the problem actually belongs upstream, we’ll say so.