Rehearse your agent
before you let it act
Run deterministic scenario sweeps, compare strategies under fixed conditions, and export replay bundles you can actually trust.
The Problem
Lucky demos don’t prove production readiness
Without a rehearsal layer, agent systems jump from prompt experiments straight to production. Tyche creates the missing middle: a repeatable, measurable environment where decisions, memory, and evaluator outcomes can be inspected and rerun.
The left transcript is titled Lucky demo, March 12. It uses seed default, searches vendor X invoice, summarizes total 4,200 dollars due March 30, schedules payment, and reports verdict PASS. The right transcript is titled Same code, rerun March 14. It has the same seed and steps, but the summary line changes to total 42,000 dollars due March 30 while the verdict still says PASS. The message below says Tyche fixes this with seeds, replay manifests, and reproducible verdicts.
How It Works
From scenario to evidence in three steps
Scenario Pack
Define the environment, starting state, tools, memory settings, and scoring rules for the run.
Sweeps + Comparison
Run the same scenario across prompts, models, policies, or tool chains under controlled conditions.
Replay Bundle
Export deterministic run evidence with state snapshots, decisions, and outcomes for review or postmortem.
Replay bundle file tree: manifest.json with seed 0xA3F1 and scenario version 1.4, transcripts run_01 through run_03, state snapshots turn_00 and turn_01 through turn_12, scorecard.csv, comparison.html, and README.md. Compare-runs table columns are strategy, accuracy, cost, steps, and failures. gpt-4o plus aggressive scores 0.87, costs 0.43 dollars, 8 steps, 1 failure. claude plus conservative scores 0.92, costs 1.20 dollars, 12 steps, 0 failures and is marked as winner. llama-local plus default scores 0.71, costs 0.08 dollars, 6 steps, 3 failures.
Capabilities
What Tyche gives your team
Deterministic seeds and loop controls
Runs carry seeds, scenario versions, adapter versions, and replay manifests so results can be reproduced — not just described.
Scenario packs and fixtures
Versioned definitions of actors, tools, environment rules, start states, stop conditions, and evaluator criteria. Sharable, reviewable, diffable.
Replay bundles with evidence
Run metadata, scoring, state snapshots, and enough context to explain the result and justify the decision to widen autonomy.
Token and context accounting
Memory budgets, context windows, and cost are visible per-run, not mystical. Know what each strategy costs before production does.
Hardware-neutral runners
API runners first, with local and self-hosted options as deployment choices, not the product definition. No hardware shopping list required.
Before and after production
Pre-production rehearsal and post-incident reconstruction use the same primitives. One tool for both confidence and accountability.
Use Cases
Where Tyche creates the most value
Pre-production rehearsal
Test whether an agent workflow behaves acceptably before it is allowed anywhere near live systems.
Post-incident replay
An approved agent sent the wrong vendor message on a Tuesday. The team grabs the trace, feeds its seed and scenario version into Tyche, reruns with alternate prompts, and within an afternoon has three candidate fixes, a scorecard comparing them, and a replay bundle the incident review can cite. The patched scenario becomes the next regression test.
Strategy comparison
Measure multiple prompts, models, or tool chains under the same conditions instead of arguing from vibes.
Cost and privacy tuning
Use local or self-hosted runners where the economics or data sensitivity justify it, without making hardware the core story.
Suite Fit
Better together with Forseti
Forseti tells you whether an agent may act. Tyche tells you how that agent is likely to behave before you let it act. Together they form a credible enterprise control and rehearsal story. Winning policies from Tyche runs can graduate directly into Forseti policy packs.
Timeline: March 2, Tyche scenario authored for invoice-review version 0.1. March 5, Tyche sweep complete with 24 runs and claude-conservative winning. March 6, Forseti policy extracted: pay requires approval and payments of 5,000 dollars or more need 2 C-level approvals or 10 member approvals. March 12, Forseti live in production and first governed intent released. March 18, Tyche incident replay reconstructs a denied intent and patches the scenario.
FAQ
Common questions
Get Started
Bring one workflow or one incident. Leave with a replay bundle.
A Tyche discovery sprint is 1–2 weeks. We take one high-value scenario or one real incident, turn it into a seeded, reproducible simulation, and hand back a replay bundle your team can open, rerun, and cite. If the problem actually belongs upstream, we’ll say so.