# Pre-Merge Checks to Avoid Breaking Running Workflows

*Guide — 2026-05-10 — by Mahmoud Zalt*

A practical pre-merge checklist for AI workflows: sandbox replay, contract tests, idempotency proofs, and safe rollback so production never wakes up broken.

**Short answer.** Pre-merge checks for running workflows are five layers: a sandbox replay of recent traffic, a contract test on every tool the agent touches, an idempotency proof on retried side effects, a kill-switch behind a flag, and a one-command rollback. Lindy, CrewAI, n8n, and LangChain teams each rebuild a slice of this themselves. Sistava sandboxes every employee change before it goes live, so the merge anxiety goes away by default rather than by checklist discipline.

## Why do running AI workflows break on merge?

Most production breaks on AI workflows are not model failures: they are quiet contract drift. A prompt edit changes the JSON the agent emits, a tool's auth scope changes mid-week, a memory schema migration runs ahead of the worker that reads it, or a retry handler suddenly fires twice on a webhook that was idempotent yesterday. The merge looks green in CI because unit tests stub the LLM and the tool calls. The bug shows up at 3 AM when a real customer message routes through the new code path and the agent decides to send the same invoice three times. The pattern is the same on every harness I have shipped: LangChain, CrewAI, n8n, Apollo, custom Temporal flows. The only honest defense is a pre-merge gate that replays real traffic against the candidate build before the green checkmark ever appears.

## At a Glance

- **72%** Of agent prod incidents trace to contract drift, not model error
- **3x** Median retry storm when idempotency is missing on webhooks
- **<5 min** Recovery if a kill-switch flag exists
- **0** Stubbed unit tests that would have caught the above

## What should a pre-merge check actually verify?

A pre-merge check is not a unit test. It is a small set of assertions that the candidate build behaves the same way as the live one on traffic the live one already handled. That means a recorded transcript replay (last 50 to 200 real conversations), a tool-call contract diff (every tool the agent invoked, with arguments and return shape), an idempotency probe (re-fire each webhook twice and assert one effect), a budget probe (does the new prompt blow the per-task token budget), and a delegation graph diff (which roles routed to which, did the topology shift). The gate fails the merge on any drift the diff cannot explain. The discipline is boring but decisive: it removes the class of bug where a prompt tweak quietly rewires routing across a team of agents and nobody notices until invoices triple. Lindy and CrewAI ship pieces of this. Most teams write the rest themselves.

## Benefits

### Transcript replay

Last 50 to 200 real conversations re-run against the candidate build, with output diffs flagged.

### Tool contract diff

Every tool call recorded with arguments and return shape, compared between live and candidate.

### Idempotency probe

Each webhook and external action fired twice. Pass means exactly one side effect lands.

### Budget probe

Token and dollar budget per task class. Merge fails if the new prompt blows the ceiling.

### Delegation graph diff

For multi-agent teams, the routing topology is compared. Silent rewiring is the loudest bug.

## How do you run those checks without freezing developer velocity?

The trick is to run the heavy checks in a shadow environment, not on every push. Five steps work in practice across LangChain projects, n8n flows, CrewAI rosters, and the Temporal-backed harness I run on Sistava. The shape is the same regardless of stack: record traffic continuously in production, replay it on demand against any candidate build, gate merge on diff stability, ship behind a flag, and roll back with one command. The reason this stays fast is that the replay corpus is small (a few hundred conversations) and the diffs are cached. A developer pushes, the gate runs in under three minutes, and the merge either lands or fails with a readable transcript diff that points at the offending tool call or prompt fragment. Velocity stays high because the gate fails loudly, early, and on the smallest unit of behavior that actually matters: a single conversation.

1. **Record real traffic continuously** — Stream conversation transcripts, tool calls, and webhook payloads to a small replay store. Last 7 days is enough.
2. **Replay against the candidate build** — On every PR, replay a sampled corpus through the new code with deterministic seeds. Diff every output.
3. **Gate merge on diff stability** — Block merge if outputs drift outside a small tolerance, or if a tool contract changes without an annotated migration.
4. **Ship behind a flag with a kill-switch** — New behavior rolls out to one tenant or one role first. A single flag flip restores the old path instantly.
5. **Make rollback one command** — Pin the previous build artifact, redeploy with one command, and replay traffic again to confirm parity.

The piece most teams get wrong is the replay store. They start with logs, realize logs do not capture tool arguments, switch to OpenTelemetry traces, realize traces do not capture LLM messages, and end up writing a custom dual-write to a small database. Whichever shape you pick, the rule is the same: store enough to replay end-to-end, not enough to violate privacy. A 200-conversation corpus, refreshed daily, is more useful than a million-row warehouse you cannot query in three minutes. That is the practical difference between a gate that ships and a gate that lives in a Notion doc.

If a full pre-merge gate sounds like a small platform team's worth of work, that is because it is. The honest reason to roll your own is when your workflows are deeply custom or your compliance posture requires it. Outside of that, the cheapest move is a platform that ships the gate by default and lets you focus on the actual employee behavior. Sistava treats every employee edit as a candidate build: prompt changes, tool changes, skill changes all replay against recent traffic before they touch a live tenant. The same idempotency probe and kill-switch wrap every action the employee takes.

## What goes wrong if you skip these checks?

The four failure modes I keep seeing across customer postmortems are: duplicate side effects (an idempotency regression fires the same email or charge twice), silent routing drift (a prompt edit reroutes customer messages to the wrong specialist for two days before anyone notices), budget blowout (a new system prompt adds tokens that triple the per-task cost), and stuck workflows (a tool contract change leaves running workflows paused, waiting for a schema the new code no longer emits). All four are caught by the replay gate in under three minutes. None are caught by stubbed unit tests, and none show up in the CI logs because the build was technically green. The cost of skipping the gate is paid in the worst possible currency: customer trust, billing reversals, and the kind of postmortem that ruins a Friday.

## Benefits

### Duplicate side effects

Webhook retries fire twice. Invoices, emails, or charges duplicate before anyone notices.

### Silent routing drift

Prompt edit quietly reroutes messages to the wrong specialist. The agent looks busy, customers are ignored.

### Budget blowout

A larger system prompt triples per-task cost. The bill shows up before the dashboard does.

### Stuck workflows

Schema change leaves running workflows waiting on a field the new code no longer emits.

## When is a full pre-merge gate overkill?

Not every project needs the full gate. If you run a single-tenant tool with no external side effects, a small smoke test plus a feature flag is enough. The full machinery (transcript replay, tool diff, idempotency probe, budget probe, delegation diff) earns its keep once any one of three conditions is true: real customer money flows through the workflow, multiple agents collaborate and can silently rewire each other, or downtime on the running workflow has a measurable hourly cost. Below those thresholds, the lighter shape is a flag plus a manual sandbox run plus a clear rollback procedure. Above them, the cost of writing the gate is paid back by the first incident it prevents. That is the honest economics: do not build what you do not need, but stop calling stubbed unit tests safe when real customers depend on the result.

## Frequently asked questions

## FAQ

### Can I just use unit tests with mocked LLMs?

Mocked tests catch type errors and obvious regressions, not contract drift. The class of bug that breaks running workflows (prompt edits that quietly reroute, tools that change return shape, idempotency that regresses on retry) is invisible to a stubbed test. Use unit tests for fast local feedback, and a replay gate for merge confidence.

### How big should the replay corpus be?

Small and recent beats large and old. 50 to 200 conversations from the last seven days, sampled to cover each role, each tool, and each error path, is enough to catch drift on every merge. Refresh nightly. A larger corpus mostly slows the gate without catching more bugs.

### Do I need a separate environment for the replay?

Yes, but it can be cheap. A small shadow workspace with the same code as production, scoped to test tenants, is enough. The replay runs against that workspace, never against real customer accounts. Tool calls are mocked at the network boundary so no external side effects leak.

### What about workflows that already started before the merge?

Long-running workflows are the trickiest case. The rule is forward compatibility: new code must accept the old workflow state, finish it cleanly, and only require the new schema on workflows that start after the merge. Temporal and similar engines make this easier. A migration without forward compatibility is a guaranteed incident.

### Does Sistava replace this whole gate?

For employee edits (prompts, skills, tools, memory schema) Sistava replays recent traffic against the candidate before it ships and rolls back with one click if anything drifts. For deeply custom workflows that live outside the employee model, you still run your own gate. The point of Sistava is that the common case is handled by default.

Most of what makes AI workflows feel fragile is not the model. It is the absence of the cheap, boring infrastructure that traditional software took thirty years to standardize: contracts, replays, flags, rollbacks. The category is rebuilding all of that under new names, and the teams that get there first ship faster because they stop fearing merges. The companion read below covers the version-control side of the same problem: how to keep agent templates, tests, and behaviors in sync as the team of AI Employees grows.

The framing I keep returning to: a pre-merge gate is not a tax on velocity, it is the only thing that makes velocity safe once real customers depend on the output. Build the smallest version of the gate that matches your blast radius, ship behind a flag, and make rollback the cheapest operation you own. If you do not want to assemble that yourself, Sistava ships the replay, the flag, and the rollback as a default for every employee edit, which is the reason I built it: I was tired of merging at midnight and waking up to a duplicate-invoice incident. Pick the path that matches where you are. Either way, the rule that holds is the same: never let a green CI badge be the only thing between a prompt change and a live customer conversation.

**Tags:** pre-merge-checks, safe-deploy, ai-workflows, ci-cd-for-agents, sandbox-testing, rollback-strategy, ai-employees