WDK gaps and how xFlow addresses each
Concrete pain points with Vercel's Workflow Development Kit and Worlds โ hard limits, structural gaps, and operational sharp edges โ paired with working xFlow code that addresses each. Citations are inline and specific.
How to read this
Each case study is one specific gap, with citations and code.
Categories: hard limits are caps documented in WDK's pricing/foundations pages. Structural gaps are properties of the WDK execution model. Operational items are sharp edges teams hit in practice โ issue numbers cited where applicable.
Three categories
Documented platform caps โ replay duration, run size, function duration, region pinning, retention windows. Mitigations exist; they push complexity onto you.
Properties of the execution model โ single orchestrator, no browser participation, no first-class step placement, definitions as compiled artifacts. Not patchable without rewriting the engine.
Sharp edges teams hit in practice โ determinism drift, bundler fragility, retries silently dropped. Open GitHub issues cited.
Hard limit
240-second orchestrator replay cap
Every WDK orchestrator must replay deterministically from its event history within 240 seconds. Past ~2,000 events the docs explicitly recommend splitting into child workflows. Long agent loops, tool-heavy chats, and high-fan-out runs hit this ceiling and stop progressing.
WDK behavior
Hard limitThe orchestrator function is re-executed against the event journal on every resume. Replay must complete in 240s, and per-step latency grows with completed-step count. Workarounds shard the run into child workflows, which fragments the audit trail and makes resumption messier.
// "use workflow" body re-runs from the top on every resume.
// Each await is an event; the sandbox replays the function until
// it reaches the next un-journaled await. 5,000 events โ near cap.
async function longAgent(input: AgentInput) {
"use workflow"
for (const tool of plan(input)) {
const result = await callTool(tool) // event
await persistTrace(result) // event
if (await needsApproval(result)) { // event
await waitForApproval() // event (suspend)
}
}
}xFlow's structural answer
xFlow has no orchestrator-replay step. The current FlowRun is a typed reducer over the event log, computed once per state change. The scheduler is a pure function over the current FlowRun + WorkflowDefinition. No 240s cap, no growth-with-history orchestrator cost, no determinism sandbox to escape from.
import { defineWorkflow, defineStep, link } from "@decoperations/xflow-core"
// Definitions are data. The runtime never re-runs the workflow body.
// On each appended event the substrate updates the FlowRun via reduceFlowRun;
// the scheduler reads current state and decides what's next.
const longAgent = defineWorkflow({
id: "agent.long",
version: "1.0.0",
steps: {
plan: defineStep({ id: "plan", type: "agent.plan", sideEffects: { kind: "external", idempotencyRequired: true } }),
execute: defineStep({ id: "execute", type: "agent.execute", sideEffects: { kind: "external", idempotencyRequired: true } }),
persist: defineStep({ id: "persist", type: "agent.persist", sideEffects: { kind: "idempotent", idempotencyRequired: true } }),
},
links: [
link("plan", "execute", { when: { type: "step.succeeded" } }),
link("execute", "persist", { when: { type: "step.succeeded" } }),
],
})
// 100,000 events on the run? Reducer cost is linear, runs once per change,
// no determinism sandbox, no replay timer. View is computed by the substrate
// (or cached). The scheduler reads it, picks the next step, fires it.Coexistence vs replacement
Coexist: keep WDK orchestrators for short, deterministic flows; use xFlow for long-tail agent runs and high-fan-out workflows. Replace: when most of your runs are agent-shaped, the replay model is more cost than benefit.
Hard limit
Per-step function duration ceiling (300s default, 800s max)
WDK steps run as Vercel Functions (or equivalent under self-hosted Worlds). Hobby caps at 300s; Pro and Enterprise top out at 800s on Fluid Compute. Anything longer โ ffmpeg transcodes, GPU inference, large Playwright runs โ has to be wrapped in a webhook or queue boundary, even though the workflow already provides one.
WDK behavior
Hard limitStep is a function invocation. Long-running compute means: step submits to a worker, returns; workflow waits for a webhook to resume. You're hand-rolling the durable boundary the workflow was supposed to give you.
xFlow's structural answer
A step's executor lives on whichever peer registers it. A Docker, VPS, or GPU peer with the right placement runs the step locally โ no function-runtime duration cap. The substrate-level claim mode (lease, with TTL renewal) covers the durability story.
defineStep({
id: "transcode",
type: "media.transcode",
// Only Docker peers (with ffmpeg) compete for this step
placement: { required: ["docker", "ffmpeg-binary"] },
// Lease-mode claim with hourly heartbeat
claim: { mode: "lease", ttlMs: 60 * 60_000 },
sideEffects: { kind: "external", idempotencyRequired: true },
})
// On the Docker box:
import { createXFlowRuntime } from "@decoperations/xflow-runtime"
import { detectLocalNode } from "@decoperations/xflow-provider-local"
import { xsyncSubstrate } from "@decoperations/xflow-substrate-xsync"
const runtime = createXFlowRuntime({
node: detectLocalNode({ id: "render-box-1", capabilities: ["docker", "ffmpeg-binary"] }),
substrate: xsyncSubstrate({ client: xsync }),
})
runtime.register("media.transcode", async (ctx, input) => {
// Runs as long as needed. No 800s function cap.
// Lease auto-extends via the substrate.
await ctx.progress({ phase: "downloading" })
const result = await runFfmpeg(input)
await ctx.progress({ phase: "uploading" })
return result
})Coexistence vs replacement
Coexist: keep short steps on WDK + Vercel; route long ones to xFlow Docker peers via cross-runtime federation. Replace: when long-running compute dominates your workload, the Vercel function ceiling is just a tax.
Hard limit
Per-run ceilings: 10,000 steps, 25,000 events, 50 MB payloads, 2 GB storage
These are documented hard caps. They cover most use cases, but high-fan-out batch jobs (tens of thousands of records, AI ranking pipelines, scrapers) routinely exceed them and need to be sharded into child runs.
WDK behavior
Hard limitRecommended pattern: parent workflow spawns child workflows in batches of N, joins on completion. Audit trail is now spread across runs; observability has to stitch them back together.
xFlow's structural answer
No platform-imposed ceilings on a single run. The substrate's storage adapter is what bounds you (S3, Postgres, Redis, etc.). Reducer cost is linear in events; the scheduler walks O(steps) per tick. Both are JS-level concerns, not platform-level limits.
Coexistence vs replacement
If you're already sharding into child workflows on WDK, you're paying complexity cost to avoid an artificial limit. xFlow lets you keep the run together and rely on the storage layer's scale.
Hard limit
Region pinning (iad1-only) and short retention (1d / 7d / 30d post-completion)
The Vercel-managed World currently runs in iad1 only โ global distribution is on the v5 roadmap but not shipped. Retention after run completion is plan-tier-bound and not user-configurable: 1 day on Hobby, 7 days on Pro, 30 days on Enterprise. For audit, compliance, or long-retention reporting needs, those windows are non-starters.
WDK behavior
Hard limitRun state lives in Vercel's managed substrate. You can't choose the region, the redundancy, or the retention. Long retention means switching to Postgres World (which carries its own self-host operational story) or building your own export pipeline to S3.
xFlow's structural answer
Bring your own bucket. The xSync substrate is configured against any S3-compatible store (R2, B2, Storj, AWS S3 itself). Multi-region is whatever your bucket supports. Retention is whatever your S3 lifecycle policy says โ keep forever, prune after a year, archive to Glacier.
import { createXSync } from "@decoperations/xsync-client"
import { xflowS3WormStore } from "@decoperations/xflow-s3worm"
import { xsyncSubstrate } from "@decoperations/xflow-substrate-xsync"
const store = xflowS3WormStore({
bucket: "xflow-prod",
region: "us-east-1",
// Cross-region replication is an S3 bucket setting; nothing else changes.
// Retention is an S3 lifecycle policy on this bucket.
endpoint: process.env.S3_ENDPOINT,
rootPrefix: "tenants/acme",
})
const xsync = await createXSync({ store, views: { /* ... */ } })
const substrate = xsyncSubstrate({ client: xsync })Coexistence vs replacement
Coexist: keep short-lived runs on WDK + Vercel World; mirror the lifecycle events into xFlow on S3 for audit/compliance retention. Replace: when retention or region requirements are first-class.
Structural
No browser participation โ clients consume streams, never authoring durable events
WDK clients can read streamed output via `WorkflowChatTransport` and submit hooks to resume waiting workflows. They cannot author durable lifecycle events on a run. A browser tab watching a multi-step agent sees the output stream, not a typed view of the run state, and certainly cannot contribute its own steps.
WDK behavior
StructuralClient opens a server-side stream, receives chunks, renders them. To intervene, it POSTs to a hook endpoint, which the server re-injects into the orchestrator. Multiple clients each get their own stream โ no shared state model.
xFlow's structural answer
A browser tab is a peer. It joins the run actor, runs the same reducer the server runs, sees the typed FlowRun update live, and can register executors for steps that have `placement: { required: ["browser-tab"] }`. Multiple tabs share state through the substrate; the audit log of every event a tab emits is signed by that tab's identity.
// In a Next.js client component:
"use client"
import { useEffect, useState } from "react"
import { useFlowRun } from "@decoperations/xflow-react"
import { createXFlowRuntime } from "@decoperations/xflow-runtime"
import { detectLocalNode } from "@decoperations/xflow-provider-local"
import { xsyncSubstrate } from "@decoperations/xflow-substrate-xsync"
import { createXSync } from "@decoperations/xsync-client"
import { flowRunView } from "@decoperations/xflow-substrate-xsync"
export function AgentSession({ runId, workflow }: Props) {
const [handle, setHandle] = useState(null as any)
useEffect(() => {
void (async () => {
// Browser-side xSync client (IndexedDB store, WS transport)
const xsync = await createXSync({
store: indexedDb(),
transports: [websocket("/api/xsync")],
views: { flowRun: flowRunView },
})
const substrate = xsyncSubstrate({ client: xsync })
const runtime = createXFlowRuntime({
node: detectLocalNode({ id: "browser", capabilities: ["browser-tab"] }),
substrate,
})
// The browser registers an executor for an interactive step
runtime.register("ui.confirm", async (ctx, input) => {
return await showDialog(input) // user clicks Approve
})
const h = await runtime.start({ workflow, runId })
setHandle(h)
})()
}, [])
// Reactive โ re-renders on every event
const run = useFlowRun(handle)
return <RunTimeline run={run} />
}Coexistence vs replacement
This is one of the highest-leverage xFlow additions. Even if you keep WDK on the server, mirror the run's lifecycle events into xFlow so the browser gets a typed live view and the audit log it needs.
Structural
Single orchestrator instance per run โ no multi-writer coordination
Every WDK World assumes one orchestrator advances a run. Two clients can't both contribute steps to the same run, two stakeholders can't both watch and intervene with shared state, and a browser-side optimistic step can't race against a server-side authoritative one.
WDK behavior
StructuralIf you need multi-actor coordination, you build a separate state machine on top โ usually a Durable Object (Cloudflare) or a coordination row in Postgres โ and have the workflow read from it.
xFlow's structural answer
Claim modes are first-class. `authority` pins irreversible side effects to a designated server. `deterministic-election` lets multiple peers compute the same result and the log resolves the winner. `optimistic-idempotent` accepts the first valid output. `lease` covers long compute with TTL.
import { defineStep } from "@decoperations/xflow-core"
// Irreversible โ only the trusted server may execute
defineStep({
id: "publish",
type: "social.publish",
claim: { mode: "authority", authorityActorId: "server-prod" },
placement: { required: ["server"] },
sideEffects: { kind: "irreversible", idempotencyRequired: true },
})
// Pure compute โ any peer can compute, log resolves the winner
defineStep({
id: "embed",
type: "ai.embed",
claim: { mode: "deterministic-election", tieBreaker: "lowest-hash" },
sideEffects: { kind: "pure", idempotencyRequired: false },
})
// Cacheable โ first valid output wins
defineStep({
id: "summarize",
type: "ai.summarize",
claim: { mode: "optimistic-idempotent" },
sideEffects: { kind: "idempotent", idempotencyRequired: false },
})
// Long-running โ lease until TTL expiry, then queue re-opens
defineStep({
id: "render",
type: "media.render",
claim: { mode: "lease", ttlMs: 30 * 60_000 },
placement: { required: ["docker"] },
sideEffects: { kind: "external", idempotencyRequired: true },
})Coexistence vs replacement
If your runs are single-actor (server-only), this gap doesn't bite. If your product is collaborative, agentic-with-human, or local-first, you'll be re-implementing this on top of WDK; you may as well use xFlow's claim modes.
Structural
No first-class step placement across runtimes
WDK steps run in the deployment that hosts the workflow. There's no way to say 'this step needs Docker, this step needs the edge, this step needs a GPU' inside a single workflow definition. Multi-runtime always degrades to 'wrap external calls in a step.'
WDK behavior
StructuralAll `"use step"` handlers run as functions in the same deployment. Calling a Docker box, a Cloudflare Worker, or a GPU service is `fetch()` from inside a step.
xFlow's structural answer
Placement is a step field. Every peer evaluates it against its own capabilities before attempting to claim. Wrong-host peers don't compete. The same workflow definition can target Vercel, Docker, Cloudflare, and the browser โ each step runs where its placement says it should.
const mediaPipeline = defineWorkflow({
id: "media.pipeline",
version: "1.0.0",
steps: {
// Vercel function โ short, cheap
ingest: defineStep({
id: "ingest", type: "media.ingest",
placement: { required: ["server"], forbidden: ["browser-tab"] },
sideEffects: { kind: "idempotent", idempotencyRequired: true },
}),
// Cloudflare edge โ global, low-latency
cdnSeed: defineStep({
id: "cdnSeed", type: "media.cdn-seed",
placement: { required: ["edge-worker"] },
sideEffects: { kind: "idempotent", idempotencyRequired: true },
}),
// Docker box โ GPU + ffmpeg
transcode: defineStep({
id: "transcode", type: "media.transcode",
placement: { required: ["docker", "gpu"] },
claim: { mode: "lease", ttlMs: 60 * 60_000 },
sideEffects: { kind: "external", idempotencyRequired: true },
}),
// Browser tab โ preview, optional, optimistic
preview: defineStep({
id: "preview", type: "media.preview",
placement: { required: ["browser-tab"] },
claim: { mode: "optimistic-idempotent" },
sideEffects: { kind: "pure", idempotencyRequired: false },
}),
},
links: [
link("ingest", "cdnSeed"),
link("ingest", "transcode"),
link("transcode", "preview"),
],
})Coexistence vs replacement
Hard to coexist cleanly โ placement is a property of the workflow definition, so either you have it or you don't. Adopting xFlow for multi-runtime workflows is the cleanest path.
Operational
Determinism trap on serialization or schema changes
WDK workflows that suspend mid-run replay against their event journal on resume. If a Zod schema, a custom serde codec, or even a `@__PURE__` annotation changes between deployments, replay can silently produce inconsistent state โ workflows hang in `pending` rather than fail loudly. Several open issues document the pattern.
WDK behavior
OperationalSandbox-enforced determinism is the load-bearing assumption. Schema or serde drift breaks replay; the failure mode is hangs rather than errors. Skew Protection mitigates by pinning runs to their original deploy, but old-deploy pinning means you cannot patch in-flight runs.
xFlow's structural answer
No replay sandbox. The reducer is a typed function over event types; schema changes are explicit reducer migrations, not implicit replay drift. The workflow body never re-runs โ only the reducer projects events into state.
import { reduceFlowRun, EVENT_TYPES, type XFlowEvent, type FlowRun } from "@decoperations/xflow-core"
// You can layer your own event types on top of xflow.* โ and migrate them
// explicitly when shapes change. No sandbox, no replay drift.
function reduceWithMigration(state: FlowRun, event: XFlowEvent): FlowRun {
// Your own event type whose payload changed v1 โ v2
if (event.type === "my-app.review.submitted") {
const v = (event.payload as { version?: number }).version ?? 1
if (v === 1) {
// Migrate v1 payload shape into v2 before applying the v2 logic
return applyReviewV2(state, migrateV1ToV2(event))
}
return applyReviewV2(state, event)
}
return reduceFlowRun(state, event)
}Coexistence vs replacement
The determinism trap is intrinsic to deterministic-replay engines. It's not patchable in WDK without rewriting its execution model. xFlow's reducer-over-events model side-steps it entirely.
Structural
Tenant-provided workflow code requires a separate product
WDK assumes workflow code is part of your deployment. For multi-tenant workflow builders, agent platforms, or CI/CD products where customers ship their own workflow logic, you need a different approach. Cloudflare ships *Dynamic Workflows* (Worker Loader + WorkflowEntrypoint) explicitly for this case.
WDK behavior
StructuralTo support tenant-provided workflows on WDK, you ship a separate orchestration layer (hand-rolled DAG executor or a different product). Either way, it's not the WDK runtime executing the tenant code.
xFlow's structural answer
Workflow definitions are JSON-serialisable values. Persist a tenant's `WorkflowDefinition` in your DB, load it at run time, hand it to `runtime.start()`. No build-time transform, no tenant-deploy pipeline, no separate product layer.
// API endpoint: POST /api/workflows โ tenant uploads a definition
import { defineWorkflow, type WorkflowDefinition } from "@decoperations/xflow-core"
export async function POST(req: Request) {
const body = await req.json()
const def = defineWorkflow(body) // throws if invalid
await db.tenantWorkflows.put(req.tenantId, def.id, def)
return Response.json({ id: def.id })
}
// Later: run a tenant-defined workflow
export async function startTenantWorkflow(tenantId: string, workflowId: string, input: unknown) {
const def = await db.tenantWorkflows.get(tenantId, workflowId)
return await runtime.start({ workflow: def, input })
}
// Definitions are just data. AI can generate them. UIs can edit them.
// Version history is a normal database concern.Coexistence vs replacement
Coexist: keep your own workflows on WDK; route tenant-provided ones through xFlow. Replace: if multi-tenant workflow products are your business, definitions-as-data is the foundational primitive.
Operational
Bundler / build-time fragility โ every framework finds new failure modes
WDK relies on an SWC plugin that rewrites workflow / step boundaries at build time. The plugin interacts with bundler internals โ Next basePath, Nuxt externalize rules, ESM resolution, Windows paths โ and each combination produces new edge cases. The bundler-discovery cluster is consistently the largest open-issue theme on `vercel/workflow`.
WDK behavior
OperationalEvery framework integration (Next, Nuxt, Astro, Hono, Express, Vite) has its own SWC-plugin glue path, and each combination of bundler + OS + plugin order finds new bugs. The discovery rules (`instrumentation.ts`, the `.well-known/workflow/v1/*` route synthesis) are sensitive to bundler options.
xFlow's structural answer
No build-time transform. `defineWorkflow` and `defineStep` are runtime functions; `link` is a builder. The whole grammar layer is plain TypeScript that any bundler handles. Adding a new framework integration is `pnpm add @decoperations/xflow-runtime` plus call sites โ no plugin, no discovery rules.
// Works on Next, Nuxt, Astro, Hono, Vite, Bun, Deno, browser bundles โ
// it's just JS. No "use workflow" directive, no SWC transform, no
// instrumentation.ts discovery rules.
import { defineWorkflow, defineStep, link } from "@decoperations/xflow-core"
export const ingest = defineWorkflow({
id: "ingest.basic",
version: "1.0.0",
steps: {
fetch: defineStep({ id: "fetch", type: "data.fetch", sideEffects: { kind: "external", idempotencyRequired: true } }),
parse: defineStep({ id: "parse", type: "data.parse", sideEffects: { kind: "pure", idempotencyRequired: false } }),
},
links: [link("fetch", "parse")],
})Coexistence vs replacement
If your team is fighting WDK bundler issues today, the fastest mitigation is to keep xFlow's grammar layer for new workflows โ its DX cost is just package adds.
Recap
Three ways to use xFlow with an existing WDK + Vercel deployment.
Mirror
Keep WDK; emit xFlow lifecycle events from a step boundary so the browser gets a live typed view, S3 retains a signed audit log, and you have an interop hook for future peers.
Federate
xFlow runs the multi-runtime / multi-writer parts; long server steps call into a WDK workflow as the executor. Each system does what it's best at.
Replace
Use xFlow as the durable runtime; pick a substrate (xSync, Postgres, WDK World via Phase-2 adapter) based on what you already operate.