Cloudflare Containers vs Fly.io: What I Learned (and Messed Up) Migrating a Discord Bot

8 min read

# Cloudflare Containers vs Fly.io: what I learned (and messed up) migrating a Discord bot This post is a technical account of a week migrating a Discord bot be

Cloudflare Containers vs Fly.io: what I learned (and messed up) migrating a Discord bot

This post is a technical account of a week migrating a Discord bot between two container providers — Cloudflare Containers and Fly.io — to fix a bug that, in the end, had nothing to do with the provider. The most honest title would be “how to avoid an unnecessary migration”, but the journey taught a lot about the two products, so it’s worth recording.

If you only want the TL;DR: the root cause was the Discord deprecating the voice protocol v4 in 2025, not blocked UDP. The rest of this text explains how I arrived at this wrong conclusion, what I discovered about each platform along the way, and the resulting architecture.

Context

Tessel is a platform for tabletop RPG GMs. One of the features is to record the voice session on Discord, transcribe it, and extract events for the campaign timeline. The bot needs to:

  • Receive HTTP interactions from Discord (slash commands, via TCP/443)
  • Maintain a WebSocket connection with the Discord Gateway (TCP/443)
  • Receive audio via UDP from Discord voice servers
  • Save PCM to disk, fragment into 10-minute chunks, and send to the transcription pipeline

The first version ran on Cloudflare Containers with Durable Objects routing interactions to a Node.js container that maintained the voice session. It worked well in local tests. In production, it stopped working.

The bug

The symptom was frustrating: the voice connection was established successfully, the bot entered the channel, but no audio packets arrived. The logs showed VoiceConnectionStatus.Ready, and then silence. The final PCM file had a size of zero.

Reproduced in CF, so to eliminate the provider from the equation I doubled the stack on Fly.io. Same bug. Packets did not arrive.

The incorrect diagnosis

This is where I made a big mistake. I searched for “discord voice udp cloudflare containers” and found old discussions (from 2023) suggesting that CF Workers did not support outbound UDP. CF Containers is a new product (in open beta at the time) and the documentation was not explicit about UDP. Hasty conclusion: CF Containers inherits the limitation of Workers, let’s go to Fly.io.

I migrated everything. fly.toml configuration, flyctl deploy, machine config, auto-stop/start, secrets, monitoring. It took days.

And the bug persisted.

The real track

It was when, browsing issues of discord.js, that I saw:

Discord is deprecating voice gateway versions ≤ v6 in 2025. Use @discordjs/voice 0.19+ which negotiates v8 (DAVE encryption).

The bot was on @discordjs/voice@0.18 configured with daveEncryption: false. In 2024 this worked. In 2025, Discord started to silently reject the flag — the connection was accepted, but the voice server never sent packets because the session was never considered negotiated.

The fix was:

  1. @discordjs/voice 0.18 → 0.19.2 (forces v8 negotiation)
  2. Remove daveEncryption: false (no longer optional)
  3. Replace the pipeline subscribe → opus.Decoder → fs.WriteStream with decode per-packet using @discordjs/opus OpusEncoder.decode()

The last point deserves an explanation: with DAVE (Discord Audio Voice Encryption), each packet arrives individually encrypted. The old prism-media pipeline assumed a continuous stream — it broke with the first E2EE packets. The solution is to keep a decoder per user and decode packet by packet.

// before (breaks with DAVE)
receiver.subscribe(userId)
  .pipe(new prism.opus.Decoder({ rate: 48000, channels: 2, frameSize: 960 }))
  .pipe(fs.createWriteStream(path))
 
// after (compatible with v8)
const decoder = new OpusEncoder(48000, 2)
receiver.subscribe(userId).on("data", (opusPacket) => {
  const pcm = decoder.decode(opusPacket)
  fileStream.write(pcm)
})

After the upgrade, it worked with both providers. UDP had never been the problem.

The test I should have done first

To close the diagnostic gap, I wrote an isolated 50-line project: cf-udp-test. A Worker running a minimal container that:

  1. Attempts DNS query (UDP/53) → ✅ worked
  2. Attempts STUN binding request (UDP/3478) → ✅ worked
  3. Attempts to send a packet to the Discord voice IP → ✅ worked

CF Containers supports outbound UDP without any issues. The assumption from old Workers docs was wrong. I documented this in the roadmap as a public mea-culpa.

The lesson is simple and old: isolate before migrating. A 50-line test would have saved a week of work.

Learnings about Cloudflare Containers

Even though the migration was unnecessary, I discovered several things about the product that are worth noting:

1. Durable Object migrations are strict

You can’t simply rename or replace a DO class. Each structural change needs a migration tag:

"migrations": [
  { "tag": "v1", "new_sqlite_classes": ["BotContainer"] },
  { "tag": "v2", "deleted_classes": ["BotContainer"] },
  { "tag": "v3", "new_sqlite_classes": ["BotContainer"] }
]

I had to do create → delete → create again during iteration. Skipping steps results in errors like “already exists with different namespace”.

2. Env vars are baked in at instance start

Updating a secret via wrangler secret put does not update running instances. The container already in memory continues with the old value until it is destroyed. This cost me hours debugging an “invalid token” after already rotating the token.

The fix is to force destruction:

wrangler containers delete <application-id>

Then on the next request the DO spins up a new container with the updated secrets.

3. Orphaned applications need manual cleanup

If you change the DO namespace name or recreate it, the container’s “application” becomes orphaned. Error:

There is already an application with the name X deployed that is associated with a different durable object namespace

Solution: wrangler containers delete <id> before redeploying.

4. Cold start is real and cascades

Stopped container: /gravar start waits ~3-5 seconds for the first container to start + login to the Discord Gateway (~2 additional seconds). Total: ~5-8s from interaction to “I’m ready”.

Worse: the session creation in Postgres had a synchronous trigger that called an external Edge Function to index in Vectorize. On cold start, this trigger caused the INSERT to exceed the statement_timeout of 5s from PostgREST and the transaction rolled back. Result: cold start propagated as “Failed to create session” to the user.

The fix was to migrate the trigger to pg_net async:

CREATE OR REPLACE FUNCTION public.sync_note_to_vectorize()
RETURNS trigger LANGUAGE plpgsql SECURITY DEFINER SET search_path TO ''
AS $$
DECLARE
  request_id BIGINT;
BEGIN
  -- ... build payload ...
  SELECT net.http_post(
    url := edge_function_url,
    body := payload,
    headers := jsonb_build_object('Content-Type', 'application/json', ...)
  ) INTO request_id;
  RAISE LOG 'Queued vectorize sync (pg_net request_id: %)', request_id;
  RETURN COALESCE(NEW, OLD);
END;
$$;

Now the INSERT queues the request in net.http_request_queue and returns immediately. Delivery happens in the background, with automatic retry and logs in net._http_response.

Learnings about Fly.io

Fly.io is more mature for this use case (long-running Node.js containers) and has different features:

1. auto-stop/start works well for burst loads

[[services]]
  auto_stop_machines = "stop"
  auto_start_machines = true
  min_machines_running = 0

Cold start is faster than CF Container (~1-2s for spin-up vs 3-5s), because the machine stays in a suspended state instead of being destroyed.

2. Structured logs are first-class

fly logs shows stdout directly, without needing to configure Logpush or tail. For quick debugging, it is more ergonomic than the equivalent in Cloudflare.

3. UDP works out of the box

No special config, no flag. It is expected for a product that sells “any container.”

Final architecture: toggle Fly ↔ CF

Since both platforms work after the protocol fix, I kept both implementations active. A single env var in the Worker decides where to forward:

type Env = {
  BACKEND: "fly" | "cf"
  BOT_URL: string                     // Fly endpoint
  BOT: DurableObjectNamespace<...>    // CF binding
  // ...
}
 
async function callBot(env: Env, path: string, init?: RequestInit) {
  if (env.BACKEND === "cf") {
    const stub = env.BOT.get(env.BOT.idFromName("singleton"))
    return await stub.fetch(`https://bot${path}`, init)
  }
  return await fetch(`${env.BOT_URL}${path}`, init)
}

An important limitation: the Discord bot token can only be used by one instance at a time. Therefore, only one backend is active at a time. To run both in parallel, a second bot would need to be registered in the Discord Developer Portal. For A/B testing, this is worth it; for stateless HA, it is overhead.

General lessons

  1. Diagnose before migrating. A single 50-line probe is worth more than days of hypothesis-based migration.

  2. Documentation for new products ages quickly. “Workers did not support UDP in 2023” says nothing about “Containers in 2026”. Always confirm with a current test.

  3. Cold starts cascade through the entire stack. The bug appears in the bot, but the root cause might be in a Postgres trigger. When something synchronous in the critical path calls an external network, consider migrating to a queue/async.

  4. Platform tools have peculiarities. CF Containers bake env vars at boot. Fly bakes nothing — secrets are read dynamically. Knowing this changes how you think about credential rotation.

  5. Maintaining two implementations is expensive but validates hypotheses. If I had kept the toggle from the start, I would have reproduced the bug in CF and Fly in parallel on the same day, and the v8 protocol would have stood out before the migration began.

  6. Voice/realtime protocols change. Discord deprecated v4 silently — no 410 Gone, just “ok but doesn’t send packets”. For any WebRTC-like integration, it’s worth having a smoke test that validates data flow, not just handshake.

Closure

The migration was not necessary — but it was not wasted. I learned the internals of both products, validated that both serve this use case, and ended up with an architecture that accepts provider swapping without a rewrite. The real bug (protocol v8) was invisible until we forced a reinvestigation, and the isolated UDP test remained as an artifact for the next time a similar hypothesis appears.

If you are building something voice-heavy on Discord in 2026: start with @discordjs/voice@0.19.2+, decode per-packet, and test the complete end-to-end pipeline before assuming that the infrastructure is the problem.