The signal
Cloudflare Analytics is one of the underrated monitoring tools. It surfaces every request to a Cloudflare Worker, including the ones our internal logs don't see — because the Worker rejected them before they reached our origin.
Our inbound mail path runs through a Cloudflare Email Worker (email-routing inbound → Worker fetch → POST to our /v1/inbound endpoint). Three weeks ago, while doing a routine sweep through CF Analytics, we noticed:
3 invocations / 3 errors / 63 subrequests [7-day window]
Three errors a day, every day. Not a spike. A floor.
We almost ignored it. Three errors a day on a service handling thousands of inbound messages is rounding-error territory. But the consistency was suspicious. Real bugs are noisy. Floors are bugs that are working.
The hypothesis
The Worker logged the error class as fetch_failed_4xx. Subrequest count was high (63 on three invocations = ~20 per request) which suggested retries. The actual error response wasn't logged.
We had a few hypotheses:
- Auth failure — our Worker → API path uses a shared secret. Maybe a sender's signature was failing.
- Rate limiting at the API edge.
- Body too large — but that should be rare.
The reproduction
To reproduce we needed an actual failing request. We added a debug logger inside the Worker that captured the full error response before failing the email-routing pipeline. Three days later, the first capture came in:
HTTP 400 Bad Request Content-Type: text/plain request body too large
Not auth. Not rate limit. Body too large. From Gmail.
The path
The request flow:
- Recipient hits @relayly.io address.
- Cloudflare Email Routing forwards to our Worker.
- Worker re-encodes the raw RFC822 message into a JSON envelope and POSTs it to
https://api.relayly.io/v1/inbound/cf-worker. - Our Go API parses the JSON, base64-decodes the message, and inserts a row into the inbound queue table.
Step 4 was failing on the size check. Going back to the code:
// apps/api/internal/handlers/inbound_cf_worker.go func InboundCFWorker(...) http.HandlerFunc { return func(w http.ResponseWriter, r *http.Request) { r.Body = http.MaxBytesReader(w, r.Body, 8*1024*1024) // 8 MB ... } }
8 MB. Set in the very first commit of the inbound handler, presumably because someone (probably me, four weeks ago) thought "8 MB ought to be enough for an email."
The truth about real Gmail forwards
It is not enough. Real-world email size distribution looks roughly like:
- Plain text or basic HTML: 1-50 KB
- Marketing email with images: 50 KB - 500 KB
- Email with one PDF attachment: 1-10 MB
- Email with multiple PDFs / a video / a thread of replies with images: 10-30 MB
And — this is the part that bit us — when Gmail forwards an email it inflates the size: it adds full original-message headers, often base64-encodes attachments that arrived as 7-bit, and tacks on its own forwarding metadata. A 6 MB original can become a 9 MB forwarded payload.
The 8 MB cap was rejecting a small but real slice of legitimate mail. Three a day across the volume we serve.
The fix
One line:
- r.Body = http.MaxBytesReader(w, r.Body, 8*1024*1024) + r.Body = http.MaxBytesReader(w, r.Body, 32*1024*1024)
32 MB matches the SMTP industry standard size limit (Gmail accepts 25 MB attachments, plus headers and encoding overhead, so 32 MB at our edge is a safe cushion).
The deeper fix — defense in depth
The bug existed for four weeks before we noticed. The fix is one line. The lesson is the four weeks.
What we changed:
- Sweep for other size caps. Caddy's request_body limit, our worker's body parser, the API's middleware — all audited. None were below 32 MB after the sweep.
- Added an oversize regression test.
scripts/smoke-inbound.shnow sends a 16 MB payload as part of the daily smoke. If we ever drop the cap again, we'll know within 24 hours. - CF Analytics on the verifier. Our daily verifier now pulls CF Email Worker analytics via API and alerts if the error count is non-zero. Three errors a day is no longer the floor we accept.
What you can take from this
- Floors are bugs. Steady-state error counts on a service that should be at zero are signal, not noise. The "background error rate" is a red flag, not a state of the world.
- Default sizes are conservative for a reason. But "conservative" cuts both ways — too-small caps reject legitimate traffic. Audit your size caps against actual real-world payload distribution, not against what you imagine the distribution looks like.
- Test the cliff edges. If your service has a "max size" parameter, write an oversize regression test. The day someone changes the value to 8 MB during a refactor, your test will catch it.
- Cloudflare Analytics is useful even when you have other observability. CF sees errors that never reach your origin. We had Datadog metrics on every endpoint that responded. We had nothing on requests that timed out at CF.
If you've shipped a similar quiet-floor bug, we'd love to hear about it. hello@relayly.io.