Neural Tech Daily
ai-tutorials

Deploy a Python LLM App to the Edge: FastAPI + Anthropic API on Cloudflare Workers (May 2026)

End-to-end tutorial: scaffold a FastAPI Anthropic-API app, add a KV-backed rate-limit cache, deploy via Wrangler. Honest about Workers-Python beta status.

~12 min read
Share
Cloudflare Workers Python language documentation page on developers.cloudflare.com showing the beta-status banner and the Python runtime overview

Image: Cloudflare Workers Python documentation, used for editorial coverage of the runtime taught in this tutorial.

Status check before you start (May 2026)

Two things you should verify in your own browser before committing an afternoon to this tutorial. First, Cloudflare’s Python runtime for Workers has been publicly available in open beta since the original April 2024 Developer Week launch, with the runtime built on Pyodide running inside the V8 isolate model 1 . Open-beta status means production hardening, full Python-package support, and SLA-grade reliability are not contractually guaranteed by Cloudflare. Verify the current banner on the docs page before depending on it for production traffic.

Second, FastAPI does not run unmodified on Workers Python today. The Workers Python runtime targets Pyodide-compatible packages and exposes the Workers Fetch handler shape, not an ASGI server. Per Cloudflare’s documentation on supported packages and the Pyodide runtime constraints 2 , you cannot pip install fastapi uvicorn and wrangler deploy. Two pragmatic paths exist:

  • Path A, Workers Python with a FastAPI-style handler. Re-shape the FastAPI route signature into the Workers Python on_fetch handler. You get edge proximity and the Cloudflare network’s cold-start profile; you lose FastAPI’s Pydantic-validator ergonomics and the uvicorn ecosystem.
  • Path B, Cloudflare Pages Functions (TypeScript) proxying a FastAPI backend. Keep FastAPI on a managed Python runtime (Vercel, Modal, Fly.io, Render), put a Pages Functions TypeScript edge proxy in front for routing, KV-backed rate limiting, and CDN caching. The combination ships today on stable APIs.

This tutorial walks both. Path A is the “pure-edge” target; Path B is the production-grade fallback aggregated source consensus on the Cloudflare developer forum and recent community write-ups currently treats as the safer default for FastAPI specifically. Pick based on whether your dependency graph is Pyodide-compatible.

What you’ll build

A small Anthropic-API-backed endpoint that takes a prompt, calls Claude, and returns the response. The endpoint enforces per-IP rate limiting via Cloudflare KV so a single client can’t burn through your Anthropic spend. You deploy with Wrangler, hit the deployed URL with curl, and watch token usage land on the Anthropic console.

By the end you will have:

  • A FastAPI app structure (Path B) or a Workers Python handler (Path A) that wraps the Anthropic Messages API.
  • A Cloudflare KV namespace storing a sliding-window rate-limit counter per client IP.
  • A wrangler.toml configuration bound to that KV namespace and your Anthropic key as a secret.
  • A working curl test against the deployed *.workers.dev URL.
  • A live view of token spend on the Anthropic console.

Skills assumed: Python 3.11 or later, basic async/await, comfort with command-line Git and curl. You do not need prior Cloudflare experience; the Wrangler steps are reproducible.

Path A — Workers Python: scaffold

Cloudflare’s Python runtime exposes a fetch handler with a Request/Response shape close to the Web Fetch API. The minimum project layout per the Workers Python docs:

my-llm-edge/
  src/
    entry.py
  wrangler.toml
  requirements.txt

entry.py defines an async on_fetch(request, env) function — the Workers runtime calls it for every HTTP request hitting your *.workers.dev route 3 .

# src/entry.py
from workers import Response
import json


async def on_fetch(request, env):
    if request.method != "POST":
        return Response.new(
            json.dumps({"error": "POST only"}),
            status=405,
            headers={"content-type": "application/json"},
        )

    body = await request.json()
    prompt = body.get("prompt", "").strip()
    if not prompt:
        return Response.new(
            json.dumps({"error": "missing prompt"}),
            status=400,
            headers={"content-type": "application/json"},
        )

    # Call Anthropic via fetch (Pyodide-friendly path)
    api_key = env.ANTHROPIC_API_KEY
    upstream = await fetch(
        "https://api.anthropic.com/v1/messages",
        method="POST",
        headers={
            "x-api-key": api_key,
            "anthropic-version": "2023-06-01",
            "content-type": "application/json",
        },
        body=json.dumps({
            "model": "claude-sonnet-4-5",
            "max_tokens": 512,
            "messages": [{"role": "user", "content": prompt}],
        }),
    )
    payload = await upstream.json()
    return Response.new(
        json.dumps(payload),
        status=upstream.status,
        headers={"content-type": "application/json"},
    )

A few notes worth carrying:

  • The official anthropic Python SDK depends on httpx, which currently does not run unmodified on Pyodide. Calling the Anthropic REST endpoint directly via the Workers fetch global sidesteps the dependency issue.
  • The anthropic-version header is required by the Anthropic API. Check the current canonical value on the Anthropic getting-started docs before deploying 4 .
  • Replace claude-sonnet-4-5 with whichever current model name appears on the Anthropic model catalogue on the day you deploy. Frontier-LLM model names drift on ~60-day cadences; do not assume the example identifier above is current.
Anthropic API getting-started documentation page on docs.claude.com showing the Messages API endpoint URL, authentication header format, and the anthropic-version header requirement

Image: Anthropic API getting started documentation, used for editorial coverage of the API surface this tutorial calls.

Path B — Pages Functions edge proxy + FastAPI backend

If your dependency graph rules out Path A, keep FastAPI on a Python-native runtime and put a Pages Functions TypeScript edge handler in front. The Pages Functions runtime is the same V8 isolate model as Workers, just packaged as part of a Pages project 5 .

The FastAPI side stays standard:

# api/main.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from anthropic import Anthropic
import os

app = FastAPI()
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])


class PromptIn(BaseModel):
    prompt: str


@app.post("/v1/chat")
async def chat(payload: PromptIn):
    if not payload.prompt.strip():
        raise HTTPException(400, "missing prompt")
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content": payload.prompt}],
    )
    return {
        "text": msg.content[0].text,
        "usage": {
            "input_tokens": msg.usage.input_tokens,
            "output_tokens": msg.usage.output_tokens,
        },
    }

Deploy this to any Python runtime that publishes a HTTPS endpoint (Modal, Fly.io, Render, Vercel Python serverless). Note the endpoint URL.

The Pages Functions edge proxy lives in functions/api/chat.ts:

// functions/api/chat.ts
export const onRequestPost: PagesFunction<{
  RATE_LIMIT: KVNamespace;
  BACKEND_URL: string;
}> = async ({ request, env }) => {
  const ip = request.headers.get("cf-connecting-ip") ?? "unknown";
  const key = `rl:${ip}`;
  const count = parseInt((await env.RATE_LIMIT.get(key)) ?? "0", 10);
  if (count >= 10) {
    return new Response(
      JSON.stringify({ error: "rate limited; try in 60s" }),
      { status: 429, headers: { "content-type": "application/json" } },
    );
  }
  await env.RATE_LIMIT.put(key, String(count + 1), { expirationTtl: 60 });

  const upstream = await fetch(`${env.BACKEND_URL}/v1/chat`, {
    method: "POST",
    headers: { "content-type": "application/json" },
    body: await request.text(),
  });
  return new Response(upstream.body, {
    status: upstream.status,
    headers: { "content-type": "application/json" },
  });
};

The sliding-window pattern is intentionally simple: 10 requests per IP per 60 seconds, KV-stored, TTL-expired. KV is eventually-consistent across Cloudflare’s network, so under burst traffic a single client may exceed the cap by a small margin before propagation catches up 6 . For tighter limits Cloudflare publishes a Durable-Objects-backed rate-limiting recipe; KV is the right primitive for “good enough” gating on a tutorial deployment.

Cloudflare Workers KV documentation page on developers.cloudflare.com showing the KV namespace binding configuration and TTL-based expiry behaviour

Image: Cloudflare Workers KV documentation, used for editorial coverage of the storage primitive this tutorial uses for rate-limit state.

Wire up the KV namespace

Whichever path you take, the KV namespace creation is the same. From your project root:

npx wrangler login
npx wrangler kv namespace create RATE_LIMIT

Wrangler prints a namespace ID. Add it to wrangler.toml:

name = "my-llm-edge"
main = "src/entry.py"   # Path A; for Path B use a pages_build_output_dir
compatibility_date = "2026-05-20"
compatibility_flags = ["python_workers"]   # Path A only

[[kv_namespaces]]
binding = "RATE_LIMIT"
id = "<paste-the-printed-id-here>"

The compatibility_flags = ["python_workers"] line is the Workers Python opt-in per the docs 7 . For Path B (Pages Functions) the flag is unnecessary and the main line is replaced with a Pages-project build configuration.

Set the Anthropic key as a Wrangler secret rather than a plain environment variable. Secrets are encrypted at rest and not surfaced in wrangler.toml:

npx wrangler secret put ANTHROPIC_API_KEY
# (paste the key when prompted)

For Path B you set BACKEND_URL as a plain variable (it’s not sensitive) and ANTHROPIC_API_KEY as a secret on whatever Python runtime is hosting the FastAPI app.

Deploy

For Path A:

npx wrangler deploy

Wrangler builds the Pyodide bundle, uploads to Cloudflare, and prints your *.workers.dev URL. Cold-start times on Workers Python at the time of writing are documented by Cloudflare as higher than the JavaScript runtime because of Pyodide initialisation. Check the current published cold-start guidance on the Python language docs before designing latency-sensitive use cases 8 .

For Path B you push the Pages project (via Git connection or wrangler pages deploy) and your FastAPI backend deploys per its own runtime’s workflow (Modal modal deploy, Fly.io flyctl deploy, Render Git push, etc.).

Wrangler CLI documentation page on developers.cloudflare.com showing the wrangler deploy command reference and the wrangler.toml configuration schema

Image: Wrangler CLI documentation, used for editorial coverage of the deployment tooling this tutorial uses.

Test the deployed endpoint

A single curl confirms the wiring:

curl -X POST https://my-llm-edge.<your-subdomain>.workers.dev/ \
  -H "content-type: application/json" \
  -d '{"prompt": "Explain Cloudflare KV in two sentences."}'

A successful response returns the Anthropic Messages API payload as JSON, including content[0].text and a usage block with input + output token counts. Hit the endpoint 11 times in 60 seconds and the 11th request returns {"error": "rate limited; try in 60s"} with HTTP 429.

If you get a 500, the most common causes in order of frequency:

  • ANTHROPIC_API_KEY not set as a secret, or set on the wrong environment (Wrangler distinguishes production and preview secrets).
  • anthropic-version header missing or wrong. Verify the canonical version string on the Anthropic getting-started docs.
  • Model name retired. Anthropic retires model identifiers; check the current model catalogue and update the model field.
  • Path A only: a Python import you added in entry.py is not on the Pyodide supported-packages list. The Cloudflare Python packages reference enumerates what’s available.

Watch the bill

Anthropic’s console at console.anthropic.com surfaces a Usage view showing per-day token spend by model, broken down by input + output tokens. Both Cloudflare Workers and the Anthropic API price on token-or-request volume, so a runaway loop on either side surfaces fast:

  • Anthropic charges per million input tokens and per million output tokens, with model-specific rates published on the pricing page 9 . Output tokens are billed at a meaningfully higher rate than input tokens on every current model; the per-million-token figures are listed on docs.claude.com. Verify on the day you deploy because Anthropic adjusts the per-model rate sheet periodically.
  • Cloudflare Workers prices on requests + CPU time, with a free tier on the platform pricing page that covers tutorial-scale traffic comfortably 10 . KV reads and writes are metered separately; the free tier published on the platform pricing page covers the per-IP rate-limit counter pattern at low traffic.

The practical guardrail: set an Anthropic console spend limit before exposing the deployed URL publicly. The limit caps your downside if the rate-limit logic has a bug. Cloudflare offers a similar feature via the Workers dashboard’s billing alerts; configure both.

Anthropic API pricing documentation page on docs.claude.com showing the per-model input and output token rates and the usage tracking surface on the Anthropic console

Image: Anthropic API pricing documentation, used for editorial coverage of the token-pricing surface this tutorial recommends monitoring.

Where each path falls short

A flat comparison of the two paths against the criteria a small production deployment usually weighs:

AxisPath A — Workers PythonPath B — Pages Functions + FastAPI backend
Edge proximityFull edge (every Cloudflare PoP)Edge proxy at PoP; backend at single region
FastAPI Pydantic validatorsNot availableAvailable
Python-package compatibilityPyodide-supported list onlyFull PyPI
Cold-start profileHigher (Pyodide init)Edge: low; backend: depends on runtime
Production maturity (May 2026)Open betaStable
Operational piecesOne (Workers project)Two (Pages + Python runtime)
Anthropic SDK usageDirect REST via fetchFull anthropic Python SDK

The aggregated source consensus across the Cloudflare developer forum, the Workers Python GitHub discussions, and recent practitioner write-ups currently supports Path B as the production-safer default for FastAPI specifically, while Path A is the right choice when the dependency graph is genuinely Pyodide-compatible and edge proximity matters more than full PyPI access. Re-evaluate after Cloudflare announces general availability; the trade-off may flip.

What to wire next

Three natural extensions, in order of operational payoff:

  • Structured logging. Both paths benefit from a request log capturing prompt length, token usage, latency, and rate-limit hit/miss. On Workers Python, console.log is sufficient; on Pages Functions the same; on the FastAPI backend, a structured-logging library like structlog plus your Python runtime’s log drain.
  • Streaming responses. The Anthropic Messages API supports server-sent-events streaming. Path B forwards the SSE stream through the Pages Functions proxy with little additional code. Path A streams via the Workers ReadableStream primitive; the pattern is documented in the Workers Python docs.
  • Authentication. Public *.workers.dev URLs with only IP-based rate limiting are not access controlled. Add a bearer-token check in the handler, or place the endpoint behind Cloudflare Access for SSO-gated traffic.

The rest of the production checklist (observability, structured error responses, retries with exponential backoff on the upstream Anthropic call, prompt-injection input validation) is shared across both paths and is its own tutorial.

Costs to plan for

Both clocks tick on usage, not on idle. The Workers free tier published on the platform pricing page covers tutorial-scale traffic with significant headroom; the Anthropic free credit on a new account typically covers a few hundred prompts depending on prompt length and model choice. The published per-million-token rate sheet on docs.claude.com is the authoritative reference; verify on the day you deploy, since Anthropic adjusts the rate sheet periodically and the published figures here will drift.

For a public deployment with even modest traffic, the practical cost driver is the Anthropic side, not the Cloudflare side. The Workers KV rate-limiter exists primarily to cap Anthropic spend, not to manage Cloudflare cost.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.