Build and Deploy a Sentiment-Analysis API: HuggingFace Transformers + FastAPI + Fly.io (May 2026)

End-to-end Python tutorial: load DistilBERT SST-2 with Transformers, expose a /analyze endpoint via FastAPI lifespan, batch inputs, Dockerize, ship to Fly.io.

20 May 2026 ~14 min read

Hugging Face model card page for distilbert-base-uncased-finetuned-sst-2-english showing the SST-2 sentiment classifier model details

Image: Hugging Face model card for DistilBERT SST-2, used for editorial coverage of the model loaded in this tutorial.

What you’ll build

A production-shaped HTTP API that wraps a Hugging Face sentiment classifier. A single POST /analyze endpoint accepts a JSON payload of one or more strings, runs DistilBERT inference, and returns per-input label plus probability. The model loads once at process start, requests share that warm copy, batched inputs ride through transformers.pipeline in a single forward pass, slow clients get rejected with a request timeout, and the whole thing ships as a Docker image to Fly.io.

The aggregated source consensus across the Hugging Face Transformers pipeline documentation¹, the FastAPI lifespan-events guide², and Fly.io’s Dockerfile-deploy reference³ supports this stack as the shortest credible path from “a .py file on a laptop” to “a publicly-reachable endpoint a teammate can curl.”

The model: distilbert/distilbert-base-uncased-finetuned-sst-2-english. Hugging Face’s model card lists it at 67M parameters with Apache-2.0 licensing and 0.989 reported accuracy on the SST-2 evaluation slice⁴. It is the canonical “small, fast, English binary sentiment” baseline; later in the article the trade-off versus larger models gets named explicitly.

Prerequisites

Python 3.11 or 3.12.
Docker installed locally (Docker Desktop on macOS / Windows, or the Engine on Linux).
A Fly.io account with flyctl installed. Fly’s quickstart documents the install + fly auth signup flow⁵.
About 600 MB of disk for the model weights on first download; Hugging Face caches them at ~/.cache/huggingface/.

Step 1: Project scaffold

Create a clean directory and the file skeleton:

mkdir sentiment-api && cd sentiment-api
python -m venv .venv
source .venv/bin/activate    # Windows: .venv\Scripts\activate
touch app.py requirements.txt Dockerfile fly.toml

Pin the dependencies in requirements.txt. The versions below are the May 2026 stable line per each project’s PyPI release pages; pinning means a future Transformers release won’t silently break your image build.

fastapi==0.115.0
uvicorn[standard]==0.32.0
transformers==4.46.0
torch==2.5.0
pydantic==2.9.0

torch is the heaviest install in the list. The CPU-only wheel is roughly 200 MB on Linux x86_64; that is the default pip install torch resolves to and it is what the Docker image will use. Per the Transformers pipeline reference, the model runs on CPU by default unless device=0 is passed¹.

Step 2: Load the model once with FastAPI lifespan

The naive pattern, calling pipeline(...) inside the request handler, re-loads the model on every request and is what every production deployment guide steers you away from. FastAPI’s documentation calls out machine-learning models as the canonical lifespan use-case: load on startup, share via app state, free on shutdown².

Paste this into app.py:

import asyncio
import logging
from contextlib import asynccontextmanager
from typing import List

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline

logging.basicConfig(level=logging.INFO)
log = logging.getLogger("sentiment-api")

MODEL_ID = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
MAX_BATCH = 32
REQUEST_TIMEOUT_S = 10.0

ml_state: dict = {}


@asynccontextmanager
async def lifespan(app: FastAPI):
    log.info("loading model %s ...", MODEL_ID)
    ml_state["classifier"] = pipeline(
        task="sentiment-analysis",
        model=MODEL_ID,
        device=-1,                # -1 = CPU; set to 0 for first CUDA GPU
        truncation=True,
        max_length=512,
    )
    # Warm-up forward pass so the first real request doesn't pay
    # the kernel-compile / lazy-init cost.
    ml_state["classifier"]("warm-up sentence for the JIT cache.")
    log.info("model ready")
    yield
    log.info("shutting down; releasing model")
    ml_state.clear()


app = FastAPI(title="sentiment-api", lifespan=lifespan)


class AnalyzeRequest(BaseModel):
    inputs: List[str] = Field(..., min_length=1, max_length=MAX_BATCH)


class AnalyzeResultItem(BaseModel):
    label: str
    score: float


class AnalyzeResponse(BaseModel):
    results: List[AnalyzeResultItem]


@app.get("/healthz")
def healthz():
    return {"ok": "classifier" in ml_state}


@app.post("/analyze", response_model=AnalyzeResponse)
async def analyze(req: AnalyzeRequest):
    classifier = ml_state.get("classifier")
    if classifier is None:
        raise HTTPException(status_code=503, detail="model not ready")

    def _run():
        return classifier(req.inputs, batch_size=len(req.inputs))

    try:
        results = await asyncio.wait_for(
            asyncio.to_thread(_run),
            timeout=REQUEST_TIMEOUT_S,
        )
    except asyncio.TimeoutError:
        raise HTTPException(
            status_code=504,
            detail=f"inference exceeded {REQUEST_TIMEOUT_S}s",
        )

    return AnalyzeResponse(results=results)

Four production-grade concerns are answered in that single file. The lifespan context manager loads the classifier once and pins it in ml_state. A throwaway warm-up call pays the lazy-init cost before the first real request lands. Inference runs in a worker thread via asyncio.to_thread so it does not block the event loop. And asyncio.wait_for enforces a wall-clock cap so a slow batch or a wedged tokenizer call cannot tie up a worker indefinitely.

Hugging Face blog post on getting started with sentiment analysis using Python, the canonical reference for the DistilBERT SST-2 quickstart pattern

Image: Hugging Face blog — Getting started with sentiment analysis using Python, used for editorial coverage of the model-loading pattern this tutorial extends.

Run it locally to confirm the wiring:

pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 8080

The first launch downloads the model weights into the Hugging Face cache, prints model ready, then serves. Hit it from a second terminal:

curl -s -X POST http://localhost:8080/analyze \
    -H "content-type: application/json" \
    -d '{"inputs": ["I loved it.", "Worst film of the year."]}'

You should see a JSON response with POSITIVE and NEGATIVE labels and probabilities approaching 1.0 on both inputs.

Step 3: Batching strategy — why and when it helps

The pipeline reference is blunt on this: batch inference is disabled by default, and naive batching is not always faster⁶. On CPU, batching helps when inputs are roughly the same length and the model is small enough that you are bandwidth-bound rather than compute-bound. On GPU, batching is usually a clear win because the model is compute-bound and larger batches keep more SM lanes busy.

The handler above passes batch_size=len(req.inputs) so a client that sends ten strings gets a single forward pass for all ten. The MAX_BATCH=32 cap on the Pydantic request schema prevents a single client from passing 5,000 strings and starving co-tenants.

Two refinements once you have real traffic data:

Dynamic padding to the longest input in the batch. The Transformers tokenizer already pads to the longest sequence by default when a list is passed; the truncation=True, max_length=512 arguments in the pipeline initialiser keep one pathological 50,000-token input from blowing up memory.
Server-side micro-batching across requests. Two requests that arrive within 5–10 ms of each other can be combined into one model call. The Roman Parykin write-up on Hugging Face pipeline inference optimisation walks through the queue-and-drain pattern; the cost is a few milliseconds of added latency for the requests that arrive first in a window.

Aggregated source guidance from the Transformers docs and the Databricks NLP-inference reference both land on the same recommendation: measure on your hardware before committing to a batch size⁶. The right number depends on input length distribution, hardware, and model size — there is no universal default.

Step 4: GPU vs CPU — the latency tradeoff

For the SST-2 DistilBERT model specifically, the practical numbers fall in a known band on modern hardware:

CPU (modern x86, 4–8 vCPU): single-input latency in the tens of milliseconds; throughput roughly 50–200 inputs per second depending on input length. The model is small enough that CPU is genuinely usable in production for low-QPS workloads.
GPU (entry-level T4 / L4 / A10): single-input latency in the low single-digit milliseconds; throughput an order of magnitude higher on batched calls. The cost is the GPU instance itself, which on most clouds runs 5–10x the price of a CPU machine.

To switch the code to GPU you change device=-1 to device=0 (first CUDA device) in the pipeline call, and your Docker base image to a CUDA-enabled one. For a binary-sentiment classifier serving fewer than ~50 requests per second, the source consensus across the Transformers pipeline tutorial and the inference-optimisation write-ups treats CPU as the default and reaches for GPU only once measured CPU latency stops fitting the request SLO. Fly.io publishes GPU machine types separately from the standard compute tier; Fly’s pricing page is the canonical reference for current hourly rates⁷.

Hugging Face Transformers Pipelines main-class reference page describing batch inference behaviour and the batch_size parameter

Image: Hugging Face Transformers Pipelines class reference, used for editorial coverage of the batched-inference pattern this tutorial uses.

Step 5: Dockerize

Create Dockerfile:

FROM python:3.12-slim

ENV PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1 \
    HF_HOME=/app/.hf-cache \
    TRANSFORMERS_NO_ADVISORY_WARNINGS=1

WORKDIR /app

# System deps for tokenizers (Rust) and torch
RUN apt-get update && apt-get install -y --no-install-recommends \
        build-essential ca-certificates \
    && rm -rf /var/lib/apt/lists/*

COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt

# Pre-download the model into the image. This shifts the weight
# download from first-request time (slow, network-dependent) to
# build time (cached in the image layer).
RUN python -c "from transformers import pipeline; \
    pipeline('sentiment-analysis', \
    model='distilbert/distilbert-base-uncased-finetuned-sst-2-english')"

COPY app.py .

EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]

Two things worth calling out. First, pre-downloading the model into the image at build time means containers start ready, not “ready in 30 seconds once weights finish streaming.” Second, python:3.12-slim keeps the image around 1.2–1.5 GB compressed; a full python:3.12 base would land closer to 2 GB without buying much.

Build and run locally:

docker build -t sentiment-api .
docker run --rm -p 8080:8080 sentiment-api

The same curl from Step 2 should now hit the containerised version.

Step 6: Deploy to Fly.io

Fly’s fly launch walkthrough is the canonical reference for first-time deploy; it scans the working directory, detects the Dockerfile, generates a fly.toml, and provisions a machine⁵.

fly launch --no-deploy

Answer the prompts: pick an app name, pick a region near your users, decline the Postgres / Redis offers (the API is stateless), accept the Dockerfile when prompted. Fly writes a fly.toml. Open it and bump the machine size — the default 256 MB is too small to hold DistilBERT weights in memory. Edit the [[vm]] block:

[[vm]]
    size = "shared-cpu-2x"
    memory = "2gb"

DistilBERT plus the Python runtime needs roughly 800 MB to 1 GB resident; 2 GB gives headroom for batching overhead and the OS file cache.

Adjust the HTTP service block to point at port 8080 and the /healthz endpoint:

[http_service]
    internal_port = 8080
    force_https = true
    auto_stop_machines = "stop"
    auto_start_machines = true
    min_machines_running = 1

    [[http_service.checks]]
        grace_period = "30s"
        interval = "15s"
        method = "GET"
        timeout = "5s"
        path = "/healthz"

The 30-second grace_period matters. Without it, Fly’s health-checker hits /healthz while the model is still loading on a cold start, marks the machine unhealthy, and restarts it in a loop. The grace gives the lifespan startup time to complete.

Deploy:

fly deploy

Fly builds the image (locally or in the remote builder, depending on your config), pushes it, and rolls a machine. Per Fly’s fly deploy reference, the command handles image build + machine refresh in a single invocation³.

Fly Machines blog announcement illustrating the Machines orchestration primitive that backs fly deploy

Image: Fly.io — Fly Machines blog post, used for editorial coverage of the deploy primitive this tutorial targets.

When the deploy finishes, Fly prints a hostname of the shape https://<app-name>.fly.dev. Hit it:

curl -s -X POST https://<app-name>.fly.dev/analyze \
    -H "content-type: application/json" \
    -d '{"inputs": ["Surprisingly good.", "Hated every minute."]}'

You should see the same JSON shape as the local response.

Step 7: Production hardening checklist

The endpoint works; “works” is not “production.” A short list of the gaps this tutorial does not yet close, in roughly the order most deployments hit them:

Authentication. The /analyze endpoint is publicly callable. Add a header-based API key, OAuth, or put Cloudflare Access in front. Anything internet-reachable that runs ML inference will get scraped.
Rate limiting per client. Per-IP token-bucket or Redis-backed sliding-window. The current setup caps per-request batch size but not per-client request rate.
Structured logging plus a request ID. Wire loguru or structlog, emit one JSON line per request with latency, batch size, status code. Fly’s log pipeline accepts JSON.
Metrics. Prometheus scrape endpoint at /metrics via prometheus-fastapi-instrumentator, or push to whichever observability vendor you already use. The two metrics worth alerting on are p99 /analyze latency and model_ready=false after the lifespan startup window.
Concurrency limit. Uvicorn defaults to one worker. For CPU-bound inference, one worker per available core is the rough starting point; add --workers 2 (or whatever fits your machine size). Each worker loads its own copy of the model — memory cost scales linearly.
Graceful shutdown. The lifespan yield runs on shutdown; if you swap to a tokenizer or model that needs explicit cleanup, that’s where it goes.

FastAPI GitHub repository social card showing the project banner

Image: FastAPI GitHub repository, used for editorial coverage of the framework this tutorial uses.

Troubleshooting

A few failure modes that recur on first deploys, with the fix each time:

“Killed” during fly deploy build, or container OOMs at startup. Memory undersized. Bump the [[vm]] block to 2 GB minimum.
Health check fails for the first 60 seconds, machine restarts in a loop. grace_period too short. Bump to 60s if the image is downloading model weights on first request (only happens if you skipped the pre-download RUN python -c ... step in the Dockerfile).
Request returns 504 after exactly 10 seconds. The REQUEST_TIMEOUT_S = 10.0 cap fired. Either inputs are too long, batch is too large, or CPU is genuinely overloaded. Profile before raising the cap.
First request after a long idle returns 503. Fly’s auto_stop_machines = "stop" setting stopped the machine. The next request triggers a cold start; the lifespan startup runs again. Set min_machines_running = 1 to keep at least one warm, at the cost of paying for that machine continuously.
fly launch does not detect the Dockerfile. Run from the directory containing the Dockerfile, not a parent. Fly’s launch documentation walks through the auto-detect logic⁸.

What you’ve shipped

A FastAPI app that loads DistilBERT once at startup via the lifespan context manager, exposes POST /analyze with batched inference, enforces a 10-second per-request timeout, packages cleanly into a 1.2–1.5 GB Docker image with model weights baked in, and deploys to Fly.io with a /healthz probe that gates traffic until the model is genuinely ready.

The model itself is the same DistilBERT SST-2 checkpoint that has anchored the Hugging Face sentiment-analysis quickstart for years; the production wrapper around it is what turns a notebook into something a teammate can curl from a CI job. Swap the MODEL_ID constant for any other Transformers sequence-classification model and the rest of the pipeline carries over unchanged — that is the payoff of the lifespan pattern.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Transformers pipeline tutorial — default device, batch_size argument, and inference behaviour (accessed 2026-05-20) ↩
2. FastAPI advanced — Lifespan Events documentation, machine-learning model use-case (accessed 2026-05-20) ↩
3. Fly.io deploy-with-a-Dockerfile reference — fly deploy build-and-refresh flow (accessed 2026-05-20) ↩
4. Hugging Face model card — distilbert-base-uncased-finetuned-sst-2-english, 67M params, Apache-2.0, 0.989 SST-2 accuracy (accessed 2026-05-20) ↩
5. Fly.io Quickstart — flyctl install, fly auth signup, fly launch first-deploy flow (accessed 2026-05-20) ↩
6. Transformers Pipelines class reference — batch inference disabled by default, tuning guidance (accessed 2026-05-20) ↩
7. Fly.io pricing — shared-cpu, dedicated, and GPU machine tier hourly rates (accessed 2026-05-20) ↩
8. Fly.io flyctl launch reference — working-directory scan and Dockerfile auto-detect (accessed 2026-05-20) ↩