Build and Deploy a Sentiment-Analysis API: HuggingFace Transformers + FastAPI + Fly.io (May 2026)
End-to-end Python tutorial: load DistilBERT SST-2 with Transformers, expose a /analyze endpoint via FastAPI lifespan, batch inputs, Dockerize, ship to Fly.io.
Image: Hugging Face model card for DistilBERT SST-2, used for editorial coverage of the model loaded in this tutorial.
What you’ll build
A production-shaped HTTP API that wraps a Hugging Face sentiment classifier. A single POST /analyze endpoint accepts a JSON payload of one or more strings, runs DistilBERT inference, and returns per-input label plus probability. The model loads once at process start, requests share that warm copy, batched inputs ride through transformers.pipeline in a single forward pass, slow clients get rejected with a request timeout, and the whole thing ships as a Docker image to Fly.io.
The aggregated source consensus across the Hugging Face Transformers pipeline documentation 1 , the FastAPI lifespan-events guide 2 , and Fly.io’s Dockerfile-deploy reference 3 supports this stack as the shortest credible path from “a .py file on a laptop” to “a publicly-reachable endpoint a teammate can curl.”
The model: distilbert/distilbert-base-uncased-finetuned-sst-2-english. Hugging Face’s model card lists it at 67M parameters with Apache-2.0 licensing and 0.989 reported accuracy on the SST-2 evaluation slice 4 . It is the canonical “small, fast, English binary sentiment” baseline; later in the article the trade-off versus larger models gets named explicitly.
Prerequisites
- Python 3.11 or 3.12.
- Docker installed locally (Docker Desktop on macOS / Windows, or the Engine on Linux).
- A Fly.io account with
flyctlinstalled. Fly’s quickstart documents the install +fly auth signupflow 5 . - About 600 MB of disk for the model weights on first download; Hugging Face caches them at
~/.cache/huggingface/.
Step 1: Project scaffold
Create a clean directory and the file skeleton:
mkdir sentiment-api && cd sentiment-api
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
touch app.py requirements.txt Dockerfile fly.toml
Pin the dependencies in requirements.txt. The versions below are the May 2026 stable line per each project’s PyPI release pages; pinning means a future Transformers release won’t silently break your image build.
fastapi==0.115.0
uvicorn[standard]==0.32.0
transformers==4.46.0
torch==2.5.0
pydantic==2.9.0
torch is the heaviest install in the list. The CPU-only wheel is roughly 200 MB on Linux x86_64; that is the default pip install torch resolves to and it is what the Docker image will use. Per the Transformers pipeline reference, the model runs on CPU by default unless device=0 is passed 1 .
Step 2: Load the model once with FastAPI lifespan
The naive pattern, calling pipeline(...) inside the request handler, re-loads the model on every request and is what every production deployment guide steers you away from. FastAPI’s documentation calls out machine-learning models as the canonical lifespan use-case: load on startup, share via app state, free on shutdown 2 .
Paste this into app.py:
import asyncio
import logging
from contextlib import asynccontextmanager
from typing import List
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel, Field
from transformers import pipeline
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("sentiment-api")
MODEL_ID = "distilbert/distilbert-base-uncased-finetuned-sst-2-english"
MAX_BATCH = 32
REQUEST_TIMEOUT_S = 10.0
ml_state: dict = {}
@asynccontextmanager
async def lifespan(app: FastAPI):
log.info("loading model %s ...", MODEL_ID)
ml_state["classifier"] = pipeline(
task="sentiment-analysis",
model=MODEL_ID,
device=-1, # -1 = CPU; set to 0 for first CUDA GPU
truncation=True,
max_length=512,
)
# Warm-up forward pass so the first real request doesn't pay
# the kernel-compile / lazy-init cost.
ml_state["classifier"]("warm-up sentence for the JIT cache.")
log.info("model ready")
yield
log.info("shutting down; releasing model")
ml_state.clear()
app = FastAPI(title="sentiment-api", lifespan=lifespan)
class AnalyzeRequest(BaseModel):
inputs: List[str] = Field(..., min_length=1, max_length=MAX_BATCH)
class AnalyzeResultItem(BaseModel):
label: str
score: float
class AnalyzeResponse(BaseModel):
results: List[AnalyzeResultItem]
@app.get("/healthz")
def healthz():
return {"ok": "classifier" in ml_state}
@app.post("/analyze", response_model=AnalyzeResponse)
async def analyze(req: AnalyzeRequest):
classifier = ml_state.get("classifier")
if classifier is None:
raise HTTPException(status_code=503, detail="model not ready")
def _run():
return classifier(req.inputs, batch_size=len(req.inputs))
try:
results = await asyncio.wait_for(
asyncio.to_thread(_run),
timeout=REQUEST_TIMEOUT_S,
)
except asyncio.TimeoutError:
raise HTTPException(
status_code=504,
detail=f"inference exceeded {REQUEST_TIMEOUT_S}s",
)
return AnalyzeResponse(results=results)
Four production-grade concerns are answered in that single file. The lifespan context manager loads the classifier once and pins it in ml_state. A throwaway warm-up call pays the lazy-init cost before the first real request lands. Inference runs in a worker thread via asyncio.to_thread so it does not block the event loop. And asyncio.wait_for enforces a wall-clock cap so a slow batch or a wedged tokenizer call cannot tie up a worker indefinitely.
Image: Hugging Face blog — Getting started with sentiment analysis using Python, used for editorial coverage of the model-loading pattern this tutorial extends.
Run it locally to confirm the wiring:
pip install -r requirements.txt
uvicorn app:app --host 0.0.0.0 --port 8080
The first launch downloads the model weights into the Hugging Face cache, prints model ready, then serves. Hit it from a second terminal:
curl -s -X POST http://localhost:8080/analyze \
-H "content-type: application/json" \
-d '{"inputs": ["I loved it.", "Worst film of the year."]}'
You should see a JSON response with POSITIVE and NEGATIVE labels and probabilities approaching 1.0 on both inputs.
Step 3: Batching strategy — why and when it helps
The pipeline reference is blunt on this: batch inference is disabled by default, and naive batching is not always faster 6 . On CPU, batching helps when inputs are roughly the same length and the model is small enough that you are bandwidth-bound rather than compute-bound. On GPU, batching is usually a clear win because the model is compute-bound and larger batches keep more SM lanes busy.
The handler above passes batch_size=len(req.inputs) so a client that sends ten strings gets a single forward pass for all ten. The MAX_BATCH=32 cap on the Pydantic request schema prevents a single client from passing 5,000 strings and starving co-tenants.
Two refinements once you have real traffic data:
- Dynamic padding to the longest input in the batch. The Transformers tokenizer already pads to the longest sequence by default when a list is passed; the
truncation=True, max_length=512arguments in the pipeline initialiser keep one pathological 50,000-token input from blowing up memory. - Server-side micro-batching across requests. Two requests that arrive within 5–10 ms of each other can be combined into one model call. The Roman Parykin write-up on Hugging Face pipeline inference optimisation walks through the queue-and-drain pattern; the cost is a few milliseconds of added latency for the requests that arrive first in a window.
Aggregated source guidance from the Transformers docs and the Databricks NLP-inference reference both land on the same recommendation: measure on your hardware before committing to a batch size 6 . The right number depends on input length distribution, hardware, and model size — there is no universal default.
Step 4: GPU vs CPU — the latency tradeoff
For the SST-2 DistilBERT model specifically, the practical numbers fall in a known band on modern hardware:
- CPU (modern x86, 4–8 vCPU): single-input latency in the tens of milliseconds; throughput roughly 50–200 inputs per second depending on input length. The model is small enough that CPU is genuinely usable in production for low-QPS workloads.
- GPU (entry-level T4 / L4 / A10): single-input latency in the low single-digit milliseconds; throughput an order of magnitude higher on batched calls. The cost is the GPU instance itself, which on most clouds runs 5–10x the price of a CPU machine.
To switch the code to GPU you change device=-1 to device=0 (first CUDA device) in the pipeline call, and your Docker base image to a CUDA-enabled one. For a binary-sentiment classifier serving fewer than ~50 requests per second, the source consensus across the Transformers pipeline tutorial and the inference-optimisation write-ups treats CPU as the default and reaches for GPU only once measured CPU latency stops fitting the request SLO. Fly.io publishes GPU machine types separately from the standard compute tier; Fly’s pricing page is the canonical reference for current hourly rates 7 .
Image: Hugging Face Transformers Pipelines class reference, used for editorial coverage of the batched-inference pattern this tutorial uses.
Step 5: Dockerize
Create Dockerfile:
FROM python:3.12-slim
ENV PYTHONUNBUFFERED=1 \
PIP_NO_CACHE_DIR=1 \
HF_HOME=/app/.hf-cache \
TRANSFORMERS_NO_ADVISORY_WARNINGS=1
WORKDIR /app
# System deps for tokenizers (Rust) and torch
RUN apt-get update && apt-get install -y --no-install-recommends \
build-essential ca-certificates \
&& rm -rf /var/lib/apt/lists/*
COPY requirements.txt .
RUN pip install --upgrade pip && pip install -r requirements.txt
# Pre-download the model into the image. This shifts the weight
# download from first-request time (slow, network-dependent) to
# build time (cached in the image layer).
RUN python -c "from transformers import pipeline; \
pipeline('sentiment-analysis', \
model='distilbert/distilbert-base-uncased-finetuned-sst-2-english')"
COPY app.py .
EXPOSE 8080
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]
Two things worth calling out. First, pre-downloading the model into the image at build time means containers start ready, not “ready in 30 seconds once weights finish streaming.” Second, python:3.12-slim keeps the image around 1.2–1.5 GB compressed; a full python:3.12 base would land closer to 2 GB without buying much.
Build and run locally:
docker build -t sentiment-api .
docker run --rm -p 8080:8080 sentiment-api
The same curl from Step 2 should now hit the containerised version.
Step 6: Deploy to Fly.io
Fly’s fly launch walkthrough is the canonical reference for first-time deploy; it scans the working directory, detects the Dockerfile, generates a fly.toml, and provisions a machine 5 .
fly launch --no-deploy
Answer the prompts: pick an app name, pick a region near your users, decline the Postgres / Redis offers (the API is stateless), accept the Dockerfile when prompted. Fly writes a fly.toml. Open it and bump the machine size — the default 256 MB is too small to hold DistilBERT weights in memory. Edit the [[vm]] block:
[[vm]]
size = "shared-cpu-2x"
memory = "2gb"
DistilBERT plus the Python runtime needs roughly 800 MB to 1 GB resident; 2 GB gives headroom for batching overhead and the OS file cache.
Adjust the HTTP service block to point at port 8080 and the /healthz endpoint:
[http_service]
internal_port = 8080
force_https = true
auto_stop_machines = "stop"
auto_start_machines = true
min_machines_running = 1
[[http_service.checks]]
grace_period = "30s"
interval = "15s"
method = "GET"
timeout = "5s"
path = "/healthz"
The 30-second grace_period matters. Without it, Fly’s health-checker hits /healthz while the model is still loading on a cold start, marks the machine unhealthy, and restarts it in a loop. The grace gives the lifespan startup time to complete.
Deploy:
fly deploy
Fly builds the image (locally or in the remote builder, depending on your config), pushes it, and rolls a machine. Per Fly’s fly deploy reference, the command handles image build + machine refresh in a single invocation 3 .
Image: Fly.io — Fly Machines blog post, used for editorial coverage of the deploy primitive this tutorial targets.
When the deploy finishes, Fly prints a hostname of the shape https://<app-name>.fly.dev. Hit it:
curl -s -X POST https://<app-name>.fly.dev/analyze \
-H "content-type: application/json" \
-d '{"inputs": ["Surprisingly good.", "Hated every minute."]}'
You should see the same JSON shape as the local response.
Step 7: Production hardening checklist
The endpoint works; “works” is not “production.” A short list of the gaps this tutorial does not yet close, in roughly the order most deployments hit them:
- Authentication. The
/analyzeendpoint is publicly callable. Add a header-based API key, OAuth, or put Cloudflare Access in front. Anything internet-reachable that runs ML inference will get scraped. - Rate limiting per client. Per-IP token-bucket or Redis-backed sliding-window. The current setup caps per-request batch size but not per-client request rate.
- Structured logging plus a request ID. Wire
loguruorstructlog, emit one JSON line per request with latency, batch size, status code. Fly’s log pipeline accepts JSON. - Metrics. Prometheus scrape endpoint at
/metricsviaprometheus-fastapi-instrumentator, or push to whichever observability vendor you already use. The two metrics worth alerting on are p99/analyzelatency andmodel_ready=falseafter the lifespan startup window. - Concurrency limit. Uvicorn defaults to one worker. For CPU-bound inference, one worker per available core is the rough starting point; add
--workers 2(or whatever fits your machine size). Each worker loads its own copy of the model — memory cost scales linearly. - Graceful shutdown. The lifespan
yieldruns on shutdown; if you swap to a tokenizer or model that needs explicit cleanup, that’s where it goes.
Image: FastAPI GitHub repository, used for editorial coverage of the framework this tutorial uses.
Troubleshooting
A few failure modes that recur on first deploys, with the fix each time:
- “Killed” during
fly deploybuild, or container OOMs at startup. Memory undersized. Bump the[[vm]]block to 2 GB minimum. - Health check fails for the first 60 seconds, machine restarts in a loop.
grace_periodtoo short. Bump to 60s if the image is downloading model weights on first request (only happens if you skipped the pre-downloadRUN python -c ...step in the Dockerfile). - Request returns 504 after exactly 10 seconds. The
REQUEST_TIMEOUT_S = 10.0cap fired. Either inputs are too long, batch is too large, or CPU is genuinely overloaded. Profile before raising the cap. - First request after a long idle returns 503. Fly’s
auto_stop_machines = "stop"setting stopped the machine. The next request triggers a cold start; the lifespan startup runs again. Setmin_machines_running = 1to keep at least one warm, at the cost of paying for that machine continuously. fly launchdoes not detect the Dockerfile. Run from the directory containing the Dockerfile, not a parent. Fly’s launch documentation walks through the auto-detect logic 8 .
What you’ve shipped
A FastAPI app that loads DistilBERT once at startup via the lifespan context manager, exposes POST /analyze with batched inference, enforces a 10-second per-request timeout, packages cleanly into a 1.2–1.5 GB Docker image with model weights baked in, and deploys to Fly.io with a /healthz probe that gates traffic until the model is genuinely ready.
The model itself is the same DistilBERT SST-2 checkpoint that has anchored the Hugging Face sentiment-analysis quickstart for years; the production wrapper around it is what turns a notebook into something a teammate can curl from a CI job. Swap the MODEL_ID constant for any other Transformers sequence-classification model and the rest of the pipeline carries over unchanged — that is the payoff of the lifespan pattern.
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Transformers pipeline tutorial — default device, batch_size argument, and inference behaviour (accessed ) ↩
- 2. FastAPI advanced — Lifespan Events documentation, machine-learning model use-case (accessed ) ↩
- 3. Fly.io deploy-with-a-Dockerfile reference — fly deploy build-and-refresh flow (accessed ) ↩
- 4. Hugging Face model card — distilbert-base-uncased-finetuned-sst-2-english, 67M params, Apache-2.0, 0.989 SST-2 accuracy (accessed ) ↩
- 5. Fly.io Quickstart — flyctl install, fly auth signup, fly launch first-deploy flow (accessed ) ↩
- 6. Transformers Pipelines class reference — batch inference disabled by default, tuning guidance (accessed ) ↩
- 7. Fly.io pricing — shared-cpu, dedicated, and GPU machine tier hourly rates (accessed ) ↩
- 8. Fly.io flyctl launch reference — working-directory scan and Dockerfile auto-detect (accessed ) ↩
Further Reading
- FastAPI documentation (accessed )
- DistilBERT paper (Sanh et al., arXiv:1910.01108) (accessed )
Anonymous · no cookies set