FastAPI · micro-batched · observable · runs offline

Read the feeling behind any sentence

A fine-tuned DistilBERT classifier, wrapped in a typed, batched, monitored API. Type a line and watch the live POST /predict response paint the full six-emotion spectrum — served right now by the offline stub.

Try the classifier Read the API docs

Your sentence

Sent to POST /predict and classified into six emotions with full probabilities.

Press ⌘/Ctrl+Enter to run

Or try a feeling

The widget is just one HTTP call

Everything above is POST /predict. Call it from anything — single sentence or a batch.

Built for throughput, measured under load

A dynamic micro-batcher coalesces concurrent requests into a single forward pass. Numbers from the included scripts/loadtest.py against the offline stub, single worker, Apple-silicon laptop — reproducible from a clean checkout.

Peak throughput

604 req/s

at 16 concurrent clients

p50 latency

8.3 ms

serial round-trip, full stack

Throughput scaling

~5×

serial → 8 concurrent (batcher)

Errors

across all benchmarked runs

Concurrency	Throughput (req/s)	p50 (ms)	p95 (ms)	p99 (ms)
1	118	8.27	8.87	13.12
8	595	13.35	16.49	19.97
16	604	19.03	67.49	107.39

Reflects the stub plus full HTTP / validation / batching overhead. Reproduce with make bench.

The production layer, not just a model

Everything you need to actually ship inference — typed contracts, health probes, metrics a dashboard can read, and a zero-download offline mode for CI and demos.

Health probes

GET /healthz — readiness + liveness; 503 until the model is loaded and the batcher runs. Wired to a Docker HEALTHCHECK.

Open /healthz

Prometheus metrics

GET /metrics — request rate, latency histogram, in-flight gauge, errors, plus model inference latency and batch-size histograms.

Open /metrics

Typed OpenAPI contract

pydantic v2 validation on every request; single + batch share one endpoint. Interactive Swagger UI at /docs.

Open /docs

Dynamic micro-batching

Concurrent single requests are coalesced into one forward pass with a latency cap you control — ~5× throughput under load.

Container + monitoring

Multi-stage non-root Docker image; docker compose up brings up the API with Prometheus and a provisioned Grafana dashboard.

Offline-first by default

A deterministic lexicon stub backs the API when OFFLINE=1 — the whole service, demo, tests, and load test run with zero downloads. Flip to OFFLINE=0 for the real model.