Neural Tech Daily
ai-tutorials

Build a Production Vector Search with Qdrant + Python in 60 Minutes: A Tutorial

A 60-minute Qdrant tutorial for dev teams: Docker install, sentence-transformers embeddings, hybrid search with metadata filters, and a FastAPI endpoint.

Updated ~11 min read
Share
The Qdrant documentation quickstart page showing the open-source Rust-written vector database that this tutorial walks through end-to-end

qdrant.tech documentation quickstart page, used for editorial coverage of the tool walked through in this tutorial.

What you’ll need

This tutorial builds a complete vector-search pipeline in roughly 60 minutes of working time, including the Docker pull. The end state is a Python FastAPI endpoint that accepts a query string, embeds it with sentence-transformers, retrieves the top-k most similar documents from Qdrant with a metadata filter applied, and returns JSON. That stack covers the working surface of most retrieval-augmented-generation (RAG) features dev teams ship in 2026: a chatbot that searches your help-centre articles, an internal knowledge tool that searches Confluence dumps, a product-search backend that ranks beyond keyword match.

Qdrant is an open-source vector database written in Rust, distributed under the Apache 2.0 licence on GitHub. It sits in the middle slot between ChromaDB (in-process, simpler, fits smaller in-process workloads) and Pinecone (managed-only, paid). Qdrant gives a self-host path for cost-conscious teams and a managed cloud tier when ops time is the constraint. The Python client is on PyPI as qdrant-client 1 .

The list of what’s on your machine before starting:

  • Python 3.10 or later
  • Docker Desktop (or Docker Engine on Linux), OR a free Qdrant Cloud account
  • Roughly 4 GB of free RAM if running Qdrant locally
  • Pip or uv, your choice of package manager
  • A code editor

Time required

About 60 minutes end-to-end. Step 1 to Step 4 take ~25 minutes including the Docker pull. Step 5 to Step 7 take ~30 minutes. Common pitfalls and tuning take the remaining ~5 minutes.

Steps

1. Install qdrant-client and sentence-transformers

Create a fresh virtual environment and install the two libraries the rest of this tutorial leans on. The Python client speaks both REST and gRPC to the Qdrant server. Sentence-transformers handles the embedding step locally so the tutorial doesn’t depend on any paid embedding API.

python -m venv .venv
source .venv/bin/activate # on Windows: .venv\Scripts\activate

pip install qdrant-client sentence-transformers fastapi uvicorn

The first install pulls a few hundred megabytes because sentence-transformers brings PyTorch with it. Expect a few minutes on a typical home broadband line; longer on metered connections.

2. Start Qdrant (Docker option OR Qdrant Cloud option)

Two paths here. Pick the one that matches your situation.

Docker (local, free): runs Qdrant on localhost:6333 for the REST API and localhost:6334 for gRPC 2 . The data persists in a named volume so a docker stop doesn’t wipe your collections.

docker pull qdrant/qdrant
docker run -p 6333:6333 -p 6334:6334 \
 -v $(pwd)/qdrant_storage:/qdrant/storage:z \
 qdrant/qdrant

Open http://localhost:6333/dashboard to confirm the server is up. The dashboard ships with the Docker image and gives a visual collections view alongside the REST API.

The Qdrant GitHub repository page showing the project's Apache 2.0 licence and Rust source

github.com/qdrant/qdrant repository page, used for editorial coverage of the open-source project this tutorial walks through.

Qdrant Cloud (managed): sign up at cloud.qdrant.io, create a free cluster (1 GB RAM, 4 GB disk, 0.5 vCPU on the free tier as of 2026-05-04 per the pricing page 3 — Qdrant Cloud pricing changes; verify the current tier before deploying), and copy the cluster URL plus the API key. The rest of the tutorial works the same way; only the connection string changes.

from qdrant_client import QdrantClient

# local Docker
client = QdrantClient(url="http://localhost:6333")

# Qdrant Cloud
# client = QdrantClient(
# url="https://YOUR-CLUSTER-ID.qdrant.io:6333",
# api_key="YOUR_API_KEY",
# )

3. Create a collection with the right vector size and distance metric

A collection in Qdrant is a named bucket of vectors plus payloads. Two parameters define it: vector size (the embedding dimension) and distance metric. Get either wrong and every later step fails silently. A Cosine vs Dot Product mismatch is one of the recurring causes of “my search returns garbage” reports surfaced across Qdrant’s GitHub Discussions and community Discord threads.

The embedding model used in this tutorial is sentence-transformers/all-MiniLM-L6-v2, which produces 384-dimensional vectors 4 . The model card recommends Cosine distance for semantic-similarity tasks, which is what RAG retrieval is.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient(url="http://localhost:6333")

if client.collection_exists("docs"):
    client.delete_collection("docs")

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=384, distance=Distance.COSINE),
)

The older recreate_collection method is deprecated; the explicit delete_collection + create_collection pattern above is the upstream-recommended replacement. The supported distance metrics are Cosine, Dot, Euclid, and Manhattan, per Qdrant’s collection-configuration docs. Match the metric your embedding model was trained for. If the model card doesn’t say, Cosine is the safe default for sentence-transformers models.

Qdrant collections-configuration documentation page on qdrant.tech, listing the four supported distance metrics (Cosine, Dot, Euclid, Manhattan) and the vector-size parameter this step sets

Image: Qdrant collections concepts documentation (qdrant.tech/documentation/manage-data/collections), used for editorial coverage of the distance-metric and vector-size parameters covered in this step.

4. Embed 100 documents with sentence-transformers

The embedding step turns each document into a 384-dimensional vector. Sentence-transformers handles batching, normalisation, and the underlying transformer call. The first run downloads the model weights (~90 MB for all-MiniLM-L6-v2); subsequent runs read from the local cache.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

documents = [
    {"text": "Qdrant is a vector database written in Rust.", "category": "intro", "source": "docs"},
    {"text": "FastAPI is a Python web framework for APIs.", "category": "intro", "source": "docs"},
    # ... 98 more
]

texts = [d["text"] for d in documents]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)

print(embeddings.shape) # (100, 384)

On a 2023-era M-series Mac or a mid-range Intel laptop, embedding 100 short documents typically completes in seconds rather than minutes. Batch size 32 is a reasonable default; raise it if you have GPU memory to spare, lower it if 32 hits an out-of-memory error on CPU.

5. Insert into Qdrant with metadata payloads

Each point in Qdrant carries an ID, a vector, and a payload (a JSON-shaped metadata blob). The payload is what you filter on later: category, source, timestamp, tenant ID, anything you want to scope a search to.

from qdrant_client.models import PointStruct
import uuid

points = [
    PointStruct(
    id=str(uuid.uuid4()),
    vector=embedding.tolist(),
    payload={
    "text": doc["text"],
    "category": doc["category"],
    "source": doc["source"],
    },
    )
    for doc, embedding in zip(documents, embeddings)
]

client.upsert(collection_name="docs", points=points)

upsert inserts new points and replaces existing ones with the same ID. Use it freely; Qdrant won’t duplicate. For larger ingests (10,000+ documents), chunk the upsert call into batches of 100-500 points so a single request doesn’t time out.

6. Query with a vector plus a metadata filter

This is the step that earns Qdrant its production-ready label. A pure vector search returns top-k by similarity. A hybrid search adds a payload filter (for example, “top-5 most similar documents from the ‘intro’ category only”), and Qdrant integrates the filter into the search itself, evaluating it during HNSW graph traversal at query time rather than as a post-filter applied to the similarity-ranked list, per the filtering documentation 5 .

from qdrant_client.models import Filter, FieldCondition, MatchValue

query = "What language is Qdrant built in?"
query_vector = model.encode(query).tolist()

results = client.query_points(
    collection_name="docs",
    query=query_vector,
    query_filter=Filter(
    must=[
    FieldCondition(
    key="category",
    match=MatchValue(value="intro"),
    ),
    ],
    ),
    limit=5,
)

for r in results.points:
    print(f"{r.score:.3f} {r.payload['text']}")

query_points is the unified Qdrant query API in current versions; the older search method has been deprecated in recent qdrant-client releases. The response is a QueryResponse object with a .points attribute, which is why the iteration above uses results.points rather than iterating results directly. The score is the cosine similarity between the query vector and the matched document vector; higher means closer.

Qdrant filtering documentation on qdrant.tech, describing how payload filters integrate into HNSW graph traversal at query time rather than running as a post-filter on the similarity-ranked list

Image: Qdrant filtering concepts documentation (qdrant.tech/documentation/search/filtering), used for editorial coverage of the hybrid-search behaviour this step relies on.

7. Serve through a FastAPI endpoint

The last step wraps everything above in a single-file FastAPI application. The model loads once at startup, the Qdrant client holds an open connection, and each request runs the embed-then-search flow.

from fastapi import FastAPI
from pydantic import BaseModel
from qdrant_client import QdrantClient
from qdrant_client.models import Filter, FieldCondition, MatchValue
from sentence_transformers import SentenceTransformer

app = FastAPI()
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
client = QdrantClient(url="http://localhost:6333")

class SearchRequest(BaseModel):
    query: str
    category: str | None = None
    limit: int = 5

@app.post("/search")
def search(req: SearchRequest):
    query_vector = model.encode(req.query).tolist()
    query_filter = None
    if req.category:
    query_filter = Filter(
    must=[FieldCondition(key="category", match=MatchValue(value=req.category))]
    )
    results = client.query_points(
    collection_name="docs",
    query=query_vector,
    query_filter=query_filter,
    limit=req.limit,
    )
    return {
    "results": [
    {"score": r.score, "text": r.payload["text"], "source": r.payload.get("source")}
    for r in results.points
    ]
    }

Run it with:

uvicorn main:app --reload --port 8000

Hit the endpoint:

curl -X POST http://localhost:8000/search \
 -H "Content-Type: application/json" \
 -d '{"query": "What language is Qdrant written in?", "category": "intro", "limit": 3}'

That’s the working pipeline. Every later improvement (sharding, snapshots, payload indexes, hybrid sparse+dense search) is a tuning pass on top of these seven steps.

Common pitfalls

Vector-size mismatch. The create_collection call sets the dimension at 384. If a later code change swaps in a different embedding model (say all-mpnet-base-v2 at 768 dimensions), the upsert call throws a vector-size error. Match the embedding model’s dimension to the collection’s size parameter, or delete and re-create the collection.

Forgetting the payload index. A filter on a payload field is fast for small collections and slow for large ones unless the field is indexed. Qdrant’s indexing documentation describes the create_payload_index call; create one on every field your queries filter on, and watch latency drop. Without it, a category filter on a million-point collection scans the full collection on every query 6 .

Single-point upserts in a loop. Each client.upsert([point]) call is one round-trip. A loop of 10,000 upserts takes thousands of round-trips. Batch them at 100 to 500 points per call as a reasonable starting point, and the ingest finishes in a fraction of the time.

Cosine-vs-Dot mismatch. Sentence-transformers models are trained for cosine similarity. Storing vectors with Distance.DOT and expecting cosine-style scores returns plausible-looking but wrong rankings. Set the metric the model card recommends, and don’t guess.

Unbounded query latency in production. A FastAPI endpoint that loads the embedding model on every request will be slow. Load the model once at startup (as the code above does) and reuse the instance across requests. The bottleneck shifts from model-load time to embedding inference, which is the place to apply GPU acceleration if you need it.

Qdrant indexing concepts documentation page on qdrant.tech, covering the payload-index types (keyword, integer, geo, full-text) and the create_payload_index call this section's pitfall guidance points at

Image: Qdrant indexing concepts documentation (qdrant.tech/documentation/manage-data/indexing), used for editorial coverage of the payload-index pitfall covered in this section.

Where to go next

The seven steps above ship a working RAG retrieval backend. Production hardening adds three pieces.

Sharding and replication. Qdrant supports collection sharding for write throughput and replication for read availability. Both are configured at collection-creation time. Read the collections documentation before scaling beyond a single-node deployment.

Payload indexing. Every field used in a filter should carry an index. The indexing concepts page covers keyword, integer, geo, and full-text indexes. Index at the time you start writing filters against a field, not later.

Snapshots and backups. Qdrant supports point-in-time snapshots of collections, exportable to S3-compatible storage. Configure the snapshot schedule before your first production deploy, not after the first incident.

The retrieval quality ceiling is set by the embedding model, not Qdrant. When all-MiniLM-L6-v2 stops being good enough, the next step is all-mpnet-base-v2 (768 dim, slower, more accurate) or a domain-tuned model. Hybrid sparse-plus-dense retrieval is the other tuning lever: Qdrant supports BM25-style sparse vectors alongside dense vectors, which adds a layer of keyword precision on top of semantic recall.

For teams shipping their first RAG feature, the self-hosted Docker path keeps the cloud bill at zero through the prototype phase. Move to Qdrant Cloud only when the ops layer (replication, snapshots, monitoring) costs more engineering time than the cloud bill.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.