Neural Tech Daily
ai-tutorials

Build Semantic + Keyword Hybrid Search with Weaviate and Cohere: An End-to-End Python Tutorial

Wire Weaviate Cloud hybrid BM25 search, Cohere embed-v4.0 vectors, and rerank-v3.5 into a FastAPI /search endpoint over 100 Wikipedia chunks.

Updated ~12 min read
Share
Weaviate documentation hybrid search reference page at docs.weaviate.io — the canonical guide this tutorial builds against

Image: Weaviate hybrid search documentation, used for editorial coverage of the framework taught in this tutorial.

What you’ll build

A working /search endpoint that combines dense vector similarity with BM25 keyword scoring, then re-ranks the top hits with a cross-encoder for better ordering. The corpus is 100 Wikipedia article chunks; the stack is Weaviate Cloud (free 14-day sandbox) for storage and hybrid scoring, Cohere embed-v4.0 for the dense side, Cohere rerank-v3.5 for the final ordering, and FastAPI for the HTTP surface.

Per Weaviate’s hybrid search documentation, the engine fuses a vector-similarity result set with a BM25F keyword result set into one ranked list, with a tunable alpha weight between the two signals. Per Cohere’s rerank overview, rerank-v3.5 is the recommended cross-encoder for typical RAG chunk sizes and operates as a second-pass reranker on a candidate list. The tutorial wires the two into the standard retrieve-then-rerank pattern.

End state: a single POST /search call returns the top 5 reranked passages with source titles, BM25 + vector hybrid score, and rerank relevance score.

Prerequisites

  • Python 3.10 or higher.
  • A Weaviate Cloud account (free; sandbox cluster valid 14 days per Weaviate’s Cloud FAQ).
  • A Cohere API key (free trial tier available at dashboard.cohere.com).
  • Roughly 30 minutes for first run.

Step 1: Project setup

Create a fresh virtual environment and install dependencies.

mkdir hybrid-search && cd hybrid-search
python -m venv .venv
source .venv/bin/activate     # Windows: .venv\Scripts\activate

pip install "weaviate-client>=4.10" cohere fastapi "uvicorn[standard]" \
    requests python-dotenv

The weaviate-client package is the v4 Python client; per its readthedocs reference, the v4 client targets Weaviate 1.23.7 and above and exposes the BM25 + hybrid query APIs used below. The cohere package is the official SDK; per Cohere’s embed API docs, the v2 client (cohere.ClientV2) is the current recommended surface.

Create a .env file in the project root:

WEAVIATE_URL=https://your-cluster.weaviate.network
WEAVIATE_API_KEY=your-admin-key
COHERE_API_KEY=your-cohere-key

Step 2: Provision the Weaviate Cloud sandbox

Sign in at console.weaviate.cloud, click Create Cluster, pick the Sandbox plan, and choose any region. Per Weaviate’s “Create a cluster” docs, the sandbox provisions in under a minute and returns a cluster URL plus an Admin API key from the cluster detail page. Copy both into .env.

Per Weaviate’s Cloud FAQ, sandbox clusters expire after 14 days and cannot be extended; an organisation may hold up to two sandbox clusters simultaneously. For a production workload, the Serverless Cloud tier or self-hosted Weaviate is the next step — both expose the same Python client API, so the tutorial code ports unchanged.

Weaviate Cloud documentation page covering cluster creation and the sandbox tier

Image: Weaviate Cloud — Create a cluster, used for editorial coverage of the provisioning step.

Step 3: Fetch 100 Wikipedia chunks

Wikipedia’s REST API exposes article summaries at en.wikipedia.org/api/rest_v1/page/summary/<title>. The tutorial pulls 20 articles across five topical clusters (machine learning, climate science, history, biology, economics) and chunks each summary into roughly 500-character passages, yielding the 100-passage corpus.

Create ingest.py:

import os
import requests
import cohere
import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.config import Configure, Property, DataType
from dotenv import load_dotenv

load_dotenv()

TOPICS = [
    "Machine_learning", "Deep_learning", "Transformer_(machine_learning_model)",
    "Reinforcement_learning", "Natural_language_processing",
    "Climate_change", "Greenhouse_gas", "Carbon_capture",
    "Renewable_energy", "El_Ni%C3%B1o",
    "World_War_II", "Cold_War", "Industrial_Revolution",
    "French_Revolution", "Silk_Road",
    "DNA", "Photosynthesis", "Mitochondrion", "CRISPR", "Evolution",
]

def fetch_summary(title: str) -> str:
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{title}"
    resp = requests.get(url, headers={"User-Agent": "tutorial/1.0"}, timeout=10)
    resp.raise_for_status()
    return resp.json().get("extract", "")

def chunk(text: str, size: int = 500) -> list[str]:
    text = text.replace("\n", " ").strip()
    return [text[i:i + size] for i in range(0, len(text), size) if text[i:i + size].strip()]

def build_corpus() -> list[dict]:
    docs = []
    for title in TOPICS:
        body = fetch_summary(title)
        for idx, piece in enumerate(chunk(body)):
            docs.append({"title": title.replace("_", " "), "chunk_id": idx, "text": piece})
    return docs[:100]

The loop is bounded at 100 passages so the sandbox storage footprint stays small.

Step 4: Define the Weaviate collection

A Weaviate collection holds objects of one schema. The collection below stores Wikipedia chunks with title, chunk_id, and text properties, plus a 1536-dimension Cohere vector slot.

Add to ingest.py:

def get_client() -> weaviate.WeaviateClient:
    return weaviate.connect_to_weaviate_cloud(
        cluster_url=os.environ["WEAVIATE_URL"],
        auth_credentials=Auth.api_key(os.environ["WEAVIATE_API_KEY"]),
    )

def ensure_collection(client: weaviate.WeaviateClient, name: str = "WikiChunk"):
    if client.collections.exists(name):
        client.collections.delete(name)
    client.collections.create(
        name=name,
        vectorizer_config=Configure.Vectorizer.none(),
        properties=[
            Property(name="title", data_type=DataType.TEXT),
            Property(name="chunk_id", data_type=DataType.INT),
            Property(name="text", data_type=DataType.TEXT),
        ],
    )

Vectorizer is set to none() so the tutorial supplies Cohere embeddings explicitly rather than letting Weaviate’s text2vec-cohere module handle it server-side. The explicit path is easier to debug and decouples the embedding model from the cluster configuration.

Step 5: Embed with Cohere embed-v4.0

Per Cohere’s embeddings documentation, the embed-v4.0 model is the latest generation and supports both text and image inputs. The input_type parameter distinguishes ingest-time embeddings (search_document) from query-time embeddings (search_query), and the Cohere semantic-search guide recommends matching the two correctly for retrieval quality.

Add the embed + insert pass:

def embed_documents(co: cohere.ClientV2, texts: list[str]) -> list[list[float]]:
    resp = co.embed(
        model="embed-v4.0",
        texts=texts,
        input_type="search_document",
        embedding_types=["float"],
    )
    return resp.embeddings.float_

def ingest_all():
    co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])
    client = get_client()
    try:
        ensure_collection(client)
        corpus = build_corpus()
        texts = [d["text"] for d in corpus]
        vectors = embed_documents(co, texts)
        collection = client.collections.get("WikiChunk")
        with collection.batch.dynamic() as batch:
            for doc, vec in zip(corpus, vectors):
                batch.add_object(properties=doc, vector=vec)
        print(f"ingested {len(corpus)} chunks")
    finally:
        client.close()

if __name__ == "__main__":
    ingest_all()

Run it once:

python ingest.py
# ingested 100 chunks

The dynamic batch context manager handles flushing automatically. For larger corpora, swap to collection.batch.fixed_size(batch_size=100) per Weaviate’s batch import docs.

Cohere blog post announcing the embed-v4.0 multimodal embeddings model

Image: Cohere — Introducing Embed 4 blog post, used for editorial coverage of the embedding model the tutorial uses.

Step 6: Hybrid query with BM25 + vectors

Per Weaviate’s hybrid search reference, the hybrid() query method on a collection accepts a text query, a precomputed vector, an alpha weighting (0.0 is pure BM25, 1.0 is pure vector, 0.5 splits evenly), and a limit. The fused score appears on each returned object’s metadata.

Create search.py:

import os
import cohere
import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.query import MetadataQuery
from dotenv import load_dotenv

load_dotenv()

def get_client() -> weaviate.WeaviateClient:
    return weaviate.connect_to_weaviate_cloud(
        cluster_url=os.environ["WEAVIATE_URL"],
        auth_credentials=Auth.api_key(os.environ["WEAVIATE_API_KEY"]),
    )

def embed_query(co: cohere.ClientV2, query: str) -> list[float]:
    resp = co.embed(
        model="embed-v4.0",
        texts=[query],
        input_type="search_query",
        embedding_types=["float"],
    )
    return resp.embeddings.float_[0]

def hybrid_retrieve(query: str, alpha: float = 0.5, limit: int = 20) -> list[dict]:
    co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])
    client = get_client()
    try:
        collection = client.collections.get("WikiChunk")
        qvec = embed_query(co, query)
        resp = collection.query.hybrid(
            query=query,
            vector=qvec,
            alpha=alpha,
            limit=limit,
            return_metadata=MetadataQuery(score=True, explain_score=True),
        )
        return [
            {
                "title": obj.properties["title"],
                "chunk_id": obj.properties["chunk_id"],
                "text": obj.properties["text"],
                "hybrid_score": float(obj.metadata.score or 0.0),
            }
            for obj in resp.objects
        ]
    finally:
        client.close()

alpha=0.5 is the Weaviate default per the hybrid search docs and a reasonable starting point. Drop alpha to 0.25 for keyword-heavy use cases (exact-name search, code lookup); raise to 0.75 for paraphrase-heavy queries.

Step 7: Add the Cohere rerank pass

The retrieve stage returns 20 candidates. Per Cohere’s reranking quickstart, rerank-v3.5 scores each candidate against the query with a cross-encoder and returns a relevance score between 0 and 1; the recommended pattern is to feed the top 20-100 retrieved candidates and keep the top 5-10.

Append to search.py:

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    co = cohere.ClientV2(api_key=os.environ["COHERE_API_KEY"])
    resp = co.rerank(
        model="rerank-v3.5",
        query=query,
        documents=[c["text"] for c in candidates],
        top_n=top_n,
    )
    out = []
    for r in resp.results:
        base = candidates[r.index]
        out.append({**base, "rerank_score": r.relevance_score})
    return out

def search(query: str, alpha: float = 0.5, top_n: int = 5) -> list[dict]:
    candidates = hybrid_retrieve(query, alpha=alpha, limit=20)
    if not candidates:
        return []
    return rerank(query, candidates, top_n=top_n)

Per Cohere’s rerank best-practices docs, the result object exposes results[i].index (position in the original list) and results[i].relevance_score. The wrapper above stitches those back to the original payload so the API consumer sees title + text + both scores.

Step 8: Expose the /search endpoint with FastAPI

FastAPI’s first-steps tutorial covers the minimal app shape: declare an app, declare a route, run with uvicorn. The wrapper below adds a Pydantic request model so the request body is validated and the OpenAPI schema is generated automatically.

Create app.py:

from fastapi import FastAPI
from pydantic import BaseModel, Field
from search import search

app = FastAPI(title="Hybrid Search Demo")

class SearchRequest(BaseModel):
    query: str = Field(..., min_length=1, max_length=500)
    alpha: float = Field(0.5, ge=0.0, le=1.0)
    top_n: int = Field(5, ge=1, le=20)

@app.post("/search")
def post_search(req: SearchRequest):
    return {"query": req.query, "results": search(req.query, req.alpha, req.top_n)}

@app.get("/health")
def health():
    return {"status": "ok"}

Run it:

uvicorn app:app --reload --port 8000

Test from a second terminal:

curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "how do transformers handle long context", "alpha": 0.6, "top_n": 3}'

The response contains the top 3 passages with hybrid_score from Weaviate’s fused BM25 + vector score and rerank_score from Cohere’s cross-encoder.

FastAPI GitHub repository banner showing the project header and tagline

Image: FastAPI GitHub repository, used for editorial coverage of the API framework the tutorial wires up.

Step 9: Tuning alpha and observing the difference

Hit the endpoint with three alpha values for the same query and compare. With alpha=0.0 (pure BM25), the engine ranks by exact-keyword overlap; with alpha=1.0 (pure vector), it ranks by Cohere embedding cosine similarity; with alpha=0.5, the two signals fuse.

A useful debugging move is to set explain_score=True in the MetadataQuery (already wired above); per the Weaviate Python client BM25 reference, the explainScore string surfaces the per-component contributions so the operator can see which signal dominated each result.

The rerank pass is the second layer of correction: the hybrid result set may surface a keyword-rich but topically-off passage, and rerank-v3.5 (per Cohere’s changelog) is trained specifically to pull the topically-relevant passage to the top of a candidate set.

Common failure modes

Connection errors on Weaviate Cloud. The sandbox cluster URL includes a hyphenated subdomain; copy it verbatim from the cluster detail page. The Admin API key is distinct from the Read-only key — the ingest path needs Admin.

Vector dimension mismatch on insert. embed-v4.0 returns 1536-dimension vectors by default per Cohere’s embed API reference. If the collection was created with a different vector_index_config dimension hint, Weaviate rejects the batch — drop and recreate the collection.

Empty results. Both BM25 and vector paths return nothing when the corpus is below ~20 chunks; verify the ingest run actually completed (collection.aggregate.over_all().total_count should return 100).

Rerank-v3.5 latency. Per Cohere’s rerank-v3.5 changelog, the model is optimised for typical RAG chunk sizes (under ~4096 tokens per document) and reranking ~20 candidates typically returns in under a second; sending 100+ candidates per call increases latency materially.

Cohere Rerank product page describing the rerank-v3.5 cross-encoder

Image: Cohere Rerank product page, used for editorial coverage of the reranking model the tutorial uses.

Where to go next

  • Swap the corpus. Replace build_corpus() with a loader for the actual document set: PDFs via pypdf, Markdown files via frontmatter, or a SQL extract. The collection schema stays the same.
  • Move past the sandbox. Per Weaviate’s pricing page, Serverless Cloud is the next tier up when the 14-day sandbox expires; the Python client connection code is unchanged.
  • Add citations to LLM answers. Pipe the reranked top 3-5 chunks into Cohere’s chat endpoint or any LLM as grounded context, and surface the chunk titles + chunk_ids back to the user as citations.
  • Benchmark alpha values. Build a small held-out query set with known relevant chunks, sweep alpha from 0.0 to 1.0 in steps of 0.1, and measure NDCG@5 or MRR per the standard IR evaluation toolkit.

The full source for this tutorial fits in three files — ingest.py, search.py, app.py — and totals around 150 lines. The hybrid retrieve-then-rerank pattern, per both Weaviate’s and Cohere’s documentation, is the same shape that scales from this 100-passage demo to multi-million-document production deployments.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Weaviate — Hybrid search documentation (alpha parameter, BM25F + vector fusion) (accessed )
  2. 2. Weaviate — Cloud FAQ (14-day sandbox expiry, two-cluster organisation limit) (accessed )
  3. 3. Cohere — embed-v4.0 model overview and input_type parameter (accessed )
  4. 4. Cohere — rerank-v3.5 cross-encoder, retrieve-then-rerank recommended pattern (accessed )
  5. 5. Cohere — reranking quickstart (top_n parameter, relevance_score 0-1 range) (accessed )
  6. 6. Weaviate Python Client — BM25 query reference (v4.20+ targets Weaviate 1.23.7+) (accessed )
  7. 7. FastAPI — First steps tutorial (app declaration, route handler, uvicorn run) (accessed )

Further Reading

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.