Build a Multimodal RAG App with Images + Text: An End-to-End Python Tutorial

Build a multimodal RAG app end-to-end in Python: LangChain + ChromaDB with OpenCLIP, query a product catalogue with photos or text, answer with Claude Sonnet.

20 May 2026 Updated 20 May 2026 ~12 min read

LangChain multimodal-inputs documentation page showing how HumanMessage content blocks carry image and text together for chat models like Claude

Image: LangChain multimodal-inputs reference, used for editorial coverage of the multimodal message schema this tutorial relies on.

What you’ll build

By the end of this walkthrough you will have a working multimodal retrieval-augmented generation (RAG) app: a ChromaDB collection that holds OpenCLIP embeddings for a small product catalogue of photos plus written descriptions, a query function that accepts either a text prompt or an uploaded image and returns the top-5 matches with source attribution, and a Claude Sonnet answer-generation step that reads those matches (image + text together) and writes a grounded reply.

The pieces are all open primary sources: LangChain’s langchain-anthropic package wires Claude into the message schema¹, ChromaDB ships an OpenCLIP embedding function that puts text and images in the same vector space², and Anthropic’s vision-capable Sonnet models accept image content blocks alongside text in a single Messages API call³. The whole thing runs locally on a laptop CPU; only the final answer-generation step calls the network.

What you’ll need

Python 3.10 or later, plus a virtual-environment workflow (venv, uv, poetry, take your pick).
An Anthropic API key with a vision-capable Sonnet model enabled. Claude’s vision is available on the Claude 3 family and later (Haiku, Sonnet, Opus), all of which accept image content blocks via the Messages API per Anthropic’s vision documentation³.
A folder of ~10–20 product photos plus a CSV (or JSON) mapping each filename to a short written description. The tutorial uses a small electronics catalogue but any image-plus-caption corpus works.
Roughly 60 minutes start to finish, plus 1–2 GB of disk for the OpenCLIP model weights ChromaDB downloads on first use.

Budget split: 10 minutes to install and set keys, 15 minutes to ingest the catalogue, 20 minutes to wire the query + Claude answer step, 15 minutes to test with both text and image queries.

Step 1. Set up the project

Create a working directory, a virtual environment, and a .env file for your API key:

mkdir multimodal-rag && cd multimodal-rag
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -U langchain langchain-anthropic chromadb open_clip_torch pillow python-dotenv

The four installs cover all the moving parts: langchain for the orchestration glue, langchain-anthropic for the Claude chat-model binding¹, chromadb for the vector store⁴, and open_clip_torch plus pillow because ChromaDB’s multimodal embedding function calls OpenCLIP under the hood and Pillow handles the image loading⁵.

Drop your Anthropic key into .env:

ANTHROPIC_API_KEY=sk-ant-...

Create a data/ folder. Put your product photos in data/images/ and write a data/catalogue.csv with two columns: filename (just the basename, no path) and description (one to three sentences per item). A tiny illustrative excerpt:

filename,description
keyboard-001.jpg,"Tenkeyless mechanical keyboard with brown tactile switches and a navy-blue PBT keycap set."
mouse-003.jpg,"Wireless ergonomic vertical mouse with a thumb rest and four programmable side buttons."
monitor-007.jpg,"27-inch 1440p IPS monitor with a height-adjustable stand and a single USB-C upstream port."

ChromaDB multimodal embeddings documentation page showing the OpenCLIP embedding function and the add/query methods for mixed image and text data

Image: ChromaDB multimodal-embeddings documentation, used for editorial coverage of the OpenCLIP-backed multimodal collection API.

Step 2. Build the multimodal index

ChromaDB exposes multimodal collections via the OpenCLIP embedding function. The same function embeds text strings and images into a shared vector space, so a text query like “compact mechanical keyboard” can retrieve image rows whose pixels look like a small mechanical keyboard, and an uploaded photo can retrieve text rows whose description matches what is in the photo².

Create build_index.py:

import csv
from pathlib import Path

import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader

DATA_DIR = Path("data")
IMAGE_DIR = DATA_DIR / "images"
CSV_PATH = DATA_DIR / "catalogue.csv"


def build():
    client = chromadb.PersistentClient(path="./chroma_store")
    embedder = OpenCLIPEmbeddingFunction()
    loader = ImageLoader()

    collection = client.get_or_create_collection(
        name="product_catalogue",
        embedding_function=embedder,
        data_loader=loader,
    )

    with CSV_PATH.open() as fh:
        rows = list(csv.DictReader(fh))

    image_ids, image_uris, image_meta = [], [], []
    text_ids, text_docs, text_meta = [], [], []

    for row in rows:
        filename = row["filename"]
        description = row["description"]
        image_path = IMAGE_DIR / filename
        if not image_path.exists():
            print(f"skip: {filename} not found")
            continue

        image_ids.append(f"img::{filename}")
        image_uris.append(str(image_path))
        image_meta.append({"source": filename, "modality": "image"})

        text_ids.append(f"txt::{filename}")
        text_docs.append(description)
        text_meta.append({"source": filename, "modality": "text"})

    collection.add(ids=image_ids, uris=image_uris, metadatas=image_meta)
    collection.add(ids=text_ids, documents=text_docs, metadatas=text_meta)

    print(f"indexed {len(image_ids)} images and {len(text_docs)} descriptions")


if __name__ == "__main__":
    build()

Two things to call out. First, PersistentClient writes the index to disk under ./chroma_store/, so you can rebuild the query layer without re-embedding⁴. Second, ChromaDB stores embeddings and URI pointers, not the original images — so the data/images/ folder needs to stay where it is, or the URIs will dangle².

Run it once:

python build_index.py

The first run downloads the OpenCLIP weights (around 600 MB for the default ViT-B/32 checkpoint, per the OpenCLIP project README⁵). Subsequent runs read from the local cache.

Anthropic vision documentation page describing image content blocks and Claude model support for multimodal inputs via the Messages API

Image: Anthropic vision documentation, used for editorial coverage of the image-content-block schema Claude Sonnet accepts.

Step 3. Query with text or with an image

Now the query layer. ChromaDB lets you query a multimodal collection with query_texts=[...] for natural-language search, or query_uris=[...] (or query_images=[...]) for image-to-anything search². Both return the same shape: distances, ids, metadatas, and (optionally) documents or image data.

Create query.py:

from pathlib import Path
from typing import Optional

import chromadb
from chromadb.utils.embedding_functions import OpenCLIPEmbeddingFunction
from chromadb.utils.data_loaders import ImageLoader


def open_collection():
    client = chromadb.PersistentClient(path="./chroma_store")
    return client.get_or_create_collection(
        name="product_catalogue",
        embedding_function=OpenCLIPEmbeddingFunction(),
        data_loader=ImageLoader(),
    )


def search(text: Optional[str] = None, image_path: Optional[str] = None, k: int = 5):
    collection = open_collection()

    if text and image_path:
        raise ValueError("pass either text or image_path, not both")
    if not text and not image_path:
        raise ValueError("pass a text query or an image_path")

    if text:
        result = collection.query(
            query_texts=[text],
            n_results=k,
            include=["documents", "metadatas", "distances", "uris"],
        )
    else:
        result = collection.query(
            query_uris=[image_path],
            n_results=k,
            include=["documents", "metadatas", "distances", "uris"],
        )

    hits = []
    for i in range(len(result["ids"][0])):
        hits.append({
            "id": result["ids"][0][i],
            "distance": result["distances"][0][i],
            "metadata": result["metadatas"][0][i],
            "document": (result["documents"][0][i]
                         if result.get("documents") else None),
            "uri": (result["uris"][0][i]
                    if result.get("uris") else None),
        })
    return hits


if __name__ == "__main__":
    for hit in search(text="quiet mechanical keyboard for an office"):
        print(hit["metadata"]["source"], round(hit["distance"], 3),
              "::", hit["document"] or hit["uri"])

A quick sanity check: text query and image query should both return five rows, mixed across the image and text modalities, ranked by distance.

The metadata["source"] field carries the original filename forward as source attribution. When Claude answers in the next step, the response can cite which catalogue rows it leaned on by name.

Step 4. Generate the answer with Claude Sonnet

This is the multimodal step. The top-5 hits include both image rows (URI pointers to JPEGs on disk) and text rows (the original descriptions). Claude Sonnet accepts both in a single HumanMessage via LangChain’s content-block schema: text blocks for prompts and retrieved descriptions, image blocks (base64-encoded) for the retrieved product photos⁶.

Create answer.py:

import base64
import os
from pathlib import Path
from typing import Optional

from dotenv import load_dotenv
from langchain_anthropic import ChatAnthropic
from langchain_core.messages import HumanMessage

from query import search

load_dotenv()


def encode_image(path: str) -> tuple[str, str]:
    suffix = Path(path).suffix.lower().lstrip(".")
    mime = {"jpg": "image/jpeg", "jpeg": "image/jpeg",
            "png": "image/png", "webp": "image/webp"}.get(suffix, "image/jpeg")
    with open(path, "rb") as fh:
        b64 = base64.standard_b64encode(fh.read()).decode("utf-8")
    return b64, mime


def build_message(question: str, hits: list[dict]) -> HumanMessage:
    blocks = [{
        "type": "text",
        "text": (
            "You are answering a question about a product catalogue. "
            "Below are the top retrieval hits — some are product photos, "
            "some are written descriptions. Cite each item you use by its "
            "source filename. If the hits do not support an answer, say so.\n\n"
            f"Question: {question}\n\nRetrieved hits:"
        ),
    }]

    for hit in hits:
        source = hit["metadata"]["source"]
        modality = hit["metadata"]["modality"]
        if modality == "text":
            blocks.append({
                "type": "text",
                "text": f"[source={source}] {hit['document']}",
            })
        else:
            b64, mime = encode_image(hit["uri"])
            blocks.append({
                "type": "text",
                "text": f"[source={source}] (image below)",
            })
            blocks.append({
                "type": "image",
                "source_type": "base64",
                "data": b64,
                "mime_type": mime,
            })

    return HumanMessage(content=blocks)


def answer(question: Optional[str] = None, image_query: Optional[str] = None):
    if question and not image_query:
        hits = search(text=question, k=5)
        prompt_question = question
    elif image_query and not question:
        hits = search(image_path=image_query, k=5)
        prompt_question = "Describe products in the catalogue that match this uploaded image."
    else:
        raise ValueError("pass exactly one of question or image_query")

    model = ChatAnthropic(model="claude-sonnet-4-5", max_tokens=1024)
    message = build_message(prompt_question, hits)
    response = model.invoke([message])

    return {"answer": response.content, "hits": hits}


if __name__ == "__main__":
    result = answer(question="Which keyboard would work in a shared office?")
    print(result["answer"])
    print("---")
    for hit in result["hits"]:
        print(hit["metadata"]["source"], "::", hit["metadata"]["modality"])

Two things to flag. LangChain’s content-block schema accepts both {"type": "image", "url": ...} and base64-encoded variants per the messages reference; this tutorial uses base64 because the images live locally on disk⁶. And the model name claude-sonnet-4-5 is the production vision-capable Sonnet on Anthropic’s API catalogue as of May 2026; the same code path works for newer Sonnet versions when they ship⁷.

LangChain ChatAnthropic integration documentation page showing the pip install command and instantiation pattern for the langchain-anthropic chat model binding

Image: LangChain ChatAnthropic integration docs, used for editorial coverage of the chat-model binding the answer step uses.

Step 5. Try it both ways

Run two checks. First, a pure text query the catalogue should answer well:

python answer.py

Expected behaviour: Claude reads the five retrieved hits (a mix of keyboard descriptions and photos), then writes one or two paragraphs that name the matching SKUs by filename and explain why each fits a shared-office scenario.

Then try an image query. Drop a photo of a keyboard you own into data/queries/sample.jpg and adapt the entry point:

result = answer(image_query="data/queries/sample.jpg")

The image-as-query path runs OpenCLIP over the uploaded photo, retrieves visually similar catalogue rows (plus any text descriptions whose embeddings sit nearby in the shared vector space), and asks Claude to compare the uploaded image against the retrieved set. Filenames in the response let a reader trace each claim back to a specific catalogue row.

Step 6. Validate the retrieval before trusting it

Multimodal embeddings can surface confident-looking false matches. Two checks worth running before shipping:

Distance threshold. OpenCLIP cosine distances above roughly 0.35 in the default ViT-B/32 configuration tend to be noise in the LangChain + ChromaDB defaults — filter low-confidence rows out before passing them to Claude, or surface them with an explicit hedge in the prompt.
Modality balance. A pure-text query that retrieves five text rows and zero image rows defeats the multimodal premise. Watch for collections where one modality dominates the embedding distribution; rebalancing the catalogue (fewer descriptions per photo, or vice versa) usually fixes it.

Per the LangChain messages reference, not every chat model supports every file type, and provider-specific size and dimension limits apply⁶. Anthropic’s vision documentation puts the per-request cap at 100 images for 200k-context models and 600 for other models, with maximum dimensions of 8000x8000 px (reduced to 2000x2000 when more than 20 images ship in one request)³. The five-row top-k in this tutorial sits well under any of those ceilings.

OpenCLIP project README on GitHub showing the open_clip_torch package, model variants, and pretrained checkpoints powering the ChromaDB multimodal embedder

Image: OpenCLIP project README, used for editorial coverage of the embedding model ChromaDB downloads on first use.

Where to take it next

A few natural extensions once the base loop works:

Swap the embedder. ChromaDB’s multimodal interface accepts any embedding function that produces text and image vectors in the same space; OpenCLIP is the default but custom CLIP variants or domain-tuned alternatives plug in via the same embedding_function argument².
Persist user uploads. The current answer(image_query=...) reads a one-off file from disk. A small FastAPI wrapper (multipart form upload, save to a temp path, invoke answer) turns it into an HTTP endpoint without changing the retrieval logic.
Add a re-ranker. Cross-encoder re-rankers can resort the top-20 retrieval hits before the top-5 reach Claude; useful when the catalogue grows and CLIP’s nearest-neighbour ranking starts to lose precision at the head of the list.
Wire tool use. Instead of always retrieving on every turn, expose search() to Claude as a tool the model can call when it needs more context, the pattern this publication’s prior single-modality RAG tutorial walks through with Pinecone.

Recap

You now have a working multimodal RAG loop: ChromaDB + OpenCLIP for the index, a query function that takes either text or an image and returns the top-5 matches with filename-level source attribution, and Claude Sonnet for the answer step with both text and image content blocks reaching the model in one request. Everything except the final Claude call runs locally; the index lives on disk so iteration cycles stay tight.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. LangChain — ChatAnthropic integration (pip install -U langchain-anthropic; instantiation pattern with claude-sonnet / claude-haiku model names) (accessed 2026-05-20) ↩
2. ChromaDB — Multimodal embeddings (OpenCLIP embedding function; add ids+uris and ids+documents; query_texts / query_uris / query_images) (accessed 2026-05-20) ↩
3. Anthropic — Vision (Claude image inputs, model support, per-request image counts and dimension limits) (accessed 2026-05-20) ↩
4. ChromaDB — Getting started (pip install chromadb; client + collection creation pattern) (accessed 2026-05-20) ↩
5. OpenCLIP project README (mlfoundations/open_clip — ViT-B/32 default checkpoint; PyTorch-based reference implementation) (accessed 2026-05-20) ↩
6. LangChain — Messages reference (multimodal HumanMessage content blocks; image url vs base64 patterns; provider-specific size limit caveat) (accessed 2026-05-20) ↩
7. Anthropic — Models overview (current Sonnet model identifier on the Anthropic API catalogue) (accessed 2026-05-20) ↩