Build Your First RAG App with LangChain and ChromaDB in 60 Minutes: A Tutorial

RAG in 60 minutes on a laptop: LangChain + ChromaDB, in-process vector store, free local embeddings, Claude or GPT for answers. No Docker, no Pinecone.

4 May 2026 Updated 19 May 2026 ~12 min read

LangChain documentation social-share card for the official Build a Retrieval Augmented Generation (RAG) App tutorial page

Image: LangChain documentation social card for the official RAG tutorial page, used for editorial coverage.

What you’ll need

This tutorial walks the reader from an empty Python folder to a working Retrieval-Augmented Generation pipeline in roughly 60 minutes¹. The finished app ingests a small document corpus, embeds it, stores the vectors in ChromaDB on the local disk, retrieves the top-3 chunks per query², and passes them to a chat model (Claude Sonnet 4.5, GPT-5, or a local Ollama model) for grounded answers.

Why this specific stack: ChromaDB runs in-process, per its own getting-started guide². There is no Docker container, no separate vector-database server, no managed cloud account beyond the LLM API key. The whole pipeline lives in a single Python file on the developer’s laptop until the project is ready for hosting. A free local embedder plus a paid LLM call only at query time keeps fixed costs at zero while the corpus and traffic are still small.

The 60-minute clock assumes the reader already has a working Python install and a code editor. It does not assume prior LangChain or vector-database experience.

Prerequisites

Python 3.10 or newer (python --version).
pip working from a virtual environment.
One LLM access path: an Anthropic API key, an OpenAI API key, or Ollama installed locally for a free path.
2 GB of free disk for the embedding-model and its Python dependencies (torch, transformers); the MiniLM weights themselves are ~90 MB but the install pulls in ~1.8 GB of ML libraries on first install.
A text editor and a terminal.

Step 1: Install LangChain, Chroma, and the embedder

Create a virtual environment first. The packages below pull in heavy ML dependencies; do not install them globally.

python -m venv .venv
source .venv/bin/activate    # macOS / Linux
# .venv\Scripts\activate     # Windows PowerShell

pip install --upgrade pip
pip install langchain langchain-community langchain-chroma \
    langchain-anthropic langchain-openai langchain-ollama \
    langchain-huggingface langchain-text-splitters \
    sentence-transformers chromadb

langchain is the core orchestration library; langchain-community carries the document loaders; langchain-chroma is the vector-store integration; the model-provider packages (langchain-anthropic, langchain-openai, langchain-ollama) give access to the three chat-model paths; langchain-huggingface provides the local-embedder bindings; langchain-text-splitters carries the chunking utilities; sentence-transformers is the free local embedder; chromadb is the vector store itself.

Pin versions in production. For this tutorial, accept whatever pip resolves on the day of install. LangChain releases breaking changes often, so re-read the official RAG tutorial³ if a snippet here drifts from the current API.

LangChain documentation introduction page on docs.langchain.com/oss/python/langchain/overview, the canonical entry point for the orchestration library this tutorial builds on

Image: LangChain Python documentation introduction (docs.langchain.com/oss/python/langchain/overview), used for editorial coverage of the orchestration library this tutorial uses.

Step 2: Configure the chat model

Pick one of the three paths below. The rest of the tutorial calls a single llm object, so the integration only needs to happen once.

Path A: Anthropic Claude (paid, recommended for quality):

import os
from langchain_anthropic import ChatAnthropic

os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..." # do not commit this
llm = ChatAnthropic(model="claude-sonnet-4-6", temperature=0)

Path B: OpenAI (paid, common default):

import os
from langchain_openai import ChatOpenAI

os.environ["OPENAI_API_KEY"] = "sk-..."
llm = ChatOpenAI(model="gpt-5", temperature=0)

OpenAI ships new flagship models often. Check the OpenAI models page⁷ at the time of writing the project, and substitute the current default flagship model name in the model= argument. The gpt-5 family in this snippet is illustrative; the pipeline shape is the same whichever current model the API accepts.

Path C: Ollama (free, fully local):

ollama pull llama3.1:8b

from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.1:8b", temperature=0)

temperature=0 keeps output deterministic for a tutorial. Bump to 0.2–0.4 once the pipeline is stable and the answers feel too rigid.

Step 3: Configure the embedder

The embedder is the model that turns text into vectors. The free path uses all-MiniLM-L6-v2 from sentence-transformers, a small model that produces 384-dimensional vectors⁴ and runs comfortably on a laptop CPU.

from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
 model_name="sentence-transformers/all-MiniLM-L6-v2"
)

First call downloads ~90 MB of model weights to ~/.cache/huggingface/. Subsequent runs are instant.

The paid alternative is OpenAI’s text-embedding-3-small, which produces 1536-dimensional vectors per the OpenAI embeddings guide⁸. Pricing and benchmark positioning shift over time; check OpenAI’s current pricing page before deciding. Swap in if the recall quality of MiniLM proves insufficient on the actual corpus:

from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

The same embeddings object gets used for both ingestion and query. Keep them identical, or retrieval will return nonsense.

Step 4: Build the document corpus

Real RAG systems ingest hundreds of PDFs. For the tutorial, ten short text snippets are enough to demonstrate the pipeline end-to-end⁵. Save the snippets as corpus.py in the project folder:

DOCS = [
    "ChromaDB is an open-source vector database that runs in-process via its Python client.",
    "LangChain is an orchestration library for building LLM applications with retrieval, tools, and agents.",
    "Retrieval-Augmented Generation grounds an LLM's answer in a specific document corpus.",
    "Sentence-transformers produces dense vector embeddings on local CPU without an API key.",
    "Chroma's PersistentClient writes the vector index to a folder on disk for reuse across runs.",
    "Top-k retrieval returns the k most similar chunks ranked by cosine similarity to the query embedding.",
    "Chunk size of 500 to 1000 characters with 50 to 100 character overlap is a reasonable starting default.",
    "Metadata filters narrow retrieval before the similarity search runs.",
    "The Anthropic Claude API and OpenAI Chat Completions API both accept a system prompt and a user prompt.",
    "Ollama runs open-weight models like Llama 3.1 locally without sending data to a remote server.",
]

In production, replace this with a document loader (PyPDFLoader, WebBaseLoader, or DirectoryLoader from langchain_community.document_loaders) and pipe its output through a text splitter before ingestion. The pipeline shape stays identical; the chunking step is what step 5 below adds.

Chroma getting-started documentation on docs.trychroma.com, the in-process vector-store this tutorial writes the embedded corpus into

Image: Chroma documentation getting-started page (docs.trychroma.com), used for editorial coverage of the vector store this tutorial uses.

Step 5: Chunk and ingest the corpus into ChromaDB

LangChain’s Chroma.from_documents integration handles two things in one call: embed each input document, and write the vectors plus metadata to disk. It does not chunk. Chunking is a separate step that has to happen first whenever the input documents are longer than the embedder’s context window (the all-MiniLM-L6-v2 model card⁴ lists a 256-word-piece input limit; longer inputs get truncated).

The 10-snippet DOCS list above happens to be one sentence per entry, so chunking is a no-op for the demo. The moment the reader replaces DOCS with a real loader (a PDF, a web page, a folder of docs), chunking becomes mandatory; a single 5,000-character page passed straight into the embedder produces a single low-quality vector and breaks retrieval silently.

RecursiveCharacterTextSplitter from langchain_text_splitters is the standard splitter for prose. Apply it before Chroma.from_documents:

from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from corpus import DOCS

raw_documents = [
    Document(page_content=text, metadata={"source": f"doc-{i}"})
    for i, text in enumerate(DOCS)
]

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(raw_documents)

vectorstore = Chroma.from_documents(
    documents=documents,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="tutorial",
)

persist_directory is the magic ingredient. Chroma writes the SQLite index to ./chroma_db/ so subsequent runs reload the same collection without re-embedding. To reuse the existing collection later:

vectorstore = Chroma(
    embedding_function=embeddings,
    persist_directory="./chroma_db",
    collection_name="tutorial",
)

The corpus here is small enough that ingestion takes seconds. A 10,000-chunk corpus on MiniLM takes 3–5 minutes on a modest laptop CPU; switch to OpenAI embeddings via batched API calls for larger workloads.

Step 6: Retrieve and answer

The retrieve-and-answer step has three pieces: build a retriever from the vector store, define the prompt template that instructs the LLM to ground its answer in the retrieved chunks, and chain them together.

from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

prompt = ChatPromptTemplate.from_messages([
 ("system",
 "You are a precise assistant. Answer the user's question using ONLY the context below. "
 "If the answer isn't in the context, say 'I don't have enough information to answer that.' "
 "Do not invent facts.\n\nContext:\n{context}"),
 ("human", "{question}"),
])

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

rag_chain = (
 {"context": retriever | format_docs, "question": RunnablePassthrough()}
 | prompt
 | llm
 | StrOutputParser()
)

k=3 returns the three most similar chunks per query⁶. Higher k gives the LLM more context but also more noise; lower k is faster but risks missing the relevant chunk. Three is a sensible default for a small corpus; the LangChain RAG tutorial³ walks through retriever configuration in more depth, and larger production setups commonly pair a higher k (5 or 10) with a re-ranker stage to filter noise before generation.

The system prompt’s “Answer using ONLY the context” instruction is what turns a chat call into a RAG call. Without it, the model falls back on its parametric knowledge and the grounding evaporates.

LangChain official build-a-RAG-app tutorial on docs.langchain.com/oss/python/langchain/rag, the canonical reference for the retriever, prompt template, and chain pattern this step assembles

Image: LangChain official RAG tutorial (docs.langchain.com/oss/python/langchain/rag), used for editorial coverage of the retrieve-and-answer pattern this step implements.

Step 7: Run end-to-end and inspect

Wire it all up in app.py:

if __name__ == "__main__":
    queries = [
        "What does ChromaDB do?",
        "How does top-k retrieval rank results?",
        "What is the capital of France?",
    ]
    for q in queries:
        print(f"\nQ: {q}")
        print(f"A: {rag_chain.invoke(q)}")

Run it: python app.py. The first two queries return grounded answers built from the corpus. The third query, capital of France, is not in the corpus, so the model should respond with the “I don’t have enough information” fallback. If it answers “Paris” instead, the grounding instruction is being ignored and the system prompt needs tightening.

To inspect what the retriever actually returns before it hits the LLM:

docs = retriever.invoke("How does top-k retrieval rank results?")
for d in docs:
    print(d.metadata, d.page_content[:80])

Run that snippet whenever an answer feels wrong. Bad answers usually come from bad retrieval, not bad generation.

Common pitfalls

Embedding model drift. Ingest with one embedder, query with another, and retrieval breaks silently. Pin the embedder name in code, not in a config file the team forgets to sync.

Chunk size mismatched to corpus. Default chunking is fine for prose. Source code, tables, and JSON need custom splitters from langchain_text_splitters. A chunk that splits a function in half retrieves badly.

Persistent vs in-memory confusion. Skipping persist_directory gives an in-memory Chroma client that vanishes when the Python process exits. The next run re-embeds from scratch. Always pass persist_directory once the corpus is real.

Metadata not used for filtering. Tagging documents with source, category, or date and then never filtering on them wastes the metadata layer. Pass filter={"category": "billing"} to as_retriever to scope retrieval to a slice of the corpus.

API rate limits at ingestion time. OpenAI embeddings batch requests by default; very large corpora still hit per-minute caps. Sentence-transformers has no rate limit but is CPU-bound. Plan ingestion windows accordingly.

LangChain open-source repository on GitHub at github.com/langchain-ai/langchain, hosting the source for the orchestration library and the integration packages this tutorial installs

Image: LangChain GitHub repository (github.com/langchain-ai/langchain), used for editorial coverage of the open-source library this tutorial draws from.

Where to go next

The pipeline above is a working baseline. Production-shaped extensions worth knowing about:

Pinecone or Weaviate for managed vector hosting once the corpus exceeds what fits comfortably on a laptop disk. The LangChain integration shape is identical: swap Chroma for Pinecone and re-ingest.
LangSmith for tracing and evaluation. Every chain call gets logged with the retrieved chunks, the prompt sent, and the answer returned. Debugging RAG without traces is painful.
LangGraph for agent loops where the model decides whether to retrieve, retrieve again, or stop. Multi-hop questions (“What did the CEO say about pricing in the latest call?”) often need agent control flow, not a single retrieve-and-answer pass.
Hybrid search, combining dense vector similarity with BM25 keyword search via a reciprocal rank fuser. Pure vector search misses exact-match queries (model numbers, error codes, brand names); hybrid fixes that.
Re-rankers like Cohere Rerank or BGE-Reranker that re-score the top-20 retrieved chunks before sending the top-3 to the LLM. Improves answer quality on noisy corpora at the cost of one extra API call per query.

The 60-minute baseline is enough to take to a teammate, a hackathon, or an internal demo. Production deployment is a different conversation, scoped to the corpus, the latency budget, and the eval harness, none of which this tutorial sets up.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. LangChain official RAG tutorial. URL substantiates the pipeline shape (install, embed, ingest, retrieve, generate) and the order of steps; the 60-minute time-to-completion is the publication's own estimate based on that scope plus typical-laptop install time, not a figure from the LangChain docs. (accessed 2026-05-04) ↩
2. Chroma documentation — getting-started page substantiates the in-process Python client and the retrieval API's `n_results` parameter (mapped to LangChain's `k`). (accessed 2026-05-04) ↩
3. LangChain RAG tutorial — current API surface for the `Chroma.from_documents` ingestion path and the `RunnablePassthrough` chain pattern. (accessed 2026-05-04) ↩
4. Hugging Face model card for `sentence-transformers/all-MiniLM-L6-v2` — 384-dimensional output, ~90 MB model size. (accessed 2026-05-04) ↩
5. LangChain RAG tutorial — small-corpus pedagogical pattern; 10 documents is sufficient to exercise the full ingest-retrieve-generate path. (accessed 2026-05-04) ↩
6. Chroma documentation — `n_results` defaults and similarity-ranking semantics. (accessed 2026-05-04) ↩
7. OpenAI Platform models page — canonical reference for the current default flagship chat model name to substitute into the `ChatOpenAI(model=...)` argument. (accessed 2026-05-10) ↩
8. OpenAI Platform embeddings guide — `text-embedding-3-small` model card, including the 1536-dimensional output vector size. (accessed 2026-05-04) ↩