Build a Customer-Support RAG Agent with Claude + Pinecone: An End-to-End Python Project

End-to-end Python tutorial: Pinecone serverless index, 10-doc PDF knowledge base, Claude Sonnet tool-use, FastAPI endpoint, evaluated on 5 support queries.

20 May 2026 Updated 20 May 2026 ~12 min read

Pinecone documentation home page covering serverless indexes and the embedding-query-upsert workflow this tutorial walks through end-to-end

Image: Pinecone documentation home, used for editorial coverage of the vector database taught in this tutorial.

What you’ll build

By the end of this tutorial you will have a working customer-support FAQ bot: a Pinecone serverless index holding embeddings of a small product knowledge base, a Claude Sonnet agent that retrieves relevant chunks via structured tool use, a FastAPI endpoint that exposes the bot over HTTP, and a tiny evaluation harness that scores it against five reference queries. The whole thing runs locally on a laptop in under an hour and costs well under one dollar end-to-end at current Pinecone Starter and Anthropic API rates¹².

The architecture is the standard retrieval-augmented generation (RAG) loop: chunk your documents, embed each chunk, store the embeddings in a vector index, and at query time retrieve the top-k nearest chunks and pass them to the model as context. The twist here is that Claude does the retrieval itself via a tool call rather than the application code stuffing context into the prompt unconditionally. That pattern lets the model decide when to search, what to search for, and whether one search is enough.

What you’ll need

Python 3.10 or later, comfortable with virtual environments and pip.
A Pinecone account on the Starter (free) plan. Starter includes one project, indexes capped at 2 GB of storage and 2 million read units / 1 million write units per month — plenty for a 10-document knowledge base¹.
An Anthropic API key with the claude-sonnet-4-5 model enabled. Sonnet 4.5 is the current production-tier model on Anthropic’s API catalogue as of May 2026³.
An OpenAI API key for embeddings (we use text-embedding-3-small at $0.02 per 1M tokens⁴). You can substitute any embedding model that produces 1536-dimensional vectors; the index dimension just has to match.

Budget about 60 minutes start to finish: 10 to set up keys and install, 15 to build the index, 20 to wire the agent and FastAPI endpoint, 15 to run the eval and read the results.

Step 1. Set up the project

Create a fresh folder and install dependencies.

mkdir support-rag && cd support-rag
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install "pinecone>=5.0" "anthropic>=0.40" "openai>=1.40" \
    "fastapi>=0.110" "uvicorn>=0.30" "pypdf>=4.0" "httpx>=0.27"

The pinecone package is the official v5+ client; older code that imports from pinecone-client is the deprecated v2 line and the API has changed substantially. The anthropic package is the official Python SDK⁵.

Export your keys:

export PINECONE_API_KEY="pcsk_..."
export ANTHROPIC_API_KEY="sk-ant-..."
export OPENAI_API_KEY="sk-..."

Step 2. Create the Pinecone serverless index

Pinecone serverless is the default index type on the Starter plan and the recommended option for new projects per the create-an-index guide⁶. Serverless indexes auto-scale and bill on usage rather than on provisioned pod-hours.

Create index_setup.py:

# index_setup.py
import os
from pinecone import Pinecone, ServerlessSpec

pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])

INDEX_NAME = "support-kb"

if not pc.has_index(INDEX_NAME):
    pc.create_index(
        name=INDEX_NAME,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

index = pc.Index(INDEX_NAME)
print(pc.describe_index(INDEX_NAME))

Run it once:

python index_setup.py

You should see a status block confirming the index is ready. The dimension=1536 matches text-embedding-3-small; switch to 3072 if you swap in text-embedding-3-large. Cosine similarity is the standard choice for text embeddings; Euclidean and dot-product are available too but cosine handles unnormalised vectors more gracefully.

Pinecone serverless index creation guide showing the cloud and region selection covered in this step

Image: Pinecone — Create a serverless index, used for editorial coverage of the index-creation flow.

Step 3. Chunk and embed the knowledge base

The sample knowledge base is ten short product-support PDFs: refund policy, shipping windows, warranty terms, account recovery, two-factor reset, payment failures, return-window edge cases, subscription pause, B2B invoicing, and contact escalation. Drop them in kb/ as 01.pdf through 10.pdf (or your equivalent files).

The chunking decision matters more than newcomers expect. The cited consensus across Pinecone’s own guides and the broader RAG literature is that 512-token chunks with ~50-token overlap are a sensible default for general FAQ-style content: long enough to carry one coherent thought, short enough that retrieval surfaces precise context. We use 500 characters with 50 characters of overlap as a rough character-level proxy that works without a tokeniser dependency for this tutorial.

Create ingest.py:

# ingest.py
import os
import uuid
from pathlib import Path
from pypdf import PdfReader
from openai import OpenAI
from pinecone import Pinecone

CHUNK_SIZE = 500
CHUNK_OVERLAP = 50
EMBED_MODEL = "text-embedding-3-small"
INDEX_NAME = "support-kb"

oa = OpenAI()
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index(INDEX_NAME)


def chunk_text(text: str) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + CHUNK_SIZE
        chunks.append(text[start:end])
        start += CHUNK_SIZE - CHUNK_OVERLAP
    return chunks


def embed(texts: list[str]) -> list[list[float]]:
    resp = oa.embeddings.create(model=EMBED_MODEL, input=texts)
    return [d.embedding for d in resp.data]


def ingest_pdf(path: Path) -> int:
    reader = PdfReader(str(path))
    full_text = "\n".join(p.extract_text() or "" for p in reader.pages)
    chunks = chunk_text(full_text)
    vectors = embed(chunks)
    records = [
        {
            "id": f"{path.stem}-{i}-{uuid.uuid4().hex[:8]}",
            "values": vec,
            "metadata": {
                "source": path.name,
                "chunk_index": i,
                "text": chunk,
            },
        }
        for i, (chunk, vec) in enumerate(zip(chunks, vectors))
    ]
    index.upsert(vectors=records)
    return len(records)


if __name__ == "__main__":
    total = 0
    for pdf in sorted(Path("kb").glob("*.pdf")):
        n = ingest_pdf(pdf)
        print(f"{pdf.name}: {n} chunks")
        total += n
    print(f"total upserted: {total}")

Run it:

python ingest.py

For a 10-PDF set with roughly 2-3 pages each, you should see 60-90 chunks upserted, the exact number depending on PDF length. The upsert call batches up to 1,000 vectors per request per Pinecone’s data-plane guide⁷; for larger knowledge bases, chunk the upsert into batches manually.

Pinecone upsert documentation showing the vector record schema with id, values, and metadata fields

Image: Pinecone — Upsert vectors, used for editorial coverage of the upsert schema.

Step 4. Wire Claude with a retrieval tool

Now the agent. Claude’s tool-use API lets you declare functions the model can call during generation; the SDK returns a tool_use block, your code runs the function, and you reply with a tool_result block to continue the conversation⁸.

Create agent.py:

# agent.py
import os
import json
from openai import OpenAI
from pinecone import Pinecone
from anthropic import Anthropic

MODEL = "claude-sonnet-4-5-20250929"
INDEX_NAME = "support-kb"
TOP_K = 4

oa = OpenAI()
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index = pc.Index(INDEX_NAME)
anthropic = Anthropic()

SEARCH_TOOL = {
    "name": "search_knowledge_base",
    "description": (
        "Search the customer-support knowledge base for relevant "
        "passages. Use for any question about refunds, shipping, "
        "warranties, account recovery, payments, returns, "
        "subscriptions, invoicing, or escalation."
    ),
    "input_schema": {
        "type": "object",
        "properties": {
            "query": {
                "type": "string",
                "description": "Natural-language search query.",
            }
        },
        "required": ["query"],
    },
}

SYSTEM = (
    "You are a customer-support assistant for an online store. "
    "Always call search_knowledge_base before answering a "
    "factual question about policy. Quote the relevant passage "
    "and cite the source filename. If the knowledge base does "
    "not cover the question, say so plainly and suggest the "
    "user contact human support."
)


def run_search(query: str) -> str:
    vec = oa.embeddings.create(
        model="text-embedding-3-small", input=[query]
    ).data[0].embedding
    res = index.query(vector=vec, top_k=TOP_K, include_metadata=True)
    passages = [
        {
            "source": m["metadata"]["source"],
            "text": m["metadata"]["text"],
            "score": m["score"],
        }
        for m in res["matches"]
    ]
    return json.dumps(passages)


def ask(question: str) -> str:
    messages = [{"role": "user", "content": question}]
    while True:
        resp = anthropic.messages.create(
            model=MODEL,
            max_tokens=1024,
            system=SYSTEM,
            tools=[SEARCH_TOOL],
            messages=messages,
        )
        if resp.stop_reason == "tool_use":
            tool_use = next(
                b for b in resp.content if b.type == "tool_use"
            )
            result = run_search(tool_use.input["query"])
            messages.append({"role": "assistant", "content": resp.content})
            messages.append({
                "role": "user",
                "content": [{
                    "type": "tool_result",
                    "tool_use_id": tool_use.id,
                    "content": result,
                }],
            })
            continue
        return "".join(
            b.text for b in resp.content if b.type == "text"
        )


if __name__ == "__main__":
    print(ask("How long does shipping take to the US?"))

The while True loop is the standard Anthropic tool-use pattern: keep calling messages.create until stop_reason is no longer tool_use, feeding each tool result back as a user message⁸. For this tutorial the agent will usually search once and answer; for harder questions it may search twice with refined queries.

Anthropic tool-use documentation showing the tool_use and tool_result message flow used by the agent in this step

Image: Anthropic — Tool use with Claude, used for editorial coverage of the tool-use loop.

Step 5. Expose the bot via FastAPI

A REST endpoint takes the agent from a script to something a frontend or another service can hit. Create app.py:

# app.py
from fastapi import FastAPI
from pydantic import BaseModel
from agent import ask

app = FastAPI()


class Question(BaseModel):
    question: str


class Answer(BaseModel):
    answer: str


@app.post("/ask", response_model=Answer)
def ask_endpoint(q: Question) -> Answer:
    return Answer(answer=ask(q.question))


@app.get("/healthz")
def healthz() -> dict[str, str]:
    return {"status": "ok"}

Run the server locally:

uvicorn app:app --reload --port 8000

Then hit it:

curl -X POST http://localhost:8000/ask \
  -H "Content-Type: application/json" \
  -d '{"question": "What is the refund window for opened items?"}'

The response is a JSON object with an answer field containing the model’s reply, including the quoted passage and source citation per the system prompt.

Step 6. Evaluate on five test queries

Eval is where most first-time RAG projects skip ahead and regret it. A five-query smoke test catches the obvious failure modes (missing retrieval, hallucinated policy, wrong source cited) in five minutes.

Create eval.py with five queries and expected facts:

# eval.py
from agent import ask

TESTS = [
    {
        "q": "How long does shipping take to the US?",
        "must_include": ["business days"],
        "must_cite": "02.pdf",
    },
    {
        "q": "Can I get a refund 60 days after purchase?",
        "must_include": ["30", "no"],
        "must_cite": "01.pdf",
    },
    {
        "q": "How do I reset two-factor authentication?",
        "must_include": ["recovery code", "support"],
        "must_cite": "05.pdf",
    },
    {
        "q": "Does the warranty cover accidental damage?",
        "must_include": ["accidental"],
        "must_cite": "03.pdf",
    },
    {
        "q": "Can you tell me the CEO's salary?",
        "must_include": ["cannot", "human support"],
        "must_cite": None,
    },
]


def score():
    passed = 0
    for t in TESTS:
        ans = ask(t["q"]).lower()
        ok_phrase = all(p.lower() in ans for p in t["must_include"])
        ok_cite = (
            t["must_cite"] is None
            or t["must_cite"].lower() in ans
        )
        verdict = "PASS" if ok_phrase and ok_cite else "FAIL"
        print(f"[{verdict}] {t['q']}")
        if verdict == "PASS":
            passed += 1
    print(f"\n{passed}/{len(TESTS)} passed")


if __name__ == "__main__":
    score()

On a clean run with the sample knowledge base, expect four or five out of five to pass. The fifth (the out-of-scope CEO-salary query) is the most fragile: it tests whether the agent correctly declines instead of hallucinating an answer. If it fails, tighten the system prompt’s “knowledge base does not cover this” clause.

Anthropic Python SDK GitHub repository showing the messages API used by the agent and eval script in this tutorial

Image: Anthropic Python SDK on GitHub, used for editorial coverage of the messages API the agent uses.

Cost notes

A full ingest-plus-eval run on the 10-document sample knowledge base costs roughly: $0.001 for the embedding pass (well under 100,000 tokens at $0.02 per 1M)⁴, $0 for Pinecone reads and writes within the Starter free tier¹, and roughly $0.02-0.05 for the Claude Sonnet calls during the five-query eval at Sonnet 4.5’s published rates of $3 per million input tokens and $15 per million output tokens². Production traffic costs scale with query volume; the Starter tier holds for low-traffic prototypes but moves to Standard once you cross the included read-unit budget.

Common pitfalls

Index dimension mismatch. If you upsert 1536-dim vectors into a 3072-dim index (or vice versa) Pinecone rejects the write with a clear error. The fix is to recreate the index at the right dimension or switch embedding models.
pinecone-client vs pinecone. The deprecated v2 package is still pip-installable; new code uses the v5 pinecone package. The class names and method signatures changed.
Tool-use loop bugs. Forgetting to append the assistant’s tool_use block before the tool_result block produces an API error. The pattern is always: assistant turn (with tool_use) → user turn (with tool_result) → assistant turn (final text).
Embedding inconsistency. If you embed documents with one model and queries with another, similarity scores collapse. Pick one embedding model and use it for both sides.
Over-chunking. Aggressively small chunks (under 200 characters) lose enough context that retrieval surfaces fragments the model can’t reason over. Aggressively large chunks (over 2,000 characters) waste retrieval precision. The 500-character setting in this tutorial is a starting point; tune to your content.

Where to go next

Three natural extensions: add a metadata filter to index.query so a region or product field narrows retrieval (Pinecone metadata filtering is documented in the query guide⁹); add a second tool that escalates to a human-support ticket via your CRM’s API; and swap the FastAPI endpoint behind an async worker so concurrent requests don’t queue on the Claude call.

For evaluation beyond smoke tests, the cited consensus across RAG-eval literature points to RAGAS or LangSmith for systematic faithfulness, context-precision, and answer-relevance scoring on a larger labelled set. Both wrap the same loop you built here in a richer metric harness.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. Pinecone pricing page — Starter plan includes 2 GB storage, 2M read units / 1M write units per month, on the free tier; paid Standard tier begins at \$50 per month minimum. (accessed 2026-05-20) ↩
2. Anthropic API pricing — Claude Sonnet 4.5 at \$3 per million input tokens and \$15 per million output tokens at standard rates. (accessed 2026-05-20) ↩
3. Anthropic models overview — Sonnet 4.5 (claude-sonnet-4-5-20250929) listed as the current Sonnet-tier production model. (accessed 2026-05-20) ↩
4. OpenAI embeddings guide — text-embedding-3-small at \$0.02 per 1M tokens, 1536-dimensional output by default. (accessed 2026-05-20) ↩
5. Anthropic Python SDK repository — official client library, installation and messages API reference. (accessed 2026-05-20) ↩
6. Pinecone — Create a serverless index, documenting the ServerlessSpec class and cloud/region options. (accessed 2026-05-20) ↩
7. Pinecone — Upsert vectors, documenting the 1,000-vector batch ceiling per request. (accessed 2026-05-20) ↩
8. Anthropic — Tool use with Claude, documenting the tool_use / tool_result message-flow pattern. (accessed 2026-05-20) ↩
9. Pinecone — Query data and semantic search, documenting metadata filtering on the query operation. (accessed 2026-05-20) ↩