Neural Tech Daily
ai-tutorials

Build a Flashcard Generator from PDF Notes: Claude + Anki Export Tutorial

Extract text from a PDF, prompt Claude to draft 10-30 Anki-style flashcards with tags, then export a ready-to-import .apkg deck via genanki.

Updated ~13 min read
Share
Anki manual page at docs.ankiweb.net — the official reference for the .apkg deck format this tutorial exports

Image: Anki manual, used for editorial coverage of the deck format this tutorial exports.

What you’ll build

A single-file Python script that takes a PDF of study notes, extracts the text, prompts Claude to draft 10 to 30 Anki-style flashcards with a front, a back, and a small set of topical tags, and writes a .apkg file you can double-click to import into Anki Desktop. 1 The pipeline is three stages stitched end to end: pypdf for extraction, the Anthropic Python SDK for the generation call, and genanki for the deck export. 2

The whole thing lands at around 130 lines of Python. No background in spaced-repetition tooling is assumed; the relevant terms — note type, model, deck — are defined inline against Anki’s own documentation. 3

What you’ll need

  • Python 3.9 or later. Both pypdf and genanki support the current 3.9-plus range per their PyPI pages. 2 4
  • An Anthropic API key. Sign in at console.anthropic.com, create a key, and budget for a few cents per run — Claude API pricing is published on the Anthropic pricing page and varies by model tier. 5
  • A PDF of notes to test against. A lecture handout, a textbook chapter, a meeting transcript, or any text-heavy document. Scanned image-only PDFs will not work without an OCR step; pypdf reads embedded text only. 6
  • Anki Desktop installed if you want to import the deck locally; the .apkg format is the canonical import unit Anki documents. 7
  • Roughly 30 to 45 minutes if you copy-paste the code.

The three libraries doing the work

Per the cited PyPI and Anki documentation, three packages cover the full pipeline:

StageLibraryWhat it does
PDF text extractionpypdfPure-Python PDF library; PdfReader plus page.extract_text() returns the embedded text of each page. 6
Flashcard draftinganthropic Python SDKOfficial Anthropic SDK; wraps the Messages API and is published on PyPI. 8
Deck exportgenankiThird-party library by Kerrick Staley for generating .apkg files programmatically; documented on PyPI and GitHub. 2 4

Per genanki’s README, the library models the same three primitives Anki itself uses: a Model (the note type with its fields and card templates), a Note (one row of card data), and a Deck that bundles notes for export. 4 Matching genanki’s vocabulary to Anki’s own helps later when you want to add cloze cards or media. 9

Step 1: Install the packages

Create a fresh project directory and a virtual environment, then install the three dependencies:

mkdir flashcard-generator && cd flashcard-generator
python -m venv .venv
source .venv/bin/activate   # on Windows: .venv\Scripts\activate

pip install \
    anthropic \
    pypdf \
    genanki \
    python-dotenv

The anthropic package is the official Python SDK Anthropic publishes on PyPI. 8 pypdf is the maintained successor to the older PyPDF2 project, and the docs site at pypdf.readthedocs.io is the canonical reference for the extraction API used below. 6 genanki ships on PyPI under that exact name; the GitHub repository carries usage examples that match the deck-building code in Step 4. 2 4 python-dotenv is a convenience for loading the API key from a .env file.

The pypdf documentation page at pypdf.readthedocs.io showing the PdfReader API and the extract_text method this tutorial calls.

The pypdf docs page at pypdf.readthedocs.io — the reference for the extraction call used in Step 2.

Step 2: Set up the API key and project files

Store the Anthropic API key in a .env file at the project root. Create .env:

ANTHROPIC_API_KEY=sk-ant-api03-...your-key...

Add .env to .gitignore immediately so the key never lands in version control. The Anthropic SDK reads ANTHROPIC_API_KEY from the environment by default, per the SDK’s quickstart, so no further wiring is needed. 7

Drop a PDF you want to study into the project directory and name it notes.pdf for the walkthrough. The script accepts any path at runtime; this just keeps the example tidy.

Step 3: Extract text from the PDF

Create generate_deck.py and start with the extraction step. pypdf’s PdfReader opens the document and exposes a list of pages; calling extract_text() on each page returns the embedded text as a string, per the pypdf user guide. 6

import os
import json
import re
from pathlib import Path

from dotenv import load_dotenv
from pypdf import PdfReader
from anthropic import Anthropic
import genanki

load_dotenv()

MODEL = "claude-sonnet-4-5"
MAX_CHARS = 30000


def extract_pdf_text(pdf_path: str) -> str:
    """Read a PDF and return the concatenated text of every page."""
    reader = PdfReader(pdf_path)
    pages = []
    for index, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        text = text.strip()
        if text:
            pages.append(f"[page {index + 1}]\n{text}")
    full_text = "\n\n".join(pages)
    if len(full_text) > MAX_CHARS:
        full_text = full_text[:MAX_CHARS]
    return full_text

A few things worth understanding before moving on. extract_text() returns embedded text only — if the PDF is a scanned image (a photograph of a textbook page, for instance) the call returns an empty string and the script will not have anything to send Claude. 6 For scanned material, swap in an OCR step (Tesseract, pdf2image plus pytesseract, or a hosted OCR API) before this point.

The MAX_CHARS cap of 30,000 characters is a coarse safety rail to keep the prompt cost predictable. Claude’s input context is large enough to take far more, but token costs scale with input length and the cited Anthropic pricing page documents per-million-token rates for both input and output. 5 For a 30-page lecture handout this cap comfortably fits the whole document; for a textbook chapter you may want to chunk and call once per section.

Step 4: Prompt Claude for structured flashcards

The next function sends the extracted text to Claude and asks for a JSON array of flashcards. A structured-output prompt is the simplest reliable way to get programmatically usable cards back; Anthropic’s Messages API supports plain text generation and the SDK handles the request, per the API quickstart. 7

FLASHCARD_PROMPT = """You are helping a student build an Anki deck from
their notes. Read the source text below and produce between 10 and 30
flashcards covering the most testable, atomic facts and concepts.

Rules:
- Each card has a short question on the front and a concise answer on the back.
- One fact per card. Split compound questions into separate cards.
- Avoid yes-or-no questions; prefer "what", "why", "how", "when".
- Add 1 to 3 lowercase tags per card describing the topic (e.g. "biology",
  "krebs-cycle"). Use hyphens, not spaces, inside a tag.
- Do not include answers in the question, or vice versa.
- Return ONLY a JSON array, no preamble, no markdown fences.

Schema for each item:
{"front": "string", "back": "string", "tags": ["string", ...]}

Source text:
---
%s
---
"""


def draft_flashcards(source_text: str) -> list[dict]:
    """Send the source text to Claude and parse the returned JSON array."""
    client = Anthropic()
    response = client.messages.create(
        model=MODEL,
        max_tokens=4000,
        messages=[
            {
                "role": "user",
                "content": FLASHCARD_PROMPT % source_text,
            }
        ],
    )
    raw = response.content[0].text.strip()
    raw = re.sub(r"^```(?:json)?|```$", "", raw, flags=re.MULTILINE).strip()
    cards = json.loads(raw)
    if not isinstance(cards, list):
        raise ValueError("Expected a JSON array of flashcards.")
    return cards

The prompt does three things worth flagging. It pins the schema to a simple {front, back, tags} object so the JSON is easy to parse with the standard library’s json module. 10 It explicitly bans markdown code fences in the response, since Claude often wraps JSON in ```json blocks by default; the re.sub line is a belt-and-suspenders strip in case the model adds them anyway. The tag-format hint (lowercase, hyphenated, no spaces) matches how Anki itself handles tags — space-separated, case-insensitive — so the tags survive the round-trip into Anki without surprises. 3

The Anki manual page on importing decks at docs.ankiweb.net describing the .apkg format the script writes.

Anki manual — Importing decks. The .apkg format documented here is what genanki produces.

If the JSON parse fails — Claude occasionally trims a closing bracket on long outputs — re-run with a shorter source text or raise max_tokens. A more defensive build would catch json.JSONDecodeError, log the raw response for inspection, and retry once with a clarifying follow-up message.

Step 5: Build the Anki deck with genanki

genanki requires three things to produce a .apkg: a Model describing the note type, a Deck to hold the notes, and one Note per card. The library’s README shows the canonical pattern, reproduced and adapted below. 4

DECK_ID = 1607392319
MODEL_ID = 1607392320


def build_deck(cards: list[dict], deck_name: str, output_path: str) -> str:
    """Convert flashcard dicts into a .apkg file and return the path."""
    model = genanki.Model(
        MODEL_ID,
        "Claude Flashcard",
        fields=[
            `{"name": "Front"}`,
            `{"name": "Back"}`,
        ],
        templates=[
            {
                "name": "Card 1",
                "qfmt": "{`{Front}`}",
                "afmt": "{`{FrontSide}`}<hr id=answer>{`{Back}`}",
            },
        ],
    )

    deck = genanki.Deck(DECK_ID, deck_name)

    for card in cards:
        front = str(card.get("front", "")).strip()
        back = str(card.get("back", "")).strip()
        tags = [str(t).strip() for t in card.get("tags", []) if str(t).strip()]
        if not front or not back:
            continue
        note = genanki.Note(
            model=model,
            fields=[front, back],
            tags=tags,
        )
        deck.add_note(note)

    package = genanki.Package(deck)
    package.write_to_file(output_path)
    return output_path

A few things worth understanding here. The DECK_ID and MODEL_ID are arbitrary integers, but genanki’s README is explicit that you must pick a stable, unique value per deck and per note type — Anki uses these IDs to merge updates rather than create duplicate decks on re-import. 4 If you regenerate the deck after editing notes upstream, keeping the same IDs lets Anki overlay the new version on the old.

The qfmt and afmt strings are Anki’s card-template language: {{Front}} and {{Back}} are field substitutions, {{FrontSide}} repeats the question on the answer side, and <hr id=answer> is the divider Anki traditionally renders between question and answer. 9 Tags pass through as a list of strings on the Note; Anki imports them as-is. 3

The genanki GitHub repository README showing the Model, Deck, and Note example pattern this code follows.

genanki on GitHub — the canonical Model / Deck / Note pattern adapted in Step 5.

Step 6: Wire the entry point

Tie the three functions together with a small main block. The script takes a PDF path and an output deck name as command-line arguments, reports progress, and writes the .apkg to the current directory.

def main() -> None:
    import argparse

    parser = argparse.ArgumentParser(description="Generate an Anki deck from a PDF.")
    parser.add_argument("pdf", help="Path to the source PDF.")
    parser.add_argument(
        "--deck-name",
        default="Notes from PDF",
        help="Name of the Anki deck.",
    )
    parser.add_argument(
        "--output",
        default="deck.apkg",
        help="Output .apkg path.",
    )
    args = parser.parse_args()

    pdf_path = Path(args.pdf)
    if not pdf_path.exists():
        raise SystemExit(f"PDF not found: `{pdf_path}`")

    print(f"Extracting text from `{pdf_path}`...")
    source_text = extract_pdf_text(str(pdf_path))
    print(f"Got {len(source_text)} characters of text.")

    print("Asking Claude to draft flashcards...")
    cards = draft_flashcards(source_text)
    print(f"Got {len(cards)} cards back.")

    print(f"Writing deck to {args.output}...")
    build_deck(cards, args.deck_name, args.output)
    print("Done.")


if __name__ == "__main__":
    main()

Run the script against your PDF:

python generate_deck.py notes.pdf --deck-name "Biology Chapter 4"

Expect a 10-to-30-card run to take under a minute on Sonnet, dominated by the API round-trip. Double-click the resulting deck.apkg and Anki opens it via the import dialog, per the manual’s Importing chapter. 1

The Anthropic developer documentation page at docs.claude.com showing the Python SDK quickstart code referenced in Step 4.

Anthropic’s developer docs at docs.claude.com — the Messages API quickstart this tutorial calls into.

Tuning the output

A handful of practical knobs are worth knowing once the basic pipeline runs:

  • Card count. The 10 to 30 range in the prompt is a soft target. For a dense chapter, raise the upper bound and the max_tokens budget together. For a one-page handout, drop the lower bound to 5.
  • Card style. The prompt currently asks for short factual cards. Cloze-deletion cards (where part of a sentence is hidden) are an Anki-native feature documented in the Card types chapter; producing them needs a different genanki.Model configured with the Cloze type and a prompt instructing Claude to emit {{c1::hidden text}} markup. 9
  • Chunking long PDFs. The 30,000-character cap above handles small documents. For a full textbook chapter, split the source text on ## headings or by page range, call draft_flashcards per chunk, and concatenate the resulting lists before passing to build_deck.
  • Model choice. Sonnet 4.5 is the default above for cost and speed; for harder material (graduate-level mathematics, dense legal text) the larger Claude tier surfaces sharper card framing at a higher per-token rate, per Anthropic’s pricing page. 5 The Models overview page lists the current production model IDs. 11
  • Quality review. Treat the first run as a first draft. The cited Anki documentation flags that hand-editing cards in the Anki UI after import is a normal part of the workflow; the model name pinned via MODEL_ID lets you re-run the script later and overlay corrections without losing your edits. 4

Common failure modes

Per the cited library and SDK documentation, three failure modes account for most first-run issues:

  • extract_text() returns empty strings. The PDF is image-only. Add an OCR step before calling draft_flashcards. The pypdf docs flag this as the expected behaviour for scanned documents. 6
  • json.JSONDecodeError from Claude’s response. The model wrapped the array in prose or trimmed a bracket on a long output. Re-run with a higher max_tokens, or wrap the parse in a try/except that logs the raw text and retries once. The Python json module’s error reporting names the line and column of the parse failure, which is usually enough to spot the issue. 10
  • Duplicate cards on re-import. The DECK_ID or MODEL_ID changed between runs. Pin them as module-level constants (as above) and Anki merges updates instead of creating a parallel deck, per the genanki README. 4

Where to go next

The script is intentionally minimal. Three natural extensions worth considering, each grounded in the cited primary sources:

  1. Add cloze cards by configuring a second genanki.Model with the Cloze note type and prompting Claude to emit {{c1::...}} markup on selected cards. Anki’s templates chapter documents the Cloze format. 9
  2. Persist source citations by storing the page number (extract_pdf_text already tags each chunk with [page N]) in a third field on the note model, so each card’s back side shows where it came from.
  3. Batch process a folder of PDFs by walking a directory, calling the pipeline once per file, and writing one deck per source. genanki.Package accepts multiple Deck instances per the README if you prefer one bundled .apkg. 4

The pattern — pull text out, ask Claude for structured output, render to a file format another tool consumes — generalises beyond flashcards. The same three-stage shape works for quizzes, summaries, and study schedules; only the schema and the export library change.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

  1. 1. Anki manual — Importing decks (docs.ankiweb.net) (accessed )
  2. 2. genanki on PyPI (accessed )
  3. 3. Anki manual home (docs.ankiweb.net) (accessed )
  4. 4. genanki — GitHub repository (kerrickstaley/genanki) (accessed )
  5. 5. Anthropic — Claude API pricing (accessed )
  6. 6. pypdf documentation — PdfReader and extract_text (accessed )
  7. 7. Anthropic — Claude API quickstart (accessed )
  8. 8. anthropic Python SDK on PyPI (accessed )
  9. 9. Anki manual — Card types and templates (accessed )
  10. 10. Python docs — json module (accessed )
  11. 11. Anthropic — Models overview (accessed )

Further Reading

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.