Build a YouTube Transcript Summariser CLI with Claude: An End-to-End Python Tutorial

Python CLI that fetches a YouTube transcript via youtube-transcript-api, summarises with Claude Sonnet, outputs Markdown with timestamps — packaged for PyPI.

20 May 2026 Updated 20 May 2026 ~13 min read

youtube-transcript-api GitHub repository README showing the Python library that powers the transcript fetch step of this tutorial

Image: youtube-transcript-api GitHub repository, used for editorial coverage of the library taught in this tutorial.

What you’ll build

By the end of this walkthrough you will have a small Python command-line tool, ytsum, that takes a YouTube URL or video ID, pulls the transcript, sends it to Claude Sonnet with a structured prompt, and writes a Markdown summary with timestamped section anchors back to your terminal or to a file. The same tool handles videos longer than two hours through a chunk-and-merge step, and the final section walks through publishing it to PyPI for anyone to install with pip.

The two libraries doing the heavy lifting are youtube-transcript-api for transcript retrieval and the official anthropic SDK for the model call. The CLI shell uses Click, with an argparse variant noted for readers who prefer the standard library. The aggregated source consensus across the youtube-transcript-api README, the Anthropic Messages API reference, and the Click documentation supports this stack as the shortest path from idea to working tool¹²³.

Prerequisites

You’ll want Python 3.10 or newer, a working virtual environment, an Anthropic API key from console.anthropic.com, and a YouTube video that actually carries a transcript (automatic captions count). Cost per summary on Claude Sonnet 4.5 sits at $3 per 1M input tokens and $15 per 1M output tokens as of May 2026 per Anthropic’s published pricing⁴. A typical 30-minute video transcript is roughly 5,000 to 8,000 tokens in and 800 tokens out, which lands somewhere around $0.03 per summary — verify on Anthropic’s current pricing page before running at volume since these numbers fluctuate.

Budget about 90 minutes for a first walkthrough: 10 to install, 20 to build the transcript fetch, 30 to wire Claude in and shape the prompt, 20 to add the CLI flags, and 10 to either chunk a long video or push to PyPI.

Step 1: Set up the project

Create a fresh project folder and an isolated virtual environment so dependencies stay contained.

mkdir ytsum && cd ytsum
python3 -m venv .venv
source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install "youtube-transcript-api>=0.6" "anthropic>=0.34" "click>=8.1"

The youtube-transcript-api pin tracks the current release line on PyPI as of May 2026⁵. The anthropic SDK pin pulls a build that supports the current Messages API surface⁶. Click 8.1 is the latest stable release per the Pallets documentation⁷.

Sketch the project layout you’re heading toward:

ytsum/
  ytsum/
    __init__.py
    cli.py
    transcript.py
    summarise.py
  pyproject.toml
  README.md

The single-package structure keeps imports clean and matches the layout the Python Packaging User Guide recommends for a PyPI-ready project⁸.

Step 2: Fetch the transcript

The transcript-fetch step is a thin wrapper around youtube-transcript-api. Two things to handle: extracting the video ID from a URL, and degrading gracefully when no transcript exists.

Create ytsum/transcript.py:

"""Fetch and normalise a YouTube transcript."""

from __future__ import annotations

import re
from dataclasses import dataclass
from typing import Optional

from youtube_transcript_api import (
    YouTubeTranscriptApi,
    TranscriptsDisabled,
    NoTranscriptFound,
)


@dataclass
class TranscriptSegment:
    """A single line of transcript with its start offset in seconds."""

    text: str
    start: float
    duration: float


_VIDEO_ID_RE = re.compile(
    r"(?:v=|youtu\.be/|/embed/|/shorts/)([A-Za-z0-9_-]{11})"
)


def extract_video_id(url_or_id: str) -> str:
    """Accept a full URL or a bare 11-char video ID and return the ID."""
    if re.fullmatch(r"[A-Za-z0-9_-]{11}", url_or_id):
        return url_or_id
    match = _VIDEO_ID_RE.search(url_or_id)
    if not match:
        raise ValueError(f"Could not extract a video ID from: {url_or_id}")
    return match.group(1)


def fetch_transcript(
    url_or_id: str,
    languages: Optional[list[str]] = None,
) -> list[TranscriptSegment]:
    """Return the transcript as a list of TranscriptSegment objects."""
    video_id = extract_video_id(url_or_id)
    languages = languages or ["en"]
    try:
        raw = YouTubeTranscriptApi.get_transcript(
            video_id, languages=languages
        )
    except TranscriptsDisabled as exc:
        raise RuntimeError(
            f"Transcripts are disabled for video {video_id}."
        ) from exc
    except NoTranscriptFound as exc:
        raise RuntimeError(
            f"No transcript in {languages} for video {video_id}."
        ) from exc
    return [
        TranscriptSegment(
            text=item["text"],
            start=float(item["start"]),
            duration=float(item["duration"]),
        )
        for item in raw
    ]

The regex covers the four URL shapes YouTube hands out in 2026: watch?v=, youtu.be/, embed/, and shorts/. The two named exceptions are the documented failure modes in the youtube-transcript-api README⁹; mapping them to a clean RuntimeError keeps the CLI layer above from needing to know the library’s exception hierarchy.

Quick smoke test from a Python REPL inside the project venv:

>>> from ytsum.transcript import fetch_transcript
>>> segments = fetch_transcript("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
>>> segments[0]
TranscriptSegment(text='We're no strangers to love', start=0.0, duration=2.5)

If that returns segments, the network and library install are working. If it raises RuntimeError, pick a different video; not every channel publishes transcripts.

Step 3: Send it to Claude

The summarisation step lives in ytsum/summarise.py. The job is to (a) format the transcript with timestamp anchors so Claude can cite them, (b) build a structured prompt that asks for Markdown with section headers, and (c) call the Messages API.

"""Summarise a transcript with Claude."""

from __future__ import annotations

import os
from typing import Literal

import anthropic

from .transcript import TranscriptSegment

Length = Literal["short", "medium", "long"]
Format = Literal["markdown", "bullets", "outline"]

MODEL = "claude-sonnet-4-5"

LENGTH_GUIDANCE = {
    "short": "Aim for roughly 150 words. One paragraph plus 3 bullet takeaways.",
    "medium": "Aim for roughly 400 words. Two to three paragraphs plus 5 bullet takeaways.",
    "long": "Aim for roughly 800 words with section headers covering each major topic.",
}

FORMAT_GUIDANCE = {
    "markdown": "Return clean Markdown. Use ## for section headers.",
    "bullets": "Return a single bulleted list. Each bullet is one self-contained idea.",
    "outline": "Return a hierarchical outline using nested bullets, three levels deep at most.",
}


def _format_seconds(seconds: float) -> str:
    """Render a float second-offset as H:MM:SS for transcripts > 1h, else M:SS."""
    total = int(seconds)
    hours, remainder = divmod(total, 3600)
    minutes, secs = divmod(remainder, 60)
    if hours:
        return f"{hours}:{minutes:02d}:{secs:02d}"
    return f"{minutes}:{secs:02d}"


def format_transcript_for_prompt(
    segments: list[TranscriptSegment],
    anchor_every_seconds: int = 60,
) -> str:
    """Join segments into a prompt-ready string with timestamp anchors."""
    out: list[str] = []
    next_anchor = 0
    for seg in segments:
        if seg.start >= next_anchor:
            out.append(f"\n[{_format_seconds(seg.start)}] ")
            next_anchor = int(seg.start) + anchor_every_seconds
        out.append(seg.text.strip())
        out.append(" ")
    return "".join(out).strip()


def summarise(
    segments: list[TranscriptSegment],
    length: Length = "medium",
    fmt: Format = "markdown",
    api_key: str | None = None,
) -> str:
    """Send the formatted transcript to Claude and return the Markdown summary."""
    client = anthropic.Anthropic(
        api_key=api_key or os.environ["ANTHROPIC_API_KEY"]
    )
    transcript_text = format_transcript_for_prompt(segments)

    system = (
        "You are a careful summariser of video transcripts. "
        "Preserve the speaker's claims faithfully and do not invent details. "
        "Where the transcript supports it, cite timestamps inline using the "
        "anchors in square brackets that already appear in the transcript "
        "(for example, [12:34])."
    )

    user = (
        f"{LENGTH_GUIDANCE[length]}\n"
        f"{FORMAT_GUIDANCE[fmt]}\n\n"
        "Transcript follows. Timestamp anchors are in square brackets at "
        "approximately one-minute intervals; reuse them when citing.\n\n"
        f"{transcript_text}"
    )

    response = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    return response.content[0].text

The Messages API call follows the shape documented in the Anthropic API reference¹⁰. The system parameter sets the persona; the actual prompt and transcript go in a single user message. max_tokens=2048 is enough headroom for the long length setting and keeps a runaway response bounded.

A note on the timestamp-anchor pattern: rather than asking Claude to compute timestamps from raw segment data (which large language models do poorly), the transcript itself is interleaved with [mm:ss] anchors every 60 seconds. Claude is asked to quote those anchors, not invent them. This is the same pattern Anthropic’s own documentation recommends for grounded citations in long-context tasks¹¹.

Anthropic Messages API reference page showing the request schema this tutorial calls

Image: Anthropic Messages API reference, used for editorial coverage of the API surface called in this tutorial.

Step 4: Wire up the CLI

The CLI shell sits in ytsum/cli.py. Click handles flag parsing, validation, and the --help output for free.

"""Command-line entry point for ytsum."""

from __future__ import annotations

import sys
from pathlib import Path

import click

from .summarise import summarise, Length, Format
from .transcript import fetch_transcript


@click.command()
@click.argument("url_or_id")
@click.option(
    "--format",
    "fmt",
    type=click.Choice(["markdown", "bullets", "outline"]),
    default="markdown",
    help="Output format. Default: markdown.",
)
@click.option(
    "--length",
    type=click.Choice(["short", "medium", "long"]),
    default="medium",
    help="Target summary length. Default: medium.",
)
@click.option(
    "--output-file",
    "-o",
    type=click.Path(dir_okay=False, writable=True, path_type=Path),
    default=None,
    help="Write the summary to this file. Default: print to stdout.",
)
@click.option(
    "--language",
    multiple=True,
    default=["en"],
    help="Preferred transcript language codes. Repeatable.",
)
def main(
    url_or_id: str,
    fmt: Format,
    length: Length,
    output_file: Path | None,
    language: tuple[str, ...],
) -> None:
    """Summarise a YouTube video by URL or video ID."""
    try:
        segments = fetch_transcript(url_or_id, languages=list(language))
    except (ValueError, RuntimeError) as exc:
        click.echo(f"error: {exc}", err=True)
        sys.exit(1)

    summary = summarise(segments, length=length, fmt=fmt)

    if output_file:
        output_file.write_text(summary, encoding="utf-8")
        click.echo(f"Wrote {output_file}", err=True)
    else:
        click.echo(summary)


if __name__ == "__main__":
    main()

Click’s Choice type rejects any value outside the documented set and shows the valid options in --help¹². The path_type=Path argument returns a real pathlib.Path instead of a string, which avoids type-juggling at the write step. Readers who prefer the standard library can swap Click for argparse — the Python docs walk through the same flag set in roughly the same number of lines¹³.

Run it end-to-end from the project root:

export ANTHROPIC_API_KEY="sk-ant-..."
python -m ytsum.cli "https://www.youtube.com/watch?v=dQw4w9WgXcQ" \
  --length short \
  --format markdown \
  -o summary.md

The first time you run it, expect a 3 to 10 second wait while Claude responds. The summary lands in summary.md with section headers and inline [mm:ss] citations pointing back into the video.

Click documentation home page showing the Python CLI framework used for the flag-parsing layer in this tutorial

Image: Click documentation home, used for editorial coverage of the CLI framework used in this tutorial.

Step 5: Handle long videos with chunk-and-merge

Anything past roughly two hours of video starts to crowd the context window even for Claude Sonnet’s 200,000-token capacity per the Anthropic models overview¹⁴. The robust approach is to split the transcript into overlapping chunks, summarise each chunk independently, then merge the chunk summaries into a final pass.

Add a chunked_summarise function to summarise.py:

def _chunk_segments(
    segments: list[TranscriptSegment],
    chunk_seconds: int = 1200,  # 20-minute chunks
    overlap_seconds: int = 60,
) -> list[list[TranscriptSegment]]:
    """Split segments into overlapping windows by start time."""
    if not segments:
        return []
    total = segments[-1].start + segments[-1].duration
    chunks: list[list[TranscriptSegment]] = []
    start = 0.0
    while start < total:
        end = start + chunk_seconds
        window = [
            s for s in segments
            if s.start >= start and s.start < end
        ]
        if window:
            chunks.append(window)
        start = end - overlap_seconds
    return chunks


def chunked_summarise(
    segments: list[TranscriptSegment],
    length: Length = "medium",
    fmt: Format = "markdown",
    api_key: str | None = None,
) -> str:
    """Summarise long transcripts via map-reduce over time-based chunks."""
    chunks = _chunk_segments(segments)
    if len(chunks) <= 1:
        return summarise(segments, length=length, fmt=fmt, api_key=api_key)

    partials = [
        summarise(chunk, length="short", fmt="bullets", api_key=api_key)
        for chunk in chunks
    ]

    client = anthropic.Anthropic(
        api_key=api_key or os.environ["ANTHROPIC_API_KEY"]
    )
    merge_prompt = (
        "Merge the following per-chunk summaries of the same video into a "
        "single coherent summary. Preserve timestamp anchors. "
        f"{LENGTH_GUIDANCE[length]} {FORMAT_GUIDANCE[fmt]}\n\n"
        + "\n\n---\n\n".join(partials)
    )
    response = client.messages.create(
        model=MODEL,
        max_tokens=2048,
        messages=[{"role": "user", "content": merge_prompt}],
    )
    return response.content[0].text

The 20-minute chunk size with 1-minute overlap leaves each chunk well under the per-call token ceiling while keeping enough context at chunk boundaries that the merge pass can stitch ideas across them. To activate it, branch on transcript length in cli.py:

total_seconds = segments[-1].start + segments[-1].duration if segments else 0
if total_seconds > 7200:  # 2 hours
    from .summarise import chunked_summarise
    summary = chunked_summarise(segments, length=length, fmt=fmt)
else:
    summary = summarise(segments, length=length, fmt=fmt)

The map-reduce pattern is the same one the Anthropic prompting guide suggests for long-document summarisation when a single call would otherwise hit context limits¹⁵.

Step 6 (optional): Publish to PyPI

Skip this if you’re keeping the tool local. If you want anyone to pip install ytsum, the Python Packaging User Guide is the canonical reference¹⁶.

A minimal pyproject.toml:

[build-system]
requires = ["setuptools>=68", "wheel"]
build-backend = "setuptools.build_meta"

[project]
name = "ytsum"
version = "0.1.0"
description = "Summarise YouTube videos with Claude from the command line."
readme = "README.md"
requires-python = ">=3.10"
dependencies = [
    "youtube-transcript-api>=0.6",
    "anthropic>=0.34",
    "click>=8.1",
]

[project.scripts]
ytsum = "ytsum.cli:main"

The [project.scripts] entry creates the ytsum binary on install so users can run ytsum <url> directly without python -m. Build the distribution and upload it with twine:

pip install build twine
python -m build
twine upload dist/*

twine upload prompts for your PyPI API token from pypi.org/manage/account/token/ per the twine documentation¹⁷. Test on TestPyPI first via twine upload --repository testpypi dist/* before pushing to the real index.

Python Packaging User Guide tutorial page showing the pyproject.toml structure used to publish ytsum to PyPI

Image: Python Packaging User Guide, used for editorial coverage of the PyPI-publish workflow.

Common pitfalls

A handful of failure modes catch most first-time runs. The youtube-transcript-api GitHub issues page documents recurring patterns worth pre-empting¹⁸:

“Transcripts are disabled” on a video you can clearly see captions for in the YouTube UI. YouTube distinguishes uploader-provided transcripts from auto-generated ones; some channels disable the public transcript surface even when captions render. The library exposes the underlying state via list_transcripts() if you need to inspect what’s available.
IP blocks on long-running batch jobs. Per the library README, YouTube rate-limits transcript requests from cloud IP ranges aggressively. The README recommends running batch jobs from residential IPs or through a proxy. Don’t loop the CLI over hundreds of videos from a CI runner without expecting throttling.
anthropic.AuthenticationError. Almost always a missing or stale ANTHROPIC_API_KEY. Confirm the key is exported in the same shell session and that it hasn’t been rotated in the Anthropic console.
Timestamps drifting in the final summary. Claude occasionally invents intermediate anchors when asked to summarise broadly. The fix is to tighten the system prompt: explicitly forbid timestamps the transcript doesn’t carry, and add one validation pass that regex-checks each [mm:ss] against the anchor set you actually inserted.

Where to take it next

Three reasonable extensions sit one evening’s work each: pipe the transcript through a topic-extraction pass before summarising (gives the user a clickable table of contents), wire a --stream flag onto the client.messages.stream API so summaries print token-by-token, or swap in a different Claude tier per length setting — Haiku for short, Sonnet for medium, Opus for long — to trade cost against quality on the per-summary axis.

The full tool sits in around 200 lines of Python and one pyproject.toml. The two libraries doing real work are youtube-transcript-api and anthropic; the rest is plumbing.

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Cited Sources

1. youtube-transcript-api GitHub README — library purpose and install instructions (accessed 2026-05-20) ↩
2. Anthropic API — Messages reference (request schema) (accessed 2026-05-20) ↩
3. Click documentation home — Python CLI framework (accessed 2026-05-20) ↩
4. Anthropic pricing page — Claude Sonnet 4.5 input / output rate per 1M tokens (accessed 2026-05-20) ↩
5. youtube-transcript-api on PyPI — current release line (accessed 2026-05-20) ↩
6. Anthropic Python SDK on PyPI — supported version range (accessed 2026-05-20) ↩
7. Click documentation — version stability and release notes (accessed 2026-05-20) ↩
8. Python Packaging User Guide — recommended project layout (accessed 2026-05-20) ↩
9. youtube-transcript-api README — documented exception types (accessed 2026-05-20) ↩
10. Anthropic Messages API — request body shape (accessed 2026-05-20) ↩
11. Anthropic models overview — grounded-citation prompting pattern (accessed 2026-05-20) ↩
12. Click documentation — Choice parameter type (accessed 2026-05-20) ↩
13. argparse documentation — Python standard-library CLI parser (accessed 2026-05-20) ↩
14. Anthropic models overview — Claude Sonnet context-window capacity (accessed 2026-05-20) ↩
15. Anthropic documentation — long-document summarisation guidance (accessed 2026-05-20) ↩
16. Python Packaging User Guide — pyproject.toml tutorial (accessed 2026-05-20) ↩
17. Twine documentation — uploading to PyPI with API tokens (accessed 2026-05-20) ↩
18. youtube-transcript-api README + Issues — documented failure modes (accessed 2026-05-20) ↩

Anonymous · no cookies set

Found this useful? Share it.