Neural Tech Daily
ai-tutorials

Build a Real-Time Voice Chatbot: Web Speech API + Claude Streaming + Web Audio (May 2026)

Browser-only voice chatbot — mic input via Web Speech API, streamed Claude responses, Web Audio playback. Pure HTML/JS deployed on GitHub Pages.

~9 min read
Share
Anthropic news announcement page for Claude Opus 4.7 showing the model card visual used as the editorial cover for this tutorial covering the Messages streaming API

Image: Anthropic news page for Claude Opus 4.7 (anthropic.com), used for editorial coverage of the streaming API surface discussed below.

TL;DR

This tutorial walks through a single-page browser chatbot that listens to the microphone, streams a Claude response token-by-token, and speaks the answer back, with no server and no build step, deployable to GitHub Pages. The stack is pure HTML/JS: the Web Speech API’s SpeechRecognition interface for transcription, the Anthropic Messages API with stream: true for token-by-token responses 1 , and the Web Speech API’s SpeechSynthesis interface for playback.

Per Anthropic’s August 2024 CORS announcement, the Messages API accepts cross-origin requests when the client sends the explicit anthropic-dangerous-direct-browser-access: true header. The header’s deliberately ominous name signals that embedding a production API key in client JavaScript exposes it to anyone who opens DevTools 2 . The pattern fits prototype demos, bring-your-own-key tools, and internal apps with trusted users; it does not fit a public production product.

Per MDN, SpeechRecognition is flagged “Limited availability” and is not Baseline. Chrome and Edge ship it under the webkitSpeechRecognition prefix; Safari supports the standard name; Firefox does not implement it 3 . The tutorial targets Chromium browsers as the primary path and degrades to text-typed input where the API is missing.

What you’ll need

  • A modern Chromium browser (Chrome 33+, Edge, Brave). Per MDN, SpeechRecognition access requires HTTPS or localhost 3 .
  • An Anthropic API key from the Console. The browser pattern is bring-your-own-key: the user pastes their own key into a field, and the page never ships a hardcoded production key.
  • A GitHub account for free static hosting via GitHub Pages 4 .
  • A text editor (VS Code, Sublime, or anything that saves UTF-8).

Total walkthrough time is roughly 40 minutes for a developer comfortable with vanilla JavaScript and the fetch API.

Step 1: Scaffold the page

Create a folder voice-chatbot/ with one file: index.html. The entire app lives in a single file, with no npm, no bundler, and no build script. Drop the following skeleton in:

<!doctype html>
<html lang="en">
  <head>
    <meta charset="utf-8" />
    <title>Voice Chatbot</title>
    <meta name="viewport" content="width=device-width,initial-scale=1" />
    <style>
      body { font: 16px/1.5 system-ui; max-width: 640px; margin: 2rem auto; padding: 0 1rem; }
      #log { border: 1px solid #ccc; padding: 1rem; min-height: 240px; margin: 1rem 0; }
      .user { color: #246; }
      .bot  { color: #262; }
      button { padding: 0.6rem 1.2rem; font-size: 1rem; }
    </style>
  </head>
  <body>
    <h1>Voice Chatbot</h1>
    <p><input id="apiKey" type="password" placeholder="sk-ant-... (your Anthropic key)" size="40" /></p>
    <p><button id="talkBtn">Hold to talk</button></p>
    <div id="log"></div>
    <script type="module" src="./app.js"></script>
  </body>
</html>

The API-key input is a type="password" field so the value isn’t shown on screen; the script reads it at request time and never persists it. For a public-facing demo, swap this for a server-side proxy that holds the key.

Simon Willison's August 2024 blog post announcing the Anthropic CORS header anthropic-dangerous-direct-browser-access for enabling direct browser calls to the Claude API

Image: Simon Willison’s blog post on Anthropic’s CORS rollout (simonwillison.net), used for editorial coverage of the browser-access header.

Step 2: Wire up speech recognition

Create app.js next to index.html. Start with the recognition plumbing. Per MDN, the constructor is exposed under two names: webkitSpeechRecognition on Chromium and Safari’s older builds, SpeechRecognition on the standards track 3 . A two-line probe handles both:

const Recognition = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!Recognition) {
    document.getElementById('talkBtn').disabled = true;
    document.getElementById('log').textContent =
        'This browser does not expose SpeechRecognition. Try Chrome or Edge.';
}

When the API is available, configure the instance. MDN documents continuous (keep listening across pauses) and interimResults (emit partial transcripts before the speaker finishes) as the two most useful properties 3 . For a push-to-talk button, the simplest config is single-shot final results:

const recognition = new Recognition();
recognition.lang = 'en-US';
recognition.continuous = false;
recognition.interimResults = false;

recognition.onresult = (event) => {
    const transcript = event.results[0][0].transcript;
    log('user', transcript);
    askClaude(transcript);
};

recognition.onerror = (event) => log('bot', `[recognition error: ${event.error}]`);
recognition.onend = () => {
    document.getElementById('talkBtn').textContent = 'Hold to talk';
};

Wire the button to start and stop the recognizer. A hold-to-talk pattern feels more deliberate than a toggle and avoids the API swallowing background speech:

const btn = document.getElementById('talkBtn');
btn.addEventListener('mousedown', () => {
    recognition.start();
    btn.textContent = 'Listening...';
});
btn.addEventListener('mouseup', () => recognition.stop());
btn.addEventListener('touchstart', (e) => { e.preventDefault(); recognition.start(); });
btn.addEventListener('touchend',   (e) => { e.preventDefault(); recognition.stop(); });

The first time the page calls recognition.start(), the browser shows a microphone-permission prompt. Per MDN, the permission is origin-scoped and persists per origin once granted 3 .

Medium blog post hero illustration for an in-browser speech-to-text tutorial using the Web Speech API SpeechRecognition interface

Image: Medium tutorial header for an in-browser speech-to-text walkthrough (medium.com), used for editorial coverage of the recognizer event model discussed below.

Step 3: Stream the Claude response

Anthropic’s Messages API streams responses via server-sent events when the request body sets stream: true 1 . The browser cannot use the EventSource constructor here because EventSource only emits GET requests; instead, read the response body as a stream using fetch plus ReadableStream. The streaming event types Anthropic emits include message_start, content_block_start, content_block_delta, content_block_stop, message_delta, and message_stop. The delta events carry the incremental text 1 .

Add the request function:

async function askClaude(prompt) {
    const apiKey = document.getElementById('apiKey').value.trim();
    if (!apiKey) { log('bot', '[paste your API key first]'); return; }

    const response = await fetch('https://api.anthropic.com/v1/messages', {
        method: 'POST',
        headers: {
            'content-type': 'application/json',
            'x-api-key': apiKey,
            'anthropic-version': '2023-06-01',
            'anthropic-dangerous-direct-browser-access': 'true'
        },
        body: JSON.stringify({
            model: 'claude-opus-4-7',
            max_tokens: 512,
            stream: true,
            messages: [{ role: 'user', content: prompt }]
        })
    });

    if (!response.ok) {
        log('bot', `[API error ${response.status}]`);
        return;
    }

    await readStream(response.body);
}

The anthropic-dangerous-direct-browser-access: true header is the explicit opt-in that unlocks CORS for the Messages endpoint. Without it, the browser blocks the request at the preflight stage 2 .

Now the streaming reader. Server-sent event lines arrive as event: <type> followed by data: <json> blocks separated by blank lines 5 . The reader parses each data: line, looks for content_block_delta events whose delta.text field carries the incremental token, and feeds the chunk to the speech queue:

async function readStream(body) {
    const reader = body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';
    let fullText = '';

    while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n');
        buffer = lines.pop();

        for (const line of lines) {
            if (!line.startsWith('data: ')) continue;
            const payload = JSON.parse(line.slice(6));
            if (payload.type === 'content_block_delta'
                && payload.delta?.type === 'text_delta') {
                const chunk = payload.delta.text;
                fullText += chunk;
                appendToLog(chunk);
                queueForSpeech(chunk);
            }
        }
    }
    flushSpeech();
}

The buffer-and-split pattern handles the case where a single read() returns a partial line. Without it, a token boundary that splits across two network chunks crashes the JSON parse.

Anthropic news page model-card illustration accompanying the Claude 3.5 Sonnet release announcement reproduced for editorial coverage of the Messages streaming API

Image: Anthropic news page for Claude 3.5 Sonnet (anthropic.com), used for editorial coverage of the Messages streaming event model.

Step 4: Speak the response

Per MDN, SpeechSynthesis accepts a SpeechSynthesisUtterance object via its speak() method and queues each utterance: calling speak() while one utterance is mid-flight enqueues the next rather than interrupting it 6 . The naive approach is to wait until the full response arrives and call speak() once; that defeats the point of streaming. A more responsive pattern chunks the stream at sentence boundaries and speaks each sentence the moment a terminator arrives:

let speechBuffer = '';

function queueForSpeech(chunk) {
    speechBuffer += chunk;
    const match = speechBuffer.match(/^(.+?[.!?])(\s+)(.*)$/s);
    if (match) {
        speak(match[1]);
        speechBuffer = match[3];
    }
}

function flushSpeech() {
    if (speechBuffer.trim()) speak(speechBuffer);
    speechBuffer = '';
}

function speak(text) {
    const utterance = new SpeechSynthesisUtterance(text);
    utterance.lang = 'en-US';
    utterance.rate = 1.0;
    speechSynthesis.speak(utterance);
}

The regex captures the first complete sentence in the buffer, sends it to the speech queue, and keeps the trailing fragment for the next chunk. Sentence-level chunking gives the playback prosody, so the synth voice rises and falls at periods, which sounds more natural than per-token playback.

Step 5: A small logging helper

The body references a log helper and an appendToLog helper for streaming text. Both are small:

const logEl = document.getElementById('log');
let currentBotLine = null;

function log(role, text) {
    const p = document.createElement('p');
    p.className = role;
    p.textContent = text;
    logEl.appendChild(p);
    if (role === 'bot') currentBotLine = p;
    else currentBotLine = null;
}

function appendToLog(chunk) {
    if (!currentBotLine) {
        currentBotLine = document.createElement('p');
        currentBotLine.className = 'bot';
        logEl.appendChild(currentBotLine);
    }
    currentBotLine.textContent += chunk;
}

The currentBotLine reference is reset whenever the user speaks, so the next bot reply starts on a fresh paragraph rather than appending to the previous one.

Step 6: Deploy on GitHub Pages

Per GitHub’s getting-started docs, a public repository whose root contains index.html publishes automatically when Pages is enabled from the Settings tab 4 . The free tier serves the site at https://<user>.github.io/<repo>/ over HTTPS. The HTTPS requirement matters because SpeechRecognition refuses to start on plain HTTP origins 3 .

git init
git add index.html app.js
git commit -m "voice chatbot"
git branch -M main
git remote add origin https://github.com/<user>/voice-chatbot.git
git push -u origin main

In the repo’s Settings → Pages, set Source to “Deploy from a branch”, branch main, folder / (root). The first deploy takes roughly a minute; subsequent pushes redeploy automatically.

Architecture recap

The full data flow runs end-to-end inside the browser tab:

StageAPIBrowser surface
Mic captureWeb Speech APISpeechRecognition.start()
TranscriptWeb Speech APIonresult event
LLM requestAnthropic Messages APIfetch with stream: true
LLM streamServer-sent eventsReadableStream on response.body
SpeechWeb Speech APISpeechSynthesis.speak()

The Web Audio API mentioned in the title is reached transparently: SpeechSynthesis routes its output through the browser’s audio pipeline, which is the Web Audio surface from the browser’s perspective. If a future iteration needs to manipulate the synth output (apply a low-pass filter, mix in background ambience), the upgrade path is to swap SpeechSynthesis for a server-side TTS that returns audio buffers, then play them through AudioContext nodes per the MDN Web Audio guide.

Where the browser-only pattern breaks

The pattern works for prototypes, internal tools, and bring-your-own-key apps. Three places it does not work:

  1. Public production with a single shared API key. Anyone who opens DevTools sees the key. The anthropic-dangerous-direct-browser-access header name signals this risk directly 2 . Production apps need a server-side proxy that holds the key and forwards the request.
  2. Cross-browser support. Per MDN, SpeechRecognition is “Limited availability” and not Baseline; Firefox does not ship it at all 3 . Production voice features typically use a cloud STT service (Deepgram, AssemblyAI, Whisper API) for parity across browsers.
  3. Long conversations. The prototype above sends a single user message per request. For multi-turn memory, accumulate the message history in a JS array and send the full array on each call. The Messages API expects the full conversation on every request, not a session ID 7 .

Closing

The browser-only stack collapses what would otherwise be a Node.js backend, a WebSocket server, and two cloud services into a single HTML file. The cost is the bring-your-own-key constraint and the Chromium-first browser story. Per Anthropic’s CORS announcement, the constraint is intentional: the API supports direct browser calls precisely because the team accepts the pattern for internal and BYOK use cases while flagging the production risk in the header name itself 2 .

How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.

Sources consulted

Anonymous · no cookies set

Report a problem with this article

Articles are produced by an autonomous AI pipeline; mistakes do happen. Tell us what's wrong and the editorial review will revisit the claim.

Category

Found this useful? Share it.