Build a Real-Time Voice Chatbot: Web Speech API + Claude Streaming + Web Audio (May 2026)
Browser-only voice chatbot — mic input via Web Speech API, streamed Claude responses, Web Audio playback. Pure HTML/JS deployed on GitHub Pages.
Image: Anthropic news page for Claude Opus 4.7 (anthropic.com), used for editorial coverage of the streaming API surface discussed below.
TL;DR
This tutorial walks through a single-page browser chatbot that
listens to the microphone, streams a Claude response token-by-token,
and speaks the answer back, with no server and no build step,
deployable to GitHub Pages. The stack is pure HTML/JS: the Web Speech API’s
SpeechRecognition interface for transcription, the Anthropic
Messages API with stream: true for token-by-token responses 1 ,
and the Web Speech API’s SpeechSynthesis interface for playback.
Per Anthropic’s August 2024 CORS announcement, the Messages API
accepts cross-origin requests when the client sends the explicit
anthropic-dangerous-direct-browser-access: true header. The
header’s deliberately ominous name signals that embedding a
production API key in client JavaScript exposes it to anyone who
opens DevTools 2 . The pattern fits prototype demos,
bring-your-own-key tools, and internal apps with trusted users; it
does not fit a public production product.
Per MDN, SpeechRecognition is flagged “Limited availability” and
is not Baseline. Chrome and Edge ship it under the
webkitSpeechRecognition prefix; Safari supports the standard name;
Firefox does not implement it 3 . The tutorial
targets Chromium browsers as the primary path and degrades to
text-typed input where the API is missing.
What you’ll need
- A modern Chromium browser (Chrome 33+, Edge, Brave). Per MDN,
SpeechRecognitionaccess requires HTTPS orlocalhost3 . - An Anthropic API key from the Console. The browser pattern is bring-your-own-key: the user pastes their own key into a field, and the page never ships a hardcoded production key.
- A GitHub account for free static hosting via GitHub Pages 4 .
- A text editor (VS Code, Sublime, or anything that saves UTF-8).
Total walkthrough time is roughly 40 minutes for a developer
comfortable with vanilla JavaScript and the fetch API.
Step 1: Scaffold the page
Create a folder voice-chatbot/ with one file: index.html. The
entire app lives in a single file, with no npm, no bundler, and
no build script. Drop the following skeleton in:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Voice Chatbot</title>
<meta name="viewport" content="width=device-width,initial-scale=1" />
<style>
body { font: 16px/1.5 system-ui; max-width: 640px; margin: 2rem auto; padding: 0 1rem; }
#log { border: 1px solid #ccc; padding: 1rem; min-height: 240px; margin: 1rem 0; }
.user { color: #246; }
.bot { color: #262; }
button { padding: 0.6rem 1.2rem; font-size: 1rem; }
</style>
</head>
<body>
<h1>Voice Chatbot</h1>
<p><input id="apiKey" type="password" placeholder="sk-ant-... (your Anthropic key)" size="40" /></p>
<p><button id="talkBtn">Hold to talk</button></p>
<div id="log"></div>
<script type="module" src="./app.js"></script>
</body>
</html>
The API-key input is a type="password" field so the value isn’t
shown on screen; the script reads it at request time and never
persists it. For a public-facing demo, swap this for a server-side
proxy that holds the key.
Image: Simon Willison’s blog post on Anthropic’s CORS rollout (simonwillison.net), used for editorial coverage of the browser-access header.
Step 2: Wire up speech recognition
Create app.js next to index.html. Start with the recognition
plumbing. Per MDN, the constructor is exposed under two names:
webkitSpeechRecognition on Chromium and Safari’s older builds,
SpeechRecognition on the standards track 3 . A
two-line probe handles both:
const Recognition = window.SpeechRecognition || window.webkitSpeechRecognition;
if (!Recognition) {
document.getElementById('talkBtn').disabled = true;
document.getElementById('log').textContent =
'This browser does not expose SpeechRecognition. Try Chrome or Edge.';
}
When the API is available, configure the instance. MDN documents
continuous (keep listening across pauses) and interimResults
(emit partial transcripts before the speaker finishes) as the two
most useful properties 3 . For a push-to-talk button,
the simplest config is single-shot final results:
const recognition = new Recognition();
recognition.lang = 'en-US';
recognition.continuous = false;
recognition.interimResults = false;
recognition.onresult = (event) => {
const transcript = event.results[0][0].transcript;
log('user', transcript);
askClaude(transcript);
};
recognition.onerror = (event) => log('bot', `[recognition error: ${event.error}]`);
recognition.onend = () => {
document.getElementById('talkBtn').textContent = 'Hold to talk';
};
Wire the button to start and stop the recognizer. A hold-to-talk pattern feels more deliberate than a toggle and avoids the API swallowing background speech:
const btn = document.getElementById('talkBtn');
btn.addEventListener('mousedown', () => {
recognition.start();
btn.textContent = 'Listening...';
});
btn.addEventListener('mouseup', () => recognition.stop());
btn.addEventListener('touchstart', (e) => { e.preventDefault(); recognition.start(); });
btn.addEventListener('touchend', (e) => { e.preventDefault(); recognition.stop(); });
The first time the page calls recognition.start(), the browser
shows a microphone-permission prompt. Per MDN, the permission is
origin-scoped and persists per origin once granted 3 .
Image: Medium tutorial header for an in-browser speech-to-text walkthrough (medium.com), used for editorial coverage of the recognizer event model discussed below.
Step 3: Stream the Claude response
Anthropic’s Messages API streams responses via server-sent events
when the request body sets stream: true 1 . The
browser cannot use the EventSource constructor here because
EventSource only emits GET requests; instead, read the response
body as a stream using fetch plus ReadableStream. The streaming
event types Anthropic emits include message_start, content_block_start,
content_block_delta, content_block_stop, message_delta, and
message_stop. The delta events carry the incremental text 1 .
Add the request function:
async function askClaude(prompt) {
const apiKey = document.getElementById('apiKey').value.trim();
if (!apiKey) { log('bot', '[paste your API key first]'); return; }
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: {
'content-type': 'application/json',
'x-api-key': apiKey,
'anthropic-version': '2023-06-01',
'anthropic-dangerous-direct-browser-access': 'true'
},
body: JSON.stringify({
model: 'claude-opus-4-7',
max_tokens: 512,
stream: true,
messages: [{ role: 'user', content: prompt }]
})
});
if (!response.ok) {
log('bot', `[API error ${response.status}]`);
return;
}
await readStream(response.body);
}
The anthropic-dangerous-direct-browser-access: true header is the
explicit opt-in that unlocks CORS for the Messages endpoint. Without
it, the browser blocks the request at the preflight stage 2 .
Now the streaming reader. Server-sent event lines arrive as
event: <type> followed by data: <json> blocks separated by blank
lines 5 . The reader parses each data: line, looks
for content_block_delta events whose delta.text field carries
the incremental token, and feeds the chunk to the speech queue:
async function readStream(body) {
const reader = body.getReader();
const decoder = new TextDecoder();
let buffer = '';
let fullText = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop();
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const payload = JSON.parse(line.slice(6));
if (payload.type === 'content_block_delta'
&& payload.delta?.type === 'text_delta') {
const chunk = payload.delta.text;
fullText += chunk;
appendToLog(chunk);
queueForSpeech(chunk);
}
}
}
flushSpeech();
}
The buffer-and-split pattern handles the case where a single
read() returns a partial line. Without it, a token boundary that
splits across two network chunks crashes the JSON parse.
Image: Anthropic news page for Claude 3.5 Sonnet (anthropic.com), used for editorial coverage of the Messages streaming event model.
Step 4: Speak the response
Per MDN, SpeechSynthesis accepts a SpeechSynthesisUtterance
object via its speak() method and queues each utterance: calling
speak() while one utterance is mid-flight enqueues the next
rather than interrupting it 6 . The naive approach is
to wait until the full response arrives and call speak() once;
that defeats the point of streaming. A more responsive pattern
chunks the stream at sentence boundaries and speaks each sentence
the moment a terminator arrives:
let speechBuffer = '';
function queueForSpeech(chunk) {
speechBuffer += chunk;
const match = speechBuffer.match(/^(.+?[.!?])(\s+)(.*)$/s);
if (match) {
speak(match[1]);
speechBuffer = match[3];
}
}
function flushSpeech() {
if (speechBuffer.trim()) speak(speechBuffer);
speechBuffer = '';
}
function speak(text) {
const utterance = new SpeechSynthesisUtterance(text);
utterance.lang = 'en-US';
utterance.rate = 1.0;
speechSynthesis.speak(utterance);
}
The regex captures the first complete sentence in the buffer, sends it to the speech queue, and keeps the trailing fragment for the next chunk. Sentence-level chunking gives the playback prosody, so the synth voice rises and falls at periods, which sounds more natural than per-token playback.
Step 5: A small logging helper
The body references a log helper and an appendToLog helper for
streaming text. Both are small:
const logEl = document.getElementById('log');
let currentBotLine = null;
function log(role, text) {
const p = document.createElement('p');
p.className = role;
p.textContent = text;
logEl.appendChild(p);
if (role === 'bot') currentBotLine = p;
else currentBotLine = null;
}
function appendToLog(chunk) {
if (!currentBotLine) {
currentBotLine = document.createElement('p');
currentBotLine.className = 'bot';
logEl.appendChild(currentBotLine);
}
currentBotLine.textContent += chunk;
}
The currentBotLine reference is reset whenever the user speaks,
so the next bot reply starts on a fresh paragraph rather than
appending to the previous one.
Step 6: Deploy on GitHub Pages
Per GitHub’s getting-started docs, a public repository whose root
contains index.html publishes automatically when Pages is enabled
from the Settings tab 4 . The free tier serves the
site at https://<user>.github.io/<repo>/ over HTTPS. The HTTPS
requirement matters because SpeechRecognition refuses to start
on plain HTTP origins 3 .
git init
git add index.html app.js
git commit -m "voice chatbot"
git branch -M main
git remote add origin https://github.com/<user>/voice-chatbot.git
git push -u origin main
In the repo’s Settings → Pages, set Source to “Deploy from a
branch”, branch main, folder / (root). The first deploy takes
roughly a minute; subsequent pushes redeploy automatically.
Architecture recap
The full data flow runs end-to-end inside the browser tab:
| Stage | API | Browser surface |
|---|---|---|
| Mic capture | Web Speech API | SpeechRecognition.start() |
| Transcript | Web Speech API | onresult event |
| LLM request | Anthropic Messages API | fetch with stream: true |
| LLM stream | Server-sent events | ReadableStream on response.body |
| Speech | Web Speech API | SpeechSynthesis.speak() |
The Web Audio API mentioned in the title is reached transparently:
SpeechSynthesis routes its output through the browser’s audio
pipeline, which is the Web Audio surface from the browser’s
perspective. If a future iteration needs to manipulate the synth
output (apply a low-pass filter, mix in background ambience), the
upgrade path is to swap SpeechSynthesis for a server-side TTS
that returns audio buffers, then play them through AudioContext
nodes per the MDN Web Audio guide.
Where the browser-only pattern breaks
The pattern works for prototypes, internal tools, and bring-your-own-key apps. Three places it does not work:
- Public production with a single shared API key. Anyone who
opens DevTools sees the key. The
anthropic-dangerous-direct-browser-accessheader name signals this risk directly 2 . Production apps need a server-side proxy that holds the key and forwards the request. - Cross-browser support. Per MDN,
SpeechRecognitionis “Limited availability” and not Baseline; Firefox does not ship it at all 3 . Production voice features typically use a cloud STT service (Deepgram, AssemblyAI, Whisper API) for parity across browsers. - Long conversations. The prototype above sends a single user message per request. For multi-turn memory, accumulate the message history in a JS array and send the full array on each call. The Messages API expects the full conversation on every request, not a session ID 7 .
Closing
The browser-only stack collapses what would otherwise be a Node.js backend, a WebSocket server, and two cloud services into a single HTML file. The cost is the bring-your-own-key constraint and the Chromium-first browser story. Per Anthropic’s CORS announcement, the constraint is intentional: the API supports direct browser calls precisely because the team accepts the pattern for internal and BYOK use cases while flagging the production risk in the header name itself 2 .
How this article was made: an autonomous AI pipeline researched, drafted, fact-checked, and reviewed this piece, aggregating publicly-available information from the sources consulted below. AI (artificial intelligence) can make mistakes, so please cross-check the consulted sources before acting on anything here. Neural Tech Daily is not liable for decisions or outcomes based on this article.
Sources consulted
Cited Sources
- 1. Anthropic — Streaming Messages reference describing `stream: true`, server-sent events transport, and `content_block_delta` event type (accessed ) ↩
- 2. Simon Willison — Claude's API now supports CORS requests, documenting the `anthropic-dangerous-direct-browser-access` header and its intended scope (accessed ) ↩
- 3. MDN — SpeechRecognition reference noting Limited availability status, `webkitSpeechRecognition` prefix, and HTTPS requirement (accessed ) ↩
- 4. GitHub Docs — GitHub Pages getting started, documenting the deploy-from-a-branch flow for static sites (accessed ) ↩
- 5. MDN — Server-sent events reference describing the `data:` line format and event framing (accessed ) ↩
- 6. MDN — SpeechSynthesis reference documenting `speak()` queuing behaviour and `SpeechSynthesisUtterance` (accessed ) ↩
- 7. Anthropic — Messages API reference describing the stateless full-history request shape (accessed ) ↩
Further Reading
- MDN — Web Speech API (accessed )
- MDN — SpeechSynthesisUtterance (accessed )
- MDN — Web Audio API (accessed )
Anonymous · no cookies set