High-Quality Structured Data
The largest structured dataset of American radio drama in existence. 233,264 audio files. 6,408 series. Six decades of broadcast history, organized, cataloged, and machine-readable.
What Makes This Different
There is no other dataset like this. Not on Hugging Face. Not on Kaggle. Not in any university archive. The Radio Drama Corpus is tens of thousands of hours of professional multi-speaker dramatic dialogue — written by professional writers, performed by professional actors, directed and produced for national broadcast. Every genre. Every major American network. 1920 through 1982.
This is not podcast audio. Not audiobook narration. Not synthetic speech. These are real performances — scripted scenes with multiple speakers, natural turn-taking, emotional range, sound effects, and music cues. The kind of data that speech and language models are starving for.
Storage Architecture
Every file lives in a structured object store with a deterministic path scheme.
files/{collection}/{Series Name}/{YYYY-MM-DD}/{filename}/{filename}.mp3The derivative folder structure sits alongside each source file — HLS segments, waveform data, transcripts, and analysis outputs all share the same parent path. No guessing. No flat buckets. Every file addressable by series, airdate, and filename.
Object Storage
5 TB across 233,264 files. Two collection tiers: public domain (9,970 files, 184 GB) and research collection (223,294 files, 2.14 TB). Cloudflare R2 with global edge distribution.
Relational Database
Turso (libSQL) with structured tables for shows, episodes, airdates, networks, loglines, and file metadata. Edge-replicated. Sub-millisecond reads worldwide.
Structured Metadata
Every series and episode cataloged with machine-readable fields.
Series-Level
Episode-Level
Loglines
Every episode has a logline — a one-sentence plot summary that captures the narrative hook of each broadcast. Consistent in style across the entire corpus. These are not transcripts — they are structured descriptions of what happens in each episode.
A man trapped in a coffin buried alive must convince the gravedigger above to dig him out before his air runs out.
A crew of astronauts discovers that the alien civilization they came to study has been studying them the entire time.
Sergeant Friday tracks a stolen shipment of industrial dynamite through the warehouses of downtown Los Angeles.
What You Can Build
Speech Synthesis
Train TTS models on real dramatic performances. Multi-speaker, multi-emotion, multi-genre. Tens of thousands of hours of professional voice acting.
Speaker Diarization
Full-cast productions with multiple speakers per scene. Natural conversational turn-taking. Sound effects and music provide segmentation challenges at scale.
Narrative Understanding
Genre classification. Plot structure analysis. Character relationship extraction. Dialogue generation. Style transfer between decades and genres.
Audio Retrieval
Cross-modal search (text-to-audio, audio-to-text). Logline-to-episode matching. Content-based recommendation. Duplicate detection at scale.
Cultural Analytics
Six decades of American popular culture. Representation studies. Linguistic drift analysis. Advertising history. Cold War propaganda. Gender roles in mid-century media.
Transcription & Alignment
Forced alignment between audio and transcripts. ASR benchmarking on pre-digital recordings with noise, compression, and varying audio quality.
For Listeners
Not a researcher? Radio Index is the streaming platform built on top of this data. Browse by show, genre, decade, or network. Background playback on iOS. Full VoiceOver accessibility.
Technology Stack
Built on production infrastructure from the companies building the future of AI, cloud, and media processing.
Processing Pipeline
Every episode runs through a multi-stage enrichment pipeline.
Ingest
Audio files ingested into structured R2 storage. Filename parsed for series, airdate, and episode metadata.
Transcribe
Deepgram processes audio to generate word-level transcripts with speaker diarization and confidence scores.
Analyze
Our fine-tuned models generate loglines, extract genre tags, identify cast patterns, and flag content characteristics.
Store
Structured metadata written to Turso. Derivative files (transcripts, waveforms, HLS segments) stored alongside source audio in R2.
Serve
Catalog API, streaming via Radio Index, edge-cached delivery through Cloudflare's global network.
Access
The catalog API is available to registered developers. Academic and nonprofit researchers can request bulk metadata export. Audio access is through Radio Index (streaming) or by institutional agreement.