High-Quality Structured Data

The largest structured dataset of American radio drama in existence. 233,264 audio files. 6,408 series. Six decades of broadcast history, organized, cataloged, and machine-readable.

6,408Series Tracked
233,264Audio Files
170,312Episodes Cataloged
5 TBAudio

What Makes This Different

There is no other dataset like this. Not on Hugging Face. Not on Kaggle. Not in any university archive. The Radio Drama Corpus is tens of thousands of hours of professional multi-speaker dramatic dialogue — written by professional writers, performed by professional actors, directed and produced for national broadcast. Every genre. Every major American network. 1920 through 1982.

This is not podcast audio. Not audiobook narration. Not synthetic speech. These are real performances — scripted scenes with multiple speakers, natural turn-taking, emotional range, sound effects, and music cues. The kind of data that speech and language models are starving for.

Storage Architecture

Every file lives in a structured object store with a deterministic path scheme.

files/{collection}/{Series Name}/{YYYY-MM-DD}/{filename}/{filename}.mp3

The derivative folder structure sits alongside each source file — HLS segments, waveform data, transcripts, and analysis outputs all share the same parent path. No guessing. No flat buckets. Every file addressable by series, airdate, and filename.

Object Storage

5 TB across 233,264 files. Two collection tiers: public domain (9,970 files, 184 GB) and research collection (223,294 files, 2.14 TB). Cloudflare R2 with global edge distribution.

Relational Database

Turso (libSQL) with structured tables for shows, episodes, airdates, networks, loglines, and file metadata. Edge-replicated. Sub-millisecond reads worldwide.

Structured Metadata

Every series and episode cataloged with machine-readable fields.

Series-Level

nameTEXTSeries title
yearsTEXTBroadcast years (e.g. 1942–1962)
networkTEXTCBS, NBC, ABC, Mutual, Blue Network, Syndicated
episode_countINTEGERFiles in collection
collection_typeTEXTpublic or private
in_booksBOOLEANDocumented in radio encyclopedias
loglineTEXTSeries description

Episode-Level

airdateDATEOriginal broadcast date
loglineTEXTEpisode plot summary
durationINTEGERRuntime in seconds
filenameTEXTSource file reference
file_sizeINTEGERBytes
seriesTEXTParent series name

Loglines

Every episode has a logline — a one-sentence plot summary that captures the narrative hook of each broadcast. Consistent in style across the entire corpus. These are not transcripts — they are structured descriptions of what happens in each episode.

Suspense1943-01-19

A man trapped in a coffin buried alive must convince the gravedigger above to dig him out before his air runs out.

X Minus One1955-06-24

A crew of astronauts discovers that the alien civilization they came to study has been studying them the entire time.

Dragnet1951-03-08

Sergeant Friday tracks a stolen shipment of industrial dynamite through the warehouses of downtown Los Angeles.

What You Can Build

Speech Synthesis

Train TTS models on real dramatic performances. Multi-speaker, multi-emotion, multi-genre. Tens of thousands of hours of professional voice acting.

Speaker Diarization

Full-cast productions with multiple speakers per scene. Natural conversational turn-taking. Sound effects and music provide segmentation challenges at scale.

Narrative Understanding

Genre classification. Plot structure analysis. Character relationship extraction. Dialogue generation. Style transfer between decades and genres.

Audio Retrieval

Cross-modal search (text-to-audio, audio-to-text). Logline-to-episode matching. Content-based recommendation. Duplicate detection at scale.

Cultural Analytics

Six decades of American popular culture. Representation studies. Linguistic drift analysis. Advertising history. Cold War propaganda. Gender roles in mid-century media.

Transcription & Alignment

Forced alignment between audio and transcripts. ASR benchmarking on pre-digital recordings with noise, compression, and varying audio quality.

For Listeners

Not a researcher? Radio Index is the streaming platform built on top of this data. Browse by show, genre, decade, or network. Background playback on iOS. Full VoiceOver accessibility.

Technology Stack

Built on production infrastructure from the companies building the future of AI, cloud, and media processing.

AnthropicOur fine-tuned models for content analysis and metadata enrichment
OpenAIOur fine-tuned models for summarization and extraction
DeepgramSpeech-to-text transcription across 233K+ episodes
CloudflareWorkers for compute, R2 for object storage, edge network for global delivery
TursoEdge-replicated libSQL database for catalog metadata
ModalGPU compute for audio analysis and batch processing pipelines
Fly.ioApplication hosting and service orchestration
AppleNative iOS app with VoiceOver, background playback, and offline support

Processing Pipeline

Every episode runs through a multi-stage enrichment pipeline.

1

Ingest

Audio files ingested into structured R2 storage. Filename parsed for series, airdate, and episode metadata.

2

Transcribe

Deepgram processes audio to generate word-level transcripts with speaker diarization and confidence scores.

3

Analyze

Our fine-tuned models generate loglines, extract genre tags, identify cast patterns, and flag content characteristics.

4

Store

Structured metadata written to Turso. Derivative files (transcripts, waveforms, HLS segments) stored alongside source audio in R2.

5

Serve

Catalog API, streaming via Radio Index, edge-cached delivery through Cloudflare's global network.

Access

The catalog API is available to registered developers. Academic and nonprofit researchers can request bulk metadata export. Audio access is through Radio Index (streaming) or by institutional agreement.

APIREST. JSON. Paginated. Free for non-commercial use.
Bulk ExportCSV and JSON metadata dumps available on request.
Audio AccessStreaming via Radio Index. Institutional agreements for research.
LicenseOpen catalog metadata. Attribution required. Audio rights vary by series.