Methodology

Overview

The Cult Codex is a structured archive of the Cult of Psyche podcast. It uses a combination of automated transcription, AI-assisted analysis, and human curation to catalog episodes, identify participants, extract quotes, and map recurring topics and lore.

This page documents the processes, tools, and editorial standards used to build and maintain the archive.

Data Pipeline

1. Transcript Acquisition

Transcripts are sourced from YouTube captions where available. When YouTube captions are disabled, audio is downloaded and transcribed locally using OpenAI Whisper (medium model). Transcripts are stored as timestamped segments.

2. AI Enrichment

Each transcript is processed by an AI model (Claude) to extract structured data: episode summaries, guest identifications, topic tags, notable quotes, lore references, and content type classification. The AI is prompted with explicit editorial guidelines emphasizing neutral, factual language.

3. Database Import

Enriched data is imported into a PostgreSQL database (Neon) via Prisma ORM. The import process handles deduplication of people, topics, and lore entries using slug-based matching.

4. Quality Indicators

Each episode displays provenance badges indicating whether its data is “transcript-backed” (derived from a full transcript) or “inferred” (generated from title and metadata only). This helps members gauge the reliability of each entry.

Editorial Standards

Summaries use neutral, descriptive language without editorial judgment
Guest identifications are based on in-stream introductions and display names
Quotes are extracted verbatim from transcripts where possible
Topic tags are normalized to avoid duplicates (e.g., “AI” vs “Artificial Intelligence”)
Lore entries distinguish between canonical (stated on stream), speculative, and community myth
Person pages include archive context notices explaining that profiles are auto-generated

Limitations

AI-generated summaries may occasionally misidentify speakers or misattribute statements
Whisper transcriptions of overlapping speech or low-quality audio may contain errors
Some early episodes lack transcripts entirely and have minimal metadata
Guest identification relies on display names which may not reflect legal or preferred names
Topic and lore categorization involves subjective judgment by the AI model

Submit a Correction|Content Policy

Overview

This page documents the processes, tools, and editorial standards used to build and maintain the archive.

Data Pipeline

1. Transcript Acquisition

2. AI Enrichment

3. Database Import

Enriched data is imported into a PostgreSQL database (Neon) via Prisma ORM. The import process handles deduplication of people, topics, and lore entries using slug-based matching.

4. Quality Indicators

Editorial Standards

Summaries use neutral, descriptive language without editorial judgment
Guest identifications are based on in-stream introductions and display names
Quotes are extracted verbatim from transcripts where possible
Topic tags are normalized to avoid duplicates (e.g., “AI” vs “Artificial Intelligence”)
Lore entries distinguish between canonical (stated on stream), speculative, and community myth
Person pages include archive context notices explaining that profiles are auto-generated

Limitations

AI-generated summaries may occasionally misidentify speakers or misattribute statements
Whisper transcriptions of overlapping speech or low-quality audio may contain errors
Some early episodes lack transcripts entirely and have minimal metadata
Guest identification relies on display names which may not reflect legal or preferred names
Topic and lore categorization involves subjective judgment by the AI model

Submit a Correction|Content Policy

Overview

Data Pipeline

1. Transcript Acquisition

2. AI Enrichment

3. Database Import

4. Quality Indicators

Editorial Standards

Limitations

METHODOLOGY

Overview

Data Pipeline

1. Transcript Acquisition

2. AI Enrichment

3. Database Import

4. Quality Indicators

Editorial Standards

Limitations