Methodology

How oruk turns live broadcasts into corroborated news

oruk listens to ~200 live radio, TV, social, and structured feeds, transcribes them in real time on private GPU pods, then uses LLMs to extract events and corroborate them across independent sources before publishing. Here is exactly how that pipeline works, what we trust, and what we mark as uncertain.

How does the oruk pipeline work end-to-end?

oruk runs a four-step pipeline — ingest, ASR, LLM extraction, multi-source corroboration — that converts ~200 simultaneous live broadcast streams into corroborated news events. As of this writing the system has produced 42,000+ published stories from 670,000+ raw transcriptions across 240 monitored sources in 7 regions.

Ingest

Live audio, video, social, and structured feeds streamed continuously

ASR

Per-source on-pod transcription & translation, GPU-resident

Extract

LLM extracts headline, summary, body, topics, urgency, impact, location

Corroborate

Match against existing events; merge when an independent source confirms

Every story you see on the wire has been through all four steps. The corroboration count next to a story is the number of independent sources that have confirmed the same event — same time, same place, same claim — not the number of mentions.

Live system snapshot. 215 active sources at this moment, distributed across three RunPod GPU clusters (~60–79 streams each). The most recent Gemini reconciliation cycle ran in 17 ms. Real-time numbers are exposed at /v1/stats.

What sources does oruk listen to?

oruk monitors three independent source classes — audio broadcasts, social-media signal, and machine-readable agency feeds — and tags every story with the source medium so consumers can filter for the trust level they need. The current catalogue holds 240 sources across Europe (116), North America (38), Asia-Pacific (36), South America (27), Global (12), Middle East (7), and Africa (4).

Audio broadcasts — live radio and TV news streams from public broadcasters (BBC, NPR, ABC Radio National, Deutschlandfunk, Radio Maryja, NHK World, France Info, …) and major commercial outlets across every region. The full station list lives on the Sources page and the /v1/sources endpoint.
Social signal — Mastodon firehose and Bluesky public timeline, used as a corroboration signal for events first surfaced on broadcast.
Structured feeds — USGS earthquake data, NOAA weather alerts, OpenFDA drug safety, GDELT events, and similar machine-readable feeds whose source-of-truth is the underlying agency.

Every story carries a source.medium tag of audio_radio, social, or structured, plus a per-source region and language, so consumers can filter for "radio-confirmed" stories, "structured-only" alerts, or any combination.

How does oruk transcribe live broadcasts in real time?

Audio is transcribed on dedicated GPU pods running Whisper-class ASR with on-pod voice activity detection and language identification. Non-English streams are transcribed in their source language and then translated to English on the same pod, so extraction always operates on a consistent English working copy alongside the original transcript for provenance.

Why on-pod ASR. Latency, cost, and privacy. We benchmarked cloud STT services and found per-second pricing impractical at our cadence. On-pod inference also means no audio ever leaves our infrastructure.

Transcripts are batched into rolling 30-second windows. Headlines on breaking stories typically appear on the public wire within 30 to 90 seconds of being spoken on air.

How does oruk extract news events from transcripts?

Each transcript window plus the past five minutes of context for the same source is passed to an LLM that returns a strict-format JSON event with headline, body, category, urgency, impact, location, and a verbatim source quote. A rules-based grounding step rejects any output whose claims are not directly supported by the transcript window, so vague spoken statements never sharpen into definitive headlines.

Are there any reportable news events in this window?
For each event: headline, summary, body, primary category, multi-category list, topics, urgency (breaking, developing, routine), impact (1–10), confidence (0.0–1.0), event city/country/lat/lon, and a verbatim source quote.
Is this event a continuation of an event the source has been covering, or a new event?

If a transcript says "officials say they may revisit the agreement", the headline cannot become "officials revisit agreement". The grounding ruleset is conservative by design and accepts a higher false-negative rate to keep the false-positive rate close to zero. The extraction LLM runs locally on an H100 with vLLM, with a Gemini fallback on the rare occasions the local inference server is unavailable; outputs are validated against a JSON schema before they ever reach the corroboration layer.

What counts as an independent corroborating source?

An independent source is a separate broadcaster, agency, or feed reporting the same event in its own words within the same rolling window — not a syndicated rebroadcast of a wire we already counted. Two different outlets republishing the same AP story add up to a corroboration count of one, not two.

An extracted event becomes a story only after it is reconciled against the existing wire. Reconciliation operates on:

Time proximity — events from different sources within the same rolling window.
Geographic proximity — same city, region, or specific actor.
Semantic match — embedding-based similarity across the headline, entities, and topics.

The corroboration.sources array preserves the verbatim source name and per-source quote so consumers can audit the count themselves.

What the corroboration distribution looks like in practice. Across a recent 200-story sample on the wire, the median corroboration count was 1.77, the maximum was ×36, and 10% of stories cleared the ≥3 threshold we recommend for automated decisioning. Audio broadcasts accounted for 91% of all corroborating source mentions in that sample; structured feeds added the remaining 1–8% depending on category.

How does oruk categorise stories?

Every story is assigned exactly one primary category from a closed set of twelve, plus a multi-category list of secondary tags exposed as the topics array. The closed set keeps queries deterministic — there is no long tail of synthesised category names to reconcile across releases.

The twelve primary categories are: politics, conflict, economy, disaster, diplomacy, science, health, technology, culture, environment, sports, and other. The current top five by published volume are politics (27%), other (17%), economy (17%), technology (7%), and conflict (6%).

Sources also carry a medium classification:

audio_radio — live radio or TV broadcast monitored by an ASR pod.
social — social-media-first signal (Mastodon firehose, Bluesky public timeline, curated journalist accounts).
structured — machine-readable feed where the source-of-truth is an agency (USGS, NOAA, OpenFDA, GDELT, etc.).

What quality controls catch hallucinations?

Three automated layers and one manual one. The automated stack rejects ungrounded headlines, malformed JSON, and low-confidence single-source events; the manual layer audits a random daily sample to catch drift in the LLM and adjust prompts before drift compounds.

Headline grounding — rules that reject headlines whose claims are not directly supported by the transcript window.
JSON-schema validation — every LLM output passes a strict schema check before reaching the database.
Corroboration thresholds — events below a confidence floor are held until at least one independent source confirms.
Manual review — we audit a random sample of stories every day and feed the failures back into the prompt.

When we change something material — a new source class, a prompt revision, a corroboration rule — we note it on the public changelog.

How fresh is the oruk wire?

Headlines on breaking stories typically reach the public wire within 30 to 90 seconds of being spoken on air. Audio sources are continuous; structured agency feeds are polled at the cadence the underlying agency publishes (USGS earthquakes are typically <60 s after detection, NOAA alerts within minutes of issuance).

Each source has its own polling cadence, recorded in source.cadenceMs on /v1/sources. The public wire on oruk.ai is real-time for every visitor. Authenticated /v1/stories results for Free tier accounts are filtered to ~5 minutes behind the wire so casual API use cannot front-run the homepage. The SSE stream (/v1/stream) is only available on Developer, Trader, and Enterprise; it fires in real time with no extra delay once you are allowed to connect.

How accurate is each oruk story?

oruk is a signal layer, not a primary source — accuracy scales with corroboration count, and consumers should treat single-source stories as leads rather than confirmed facts. For automated decisioning we recommend corroboration.count ≥ 3 from media you trust for the use case (audio for breaking, structured for compliance).

Each story carries an explicit confidence (0.0–1.0) the extraction LLM assigned, plus the corroboration count, the medium of every confirming source, and the verbatim quote each source used. That's everything you need to audit a story without trusting our judgement on its face. Practical guidance:

For automated decisioning, prefer events with corroboration.count ≥ 3 and at least two distinct mediums.
Cross-reference any single-source story against the broadcaster directly before quoting it externally.
Treat confidence < 0.6 as a developing-story flag, not a confirmed fact.

What if I find a story that's wrong?

Email editorial@oruk.ai with the story permalink and the correction. We log every reported correction publicly on the changelog and update the story's storyStatus field to corrected or retracted within hours.

The canonical record always lives with the original broadcaster. We do not invent facts and we do not hide errors — when we get something wrong, the correction is public and the audit trail (timeline, source quotes, confidence) stays attached to the story so the failure mode is reproducible.

All sources About oruk API docs Pricing