Speech encoders are trained to do one narrow thing. Whisper learns to transcribe. WavLM, HuBERT, and the Conformer learn to fill in masked audio. None of them is ever told who is speaking or how they sound. They pick it up anyway. The interesting question is where in the network that information lives.
To find out, we took 3,276 recordings (91 actors, the same 12 sentences, six acted emotions), ran them through four frozen encoders, and mapped every layer in 3-D. Each point below is one recording. Color the map by gender, emotion, speaker, or sentence, and drag the slider to move through the network from input to output.
loading real layer activations…
layer 0/0
one hue per speaker · click a dot to follow a voice
what to look for
Four signals, four fates
Gender splits at the very first layer in every encoder and stays split; pitch and vocal-tract resonance are cheap features, and a simple linear classifier reads gender at about 90% from layer 0. Emotion is slower. The six categories start out tangled and pull apart with depth, even between recordings of the same sentence. Speaker identity goes the other way, strong early and then dissolved as the network reorganizes around content. And the twelve sentences condense into clean clusters in all four models.
the finding
Two architectures, two schedules
Speech has two encoder families at the frontier: Transformers, and Conformers, which mix convolutions into the attention layers. They score about the same on benchmarks. But in our preprint, Categorize Early, Integrate Late, we probed 24 of them, layer by layer, and found they organize information on different schedules. Conformers sort out categories like gender and phonemes early, in roughly the first fifth of the network. Transformers push phonemes, accent, and duration to the second half. The difference is consistent enough that a classifier can tell the two architectures apart from their probe profiles alone.
Same task, same scores, different solutions. The full analysis is in the preprint; code and data are on GitHub.
Notes
- Audio: CREMA-D (Cao et al. 2014, GitHub, ODbL). 91 professional actors, 12 fixed sentences, six acted emotions (anger, disgust, fear, happy, neutral, sad). We sample six clips per actor per emotion: 3,276 recordings.
- Encoders: openai/whisper-small (supervised ASR), microsoft/wavlm-base-plus, facebook/hubert-base-ls960, and facebook/wav2vec2-conformer-rel-pos-large (all three self-supervised). Hidden states are mean-pooled per layer over real (unpadded) frames.
- Projection: PCA-50 then 3-D t-SNE per layer (perplexity 30), each layer initialized from the previous layer’s embedding. Wattenberg, Viégas & Johnson, “How to Use t-SNE Effectively” (Distill 2016), explains what t-SNE does and doesn’t preserve.
- Probes: logistic regression on PCA-50 of each layer. A high score means the information is linearly readable at that layer, not that the model uses it downstream. Emotions in CREMA-D are acted, not spontaneous — probe scores here are an upper bound on how separable emotion is in the wild.
- Preprint: Nathan Roll*, Pranav Bhalerao*, Martijn Bartelds, Arjun Pawar, Yuka Tatsumi, Tolulope Ogunremi, Chen Shani, Calbert Graham, Meghan Sumner, and Dan Jurafsky (2026). “Categorize Early, Integrate Late: Divergent Processing Strategies in Automatic Speech Recognition.” arXiv:2601.06972. *Equal contribution.