Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model)

This is probably an upstream "issue", and it's not a problem per se, more just something unexpected.

@khimaros [commented on Dec 1, 2023](https://github.com/ggerganov/whisper.cpp/pull/1058#issuecomment-1836910921):

> i'm not sure if this is expected, but with `medium.en-q5_0`, i'm seeing that speaker turns are pretty reliably marked with `>>`. i'm not using the `--diarize` or `--tdrz` flags.
> 
> i wasn't seeing this behavior with `large-v2`, `large-v3`, or `large-v3-q5_0`. any thoughts on why that would be happening?

I was curious and tried reproducing this, using the `a13.wav` sample obtained via `make samples` from https://upload.wikimedia.org/wikipedia/commons/transcoded/6/6f/Apollo13-wehaveaproblem.ogg/Apollo13-wehaveaproblem.ogg.mp3 (https://commons.wikimedia.org/wiki/File:Apollo13-wehaveaproblem.ogg).

No diarization using: `tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `medium`, `large-v1`, `large-v2`, `large-v2-q5_0`, `large-v3-q5_0`.

Diarization using: `medium.en`, `medium.en-q5_0.bin`.

Using the latest from master, `1cf679d`. M1 macOS.

> [00:00:00.000 --> 00:00:07.000]   SC Okay Houston, we've had a problem here.
> [00:00:07.000 --> 00:00:12.000]   CAPCOM This is Houston. Say again please.
> [00:00:12.000 --> 00:00:15.000]   SC Houston, we've had a problem. We've had a main B plus 100 volts.
> [00:00:15.000 --> 00:00:20.000]   CAPCOM Roger. Main B, 100 volts. Okay, standby 13. We're looking at it.
> [00:00:20.000 --> 00:00:29.000]   SC Okay. Right now, Houston, the voltage is looking good. And we had a pretty large bang
> [00:00:29.000 --> 00:00:36.000]   associated with the caution and warning amp. And as I recall, main B was the one that had
> [00:00:36.000 --> 00:00:39.000]   a amp spike on it once before.
> [00:00:39.000 --> 00:00:42.000]   CAPCOM Roger, Fred.
> [00:00:42.000 --> 00:00:48.000]   SC And the interim air, we're starting to go ahead and button up the tunnel again.

I found this:
> They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas.

-- https://github.com/openai/whisper/blob/main/model-card.md#evaluated-use

... So maybe this is just a weird case, perhaps the `medium.en` model was trained on that audio sample + a transcript? Wouldn't be too surprising. There are a number of transcripts that use the same speaker identifiers (SC = Spacecraft & CAPCOM = Capsule Communication), e.g. https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo13.pdf

Mostly creating this just to have a placeholder for the topic, as I haven't encountered other discussions. I do recall reading something about how Whisper is trained to suppress this sort of thing... Oh yeah, here we go: https://github.com/openai/whisper/discussions/854

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Diarization (speaker turn recognition) sometimes happens unexpectedly? (medium.en model) #1810

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions