Description
This is probably an upstream "issue", and it's not a problem per se, more just something unexpected.
@khimaros commented on Dec 1, 2023:
i'm not sure if this is expected, but with
medium.en-q5_0
, i'm seeing that speaker turns are pretty reliably marked with>>
. i'm not using the--diarize
or--tdrz
flags.i wasn't seeing this behavior with
large-v2
,large-v3
, orlarge-v3-q5_0
. any thoughts on why that would be happening?
I was curious and tried reproducing this, using the a13.wav
sample obtained via make samples
from https://upload.wikimedia.org/wikipedia/commons/transcoded/6/6f/Apollo13-wehaveaproblem.ogg/Apollo13-wehaveaproblem.ogg.mp3 (https://commons.wikimedia.org/wiki/File:Apollo13-wehaveaproblem.ogg).
No diarization using: tiny
, tiny.en
, base
, base.en
, small
, small.en
, medium
, large-v1
, large-v2
, large-v2-q5_0
, large-v3-q5_0
.
Diarization using: medium.en
, medium.en-q5_0.bin
.
Using the latest from master, 1cf679d
. M1 macOS.
[00:00:00.000 --> 00:00:07.000] SC Okay Houston, we've had a problem here.
[00:00:07.000 --> 00:00:12.000] CAPCOM This is Houston. Say again please.
[00:00:12.000 --> 00:00:15.000] SC Houston, we've had a problem. We've had a main B plus 100 volts.
[00:00:15.000 --> 00:00:20.000] CAPCOM Roger. Main B, 100 volts. Okay, standby 13. We're looking at it.
[00:00:20.000 --> 00:00:29.000] SC Okay. Right now, Houston, the voltage is looking good. And we had a pretty large bang
[00:00:29.000 --> 00:00:36.000] associated with the caution and warning amp. And as I recall, main B was the one that had
[00:00:36.000 --> 00:00:39.000] a amp spike on it once before.
[00:00:39.000 --> 00:00:42.000] CAPCOM Roger, Fred.
[00:00:42.000 --> 00:00:48.000] SC And the interim air, we're starting to go ahead and button up the tunnel again.
I found this:
They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas.
-- https://github.com/openai/whisper/blob/main/model-card.md#evaluated-use
... So maybe this is just a weird case, perhaps the medium.en
model was trained on that audio sample + a transcript? Wouldn't be too surprising. There are a number of transcripts that use the same speaker identifiers (SC = Spacecraft & CAPCOM = Capsule Communication), e.g. https://nssdc.gsfc.nasa.gov/planetary/lunar/apollo13.pdf
Mostly creating this just to have a placeholder for the topic, as I haven't encountered other discussions. I do recall reading something about how Whisper is trained to suppress this sort of thing... Oh yeah, here we go: openai/whisper#854