Skip to content

Incorrect audio frame pts with nasa_13013.mp4.audio.mp3 #553

Closed
@NicolasHug

Description

@NicolasHug

In nasa_13013.mp4.audio.mp3.stream0.all_frames_info.json, which we created using ffprobe, the first frame of nasa_13013.mp4.audio.mp3 has pts 0.138125 and duration 0.005875:

{
"duration_time": "0.005875",
"pts_time": "0.138125"
},
{
"duration_time": "0.072000",
"pts_time": "0.144000"
},
{
"duration_time": "0.072000",
"pts_time": "0.216000"
},

However, the corresponding AVFrame's pts and duration fields do not match these value. Instrumenting our decoder to print the first few frames infos of the decoded frames (the ones we return):

diff --git a/src/torchcodec/decoders/_core/VideoDecoder.cpp b/src/torchcodec/decoders/_core/VideoDecoder.cpp
index 0e287a5..5e9cc8a 100644
--- a/src/torchcodec/decoders/_core/VideoDecoder.cpp
+++ b/src/torchcodec/decoders/_core/VideoDecoder.cpp
@@ -1152,6 +1152,8 @@ VideoDecoder::FrameOutput VideoDecoder::convertAVFrameToFrameOutput(
       avFrame->pts, formatContext_->streams[streamIndex]->time_base);
   frameOutput.durationSeconds = ptsToSeconds(
       getDuration(avFrame), formatContext_->streams[streamIndex]->time_base);
+
+  printf("AVFrame pts = %f, duration = %f, num_samples = %d\n", frameOutput.ptsSeconds, frameOutput.durationSeconds, avFrame->nb_samples);
   if (streamInfo.avMediaType == AVMEDIA_TYPE_AUDIO) {
     convertAudioAVFrameToFrameOutputOnCPU(
         avFrameStream, frameOutput, preAllocatedOutputTensor);
AVFrame pts = 0.072000, duration = 0.072000, num_samples = 47
AVFrame pts = 0.144000, duration = 0.072000, num_samples = 576
AVFrame pts = 0.216000, duration = 0.072000, num_samples = 576

We can see that there's a disagreement with the first frame. It's likely that ffprobe is correct here, and that the correct pts and duration are 0.138125 and 0.005875: the file has a sample rate of 8000, and 47 samples at this rate yields exactly 0.005875 seconds, while 0.144000 - 0.005875 == 0.138125. In contrast, it's impossible for the frame duration to be equal to 0.072000 with only 47 samples at this sample rate.

I don't really know how to fix this for now. The pts we return are from FFmpeg itself (set by the AVFrame!), and we're just trusting it, but clearly it's wrong for this first frame. ffprobe seems to do something smarter, and I don't know what it is. It's possible that there's a field in AVFrame that I'm missing? Or ffprobe is just realizing that with 47 samples at this rate, the start of the first frame must be 0.138125 - but that means it's looking at the second frame, and that it trusts its values are correct??

I don't know.

Note: This bug only affects the first frame of nasa_13013.mp4.audio.mp3. All other frames are fine, as can be seen in #554.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions