Skip to content

Commit 6a27ada

Browse files
committed
Add audio encoder design
1 parent c937f72 commit 6a27ada

File tree

1 file changed

+190
-0
lines changed

1 file changed

+190
-0
lines changed

audio_encoder_design.md

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
Audio encoding design
2+
=====================
3+
4+
Let's talk about the design of our audio encoding capabilities. This design doc
5+
is not meant to be merged into the repo. I'm creating a PR to start a discussion
6+
and enable comments on the design proposal. The PR will eventually be closed
7+
without merging.
8+
9+
10+
Feature space and requirements
11+
------------------------------
12+
13+
When users give us the samples to be encoded, they have to provide:
14+
15+
- the FLTP tensor of decoded samples (akin to what we decode)
16+
- the sample rate of the samples. That's needed for FFmpeg to know when each
17+
sample should be played, and it cannot be inferred.
18+
19+
Those are naturally supplied as 2 separate parameters (1 for the tensor, 1 for
20+
the sample rate), but if our APIs also allowed users to pass a single
21+
[AudioSamples](https://docs.pytorch.org/torchcodec/stable/generated/torchcodec.AudioSamples.html#torchcodec.AudioSamples)
22+
object as parameter, that could be a good UX.
23+
24+
We want to enable users to encode these samples:
25+
26+
- to a file, like "output.mp3". When encoding to a file, we automatically infer
27+
the format (mp3) from the filename.
28+
- to a file-like (NYI, will come eventually). When encoding to a file-like, we
29+
can't infer the format, so users have to specify it to us.
30+
- to a tensor. Same here, users have to specify the output format.
31+
32+
We want to allow users to specify additional encoding options:
33+
34+
- The encoded bit rate, for compressed formats like mp3
35+
- The number of channels, to automatically encode to audio or to stereo,
36+
potentially different from that of the input (NYI).
37+
- The encoded sample rate, to automatically encode into a given sample rate,
38+
potentially different from that of the input (NYI).
39+
- (Maybe) other parameters, like codec-specific stuff.
40+
41+
API proposal
42+
------------
43+
44+
45+
### Option 1
46+
47+
A natural option is to create 3 separate stateless functions: one for each kind
48+
of output we want to support.
49+
50+
```py
51+
def encode_audio_to_file(
52+
samples: torch.Tensor,
53+
sample_rate: int,
54+
filename: Union[str, Path],
55+
bit_rate: Optional[int] = None,
56+
num_channels: Optional[int] = None,
57+
output_sample_rate: Optional[int] = None,
58+
) -> None:
59+
pass
60+
61+
62+
def encode_audio_to_file_like(
63+
samples: torch.Tensor,
64+
sample_rate: int,
65+
file_like: object,
66+
format: str,
67+
bit_rate: Optional[int] = None,
68+
num_channels: Optional[int] = None,
69+
output_sample_rate: Optional[int] = None,
70+
) -> None:
71+
pass
72+
73+
74+
def encode_audio_to_tensor(
75+
samples: torch.Tensor,
76+
sample_rate: int,
77+
format: str,
78+
bit_rate: Optional[int] = None,
79+
num_channels: Optional[int] = None,
80+
output_sample_rate: Optional[int] = None,
81+
) -> torch.Tensor:
82+
pass
83+
```
84+
85+
A few notes:
86+
87+
- I'm not attempting to define where the keyword-only parameters should start,
88+
that can be discussed later or on the PRs.
89+
- Both `to_file_like` and `to_tensor` need an extra `format` parameter, because
90+
it cannot be inferred. In `to_file`, it is inferred from `filename`.
91+
- To avoid collision between the input sample rate and the optional desired
92+
output sample rate, we have to use `output_sample_rate`. That's a bit meh.
93+
Technically, all of `format`, `bit_rate` and `num_channel` could also qualify
94+
for the `output_` prefix, but that would be very heavy.
95+
96+
### Option 2
97+
98+
Another option is to expose each of these functions as methods on a stateless
99+
object.
100+
101+
```py
102+
class AudioEncoder:
103+
def __init__(
104+
self,
105+
samples: torch.Tensor,
106+
sample_rate: int,
107+
):
108+
pass
109+
110+
def to_file(
111+
self,
112+
filename: Union[str, Path],
113+
bit_rate: Optional[int] = None,
114+
num_channels: Optional[int] = None,
115+
sample_rate: Optional[int] = None,
116+
) -> None:
117+
pass
118+
119+
def to_file_like(
120+
self,
121+
file_like: object,
122+
bit_rate: Optional[int] = None,
123+
num_channels: Optional[int] = None,
124+
sample_rate: Optional[int] = None,
125+
) -> None:
126+
pass
127+
128+
def to_tensor(
129+
format: str,
130+
bit_rate: Optional[int] = None,
131+
num_channels: Optional[int] = None,
132+
sample_rate: Optional[int] = None,
133+
) -> torch.Tensor:
134+
pass
135+
```
136+
137+
Usually, we like to expose objects (instead of stateless functions) when there
138+
is a clear state to be managed. That's not the case here: the `AudioEncoder` is
139+
largely stateless.
140+
Instead, we can justify exposing an object by noting that it allows us to
141+
cleanly separate unrelated blocks of parameters:
142+
- the parameters relating to the **input** are in `__init__()`
143+
- the parameters relating to the **output** are in the `to_*` methods.
144+
145+
A nice consequence of that is that we do not have a collision between the 2
146+
`sample_rate` parameters anymore, and their respective purpose is clear from the
147+
methods they are exposed in.
148+
149+
### Option 2.1
150+
151+
A natural extension of option 2 is to allow users to pass an `AudioSample`
152+
object to `__init__()`, instead of passing 2 separate parameters:
153+
154+
```py
155+
samples = # ... AudioSamples e.g. coming from the decoder
156+
AudioEncoder(samples).to_file("output.wav")
157+
```
158+
159+
This can be enabled via this kind of logic:
160+
161+
```py
162+
class AudioEncoder:
163+
def __init__(
164+
self,
165+
samples: Union[torch.Tensor, AudioSamples],
166+
sample_rate: Optional[int] = None,
167+
):
168+
assert (
169+
isinstance(samples, torch.Tensor) and sample_rate is not None) or (
170+
isinstance(sample, AudioSamples) and sample_rate is None
171+
)
172+
```
173+
174+
175+
### Thinking ahead
176+
177+
I don't want to be prescriptive on what the video decoder should look like, but I
178+
suspect that we will soon need to expose a **multistream** encoder, i.e. an
179+
encoder that can encode both an audio and a video stream at the same time (think
180+
of video generation models). I suspect the API of such encoder will look
181+
something like this (a bit similar to what TorchAudio exposes):
182+
183+
```py
184+
Encoder().add_audio(...).add_video(...).to_file(filename)
185+
Encoder().add_audio(...).add_video(...).to_file_like(filelike)
186+
encoded_bytes = Encoder().add_audio(...).add_video(...).to_tensor()
187+
```
188+
189+
This too will involve exposing an object, despite the actual managed "state"
190+
being very limited.

0 commit comments

Comments
 (0)