Skip to content

Commit 030a134

Browse files
ylacombeyiyixuxu
authored andcommitted
Stable Audio integration (#8716)
* WIP modeling code and pipeline * add custom attention processor + custom activation + add to init * correct ProjectionModel forward * add stable audio to __initèè * add autoencoder and update pipeline and modeling code * add half Rope * add partial rotary v2 * add temporary modfis to scheduler * add EDM DPM Solver * remove TODOs * clean GLU * remove att.group_norm to attn processor * revert back src/diffusers/schedulers/scheduling_dpmsolver_multistep.py * refactor GLU -> SwiGLU * remove redundant args * add channel multiples in autoencoder docstrings * changes in docsrtings and copyright headers * clean pipeline * further cleaning * remove peft and lora and fromoriginalmodel * Delete src/diffusers/pipelines/stable_audio/diffusers.code-workspace * make style * dummy models * fix copied from * add fast oobleck tests * add brownian tree * oobleck autoencoder slow tests * remove TODO * fast stable audio pipeline tests * add slow tests * make style * add first version of docs * wrap is_torchsde_available to the scheduler * fix slow test * test with input waveform * add input waveform * remove some todos * create stableaudio gaussian projection + make style * add pipeline to toctree * fix copied from * make quality * refactor timestep_features->time_proj * refactor joint_attention_kwargs->cross_attention_kwargs * remove forward_chunk * move StableAudioDitModel to transformers folder * correct convert + remove partial rotary embed * apply suggestions from yiyixuxu -> removing attn.kv_heads * remove temb * remove cross_attention_kwargs * further removal of cross_attention_kwargs * remove text encoder autocast to fp16 * continue removing autocast * make style * refactor how text and audio are embedded * add paper * update example code * make style * unify projection model forward + fix device placement * make style * remove fuse qkv * apply suggestions from review * Update src/diffusers/pipelines/stable_audio/pipeline_stable_audio.py Co-authored-by: YiYi Xu <yixu310@gmail.com> * make style * smaller models in fast tests * pass sequential offloading fast tests * add docs for vae and autoencoder * make style and update example * remove useless import * add cosine scheduler * dummy classes * cosine scheduler docs * better description of scheduler --------- Co-authored-by: YiYi Xu <yixu310@gmail.com>
1 parent 1a903d8 commit 030a134

30 files changed

+3771
-9
lines changed

docs/source/en/_toctree.yml

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,8 @@
239239
title: AsymmetricAutoencoderKL
240240
- local: api/models/autoencoder_tiny
241241
title: Tiny AutoEncoder
242+
- local: api/models/autoencoder_oobleck
243+
title: Oobleck AutoEncoder
242244
- local: api/models/consistency_decoder_vae
243245
title: ConsistencyDecoderVAE
244246
- local: api/models/transformer2d
@@ -259,6 +261,8 @@
259261
title: TransformerTemporalModel
260262
- local: api/models/sd3_transformer2d
261263
title: SD3Transformer2DModel
264+
- local: api/models/stable_audio_transformer
265+
title: StableAudioDiTModel
262266
- local: api/models/prior_transformer
263267
title: PriorTransformer
264268
- local: api/models/controlnet
@@ -362,6 +366,8 @@
362366
title: Semantic Guidance
363367
- local: api/pipelines/shap_e
364368
title: Shap-E
369+
- local: api/pipelines/stable_audio
370+
title: Stable Audio
365371
- local: api/pipelines/stable_cascade
366372
title: Stable Cascade
367373
- sections:
@@ -425,6 +431,8 @@
425431
title: CMStochasticIterativeScheduler
426432
- local: api/schedulers/consistency_decoder
427433
title: ConsistencyDecoderScheduler
434+
- local: api/schedulers/cosine_dpm
435+
title: CosineDPMSolverMultistepScheduler
428436
- local: api/schedulers/ddim_inverse
429437
title: DDIMInverseScheduler
430438
- local: api/schedulers/ddim
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# AutoencoderOobleck
14+
15+
The Oobleck variational autoencoder (VAE) model with KL loss was introduced in [Stability-AI/stable-audio-tools](https://github.com/Stability-AI/stable-audio-tools) and [Stable Audio Open](https://huggingface.co/papers/2407.14358) by Stability AI. The model is used in 🤗 Diffusers to encode audio waveforms into latents and to decode latent representations into audio waveforms.
16+
17+
The abstract from the paper is:
18+
19+
*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
20+
21+
## AutoencoderOobleck
22+
23+
[[autodoc]] AutoencoderOobleck
24+
- decode
25+
- encode
26+
- all
27+
28+
## OobleckDecoderOutput
29+
30+
[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput
31+
32+
## OobleckDecoderOutput
33+
34+
[[autodoc]] models.autoencoders.autoencoder_oobleck.OobleckDecoderOutput
35+
36+
## AutoencoderOobleckOutput
37+
38+
[[autodoc]] models.autoencoders.autoencoder_oobleck.AutoencoderOobleckOutput
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# StableAudioDiTModel
14+
15+
A Transformer model for audio waveforms from [Stable Audio Open](https://huggingface.co/papers/2407.14358).
16+
17+
## StableAudioDiTModel
18+
19+
[[autodoc]] StableAudioDiTModel

docs/source/en/api/pipelines/overview.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -71,6 +71,7 @@ The table below lists all the pipelines currently available in 🤗 Diffusers an
7171
| [Semantic Guidance](semantic_stable_diffusion) | text2image |
7272
| [Shap-E](shap_e) | text-to-3D, image-to-3D |
7373
| [Spectrogram Diffusion](spectrogram_diffusion) | |
74+
| [Stable Audio](stable_audio) | text2audio |
7475
| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution |
7576
| [Stable Diffusion Model Editing](model_editing) | model editing |
7677
| [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting |
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# Stable Audio
14+
15+
Stable Audio was proposed in [Stable Audio Open](https://arxiv.org/abs/2407.14358) by Zach Evans et al. . it takes a text prompt as input and predicts the corresponding sound or music sample.
16+
17+
Stable Audio Open generates variable-length (up to 47s) stereo audio at 44.1kHz from text prompts. It comprises three components: an autoencoder that compresses waveforms into a manageable sequence length, a T5-based text embedding for text conditioning, and a transformer-based diffusion (DiT) model that operates in the latent space of the autoencoder.
18+
19+
Stable Audio is trained on a corpus of around 48k audio recordings, where around 47k are from Freesound and the rest are from the Free Music Archive (FMA). All audio files are licensed under CC0, CC BY, or CC Sampling+. This data is used to train the autoencoder and the DiT.
20+
21+
The abstract of the paper is the following:
22+
*Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.*
23+
24+
This pipeline was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe). The original codebase can be found at [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool).
25+
26+
## Tips
27+
28+
When constructing a prompt, keep in mind:
29+
30+
* Descriptive prompt inputs work best; use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific where possible (e.g. "melodic techno with a fast beat and synths" works better than "techno").
31+
* Using a *negative prompt* can significantly improve the quality of the generated audio. Try using a negative prompt of "low quality, average quality".
32+
33+
During inference:
34+
35+
* The _quality_ of the generated audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
36+
* Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1 to enable. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly.
37+
38+
39+
## StableAudioPipeline
40+
[[autodoc]] StableAudioPipeline
41+
- all
42+
- __call__
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
13+
# CosineDPMSolverMultistepScheduler
14+
15+
The [`CosineDPMSolverMultistepScheduler`] is a variant of [`DPMSolverMultistepScheduler`] with cosine schedule, proposed by Nichol and Dhariwal (2021).
16+
It is being used in the [Stable Audio Open](https://arxiv.org/abs/2407.14358) paper and the [Stability-AI/stable-audio-tool](https://github.com/Stability-AI/stable-audio-tool) codebase.
17+
18+
This scheduler was contributed by [Yoach Lacombe](https://huggingface.co/ylacombe).
19+
20+
## CosineDPMSolverMultistepScheduler
21+
[[autodoc]] CosineDPMSolverMultistepScheduler
22+
23+
## SchedulerOutput
24+
[[autodoc]] schedulers.scheduling_utils.SchedulerOutput

0 commit comments

Comments
 (0)