Skip to content

feat: add bpe #119

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Dec 23, 2023
Merged

feat: add bpe #119

merged 7 commits into from
Dec 23, 2023

Conversation

Cyberhan123
Copy link
Contributor

@Cyberhan123 Cyberhan123 commented Dec 19, 2023

fix: #85

@leejet
Copy link
Owner

leejet commented Dec 19, 2023

I don't think it's necessary to load the merge file, vocab is already ready, just use it. The only thing we need to do is implement the actual bpe function.

@Cyberhan123
Copy link
Contributor Author

Okay, I'll adjust it

@Cyberhan123
Copy link
Contributor Author

Cyberhan123 commented Dec 19, 2023

I don't think it's necessary to load the merge file, vocab is already ready, just use it. The only thing we need to do is implement the actual bpe function.

I actually want to implement customized clip loading. For example, chinese-clip, I wonder if this idea is feasible?

edit: Possible obstacles are that the tokenizer it uses is not bpe at all.

@leejet
Copy link
Owner

leejet commented Dec 19, 2023

The clip model works if it can produce output with similar distribution to the clip model used by stable diffusion or if there is a dedicated unet model. I don't know if chinese-clip is suitable for stable diffusion, because I haven't studied its training process in detail.

@Cyberhan123
Copy link
Contributor Author

I don't think it's necessary to load the merge file, vocab is already ready, just use it. The only thing we need to do is implement the actual bpe function.

@leejet I just checked the built-in vocab.json. The vocab.json cannot repalce the function of merge.txt because it does not provide subword order. Another problem is that the token id of the bpe algorithm is the frequency of word occurrence, not a fixed mapping, so I am a bit worried that unexpected errors may occur when using the built-in vocabulary list.

@leejet
Copy link
Owner

leejet commented Dec 20, 2023

I've improved the tokenizer's handling of Unicode, generating vocab using merges information, similar to https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py. Now you can merge the latest master branch into your PR, making it convenient to implement BPE. Please refer to https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py for guidance.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 20, 2023

I already have an implementation of embeddings (textual inversion) ready, and it works fine. However, it requires a specific way of loading the embedding from the prompt bad quality, ugly, face malformed, bad anatomy, embd:bad-art-neg, embd:<embedding filename> since custom embeddings are added to the end of token_embed_weight, and to create ''new tokens" that point to the position of the new embeddings is required.

Another issue is that it is unclear how other projects handle very long prompts of more than 77 tokens. Do they only use the first 77 and ignore the rest? There are some embeddings that require up to 75 tokens, and that covers almost the entirety of the available context.

@Cyberhan123
Copy link
Contributor Author

I've improved the tokenizer's handling of Unicode, generating vocab using merges information, similar to https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py. Now you can merge the latest master branch into your PR, making it convenient to implement BPE. Please refer to https://github.com/openai/CLIP/blob/main/clip/simple_tokenizer.py for guidance.

Thx

# Conflicts:
#	model.cpp
#	model.h
#	stable-diffusion.cpp
@Cyberhan123
Copy link
Contributor Author

Now We have completed all the logic, but I feel that the performance is not very good. After investigation, I found that load_from_merges does a lot of loops. Now we need to make a trade-off, whether we want faster loading speed or less memory.

If we want to improve the speed, we can add a built-in vocab.json, which will increase the memory by about 1MB. If we need to less memory, we can keep the current code.

@Cyberhan123
Copy link
Contributor Author

I already have an implementation of embeddings (textual inversion) ready, and it works fine. However, it requires a specific way of loading the embedding from the prompt bad quality, ugly, face malformed, bad anatomy, embd:bad-art-neg, embd:<embedding filename> since custom embeddings are added to the end of token_embed_weight, and to create ''new tokens" that point to the position of the new embeddings is required.

Another issue is that it is unclear how other projects handle very long prompts of more than 77 tokens. Do they only use the first 77 and ignore the rest? There are some embeddings that require up to 75 tokens, and that covers almost the entirety of the available context.

@FSSRepo Please check out the link below:
https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/cf2772fab0af5573da775e7437e6acdca424f26e/modules/sd_hijack_clip.py#L203

In summary, it will be split into multiple chunks

@leejet
Copy link
Owner

leejet commented Dec 21, 2023

Now We have completed all the logic, but I feel that the performance is not very good. After investigation, I found that load_from_merges does a lot of loops. Now we need to make a trade-off, whether we want faster loading speed or less memory.

If we want to improve the speed, we can add a built-in vocab.json, which will increase the memory by about 1MB. If we need to less memory, we can keep the current code.

On my computer, load_merges takes 82ms, and load_vocab takes 53ms. There isn't a significant difference. These operations, akin to loading model weights, run only once, and the time spent compared to loading model weights is negligible. I believe there's no need to do this.

@Cyberhan123
Copy link
Contributor Author

We can merge this pr

@Cyberhan123
Copy link
Contributor Author

@FSSRepo
Looking forward to the merge of #104 PR so I can go ahead and working on hip's PR

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 21, 2023

My pull request is ready; however, a user is experiencing issues with the VAE tiling function for reasons unknown to me. I have tested it with images of various resolutions and haven't encountered problems with incomplete images.

@Cyberhan123
Copy link
Contributor Author

My pull request is ready; however, a user is experiencing issues with the VAE tiling function for reasons unknown to me. I have tested it with images of various resolutions and haven't encountered problems with incomplete images.

I tested on mac and windows cuda and didn't encounter this problem

@leejet
Copy link
Owner

leejet commented Dec 21, 2023

My pull request is ready; however, a user is experiencing issues with the VAE tiling function for reasons unknown to me. I have tested it with images of various resolutions and haven't encountered problems with incomplete images.

I've finished reviewing your PR and I'm ready to merge it, but I'm unsure whether Green-Sky will still encounter the issue.

@leejet
Copy link
Owner

leejet commented Dec 21, 2023

Once I've reviewed the changes in this PR, I might merge it first.

@mwt11a
Copy link

mwt11a commented Dec 21, 2023

I am finding this PR results in images that have very little relation to the prompt. Running on Linux and using CPU only.

Even if a simple prompt like "photo of a cat next to a christmas tree, wearing a christmas hat". Master will generate a image of a cat generally sitting next to a christmas tree and wearing a red santa type hat. While this PR will generate a woman wearing a woolly hat. Generally a close up head shot of her.

@Cyberhan123
Copy link
Contributor Author

I am finding this PR results in images that have very little relation to the prompt. Running on Linux and using CPU only.

Even if a simple prompt like "photo of a cat next to a christmas tree, wearing a christmas hat". Master will generate a image of a cat generally sitting next to a christmas tree and wearing a red santa type hat. While this PR will generate a woman wearing a woolly hat. Generally a close up head shot of her.

I just tried it and it happened to me too. Let me see what caused it.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 22, 2023

@Cyberhan123 Is this a complete implementation of BPE, or are there still things missing, as you could start deleting "Implement BPE Tokenizer" from the TODO section of the README?

@Cyberhan123
Copy link
Contributor Author

Cyberhan123 commented Dec 22, 2023

@Cyberhan123 Is this a complete implementation of BPE, or are there still things missing, as you could start deleting "Implement BPE Tokenizer" from the TODO section of the README?

@FSSRepo This is the complete implementation, I just made a low-level mistake, which has been fixed now.

@Cyberhan123
Copy link
Contributor Author

@mwt11a
If you want, please pull the latest code and test it, it works fine for me.

./sd -m ../../models/sd-v1-4.ckpt -p "photo of a cat next to a christmas tree, wearing a christmas hat" --type q8_0

output

@Cyberhan123
Copy link
Contributor Author

emmm, the cat generated by this model looks a bit like Gargamel...

@Cyberhan123
Copy link
Contributor Author

Cyberhan123 commented Dec 22, 2023

@leejet
when I build with msvc and ninja,I think I know what caused it: The -DCMAKE_BUILD_TYPE=Debug options
image
image

But The -DCMAKE_BUILD_TYPE=Release not caused.
image

So I change the code to:

  auto end = merges.begin() + 49152 - 256 - 2 + 1; // The Exception caused.
  merges = std::vector<std::u32string>(merges.begin() + 1, end);

Then I jumped into the exception. We can know that the macro _ITERATOR_DEBUG_LEVEL, which checks whether the iterator overflows, will only take effect in debug mode, and will be assigned a value of (void) _Off in Release mode;

image

@Cyberhan123
Copy link
Contributor Author

Cyberhan123 commented Dec 22, 2023

I am more accustomed to using JetBrains products, such as webstorm, goland, and clion. Maybe due to lack of experience, I didn't find a suitable way to format the code, which caused extra work for you. I'm sorry.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 22, 2023

I'm working with VSCode 😂😂😂 and a lot of extensions

@leejet
Copy link
Owner

leejet commented Dec 23, 2023

@Cyberhan123 I've completed reviewing the code in this PR, and it looks good to me. I'm planning to merge it now. Thank you for your contribution.

@leejet leejet merged commit 0e64238 into leejet:master Dec 23, 2023
@Cyberhan123 Cyberhan123 deleted the bpe branch January 5, 2024 12:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: tokenizer only matches exact words and no subwords
4 participants