Skip to content

Support MiniCPM-2B-128k #6602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 8 commits into from
Closed

Support MiniCPM-2B-128k #6602

wants to merge 8 commits into from

Conversation

zkh2016
Copy link
Contributor

@zkh2016 zkh2016 commented Apr 11, 2024

Comment on lines +4383 to +4385
if (!hparams.tie_lm_head){
model.output = ml.create_tensor(ctx_output_split, tn(LLM_TENSOR_OUTPUT, "weight"), {n_embd, n_vocab}, false);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already handle tied tensors few lines below. Maybe you simply have to remove the if (model.arch != LLM_ARCH_MINICPM){ check and this model would work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @ggerganov , we are currently encountering a problem when adapting our new model: the evaluation effect for models smaller than 4k is consistent with vllm, but the evaluation effect for models larger than 4k is not as good as vllm. What may be the reason for this? When evaluating, the example/server method is used, with the following startup and request parameters:

./server -m MiniCPM-2B-128k/ggml-model-f16.gguf --chat-template chatml --rope-freq-base 4129032.258 --host 0.0.0.0 -c 12000

request data:

   data = {"stream": False,
            "n_predict": max_token,
            "temperature": 0.3,
            "stop": ["<|im_end|>", "</s>"],
            "repeat_last_n": 256,
            "repeat_penalty": 1.0,
            "top_k": 40,
            "top_p": 0.5,
            "min_p": 0.05,
            "tfs_z": 1,
            "typical_p": 1,
            "presence_penalty": 0,
            "frequency_penalty": 0,
            "mirostat": 0,
            "mirostat_tau": 5,
            "mirostat_eta": 0.1,
            "grammar": "", "n_probs": 0, "min_keep": 0, "image_data": [], "cache_prompt": True,
            "api_key": "",
            "prompt": f"<|im_start|>user{prompt}<|im_end|><|im_start|>assistant\n"
            }

vllm param:

params_dict = {
    "n": 1,
    "best_of": None,
    "presence_penalty": 0.0, 
    "frequency_penalty": 0.0,
    "repetition_penalty": 1.0,
    "temperature": 0.3, 
    "top_p": 0.5, 
    "top_k": -1,
    "use_beam_search": False,
    "length_penalty": 1.0,
    "early_stopping": False,
    "stop": None,
    "stop_token_ids": None,
    "ignore_eos": False,
    "logprobs": None,
    "prompt_logprobs": None,
    "skip_special_tokens": False,
    "stop": ["<|im_end|>", "</s>"]
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you use --rope-freq-base 4129032.258 when the config specifies 1e6:

https://huggingface.co/openbmb/MiniCPM-2B-128k/blob/main/config.json#L34

Also, this model seems to use some rope scaling:

https://huggingface.co/openbmb/MiniCPM-2B-128k/blob/main/config.json#L25

You need to apply the same thing when starting the server

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model currently uses DynamicNTKScalingRotaryEmbedding. How should I pass parameters?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can run up to 64k without NTK scaling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@foldl @zkh2016 hi, do we support Dynamic NTK scaling in llama.cpp? Thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. thanks for your quick reply.

zhangkaihuo added 2 commits April 16, 2024 20:29
@@ -1548,6 +1548,8 @@ def set_gguf_parameters(self):
self.gguf_writer.add_head_count_kv(self.hparams["num_key_value_heads"])
self.gguf_writer.add_layer_norm_rms_eps(self.hparams["rms_norm_eps"])
self.gguf_writer.add_file_type(self.ftype)
if "tie_lm_head" in self.hparams:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should be tie_word_embeddings in huggingface's config.

@mofosyne mofosyne added enhancement New feature or request Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level model Model specific labels May 10, 2024
Copy link
Contributor

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 548 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8562.98ms p(95)=20759.53ms fails=, finish reason: stop=484 truncated=64
  • Prompt processing (pp): avg=94.97tk/s p(95)=397.46tk/s
  • Token generation (tg): avg=33.01tk/s p(95)=47.34tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=new_minicpm commit=f63f147471a2c45ba3c331d5f3578141243d3553

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1715379497 --> 1715380127
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 532.67, 532.67, 532.67, 532.67, 532.67, 850.55, 850.55, 850.55, 850.55, 850.55, 568.64, 568.64, 568.64, 568.64, 568.64, 598.73, 598.73, 598.73, 598.73, 598.73, 661.48, 661.48, 661.48, 661.48, 661.48, 667.86, 667.86, 667.86, 667.86, 667.86, 672.75, 672.75, 672.75, 672.75, 672.75, 701.66, 701.66, 701.66, 701.66, 701.66, 711.23, 711.23, 711.23, 711.23, 711.23, 727.17, 727.17, 727.17, 727.17, 727.17, 732.62, 732.62, 732.62, 732.62, 732.62, 739.24, 739.24, 739.24, 739.24, 739.24, 753.43, 753.43, 753.43, 753.43, 753.43, 763.24, 763.24, 763.24, 763.24, 763.24, 681.57, 681.57, 681.57, 681.57, 681.57, 668.04, 668.04, 668.04, 668.04, 668.04, 671.9, 671.9, 671.9, 671.9, 671.9, 681.22, 681.22, 681.22, 681.22, 681.22, 697.19, 697.19, 697.19, 697.19, 697.19, 703.89, 703.89, 703.89, 703.89, 703.89, 711.87, 711.87, 711.87, 711.87, 711.87, 717.6, 717.6, 717.6, 717.6, 717.6, 731.63, 731.63, 731.63, 731.63, 731.63, 724.72, 724.72, 724.72, 724.72, 724.72, 727.61, 727.61, 727.61, 727.61, 727.61, 730.22, 730.22, 730.22, 730.22, 730.22, 735.86, 735.86, 735.86, 735.86, 735.86, 733.81, 733.81, 733.81, 733.81, 733.81, 735.14, 735.14, 735.14, 735.14, 735.14, 736.86, 736.86, 736.86, 736.86, 736.86, 742.8, 742.8, 742.8, 742.8, 742.8, 745.5, 745.5, 745.5, 745.5, 745.5, 744.57, 744.57, 744.57, 744.57, 744.57, 746.66, 746.66, 746.66, 746.66, 746.66, 760.15, 760.15, 760.15, 760.15, 760.15, 767.94, 767.94, 767.94, 767.94, 767.94, 777.56, 777.56, 777.56, 777.56, 777.56, 776.35, 776.35, 776.35, 776.35, 776.35, 776.95, 776.95, 776.95, 776.95, 776.95, 779.25, 779.25, 779.25, 779.25, 779.25, 781.81, 781.81, 781.81, 781.81, 781.81, 785.98, 785.98, 785.98, 785.98, 785.98, 783.92, 783.92, 783.92, 783.92, 783.92, 755.2, 755.2, 755.2, 755.2, 755.2, 755.09, 755.09, 755.09, 755.09, 755.09, 754.44, 754.44, 754.44, 754.44, 754.44, 748.03, 748.03, 748.03, 748.03, 748.03, 755.51, 755.51, 755.51, 755.51, 755.51, 756.07, 756.07, 756.07, 756.07, 756.07, 763.23, 763.23, 763.23, 763.23, 763.23, 762.56, 762.56, 762.56, 762.56, 762.56, 765.52, 765.52, 765.52, 765.52, 765.52, 770.86, 770.86, 770.86, 770.86, 770.86, 771.32, 771.32, 771.32, 771.32, 771.32, 770.53, 770.53, 770.53, 770.53, 770.53, 772.16, 772.16, 772.16, 772.16, 772.16, 773.19, 773.19, 773.19, 773.19, 773.19, 774.7, 774.7, 774.7, 774.7, 774.7, 776.9, 776.9, 776.9, 776.9, 776.9, 777.34, 777.34, 777.34, 777.34, 777.34, 780.16, 780.16, 780.16, 780.16, 780.16, 780.16]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1715379497 --> 1715380127
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 40.08, 40.08, 40.08, 40.08, 40.08, 41.82, 41.82, 41.82, 41.82, 41.82, 32.38, 32.38, 32.38, 32.38, 32.38, 32.82, 32.82, 32.82, 32.82, 32.82, 33.5, 33.5, 33.5, 33.5, 33.5, 33.11, 33.11, 33.11, 33.11, 33.11, 34.53, 34.53, 34.53, 34.53, 34.53, 35.14, 35.14, 35.14, 35.14, 35.14, 35.37, 35.37, 35.37, 35.37, 35.37, 35.35, 35.35, 35.35, 35.35, 35.35, 34.83, 34.83, 34.83, 34.83, 34.83, 34.6, 34.6, 34.6, 34.6, 34.6, 33.49, 33.49, 33.49, 33.49, 33.49, 33.44, 33.44, 33.44, 33.44, 33.44, 33.09, 33.09, 33.09, 33.09, 33.09, 32.25, 32.25, 32.25, 32.25, 32.25, 32.43, 32.43, 32.43, 32.43, 32.43, 32.8, 32.8, 32.8, 32.8, 32.8, 32.53, 32.53, 32.53, 32.53, 32.53, 32.28, 32.28, 32.28, 32.28, 32.28, 32.13, 32.13, 32.13, 32.13, 32.13, 32.15, 32.15, 32.15, 32.15, 32.15, 32.25, 32.25, 32.25, 32.25, 32.25, 32.14, 32.14, 32.14, 32.14, 32.14, 31.88, 31.88, 31.88, 31.88, 31.88, 32.03, 32.03, 32.03, 32.03, 32.03, 32.1, 32.1, 32.1, 32.1, 32.1, 31.64, 31.64, 31.64, 31.64, 31.64, 31.28, 31.28, 31.28, 31.28, 31.28, 31.42, 31.42, 31.42, 31.42, 31.42, 31.5, 31.5, 31.5, 31.5, 31.5, 31.68, 31.68, 31.68, 31.68, 31.68, 31.82, 31.82, 31.82, 31.82, 31.82, 32.02, 32.02, 32.02, 32.02, 32.02, 31.95, 31.95, 31.95, 31.95, 31.95, 31.76, 31.76, 31.76, 31.76, 31.76, 31.62, 31.62, 31.62, 31.62, 31.62, 31.52, 31.52, 31.52, 31.52, 31.52, 31.56, 31.56, 31.56, 31.56, 31.56, 31.67, 31.67, 31.67, 31.67, 31.67, 31.82, 31.82, 31.82, 31.82, 31.82, 31.94, 31.94, 31.94, 31.94, 31.94, 31.83, 31.83, 31.83, 31.83, 31.83, 31.64, 31.64, 31.64, 31.64, 31.64, 31.41, 31.41, 31.41, 31.41, 31.41, 30.7, 30.7, 30.7, 30.7, 30.7, 29.91, 29.91, 29.91, 29.91, 29.91, 29.89, 29.89, 29.89, 29.89, 29.89, 30.01, 30.01, 30.01, 30.01, 30.01, 30.1, 30.1, 30.1, 30.1, 30.1, 30.13, 30.13, 30.13, 30.13, 30.13, 30.29, 30.29, 30.29, 30.29, 30.29, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.28, 30.36, 30.36, 30.36, 30.36, 30.36, 30.5, 30.5, 30.5, 30.5, 30.5, 30.62, 30.62, 30.62, 30.62, 30.62, 30.73, 30.73, 30.73, 30.73, 30.73, 30.79, 30.79, 30.79, 30.79, 30.79, 30.83, 30.83, 30.83, 30.83, 30.83, 30.91]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1715379497 --> 1715380127
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.12, 0.12, 0.12, 0.12, 0.12, 0.34, 0.34, 0.34, 0.34, 0.34, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.26, 0.26, 0.26, 0.26, 0.26, 0.15, 0.15, 0.15, 0.15, 0.15, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.23, 0.23, 0.23, 0.23, 0.23, 0.22, 0.22, 0.22, 0.22, 0.22, 0.33, 0.33, 0.33, 0.33, 0.33, 0.17, 0.17, 0.17, 0.17, 0.17, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.25, 0.25, 0.25, 0.25, 0.25, 0.14, 0.14, 0.14, 0.14, 0.14, 0.27, 0.27, 0.27, 0.27, 0.27, 0.26, 0.26, 0.26, 0.26, 0.26, 0.09, 0.09, 0.09, 0.09, 0.09, 0.15, 0.15, 0.15, 0.15, 0.15, 0.31, 0.31, 0.31, 0.31, 0.31, 0.3, 0.3, 0.3, 0.3, 0.3, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.11, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.18, 0.18, 0.18, 0.18, 0.18, 0.15, 0.15, 0.15, 0.15, 0.15, 0.22, 0.22, 0.22, 0.22, 0.22, 0.19, 0.19, 0.19, 0.19, 0.19, 0.15, 0.15, 0.15, 0.15, 0.15, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.32, 0.32, 0.32, 0.32, 0.32, 0.49, 0.49, 0.49, 0.49, 0.49, 0.6, 0.6, 0.6, 0.6, 0.6, 0.61, 0.61, 0.61, 0.61, 0.61, 0.5, 0.5, 0.5, 0.5, 0.5, 0.11, 0.11, 0.11, 0.11, 0.11, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.1, 0.1, 0.1, 0.1, 0.24, 0.24, 0.24, 0.24, 0.24, 0.1, 0.1, 0.1, 0.1, 0.1, 0.16, 0.16, 0.16, 0.16, 0.16, 0.18, 0.18, 0.18, 0.18, 0.18, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.13, 0.13, 0.13, 0.13, 0.13, 0.09, 0.09, 0.09, 0.09, 0.09, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.2, 0.2, 0.2, 0.2, 0.2, 0.27]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 548 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1715379497 --> 1715380127
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 1.0]
                    
Loading

@zkh2016
Copy link
Contributor Author

zkh2016 commented May 13, 2024

move to #6919

@zkh2016 zkh2016 closed this May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request model Model specific Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants