Skip to content

Commit 34f7bed

Browse files
committed
Merge remote-tracking branch 'upstream/master' into Alcpz/mmvq_q4_0_reorder
2 parents 351ef2b + d2b2031 commit 34f7bed

File tree

117 files changed

+8161
-4008
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

117 files changed

+8161
-4008
lines changed

.clang-tidy

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ Checks: >
1313
-readability-magic-numbers,
1414
-readability-uppercase-literal-suffix,
1515
-readability-simplify-boolean-expr,
16+
-readability-math-missing-parentheses,
1617
clang-analyzer-*,
1718
-clang-analyzer-security.insecureAPI.DeprecatedOrUnsafeBufferHandling,
1819
performance-*,

.github/workflows/build.yml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -601,8 +601,9 @@ jobs:
601601
-DGGML_SYCL_F16=ON
602602
cmake --build build --config Release -j $(nproc)
603603
604-
build-linux-cross:
605-
uses: ./.github/workflows/build-linux-cross.yml
604+
# Disabled for now due to sporadic issue syncing.
605+
# build-linux-cross:
606+
# uses: ./.github/workflows/build-linux-cross.yml
606607

607608
macOS-latest-cmake-ios:
608609
runs-on: macos-latest

README.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ Inference of Meta's [LLaMA](https://arxiv.org/abs/2302.13971) model (and others)
1616

1717
## Hot topics
1818

19-
- **How to use [MTLResidencySet](https://developer.apple.com/documentation/metal/mtlresidencyset?language=objc) to keep the GPU memory active?** https://github.com/ggml-org/llama.cpp/pull/11427
20-
- **VS Code extension for FIM completions:** https://github.com/ggml-org/llama.vscode
19+
- **GGML developer experience survey (organized and reviewed by NVIDIA):** [link](https://forms.gle/Gasw3cRgyhNEnrwK9)
20+
- A new binary `llama-mtmd-cli` is introduced to replace `llava-cli`, `minicpmv-cli` and `gemma3-cli` https://github.com/ggml-org/llama.cpp/pull/13012, `libllava` will be deprecated
21+
- VS Code extension for FIM completions: https://github.com/ggml-org/llama.vscode
2122
- Universal [tool call support](./docs/function-calling.md) in `llama-server` https://github.com/ggml-org/llama.cpp/pull/9639
2223
- Vim/Neovim plugin for FIM completions: https://github.com/ggml-org/llama.vim
2324
- Introducing GGUF-my-LoRA https://github.com/ggml-org/llama.cpp/discussions/10123

SECURITY.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ To protect sensitive data from potential leaks or unauthorized access, it is cru
4040
### Untrusted environments or networks
4141

4242
If you can't run your models in a secure and isolated environment or if it must be exposed to an untrusted network, make sure to take the following security precautions:
43-
* Confirm the hash of any downloaded artifact (e.g. pre-trained model weights) matches a known-good value
43+
* Do not use the RPC backend, [rpc-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/rpc) and [llama-server](https://github.com/ggml-org/llama.cpp/tree/master/examples/server) functionality (see https://github.com/ggml-org/llama.cpp/pull/13061).
44+
* Confirm the hash of any downloaded artifact (e.g. pre-trained model weights) matches a known-good value.
4445
* Encrypt your data if sending it over the network.
4546

4647
### Multi-Tenant environments

common/arg.cpp

Lines changed: 135 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,11 @@
3838

3939
using json = nlohmann::ordered_json;
4040

41+
std::initializer_list<enum llama_example> mmproj_examples = {
42+
LLAMA_EXAMPLE_LLAVA,
43+
// TODO: add LLAMA_EXAMPLE_SERVER when it's ready
44+
};
45+
4146
common_arg & common_arg::set_examples(std::initializer_list<enum llama_example> examples) {
4247
this->examples = std::move(examples);
4348
return *this;
@@ -157,6 +162,10 @@ struct common_hf_file_res {
157162

158163
#ifdef LLAMA_USE_CURL
159164

165+
bool common_has_curl() {
166+
return true;
167+
}
168+
160169
#ifdef __linux__
161170
#include <linux/limits.h>
162171
#elif defined(_WIN32)
@@ -522,64 +531,89 @@ static bool common_download_model(
522531
return true;
523532
}
524533

525-
/**
526-
* Allow getting the HF file from the HF repo with tag (like ollama), for example:
527-
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
528-
* - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
529-
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
530-
* Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
531-
*
532-
* Return pair of <repo, file> (with "repo" already having tag removed)
533-
*
534-
* Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
535-
*/
536-
static struct common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag, const std::string & bearer_token) {
537-
auto parts = string_split<std::string>(hf_repo_with_tag, ':');
538-
std::string tag = parts.size() > 1 ? parts.back() : "latest";
539-
std::string hf_repo = parts[0];
540-
if (string_split<std::string>(hf_repo, '/').size() != 2) {
541-
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
542-
}
543-
544-
// fetch model info from Hugging Face Hub API
534+
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params) {
545535
curl_ptr curl(curl_easy_init(), &curl_easy_cleanup);
546536
curl_slist_ptr http_headers;
547-
std::string res_str;
537+
std::vector<char> res_buffer;
548538

549-
std::string model_endpoint = get_model_endpoint();
550-
551-
std::string url = model_endpoint + "v2/" + hf_repo + "/manifests/" + tag;
552539
curl_easy_setopt(curl.get(), CURLOPT_URL, url.c_str());
553540
curl_easy_setopt(curl.get(), CURLOPT_NOPROGRESS, 1L);
541+
curl_easy_setopt(curl.get(), CURLOPT_FOLLOWLOCATION, 1L);
554542
typedef size_t(*CURLOPT_WRITEFUNCTION_PTR)(void * ptr, size_t size, size_t nmemb, void * data);
555543
auto write_callback = [](void * ptr, size_t size, size_t nmemb, void * data) -> size_t {
556-
static_cast<std::string *>(data)->append((char * ) ptr, size * nmemb);
544+
auto data_vec = static_cast<std::vector<char> *>(data);
545+
data_vec->insert(data_vec->end(), (char *)ptr, (char *)ptr + size * nmemb);
557546
return size * nmemb;
558547
};
559548
curl_easy_setopt(curl.get(), CURLOPT_WRITEFUNCTION, static_cast<CURLOPT_WRITEFUNCTION_PTR>(write_callback));
560-
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &res_str);
549+
curl_easy_setopt(curl.get(), CURLOPT_WRITEDATA, &res_buffer);
561550
#if defined(_WIN32)
562551
curl_easy_setopt(curl.get(), CURLOPT_SSL_OPTIONS, CURLSSLOPT_NATIVE_CA);
563552
#endif
564-
if (!bearer_token.empty()) {
565-
std::string auth_header = "Authorization: Bearer " + bearer_token;
566-
http_headers.ptr = curl_slist_append(http_headers.ptr, auth_header.c_str());
553+
if (params.timeout > 0) {
554+
curl_easy_setopt(curl.get(), CURLOPT_TIMEOUT, params.timeout);
555+
}
556+
if (params.max_size > 0) {
557+
curl_easy_setopt(curl.get(), CURLOPT_MAXFILESIZE, params.max_size);
567558
}
568-
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
569559
http_headers.ptr = curl_slist_append(http_headers.ptr, "User-Agent: llama-cpp");
570-
http_headers.ptr = curl_slist_append(http_headers.ptr, "Accept: application/json");
560+
for (const auto & header : params.headers) {
561+
http_headers.ptr = curl_slist_append(http_headers.ptr, header.c_str());
562+
}
571563
curl_easy_setopt(curl.get(), CURLOPT_HTTPHEADER, http_headers.ptr);
572564

573565
CURLcode res = curl_easy_perform(curl.get());
574566

575567
if (res != CURLE_OK) {
576-
throw std::runtime_error("error: cannot make GET request to HF API");
568+
std::string error_msg = curl_easy_strerror(res);
569+
throw std::runtime_error("error: cannot make GET request: " + error_msg);
577570
}
578571

579572
long res_code;
580-
std::string ggufFile = "";
581-
std::string mmprojFile = "";
582573
curl_easy_getinfo(curl.get(), CURLINFO_RESPONSE_CODE, &res_code);
574+
575+
return { res_code, std::move(res_buffer) };
576+
}
577+
578+
/**
579+
* Allow getting the HF file from the HF repo with tag (like ollama), for example:
580+
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q4
581+
* - bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M
582+
* - bartowski/Llama-3.2-3B-Instruct-GGUF:q5_k_s
583+
* Tag is optional, default to "latest" (meaning it checks for Q4_K_M first, then Q4, then if not found, return the first GGUF file in repo)
584+
*
585+
* Return pair of <repo, file> (with "repo" already having tag removed)
586+
*
587+
* Note: we use the Ollama-compatible HF API, but not using the blobId. Instead, we use the special "ggufFile" field which returns the value for "hf_file". This is done to be backward-compatible with existing cache files.
588+
*/
589+
static struct common_hf_file_res common_get_hf_file(const std::string & hf_repo_with_tag, const std::string & bearer_token) {
590+
auto parts = string_split<std::string>(hf_repo_with_tag, ':');
591+
std::string tag = parts.size() > 1 ? parts.back() : "latest";
592+
std::string hf_repo = parts[0];
593+
if (string_split<std::string>(hf_repo, '/').size() != 2) {
594+
throw std::invalid_argument("error: invalid HF repo format, expected <user>/<model>[:quant]\n");
595+
}
596+
597+
std::string url = get_model_endpoint() + "v2/" + hf_repo + "/manifests/" + tag;
598+
599+
// headers
600+
std::vector<std::string> headers;
601+
headers.push_back("Accept: application/json");
602+
if (!bearer_token.empty()) {
603+
headers.push_back("Authorization: Bearer " + bearer_token);
604+
}
605+
// Important: the User-Agent must be "llama-cpp" to get the "ggufFile" field in the response
606+
// User-Agent header is already set in common_remote_get_content, no need to set it here
607+
608+
// make the request
609+
common_remote_params params;
610+
params.headers = headers;
611+
auto res = common_remote_get_content(url, params);
612+
long res_code = res.first;
613+
std::string res_str(res.second.data(), res.second.size());
614+
std::string ggufFile;
615+
std::string mmprojFile;
616+
583617
if (res_code == 200) {
584618
// extract ggufFile.rfilename in json, using regex
585619
{
@@ -613,6 +647,10 @@ static struct common_hf_file_res common_get_hf_file(const std::string & hf_repo_
613647

614648
#else
615649

650+
bool common_has_curl() {
651+
return false;
652+
}
653+
616654
static bool common_download_file_single(const std::string &, const std::string &, const std::string &) {
617655
LOG_ERR("error: built without CURL, cannot download model from internet\n");
618656
return false;
@@ -635,17 +673,30 @@ static struct common_hf_file_res common_get_hf_file(const std::string &, const s
635673
return {};
636674
}
637675

676+
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params &) {
677+
if (!url.empty()) {
678+
throw std::runtime_error("error: built without CURL, cannot download model from the internet");
679+
}
680+
681+
return {};
682+
}
683+
638684
#endif // LLAMA_USE_CURL
639685

640686
//
641687
// utils
642688
//
643689

644-
static void common_params_handle_model(
690+
struct handle_model_result {
691+
bool found_mmproj = false;
692+
common_params_model mmproj;
693+
};
694+
695+
static handle_model_result common_params_handle_model(
645696
struct common_params_model & model,
646697
const std::string & bearer_token,
647-
const std::string & model_path_default,
648-
bool is_mmproj = false) { // TODO: move is_mmproj to an enum when we have more files?
698+
const std::string & model_path_default) {
699+
handle_model_result result;
649700
// handle pre-fill default model path and url based on hf_repo and hf_file
650701
{
651702
if (!model.hf_repo.empty()) {
@@ -657,7 +708,12 @@ static void common_params_handle_model(
657708
exit(1); // built without CURL, error message already printed
658709
}
659710
model.hf_repo = auto_detected.repo;
660-
model.hf_file = is_mmproj ? auto_detected.mmprojFile : auto_detected.ggufFile;
711+
model.hf_file = auto_detected.ggufFile;
712+
if (!auto_detected.mmprojFile.empty()) {
713+
result.found_mmproj = true;
714+
result.mmproj.hf_repo = model.hf_repo;
715+
result.mmproj.hf_file = auto_detected.mmprojFile;
716+
}
661717
} else {
662718
model.hf_file = model.path;
663719
}
@@ -694,6 +750,8 @@ static void common_params_handle_model(
694750
exit(1);
695751
}
696752
}
753+
754+
return result;
697755
}
698756

699757
const std::vector<ggml_type> kv_cache_types = {
@@ -827,16 +885,25 @@ static bool common_params_parse_ex(int argc, char ** argv, common_params_context
827885
throw std::invalid_argument("error: --prompt-cache-all not supported in interactive mode yet\n");
828886
}
829887

830-
common_params_handle_model(params.model, params.hf_token, DEFAULT_MODEL_PATH);
831-
common_params_handle_model(params.speculative.model, params.hf_token, "");
832-
common_params_handle_model(params.vocoder.model, params.hf_token, "");
833-
834-
// allow --mmproj to be set from -hf
835-
// assuming that mmproj is always in the same repo as text model
836-
if (!params.model.hf_repo.empty() && ctx_arg.ex == LLAMA_EXAMPLE_LLAVA) {
837-
params.mmproj.hf_repo = params.model.hf_repo;
888+
// handle model and download
889+
{
890+
auto res = common_params_handle_model(params.model, params.hf_token, DEFAULT_MODEL_PATH);
891+
if (params.no_mmproj) {
892+
params.mmproj = {};
893+
} else if (res.found_mmproj && params.mmproj.path.empty() && params.mmproj.url.empty()) {
894+
// optionally, handle mmproj model when -hf is specified
895+
params.mmproj = res.mmproj;
896+
}
897+
// only download mmproj if the current example is using it
898+
for (auto & ex : mmproj_examples) {
899+
if (ctx_arg.ex == ex) {
900+
common_params_handle_model(params.mmproj, params.hf_token, "");
901+
break;
902+
}
903+
}
904+
common_params_handle_model(params.speculative.model, params.hf_token, "");
905+
common_params_handle_model(params.vocoder.model, params.hf_token, "");
838906
}
839-
common_params_handle_model(params.mmproj, params.hf_token, "", true);
840907

841908
if (params.escape) {
842909
string_process_escapes(params.prompt);
@@ -968,28 +1035,25 @@ static void common_params_print_completion(common_params_context & ctx_arg) {
9681035
"llama-embedding",
9691036
"llama-eval-callback",
9701037
"llama-export-lora",
971-
"llama-gbnf-validator",
9721038
"llama-gen-docs",
9731039
"llama-gguf",
9741040
"llama-gguf-hash",
9751041
"llama-gguf-split",
9761042
"llama-gritlm",
9771043
"llama-imatrix",
9781044
"llama-infill",
979-
"llama-llava-cli",
1045+
"llama-mtmd-cli",
9801046
"llama-llava-clip-quantize-cli",
9811047
"llama-lookahead",
9821048
"llama-lookup",
9831049
"llama-lookup-create",
9841050
"llama-lookup-merge",
9851051
"llama-lookup-stats",
986-
"llama-minicpmv-cli",
9871052
"llama-parallel",
9881053
"llama-passkey",
9891054
"llama-perplexity",
9901055
"llama-q8dot",
9911056
"llama-quantize",
992-
"llama-quantize-stats",
9931057
"llama-qwen2vl-cli",
9941058
"llama-retrieval",
9951059
"llama-run",
@@ -2096,18 +2160,32 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
20962160
).set_examples({LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_NO_CONT_BATCHING"));
20972161
add_opt(common_arg(
20982162
{"--mmproj"}, "FILE",
2099-
"path to a multimodal projector file for LLaVA. see examples/llava/README.md",
2163+
"path to a multimodal projector file. see examples/llava/README.md",
21002164
[](common_params & params, const std::string & value) {
21012165
params.mmproj.path = value;
21022166
}
2103-
).set_examples({LLAMA_EXAMPLE_LLAVA}));
2167+
).set_examples(mmproj_examples));
21042168
add_opt(common_arg(
21052169
{"--mmproj-url"}, "URL",
2106-
"URL to a multimodal projector file for LLaVA. see examples/llava/README.md",
2170+
"URL to a multimodal projector file. see examples/llava/README.md",
21072171
[](common_params & params, const std::string & value) {
21082172
params.mmproj.url = value;
21092173
}
2110-
).set_examples({LLAMA_EXAMPLE_LLAVA}));
2174+
).set_examples(mmproj_examples));
2175+
add_opt(common_arg(
2176+
{"--no-mmproj"},
2177+
"explicitly disable multimodal projector, useful when using -hf",
2178+
[](common_params & params) {
2179+
params.no_mmproj = true;
2180+
}
2181+
).set_examples(mmproj_examples));
2182+
add_opt(common_arg(
2183+
{"--no-mmproj-offload"},
2184+
"do not offload multimodal projector to GPU",
2185+
[](common_params & params) {
2186+
params.mmproj_use_gpu = false;
2187+
}
2188+
).set_examples(mmproj_examples));
21112189
add_opt(common_arg(
21122190
{"--image"}, "FILE",
21132191
"path to an image file. use with multimodal models. Specify multiple times for batching",
@@ -2382,6 +2460,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
23822460
add_opt(common_arg(
23832461
{"-hf", "-hfr", "--hf-repo"}, "<user>/<model>[:quant]",
23842462
"Hugging Face model repository; quant is optional, case-insensitive, default to Q4_K_M, or falls back to the first file in the repo if Q4_K_M doesn't exist.\n"
2463+
"mmproj is also downloaded automatically if available. to disable, add --no-mmproj\n"
23852464
"example: unsloth/phi-4-GGUF:q4_k_m\n"
23862465
"(default: unused)",
23872466
[](common_params & params, const std::string & value) {
@@ -2726,7 +2805,7 @@ common_params_context common_params_parser_init(common_params & params, llama_ex
27262805
[](common_params & params, const std::string & value) {
27272806
params.chat_template = value;
27282807
}
2729-
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
2808+
).set_examples({LLAMA_EXAMPLE_MAIN, LLAMA_EXAMPLE_SERVER, LLAMA_EXAMPLE_LLAVA}).set_env("LLAMA_ARG_CHAT_TEMPLATE"));
27302809
add_opt(common_arg(
27312810
{"--chat-template-file"}, "JINJA_TEMPLATE_FILE",
27322811
string_format(

common/arg.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,3 +78,12 @@ bool common_params_parse(int argc, char ** argv, common_params & params, llama_e
7878

7979
// function to be used by test-arg-parser
8080
common_params_context common_params_parser_init(common_params & params, llama_example ex, void(*print_usage)(int, char **) = nullptr);
81+
bool common_has_curl();
82+
83+
struct common_remote_params {
84+
std::vector<std::string> headers;
85+
long timeout = 0; // CURLOPT_TIMEOUT, in seconds ; 0 means no timeout
86+
long max_size = 0; // max size of the response ; unlimited if 0 ; max is 2GB
87+
};
88+
// get remote file content, returns <http_code, raw_response_body>
89+
std::pair<long, std::vector<char>> common_remote_get_content(const std::string & url, const common_remote_params & params);

common/common.h

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -342,6 +342,8 @@ struct common_params {
342342

343343
// multimodal models (see examples/llava)
344344
struct common_params_model mmproj;
345+
bool mmproj_use_gpu = true; // use GPU for multimodal model
346+
bool no_mmproj = false; // explicitly disable multimodal model
345347
std::vector<std::string> image; // path to image file(s)
346348

347349
// embedding

0 commit comments

Comments
 (0)