You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -146,7 +147,7 @@ The project is under active development, and we are [looking for feedback and co
146
147
|`--host HOST`| ip address to listen (default: 127.0.0.1)<br/>(env: LLAMA_ARG_HOST) |
147
148
|`--port PORT`| port to listen (default: 8080)<br/>(env: LLAMA_ARG_PORT) |
148
149
|`--path PATH`| path to serve static files from (default: )<br/>(env: LLAMA_ARG_STATIC_PATH) |
149
-
|`--no-webui`|disable the Web UI<br/>(env: LLAMA_ARG_NO_WEBUI) |
150
+
|`--no-webui`|Disable the Web UI (default: enabled)<br/>(env: LLAMA_ARG_NO_WEBUI) |
150
151
|`--embedding, --embeddings`| restrict to only support embedding use case; use only with dedicated embedding models (default: disabled)<br/>(env: LLAMA_ARG_EMBEDDINGS) |
151
152
|`--reranking, --rerank`| enable reranking endpoint on server (default: disabled)<br/>(env: LLAMA_ARG_RERANKING) |
152
153
|`--api-key KEY`| API key to use for authentication (default: none)<br/>(env: LLAMA_API_KEY) |
@@ -164,13 +165,13 @@ The project is under active development, and we are [looking for feedback and co
164
165
|`--chat-template JINJA_TEMPLATE`| set custom jinja chat template (default: template taken from model's metadata)<br/>if suffix/prefix are specified, template will be disabled<br/>list of built-in templates:<br/>chatglm3, chatglm4, chatml, command-r, deepseek, deepseek2, exaone3, gemma, granite, llama2, llama2-sys, llama2-sys-bos, llama2-sys-strip, llama3, minicpm, mistral-v1, mistral-v3, mistral-v3-tekken, mistral-v7, monarch, openchat, orion, phi3, rwkv-world, vicuna, vicuna-orca, zephyr<br/>(env: LLAMA_ARG_CHAT_TEMPLATE) |
165
166
|`-sps, --slot-prompt-similarity SIMILARITY`| how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)<br/> |
166
167
|`--lora-init-without-apply`| load LoRA adapters without applying them (apply later via POST /lora-adapters) (default: disabled) |
167
-
|`--draft-max, --draft, --draft-n N`| number of tokens to draft for speculative decoding (default: 16) |
168
-
|`--draft-min, --draft-n-min N`| minimum number of draft tokens to use for speculative decoding (default: 5) |
169
-
|`--draft-p-min P`| minimum speculative decoding probability (greedy) (default: 0.9) |
170
-
|`-cd, --ctx-size-draft N`| size of the prompt context for the draft model (default: 0, 0 = loaded from model) |
168
+
|`--draft-max, --draft, --draft-n N`| number of tokens to draft for speculative decoding (default: 16)<br/>(env: LLAMA_ARG_DRAFT_MAX)|
169
+
|`--draft-min, --draft-n-min N`| minimum number of draft tokens to use for speculative decoding (default: 5)<br/>(env: LLAMA_ARG_DRAFT_MIN)|
170
+
|`--draft-p-min P`| minimum speculative decoding probability (greedy) (default: 0.9)<br/>(env: LLAMA_ARG_DRAFT_P_MIN)|
171
+
|`-cd, --ctx-size-draft N`| size of the prompt context for the draft model (default: 0, 0 = loaded from model)<br/>(env: LLAMA_ARG_CTX_SIZE_DRAFT)|
171
172
|`-devd, --device-draft <dev1,dev2,..>`| comma-separated list of devices to use for offloading the draft model (none = don't offload)<br/>use --list-devices to see a list of available devices |
172
-
|`-ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| number of layers to store in VRAM for the draft model |
173
-
|`-md, --model-draft FNAME`| draft model for speculative decoding (default: unused) |
173
+
|`-ngld, --gpu-layers-draft, --n-gpu-layers-draft N`| number of layers to store in VRAM for the draft model<br/>(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)|
174
+
|`-md, --model-draft FNAME`| draft model for speculative decoding (default: unused)<br/>(env: LLAMA_ARG_MODEL_DRAFT)|
174
175
175
176
176
177
Note: If both command line argument and environment variable are both set for the same param, the argument will take precedence over env var.
0 commit comments