Skip to content

Commit a77feb5

Browse files
authored
server : add some missing env variables (#9116)
* server : add some missing env variables * add LLAMA_ARG_HOST to server dockerfile * also add LLAMA_ARG_CONT_BATCHING
1 parent 2e59d61 commit a77feb5

7 files changed

+60
-17
lines changed

.devops/llama-server-cuda.Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,8 @@ ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH}
2424
ENV GGML_CUDA=1
2525
# Enable cURL
2626
ENV LLAMA_CURL=1
27+
# Must be set to 0.0.0.0 so it can listen to requests from host machine
28+
ENV LLAMA_ARG_HOST=0.0.0.0
2729

2830
RUN make -j$(nproc) llama-server
2931

.devops/llama-server-intel.Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,8 @@ RUN apt-get update && \
2626
COPY --from=build /app/build/bin/llama-server /llama-server
2727

2828
ENV LC_ALL=C.utf8
29+
# Must be set to 0.0.0.0 so it can listen to requests from host machine
30+
ENV LLAMA_ARG_HOST=0.0.0.0
2931

3032
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
3133

.devops/llama-server-rocm.Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,8 @@ ENV GPU_TARGETS=${ROCM_DOCKER_ARCH}
3939
ENV GGML_HIPBLAS=1
4040
ENV CC=/opt/rocm/llvm/bin/clang
4141
ENV CXX=/opt/rocm/llvm/bin/clang++
42+
# Must be set to 0.0.0.0 so it can listen to requests from host machine
43+
ENV LLAMA_ARG_HOST=0.0.0.0
4244

4345
# Enable cURL
4446
ENV LLAMA_CURL=1

.devops/llama-server-vulkan.Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,8 @@ RUN cp /app/build/bin/llama-server /llama-server && \
2323
rm -rf /app
2424

2525
ENV LC_ALL=C.utf8
26+
# Must be set to 0.0.0.0 so it can listen to requests from host machine
27+
ENV LLAMA_ARG_HOST=0.0.0.0
2628

2729
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
2830

.devops/llama-server.Dockerfile

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ RUN apt-get update && \
2121
COPY --from=build /app/llama-server /llama-server
2222

2323
ENV LC_ALL=C.utf8
24+
# Must be set to 0.0.0.0 so it can listen to requests from host machine
25+
ENV LLAMA_ARG_HOST=0.0.0.0
2426

2527
HEALTHCHECK CMD [ "curl", "-f", "http://localhost:8080/health" ]
2628

common/common.cpp

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -327,6 +327,10 @@ bool gpt_params_parse_ex(int argc, char ** argv, gpt_params & params) {
327327
void gpt_params_parse_from_env(gpt_params & params) {
328328
// we only care about server-related params for now
329329
get_env("LLAMA_ARG_MODEL", params.model);
330+
get_env("LLAMA_ARG_MODEL_URL", params.model_url);
331+
get_env("LLAMA_ARG_MODEL_ALIAS", params.model_alias);
332+
get_env("LLAMA_ARG_HF_REPO", params.hf_repo);
333+
get_env("LLAMA_ARG_HF_FILE", params.hf_file);
330334
get_env("LLAMA_ARG_THREADS", params.n_threads);
331335
get_env("LLAMA_ARG_CTX_SIZE", params.n_ctx);
332336
get_env("LLAMA_ARG_N_PARALLEL", params.n_parallel);
@@ -341,6 +345,9 @@ void gpt_params_parse_from_env(gpt_params & params) {
341345
get_env("LLAMA_ARG_EMBEDDINGS", params.embedding);
342346
get_env("LLAMA_ARG_FLASH_ATTN", params.flash_attn);
343347
get_env("LLAMA_ARG_DEFRAG_THOLD", params.defrag_thold);
348+
get_env("LLAMA_ARG_CONT_BATCHING", params.cont_batching);
349+
get_env("LLAMA_ARG_HOST", params.hostname);
350+
get_env("LLAMA_ARG_PORT", params.port);
344351
}
345352

346353
bool gpt_params_parse(int argc, char ** argv, gpt_params & params) {

examples/server/README.md

Lines changed: 43 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -249,23 +249,49 @@ logging:
249249

250250
Available environment variables (if specified, these variables will override parameters specified in arguments):
251251

252-
- `LLAMA_CACHE` (cache directory, used by `--hf-repo`)
253-
- `HF_TOKEN` (Hugging Face access token, used when accessing a gated model with `--hf-repo`)
254-
- `LLAMA_ARG_MODEL`
255-
- `LLAMA_ARG_THREADS`
256-
- `LLAMA_ARG_CTX_SIZE`
257-
- `LLAMA_ARG_N_PARALLEL`
258-
- `LLAMA_ARG_BATCH`
259-
- `LLAMA_ARG_UBATCH`
260-
- `LLAMA_ARG_N_GPU_LAYERS`
261-
- `LLAMA_ARG_THREADS_HTTP`
262-
- `LLAMA_ARG_CHAT_TEMPLATE`
263-
- `LLAMA_ARG_N_PREDICT`
264-
- `LLAMA_ARG_ENDPOINT_METRICS`
265-
- `LLAMA_ARG_ENDPOINT_SLOTS`
266-
- `LLAMA_ARG_EMBEDDINGS`
267-
- `LLAMA_ARG_FLASH_ATTN`
268-
- `LLAMA_ARG_DEFRAG_THOLD`
252+
- `LLAMA_CACHE`: cache directory, used by `--hf-repo`
253+
- `HF_TOKEN`: Hugging Face access token, used when accessing a gated model with `--hf-repo`
254+
- `LLAMA_ARG_MODEL`: equivalent to `-m`
255+
- `LLAMA_ARG_MODEL_URL`: equivalent to `-mu`
256+
- `LLAMA_ARG_MODEL_ALIAS`: equivalent to `-a`
257+
- `LLAMA_ARG_HF_REPO`: equivalent to `--hf-repo`
258+
- `LLAMA_ARG_HF_FILE`: equivalent to `--hf-file`
259+
- `LLAMA_ARG_THREADS`: equivalent to `-t`
260+
- `LLAMA_ARG_CTX_SIZE`: equivalent to `-c`
261+
- `LLAMA_ARG_N_PARALLEL`: equivalent to `-np`
262+
- `LLAMA_ARG_BATCH`: equivalent to `-b`
263+
- `LLAMA_ARG_UBATCH`: equivalent to `-ub`
264+
- `LLAMA_ARG_N_GPU_LAYERS`: equivalent to `-ngl`
265+
- `LLAMA_ARG_THREADS_HTTP`: equivalent to `--threads-http`
266+
- `LLAMA_ARG_CHAT_TEMPLATE`: equivalent to `--chat-template`
267+
- `LLAMA_ARG_N_PREDICT`: equivalent to `-n`
268+
- `LLAMA_ARG_ENDPOINT_METRICS`: if set to `1`, it will enable metrics endpoint (equivalent to `--metrics`)
269+
- `LLAMA_ARG_ENDPOINT_SLOTS`: if set to `0`, it will **disable** slots endpoint (equivalent to `--no-slots`). This feature is enabled by default.
270+
- `LLAMA_ARG_EMBEDDINGS`: if set to `1`, it will enable embeddings endpoint (equivalent to `--embeddings`)
271+
- `LLAMA_ARG_FLASH_ATTN`: if set to `1`, it will enable flash attention (equivalent to `-fa`)
272+
- `LLAMA_ARG_CONT_BATCHING`: if set to `0`, it will **disable** continuous batching (equivalent to `--no-cont-batching`). This feature is enabled by default.
273+
- `LLAMA_ARG_DEFRAG_THOLD`: equivalent to `-dt`
274+
- `LLAMA_ARG_HOST`: equivalent to `--host`
275+
- `LLAMA_ARG_PORT`: equivalent to `--port`
276+
277+
Example usage of docker compose with environment variables:
278+
279+
```yml
280+
services:
281+
llamacpp-server:
282+
image: ghcr.io/ggerganov/llama.cpp:server
283+
ports:
284+
- 8080:8080
285+
volumes:
286+
- ./models:/models
287+
environment:
288+
# alternatively, you can use "LLAMA_ARG_MODEL_URL" to download the model
289+
LLAMA_ARG_MODEL: /models/my_model.gguf
290+
LLAMA_ARG_CTX_SIZE: 4096
291+
LLAMA_ARG_N_PARALLEL: 2
292+
LLAMA_ARG_ENDPOINT_METRICS: 1 # to disable, either remove or set to 0
293+
LLAMA_ARG_PORT: 8080
294+
```
269295
270296
## Build
271297

0 commit comments

Comments
 (0)