You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/main/README.md
+4Lines changed: 4 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -274,6 +274,10 @@ These options help improve the performance and memory usage of the LLaMA models.
274
274
275
275
-`--no-mmap`: Do not memory-map the model. By default, models are mapped into memory, which allows the system to load only the necessary parts of the model as needed. However, if the model is larger than your total amount of RAM or if your system is low on available memory, using mmap might increase the risk of pageouts, negatively impacting performance. Disabling mmap results in slower load times but may reduce pageouts if you're not using `--mlock`. Note that if the model is larger than the total amount of RAM, turning off mmap would prevent the model from loading at all.
276
276
277
+
### Direct I/O
278
+
279
+
-`--direct-io`: Use direct I/O. Potentially faster uncached loading, fewer pageouts, no page cache pollution. You may benefit from this option if you load a model for the first time (or after some time), load several different models consecutively, or simply want to keep the page cache clean. The faster your storage device is, the greater the gain you can expect. The effect may be greater on Linux due to Transparent HugePage support.
280
+
277
281
### NUMA support
278
282
279
283
-`--numa distribute`: Pin an equal proportion of the threads to the cores on each NUMA node. This will spread the load amongst all cores on the system, utilitizing all memory channels at the expense of potentially requiring memory to travel over the slow links between nodes.
|`--mlock`| force system to keep model in RAM rather than swapping or compressing |
99
99
|`--no-mmap`| do not memory-map model (slower load but may reduce pageouts if not using mlock) |
100
+
|`--direct-io`| use direct I/O (potentially faster uncached loading, fewer pageouts, no page cache pollution) |
100
101
|`--numa TYPE`| attempt optimizations that help on some NUMA systems<br/>- distribute: spread execution evenly over all nodes<br/>- isolate: only spawn threads on CPUs on the node that execution started on<br/>- numactl: use the CPU map provided by numactl<br/>if run without this previously, it is recommended to drop the system page cache before using this<br/>see https://github.com/ggerganov/llama.cpp/issues/1437|
101
102
|`-ngl, --gpu-layers N`| number of layers to store in VRAM<br/>(env: LLAMA_ARG_N_GPU_LAYERS) |
102
103
|`-sm, --split-mode {none,layer,row}`| how to split the model across multiple GPUs, one of:<br/>- none: use one GPU only<br/>- layer (default): split layers and KV across GPUs<br/>- row: split rows across GPUs |
0 commit comments