Closed
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Current Behavior
from icecream import ic
from llama_cpp import Llama
from llama_cpp import ChatCompletionMessage
llm = Llama(
model_path="/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf",
n_gpu_layers=-1,
# n_gpu_layers=0,
)
def print_chat_streaming(output, debug_p=True,):
"""
Process and print out chat completions from a model when the stream is set to True.
Args:
output (iterable): The output from the model with stream=True.
"""
for r in output:
delta = r["choices"][0]['delta']
if 'role' in delta:
print(f"\n{delta['role']}: ", end='')
if 'content' in delta:
print(f"{delta['content']}", end='')
print("\n")
if debug_p == True:
ic(r)
output = llm.create_chat_completion(
messages=[
ChatCompletionMessage(
# role="user",
role="system",
content=r"""You're a helpful programming assistant who answers the questions the user asks of you concisely and accurately. As you're a senior engineer working at Google with a PhD in distributed systems, you're extremely smart. You take a deep breath before answering the question and solve the question step by step.""",
),
ChatCompletionMessage(
role="user",
content=r"""List groups my linux user is in""",
),
],
max_tokens=256,
stop=[],
temperature=0,
stream=True,
)
print_chat_streaming(output)
Llama.generate: prefix-match hit
assistant: To list all the groups that your Linux user belongs to, run the following command:
...
This will display a space-separated list of all the groups that you belong to.
ic| r: {'choices': [{'delta': {'content': ' '}, 'finish_reason': None, 'index': 0}],
'created': 1695045448,
'id': 'chatcmpl-4f8489ed-f56a-4fa9-b42f-cdc753de93b8',
'model': '/opt/models/WizardCoder-Python-34B-V1.0/wizardcoder-python-34b-v1.0.Q4_K_M.gguf',
'object': 'chat.completion.chunk'}
llama_print_timings: load time = 458.52 ms
llama_print_timings: sample time = 42.46 ms / 63 runs ( 0.67 ms per token, 1483.61 tokens per second)
llama_print_timings: prompt eval time = 411.41 ms / 12 tokens ( 34.28 ms per token, 29.17 tokens per second)
llama_print_timings: eval time = 2738.43 ms / 62 runs ( 44.17 ms per token, 22.64 tokens per second)
llama_print_timings: total time = 3349.54 ms
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 45 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 32
Socket(s): 2
Stepping: 7
BogoMIPS: 4190.15
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat pku ospke avx512_vnni md_clear flush_l1d arch_capabilities
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 2 MiB (64 instances)
L1i cache: 2 MiB (64 instances)
L2 cache: 64 MiB (64 instances)
L3 cache: 55 MiB (2 instances)
NUMA node(s): 2
NUMA node0 CPU(s): 0-31
NUMA node1 CPU(s): 32-63
Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Mitigation; Enhanced IBRS
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
- Operating System, e.g. for Linux:
$ uname -a
Linux gpu7 5.15.0-75-generic #82-Ubuntu SMP Tue Jun 6 23:10:23 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.4.1
Built for x86_64-pc-linux-gnu
Copyright (C) 1988-2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <https://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Metadata
Metadata
Assignees
Labels
No labels