[User] Implement Streaming LLM - Make the inference more efficient

# Prerequisites
Context length limit is an issue on all LLMs. The following repository and associated paper is demonstrating that keeping the 4 initial tokens will enable a infinite context length on most common LLMs without sacrificing performance or efficiency.

Code : https://github.com/mit-han-lab/streaming-llm

Paper reference inside the repo which demonstrates the attention-sink effect of LLMs and how to take advantage of it.


# Current Behavior

There is a limit on context length defined mostly by pre-training. Other approaches like rope or sliding window have their pros and cons, none of them can get to a higher context length than this apporach.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[User] Implement Streaming LLM - Make the inference more efficient #3440

Prerequisites

Current Behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[User] Implement Streaming LLM - Make the inference more efficient #3440

Description

Prerequisites

Current Behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions