LocalLLaMA

2249 readers

1 users here now

Community to discuss about LLaMA, the large language model created by Meta AI.

This is intended to be a replacement for r/LocalLLaMA on Reddit.

founded 1 year ago

MODERATORS

[email protected]

Vicuna-33B-1-3-SuperHOT-8K-GPTQ (huggingface.co)

submitted 1 year ago by [email protected] to c/[email protected]

9 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 2 points 1 year ago (8 children)

Has anyone gotten 8K context with a 33B model on a 4090?

[–] [email protected] 3 points 1 year ago (2 children)

Unfortunately there's just no way. KV cache size scales with the square of context length, so at 8k it's 16 times larger than at 2k, for 33b that's over 20GB for the cache alone, without weights or other buffers.

[–] [email protected] 1 points 1 year ago (1 children)

Wow that’s crazy. Would it be possible to offload the KV cache onto system RAM and keep model weights in VRAM or would that just slow everything down too much? I guess that’s kind of what llama.cpp does with GPU offload of layers? I’m still trying to figure out how this stuff actually works.

[–] [email protected] 2 points 1 year ago

That's what llama.cpp and kobold.cpp do, the KV cache is the last thing that gets offloaded so you can offload weights and keep the cache in RAM. Although neither support SuperHOT right now.

MQA models like Falcon-40B or MPT are going to be better for large context lengths. They have a tiny KV cache so even blown up 16x it's not a problem.

load more comments (5 replies)