this post was submitted on 25 Feb 2024

89 points (97.8% liked)

Selfhosted

39435 readers

6 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules:

Be civil: we're here to support and learn from one another. Insults won't be tolerated. Flame wars are frowned upon.
No spam posting.
Posts have to be centered around self-hosting. There are other communities for discussing hardware or home computing. If it's not obvious why your post topic revolves around selfhosting, please include details to make it clear.
Don't duplicate the full text of your blog or github here. Just post the link for folks to click.
Submission headline should match the article title (don’t cherry-pick information from the title to fit your agenda).
No trolling.

Resources:

Any issues on the community? Report it using the report flag.

Questions? DM the mods!

founded 1 year ago

MODERATORS

[email protected]

Self hosted LLM (sh.itjust.works)

submitted 8 months ago by [email protected] to c/[email protected]

18 comments fedilink hide all child comments

Hello internet users. I have tried gpt4all and like it, but it is very slow on my laptop. I was wondering if anyone here knows of any solutions I could run on my server (debian 12, amd cpu, intel a380 gpu) through a web interface. Has anyone found any good way to do this?

all 19 comments

sorted by: hot top controversial new old

[–] [email protected] 23 points 8 months ago* (last edited 8 months ago)

kobold.cpp is easy to use, fast and I like it.

If you're interested in more relevant Lemmy communities:

(another option: text-generation-webui has several backends bundled. Maybe one of those works for you.)

[–] [email protected] 8 points 8 months ago

text-generation-webui is kind of the standard from what I've seen to run it with a webui, but the vram stuff here is accurate. Text LLMs require an insane amount of vram to keep a conversation going.

[–] [email protected] 7 points 8 months ago

Ollama and localai can both be run on a server with no gpu. You'd need to point a different web ui to them if you want though

[–] [email protected] 6 points 8 months ago

There is an easy way with OpenWebUI but LLM are mostly accelerated by CUDA or ROCm. CPU acceleration is slow, but you can try it

[–] [email protected] 5 points 8 months ago

Ollama is a nice server base, they lots of projects that plug on top of that.

[–] [email protected] 4 points 8 months ago (3 children)

I tried Huggingface TGI yesterday, but all of the reasonable models need at least 16 gigs of vram. The only model i got working (on a desktop machine with a amd 6700xt gpu) was microsoft phi-2.

[–] [email protected] 4 points 8 months ago (1 children)

Have you been able to use it with your AMD GPU? I have a 6800 and would like to test something

[–] [email protected] 3 points 8 months ago* (last edited 8 months ago)

Yes, since we have similar gpus you could try the following to run it in a docker container on linux, taken from here and slightly modified:

#!/bin/bash

model=microsoft/phi-2
# share a volume with the Docker container to avoid downloading weights every run
volume=<path-to-your-data-directory>/data

docker run -e HSA_OVERRIDE_GFX_VERSION=10.3.0 -e PYTORCH_ROCM_ARCH="gfx1031" --device /dev/kfd --device /dev/dri --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.4-rocm --model-id $model

Note how the rocm version has a different tag and that you need to mount your gpu device into the container. The two environment variables are specific to my (any maybe yours also) gpu architecture. It will need a while to download though.

[–] [email protected] 3 points 8 months ago

Koboldcpp should allow you to run much larger models with a little bit of ram offloading. There's a fork that supports rocm for AMD cards: https://github.com/YellowRoseCx/koboldcpp-rocm

Make sure to use quantized models for the best performace, q4k_M being the standard.

[–] [email protected] 2 points 8 months ago (1 children)

I know the gpt4all models run fine on my desktop with 8gig vram. It does use a decent chunk of my normal ram though. Could the gpt4all models work on huggingface or do they use different formats? Sorry if I am completely misunderstanding huggingface, I haven't heard of it until now.

[–] [email protected] 2 points 8 months ago (1 children)

Huggingface TGI is just a piece of software handling the models, like gpt4all. Here is a list of models officially supported by TGI, although they state that you can try different ones as well. You follow the link and look for the files section. The size of the model files (safetensors or pickele binaries) gives a good estimate of how much vram you will need. Sadly this is more than most consumer graphics cards have except for santacoder and microsoft phi.

[–] [email protected] 1 points 8 months ago

I don't really want to try to get that to work. I wonder how hard it would be to create my own webui using gpt4all's Python package.

[–] [email protected] 4 points 8 months ago

You'll need a good GPU for best results.

[–] [email protected] 4 points 8 months ago

Thanks to this post, and the other comments in here, I've discovered that the ultimate ui for ai-models may well be

https://github.com/ParisNeo/lollms-webui

and on HuggingFace ( that name is aweful: to me it is the creepy-horrible FaceHugger, from the movie Alien, that I saw so many decades ago ) TheBloke has some models which are smaller

https://huggingface.co/TheBloke/

so you can choose a model that will actually-work on your hardware.

I think Llama-2 for brainstorming & CodeLlama-instruct for learning programming examples seems to be the cleanest pair, from what I've read, and he's got GGUF versions with different quantizations, so you can choose what will actually-fit on your hardware.

There are other models on huggingface which seem very useful, like

whisper-large-v3 for speech-to-text,
whisperspeech for text-to-speech,
sdxl-turbo for image-making ( for some copyright-free subjects to practice drawing with ), and so-on..

Some models require GPU, not all.

Damn things moved fast!

[–] [email protected] 4 points 8 months ago* (last edited 8 months ago)

I've had pretty good luck running llamafile on my laptop. The speeds aren't super fast, and I can only use the models that are Mistral 7B and smaller, but the results are good enough for casual use and general R and Python code.

Edit: my laptop doesn't have a dedicated GPU, and I don't think llamafile has support for Intel GPUs yet. CPU inference is still pretty quick.

[–] [email protected] 1 points 8 months ago (1 children)

Did you try LM studio?

[–] [email protected] 4 points 8 months ago

Its proprietary

[–] [email protected] 1 points 8 months ago

OP should try H2OGPT, it is somewhat technical but the UI makes it easy to configure. You can select many models and prompt types, and you can even input your own documents so that the AI uses them to answer