UPDATE: I’ve found a message on discord saying that Alpaca doesn’t expose an API, which is very likely what I need to make its ollama instance talking with other applications, right?
Therefore, I guess that installing ollama through homebrew would better fit my needs, apart for the fact that if I try something like
brew install ollama
brew services start ollama
ollama run llama3.2
it will run using my CPU instead of taking advantage of the Nvidia GPU as Alpaca was doing… am I missing a step?
1 Like
Niko
January 9, 2025, 8:35pm
22
I think you’re 100% right on all of the above (alpaca doesn’t expose the api, and install ollama with brew) BUT I’m not so sure ollama won’t automatically use your GPU? Can you check that it is not? docs on GPU Discovery . Not sure if this applies to brew, etc. But I think it should?
If it is not, looks like you can force it to use your GPU with this? ollama/docs/gpu.md at 8bccae4f92bced8222efc04f0b25573df450bb89 · ollama/ollama · GitHub
Yeah sorry my previous message was ambiguous, I meant to say that I guessed that ollama was a better fit, so I did tried installing it with brew and I did encountered an issue with GPU detection.
The GPU is a NVIDIA 4060 (laptop), configured and recognized by the system as proven by its presence in the System Monitor and by the fact that Alpaca makes use of it.
However, even if I try to set the CUDA_VISIBLE_DEVICES
env variable as suggested in the link you attached and, the ollama installed by brew ignores it.
Did you (or anyone else) got ollama from brew working with an nvidia GPU?
klmcw
January 10, 2025, 12:17am
24
I am not using ollama so verify what I am about to share.
I did a project with pytorch and CUDA support recently. I have an RTX 3050 Ti Mobile.
Here are 2 suggestions:
standard nvidia env vars (I ended up not needing these):
export __NV_PRIME_RENDER_OFFLOAD=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia
Put this in a .desktop file for the app in ~/.local/share/applications (re-login required) - this did the trick for me in most apps like kitty, vscode, etc.
PrefersNonDefaultGPU=true
Niko
January 10, 2025, 5:36am
25
Gotcha! I’m very sorry - I don’t know!
Problem is that ollama is not recognizing the nvidia GPU even when run through the command line, not only when used in external IDEs, so I fear the problem is a differnt one
Niko
January 11, 2025, 1:43am
27
Gah, unfortunately we may be out of luck with brew and ollama, judging on the mac folks’ experience: Maybe try the direct installer in this case and fiddle with the path and whatnot till it works. Sorry! I don’t have a strong GPU so I never worried about it running on the CPU.
https://www.reddit.com/r/ollama/comments/1h7grjl/m3_macbook_pro_18gb_not_using_gpu/
1 Like
j0rge
January 11, 2025, 2:02am
28
I’d just grab the old service unit and toss it into /etc/containers/systemd
:
1 Like
Sorry, but I may need a bit more detailed instructions about that
You mean I should copy just the part below [Service]
and paste it into /etc/containers/systemd
? And then?
Also, I see the linked script mentions containers so I thought, can’t I just install ollama inside a docker or podman container?
I’ve seen on ollama docs that it can also be installed inside a container as long as there is the Nvidia Container Toolkit installed in the system. Does Bluefin has it included? Because if so I might try that way as well…
j0rge
January 11, 2025, 3:06pm
30
Yep! Then you start/enable it like any other service unit, systemctl start ollama
or whatever you call it. This is you want to run it as a service on your machine. This is useful if you want a centralized ollama instance and then connect a bunch of apps to it, then you manage the llm in one place.
can’t I just install ollama inside a docker or podman container?
Yeah this is a systemd service unit that will handle that for you, it’s using the ollama/ollama container from dockerhub. You can also install it manually if you follow their instructions and run it that way: https://hub.docker.com/r/ollama/ollama
Pretty sure the nvidia image has everything it needs but if it’s missing anything we can add that.
2 Likes
For those with a 780M iGPU, I can confirm that GPU usage does work when running ollama in Docker using the compose file found in the discussion below. Not sure if this is worth documenting.
ollama:main
← alexhegit:main
> Hi. I'm interested to try this in my Minisforum UM790 Pro with AMD Ryzen 9 794… 0HS w/ Radeon 780M Graphics. Current, I've allocated 16GB VRAM for my iGPU. Please let me know if it will work in my system.
Hi. I have Minisforum UM790 Pro with AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 96gb ram. and I'm think it works.
We have 10 TOPS (base on NPU).
Setup steps:
0/ bios: Frame buffer: Auto
1/ Install ubuntu 24.04
2/ Install ROCm
3/ Install Docker
// create docker-compose.yaml
```yaml
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
restart: unless-stopped
devices:
- "/dev/kfd"
- "/dev/dri"
volumes:
- ollama_models:/root/.ollama
environment:
- HSA_OVERRIDE_GFX_VERSION=11.0.0
ports:
- "11434:11434"
dns:
- "8.8.8.8"
volumes:
ollama_models:
```
4/ `docker compose up -d`
5/ `docker exec -it ollama ollama run llama3:8b`
6/ /set verbose
7/ input: `where was beethoven born?`
I receive next llama3:8b stats:
```
total duration: 6.959157294s
load duration: 11.15427ms
prompt eval count: 16 token(s)
prompt eval duration: 568.396ms
prompt eval rate: 28.15 tokens/s
eval count: 76 token(s)
eval duration: 6.337827s
eval rate: 11.99 tokens/s
```
And next docker ollama output:
<details>
<summary>See ollama docker logs</summary>
```
2024/10/30 21:16:07 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]"
time=2024-10-30T21:16:07.556Z level=INFO source=images.go:754 msg="total blobs: 5"
time=2024-10-30T21:16:07.556Z level=INFO source=images.go:761 msg="total unused blobs removed: 0"
time=2024-10-30T21:16:07.556Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.14)"
time=2024-10-30T21:16:07.556Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm_v60102]"
time=2024-10-30T21:16:07.556Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs"
time=2024-10-30T21:16:07.559Z level=INFO source=amd_linux.go:386 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0
time=2024-10-30T21:16:07.563Z level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=6.8 name=1002:15bf total="2.0 GiB" available="1.9 GiB"
[GIN] 2024/10/30 - 21:16:27 | 200 | 33.694µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/10/30 - 21:16:27 | 200 | 10.974227ms | 127.0.0.1 | POST "/api/show"
time=2024-10-30T21:16:27.372Z level=INFO source=server.go:105 msg="system memory" total="92.1 GiB" free="89.7 GiB" free_swap="8.0 GiB"
time=2024-10-30T21:16:27.372Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=5 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.5 GiB" memory.required.partial="1.8 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[1.8 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="677.5 MiB"
time=2024-10-30T21:16:27.372Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/rocm_v60102/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 5 --threads 8 --parallel 1 --port 42097"
time=2024-10-30T21:16:27.372Z level=INFO source=sched.go:449 msg="loaded runners" count=1
time=2024-10-30T21:16:27.372Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding"
time=2024-10-30T21:16:27.373Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error"
INFO [main] starting c++ runner | tid="123887649047360" timestamp=1730322987
INFO [main] build info | build=10 commit="a04710e" tid="123887649047360" timestamp=1730322987
INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123887649047360" timestamp=1730322987 total_threads=16
INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42097" tid="123887649047360" timestamp=1730322987
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct
llama_model_loader: - kv 2: llama.block_count u32 = 32
llama_model_loader: - kv 3: llama.context_length u32 = 8192
llama_model_loader: - kv 4: llama.embedding_length u32 = 4096
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 6: llama.attention.head_count u32 = 32
llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: llama.vocab_size u32 = 128256
llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv 21: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
time=2024-10-30T21:16:27.624Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model"
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 128256
llm_load_print_meta: n_merges = 280147
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 8192
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 4
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 14336
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 8192
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 8B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 8.03 B
llm_load_print_meta: model size = 4.33 GiB (4.64 BPW)
llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct
llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 128009 '<|eot_id|>'
llm_load_print_meta: LF token = 128 'Ä'
llm_load_print_meta: EOT token = 128009 '<|eot_id|>'
llm_load_print_meta: EOG token = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
/opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no
llm_load_tensors: ggml ctx size = 0.27 MiB
llm_load_tensors: offloading 5 repeating layers to GPU
llm_load_tensors: offloaded 5/33 layers to GPU
llm_load_tensors: ROCm0 buffer size = 585.16 MiB
llm_load_tensors: CPU buffer size = 4437.80 MiB
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: ROCm0 KV buffer size = 40.00 MiB
llama_kv_cache_init: ROCm_Host KV buffer size = 216.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: ROCm_Host output buffer size = 0.50 MiB
llama_new_context_with_model: ROCm0 compute buffer size = 677.48 MiB
llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB
llama_new_context_with_model: graph nodes = 1030
llama_new_context_with_model: graph splits = 301
INFO [main] model loaded | tid="123887649047360" timestamp=1730322989
time=2024-10-30T21:16:29.632Z level=INFO source=server.go:626 msg="llama runner started in 2.26 seconds"
[GIN] 2024/10/30 - 21:16:29 | 200 | 2.292187393s | 127.0.0.1 | POST "/api/generate"
[GIN] 2024/10/30 - 21:16:53 | 200 | 6.959235313s | 127.0.0.1 | POST "/api/chat"
```
</details>
I don't know it's optimal setup bios, system, docker-compose and ollama or not. I think not. But looking at the log it seems that it works.
Radiontop output at text generation time:
<details>
<summary>See radeontop output</summary>
```
radeontop unknown, running on UNKNOWN_CHIP bus c4, 120 samples/sec
Graphics pipe 11.67% x
Event Engine 0.00% x
Vertex Grouper + Tesselator 0.00% x
Texture Addresser 0.00% x
Texture Cache 0.00% x
Shader Export 0.00% x
Sequencer Instruction Cache 0.00% x
Shader Interpolator 10.00% x
Shader Memory Exchange 0.00% x
Scan Converter 0.00% x
Primitive Assembly 0.00% x
Depth Block 0.00% x
Color Block 0.00% x
Clip Rectangle 12.50% x
76M / 1971M VRAM 3.88% x
1889M / 47148M GTT 4.01% x
2.80G / 2.80G Memory Clock 100.00% x
```
Unknown Radeon card. <= R500 won't work, new cards might.
</details>
j0rge:
You can also install it manually if you follow their instructions and run it that way: https://hub.docker.com/r/ollama/ollama
Pretty sure the nvidia image has everything it needs but if it’s missing anything we can add that.
Tried that and it works like a charm!
Following the instructions on docker hub, I run these two commands before, no prior toolkit installations needed (not sure if they are necessary, but I’ve run them just in case):
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
and then spinned up an ollama container using the following compose file
---
services:
ollama:
image: ollama/ollama
container_name: ollama
restart: unless-stopped
ports:
- 11434:11434
volumes:
- ./ollama_v:/root/.ollama
deploy:
resources:
reservations:
devices:
- capabilities:
- gpu
It exposes the ollama APIs on localhost:11434
(or any other port mapped instead of that when spinning up the container) and with that I can connect it both to Alpaca GUI (by specifying it as a “remote instance” through the settings) as well as Zed and Jetbrains IDEs.
Idk how common could be my use case, but if it’s something more people encounters it could worth adding docker instead of homebrew as a “more advanced” option in the Bluefin docs?
2 Likes
j0rge
January 12, 2025, 1:58pm
33
Yeah if you wouldn’t mind PRing it on the AI page that’d be sweet! I can take a look later today. Those last five lines are ridiculous, lol.
1 Like
I’m kinda new to PRing big projects, but I should have done it here
Let me know if there’s something wrong
1 Like
Niko
January 13, 2025, 4:58pm
35
Wow! Really excellent documentation @shaked_coffee , thank you!
1 Like
system
Closed
April 13, 2025, 4:58pm
36
This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.