Has ujust ollama-web been removed?

shaked_coffee · January 9, 2025, 7:09pm

UPDATE: I’ve found a message on discord saying that Alpaca doesn’t expose an API, which is very likely what I need to make its ollama instance talking with other applications, right?

Therefore, I guess that installing ollama through homebrew would better fit my needs, apart for the fact that if I try something like

brew install ollama
brew services start ollama
ollama run llama3.2

it will run using my CPU instead of taking advantage of the Nvidia GPU as Alpaca was doing… am I missing a step?

Niko · January 9, 2025, 8:35pm

I think you’re 100% right on all of the above (alpaca doesn’t expose the api, and install ollama with brew) BUT I’m not so sure ollama won’t automatically use your GPU? Can you check that it is not? docs on GPU Discovery. Not sure if this applies to brew, etc. But I think it should?

If it is not, looks like you can force it to use your GPU with this? ollama/docs/gpu.md at 8bccae4f92bced8222efc04f0b25573df450bb89 · ollama/ollama · GitHub

shaked_coffee · January 9, 2025, 9:29pm

Yeah sorry my previous message was ambiguous, I meant to say that I guessed that ollama was a better fit, so I did tried installing it with brew and I did encountered an issue with GPU detection.

The GPU is a NVIDIA 4060 (laptop), configured and recognized by the system as proven by its presence in the System Monitor and by the fact that Alpaca makes use of it.
However, even if I try to set the CUDA_VISIBLE_DEVICES env variable as suggested in the link you attached and, the ollama installed by brew ignores it.

Did you (or anyone else) got ollama from brew working with an nvidia GPU?

klmcw · January 10, 2025, 12:17am

I am not using ollama so verify what I am about to share.

I did a project with pytorch and CUDA support recently. I have an RTX 3050 Ti Mobile.

Here are 2 suggestions:

standard nvidia env vars (I ended up not needing these):
export __NV_PRIME_RENDER_OFFLOAD=1
export __GLX_VENDOR_LIBRARY_NAME=nvidia
Put this in a .desktop file for the app in ~/.local/share/applications (re-login required) - this did the trick for me in most apps like kitty, vscode, etc.
PrefersNonDefaultGPU=true

Niko · January 10, 2025, 5:36am

Gotcha! I’m very sorry - I don’t know!

shaked_coffee · January 10, 2025, 6:50pm

Problem is that ollama is not recognizing the nvidia GPU even when run through the command line, not only when used in external IDEs, so I fear the problem is a differnt one

Niko · January 11, 2025, 1:43am

Gah, unfortunately we may be out of luck with brew and ollama, judging on the mac folks’ experience: Maybe try the direct installer in this case and fiddle with the path and whatnot till it works. Sorry! I don’t have a strong GPU so I never worried about it running on the CPU.

https://www.reddit.com/r/ollama/comments/1h7grjl/m3_macbook_pro_18gb_not_using_gpu/

j0rge · January 11, 2025, 2:02am

I’d just grab the old service unit and toss it into /etc/containers/systemd:

github.com/ublue-os/bluefin

just/bluefin-tools.just

94d23d1e4


      
          
              "AMD (ROCm)")
                  IMAGE=rocm
                  read -r -d '' CUSTOM_ARGS <<-'EOF'
          AddDevice=/dev/dri
          AddDevice=/dev/kfd
          EOF
                  ;;
          esac
          
          read -r -d '' QUADLET <<-EOF
          [Unit]
          Description=The Ollama container
          After=local-fs.target
          
          [Service]
          Restart=always
          TimeoutStartSec=60
          # Ensure there's a userland podman.sock
          ExecStartPre=/bin/systemctl --user enable podman.socket
          # Ensure that the dir exists

shaked_coffee · January 11, 2025, 10:01am

Sorry, but I may need a bit more detailed instructions about that

You mean I should copy just the part below [Service] and paste it into /etc/containers/systemd? And then?

Also, I see the linked script mentions containers so I thought, can’t I just install ollama inside a docker or podman container?
I’ve seen on ollama docs that it can also be installed inside a container as long as there is the Nvidia Container Toolkit installed in the system. Does Bluefin has it included? Because if so I might try that way as well…

j0rge · January 11, 2025, 3:06pm

Yep! Then you start/enable it like any other service unit, systemctl start ollama or whatever you call it. This is you want to run it as a service on your machine. This is useful if you want a centralized ollama instance and then connect a bunch of apps to it, then you manage the llm in one place.

can’t I just install ollama inside a docker or podman container?

Yeah this is a systemd service unit that will handle that for you, it’s using the ollama/ollama container from dockerhub. You can also install it manually if you follow their instructions and run it that way: https://hub.docker.com/r/ollama/ollama

Pretty sure the nvidia image has everything it needs but if it’s missing anything we can add that.

blazer5200 · January 12, 2025, 10:38am

For those with a 780M iGPU, I can confirm that GPU usage does work when running ollama in Docker using the compose file found in the discussion below. Not sure if this is worth documenting.

github.com/ollama/ollama

Comment by vahpetr - Enable AMD iGPU 780M in Linux, Create amd-igpu-780m.md

ollama:main ← alexhegit:main

> Hi. I'm interested to try this in my Minisforum UM790 Pro with AMD Ryzen 9 794…0HS w/ Radeon 780M Graphics. Current, I've allocated 16GB VRAM for my iGPU. Please let me know if it will work in my system. Hi. I have Minisforum UM790 Pro with AMD Ryzen 9 7940HS w/ Radeon 780M Graphics, 96gb ram. and I'm think it works. We have 10 TOPS (base on NPU). Setup steps: 0/ bios: Frame buffer: Auto 1/ Install ubuntu 24.04 2/ Install ROCm 3/ Install Docker // create docker-compose.yaml ```yaml services: ollama: image: ollama/ollama:rocm container_name: ollama restart: unless-stopped devices: - "/dev/kfd" - "/dev/dri" volumes: - ollama_models:/root/.ollama environment: - HSA_OVERRIDE_GFX_VERSION=11.0.0 ports: - "11434:11434" dns: - "8.8.8.8" volumes: ollama_models: ``` 4/ `docker compose up -d` 5/ `docker exec -it ollama ollama run llama3:8b` 6/ /set verbose 7/ input: `where was beethoven born?` I receive next llama3:8b stats: ``` total duration: 6.959157294s load duration: 11.15427ms prompt eval count: 16 token(s) prompt eval duration: 568.396ms prompt eval rate: 28.15 tokens/s eval count: 76 token(s) eval duration: 6.337827s eval rate: 11.99 tokens/s ``` And next docker ollama output: <details> <summary>See ollama docker logs</summary> ``` 2024/10/30 21:16:07 routes.go:1158: INFO server config env="map[CUDA_VISIBLE_DEVICES: GPU_DEVICE_ORDINAL: HIP_VISIBLE_DEVICES: HSA_OVERRIDE_GFX_VERSION:11.0.0 HTTPS_PROXY: HTTP_PROXY: NO_PROXY: OLLAMA_DEBUG:false OLLAMA_FLASH_ATTENTION:false OLLAMA_GPU_OVERHEAD:0 OLLAMA_HOST:http://0.0.0.0:11434 OLLAMA_INTEL_GPU:false OLLAMA_KEEP_ALIVE:5m0s OLLAMA_LLM_LIBRARY: OLLAMA_LOAD_TIMEOUT:5m0s OLLAMA_MAX_LOADED_MODELS:0 OLLAMA_MAX_QUEUE:512 OLLAMA_MODELS:/root/.ollama/models OLLAMA_MULTIUSER_CACHE:false OLLAMA_NOHISTORY:false OLLAMA_NOPRUNE:false OLLAMA_NUM_PARALLEL:0 OLLAMA_ORIGINS:[http://localhost https://localhost http://localhost:* https://localhost:* http://127.0.0.1 https://127.0.0.1 http://127.0.0.1:* https://127.0.0.1:* http://0.0.0.0 https://0.0.0.0 http://0.0.0.0:* https://0.0.0.0:* app://* file://* tauri://*] OLLAMA_SCHED_SPREAD:false OLLAMA_TMPDIR: ROCR_VISIBLE_DEVICES: http_proxy: https_proxy: no_proxy:]" time=2024-10-30T21:16:07.556Z level=INFO source=images.go:754 msg="total blobs: 5" time=2024-10-30T21:16:07.556Z level=INFO source=images.go:761 msg="total unused blobs removed: 0" time=2024-10-30T21:16:07.556Z level=INFO source=routes.go:1205 msg="Listening on [::]:11434 (version 0.3.14)" time=2024-10-30T21:16:07.556Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cpu_avx cpu_avx2 rocm_v60102]" time=2024-10-30T21:16:07.556Z level=INFO source=gpu.go:221 msg="looking for compatible GPUs" time=2024-10-30T21:16:07.559Z level=INFO source=amd_linux.go:386 msg="skipping rocm gfx compatibility check" HSA_OVERRIDE_GFX_VERSION=11.0.0 time=2024-10-30T21:16:07.563Z level=INFO source=types.go:123 msg="inference compute" id=0 library=rocm variant="" compute=gfx1103 driver=6.8 name=1002:15bf total="2.0 GiB" available="1.9 GiB" [GIN] 2024/10/30 - 21:16:27 | 200 | 33.694µs | 127.0.0.1 | HEAD "/" [GIN] 2024/10/30 - 21:16:27 | 200 | 10.974227ms | 127.0.0.1 | POST "/api/show" time=2024-10-30T21:16:27.372Z level=INFO source=server.go:105 msg="system memory" total="92.1 GiB" free="89.7 GiB" free_swap="8.0 GiB" time=2024-10-30T21:16:27.372Z level=INFO source=memory.go:326 msg="offload to rocm" layers.requested=-1 layers.model=33 layers.offload=5 layers.split="" memory.available="[1.9 GiB]" memory.gpu_overhead="0 B" memory.required.full="5.5 GiB" memory.required.partial="1.8 GiB" memory.required.kv="256.0 MiB" memory.required.allocations="[1.8 GiB]" memory.weights.total="3.9 GiB" memory.weights.repeating="3.5 GiB" memory.weights.nonrepeating="411.0 MiB" memory.graph.full="258.5 MiB" memory.graph.partial="677.5 MiB" time=2024-10-30T21:16:27.372Z level=INFO source=server.go:388 msg="starting llama server" cmd="/usr/lib/ollama/runners/rocm_v60102/ollama_llama_server --model /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa --ctx-size 2048 --batch-size 512 --embedding --n-gpu-layers 5 --threads 8 --parallel 1 --port 42097" time=2024-10-30T21:16:27.372Z level=INFO source=sched.go:449 msg="loaded runners" count=1 time=2024-10-30T21:16:27.372Z level=INFO source=server.go:587 msg="waiting for llama runner to start responding" time=2024-10-30T21:16:27.373Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server error" INFO [main] starting c++ runner | tid="123887649047360" timestamp=1730322987 INFO [main] build info | build=10 commit="a04710e" tid="123887649047360" timestamp=1730322987 INFO [main] system info | n_threads=8 n_threads_batch=8 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 0 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | " tid="123887649047360" timestamp=1730322987 total_threads=16 INFO [main] HTTP server listening | hostname="127.0.0.1" n_threads_http="15" port="42097" tid="123887649047360" timestamp=1730322987 llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from /root/.ollama/models/blobs/sha256-6a0746a1ec1aef3e7ec53868f220ff6e389f6f8ef87a01d77c96807de94ca2aa (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = Meta-Llama-3-8B-Instruct llama_model_loader: - kv 2: llama.block_count u32 = 32 llama_model_loader: - kv 3: llama.context_length u32 = 8192 llama_model_loader: - kv 4: llama.embedding_length u32 = 4096 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.attention.head_count u32 = 32 llama_model_loader: - kv 7: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 8: llama.rope.freq_base f32 = 500000.000000 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: general.file_type u32 = 2 llama_model_loader: - kv 11: llama.vocab_size u32 = 128256 llama_model_loader: - kv 12: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 13: tokenizer.ggml.model str = gpt2 llama_model_loader: - kv 14: tokenizer.ggml.pre str = llama-bpe llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ... llama_model_loader: - kv 16: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 17: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "... llama_model_loader: - kv 18: tokenizer.ggml.bos_token_id u32 = 128000 llama_model_loader: - kv 19: tokenizer.ggml.eos_token_id u32 = 128009 llama_model_loader: - kv 20: tokenizer.chat_template str = {% set loop_messages = messages %}{% ... llama_model_loader: - kv 21: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q4_0: 225 tensors llama_model_loader: - type q6_K: 1 tensors time=2024-10-30T21:16:27.624Z level=INFO source=server.go:621 msg="waiting for server to become available" status="llm server loading model" llm_load_vocab: special tokens cache size = 256 llm_load_vocab: token to piece cache size = 0.8000 MB llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = BPE llm_load_print_meta: n_vocab = 128256 llm_load_print_meta: n_merges = 280147 llm_load_print_meta: vocab_only = 0 llm_load_print_meta: n_ctx_train = 8192 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_swa = 0 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 1 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 500000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_ctx_orig_yarn = 8192 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: ssm_dt_b_c_rms = 0 llm_load_print_meta: model type = 8B llm_load_print_meta: model ftype = Q4_0 llm_load_print_meta: model params = 8.03 B llm_load_print_meta: model size = 4.33 GiB (4.64 BPW) llm_load_print_meta: general.name = Meta-Llama-3-8B-Instruct llm_load_print_meta: BOS token = 128000 '<|begin_of_text|>' llm_load_print_meta: EOS token = 128009 '<|eot_id|>' llm_load_print_meta: LF token = 128 'Ä' llm_load_print_meta: EOT token = 128009 '<|eot_id|>' llm_load_print_meta: EOG token = 128009 '<|eot_id|>' llm_load_print_meta: max token length = 256 /opt/amdgpu/share/libdrm/amdgpu.ids: No such file or directory ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: AMD Radeon Graphics, compute capability 11.0, VMM: no llm_load_tensors: ggml ctx size = 0.27 MiB llm_load_tensors: offloading 5 repeating layers to GPU llm_load_tensors: offloaded 5/33 layers to GPU llm_load_tensors: ROCm0 buffer size = 585.16 MiB llm_load_tensors: CPU buffer size = 4437.80 MiB llama_new_context_with_model: n_ctx = 2048 llama_new_context_with_model: n_batch = 512 llama_new_context_with_model: n_ubatch = 512 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 500000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: ROCm0 KV buffer size = 40.00 MiB llama_kv_cache_init: ROCm_Host KV buffer size = 216.00 MiB llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB llama_new_context_with_model: ROCm_Host output buffer size = 0.50 MiB llama_new_context_with_model: ROCm0 compute buffer size = 677.48 MiB llama_new_context_with_model: ROCm_Host compute buffer size = 12.01 MiB llama_new_context_with_model: graph nodes = 1030 llama_new_context_with_model: graph splits = 301 INFO [main] model loaded | tid="123887649047360" timestamp=1730322989 time=2024-10-30T21:16:29.632Z level=INFO source=server.go:626 msg="llama runner started in 2.26 seconds" [GIN] 2024/10/30 - 21:16:29 | 200 | 2.292187393s | 127.0.0.1 | POST "/api/generate" [GIN] 2024/10/30 - 21:16:53 | 200 | 6.959235313s | 127.0.0.1 | POST "/api/chat" ``` </details> I don't know it's optimal setup bios, system, docker-compose and ollama or not. I think not. But looking at the log it seems that it works. Radiontop output at text generation time: <details> <summary>See radeontop output</summary> ``` radeontop unknown, running on UNKNOWN_CHIP bus c4, 120 samples/sec Graphics pipe 11.67% x Event Engine 0.00% x Vertex Grouper + Tesselator 0.00% x Texture Addresser 0.00% x Texture Cache 0.00% x Shader Export 0.00% x Sequencer Instruction Cache 0.00% x Shader Interpolator 10.00% x Shader Memory Exchange 0.00% x Scan Converter 0.00% x Primitive Assembly 0.00% x Depth Block 0.00% x Color Block 0.00% x Clip Rectangle 12.50% x 76M / 1971M VRAM 3.88% x 1889M / 47148M GTT 4.01% x 2.80G / 2.80G Memory Clock 100.00% x ``` Unknown Radeon card. <= R500 won't work, new cards might. </details>

shaked_coffee · January 12, 2025, 11:05am

Tried that and it works like a charm!

Following the instructions on docker hub, I run these two commands before, no prior toolkit installations needed (not sure if they are necessary, but I’ve run them just in case):

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

and then spinned up an ollama container using the following compose file

---
services:
  ollama:
    image: ollama/ollama
    container_name: ollama
    restart: unless-stopped
    ports:
      - 11434:11434
    volumes:
      - ./ollama_v:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities:
                - gpu

It exposes the ollama APIs on localhost:11434 (or any other port mapped instead of that when spinning up the container) and with that I can connect it both to Alpaca GUI (by specifying it as a “remote instance” through the settings) as well as Zed and Jetbrains IDEs.

Idk how common could be my use case, but if it’s something more people encounters it could worth adding docker instead of homebrew as a “more advanced” option in the Bluefin docs?

j0rge · January 12, 2025, 1:58pm

Yeah if you wouldn’t mind PRing it on the AI page that’d be sweet! I can take a look later today. Those last five lines are ridiculous, lol.

shaked_coffee · January 13, 2025, 11:17am

I’m kinda new to PRing big projects, but I should have done it here

Let me know if there’s something wrong

Niko · January 13, 2025, 4:58pm

Wow! Really excellent documentation @shaked_coffee , thank you!

system · April 13, 2025, 4:58pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Share more ujust commands between images? General	6	857	July 24, 2024
Ollama doesnt work on framework laptops with ujust Framework Laptops	5	376	February 6, 2025
Is there a reason Ucore hasn't adopted the ujust alias or iso's uCore	2	350	October 22, 2024
That Framework Desktop 👀 Framework Laptops	8	623	March 19, 2025
Rocm amd drivers General	29	678	July 10, 2024

Has ujust ollama-web been removed?

Related topics