Bluefin-dx: Nvidia woes

Hello,

I have a rather new Yoga Pro 9i (2024 model, 16IMH9) that has a 4060 alongside its Intel Arc and I am running the latest Bluefin-DX. My desktop environment is Gnome on Wayland.

Everything works fine when I disable the GPU in BIOS, but when I have it enabled I have quite a few issues:

  1. Battery life is terrible
  2. System log is spammed (probably relates to 2 above) with failure to set power limit messages
  3. When my laptop goes to sleep, it sometimes will not wake up (it’s hit and miss)
Apr 26 20:50:40 myhost /usr/bin/nvidia-powerd[6008]: error setting power limit
Apr 26 20:50:40 myhost /usr/bin/nvidia-powerd[6008]: Error setting GPU limit: 55000.
Apr 26 20:50:40 myhost /usr/bin/nvidia-powerd[6008]: error setting power limit
Apr 26 20:50:40 myhost /usr/bin/nvidia-powerd[6008]: Error setting GPU limit: 55000.
...

Is this typical of Nvidia due to its closed-source drivers? Is there any hope of getting things to work in a stable way?

Can you post the output from a rpm-ostree status?

Of course, here is my status:

❯ rpm-ostree status
State: idle
AutomaticUpdates: stage; rpm-ostreed-automatic.timer: no runs since boot
Deployments:
● ostree-image-signed:docker://ghcr.io/ublue-os/bluefin-dx-nvidia:latest
                   Digest: sha256:54804a33df44857592b9639a2da8c6d3a9553bb96d78fa7ff6ea21677efecca5
                  Version: 40.20240423.0 (2024-04-24T14:54:46Z)
          LayeredPackages: adcli lm_sensors oddjob-mkhomedir samba-common-tools sssd-ad
                           touchegg

  ostree-image-signed:docker://ghcr.io/ublue-os/bluefin-dx-nvidia:latest
                   Digest: sha256:54804a33df44857592b9639a2da8c6d3a9553bb96d78fa7ff6ea21677efecca5
                  Version: 40.20240423.0 (2024-04-24T14:54:46Z)
          LayeredPackages: adcli lm_sensors oddjob-mkhomedir samba-common-tools sssd-ad
                           touchegg

Here is the lspci -k info for the adapters:

00:02.0 VGA compatible controller: Intel Corporation Meteor Lake-P [Intel Arc Graphics] (rev 08)
	Subsystem: Lenovo Device 3e47
	Kernel driver in use: i915
	Kernel modules: i915, xe
...
01:00.0 VGA compatible controller: NVIDIA Corporation AD107M [GeForce RTX 4060 Max-Q / Mobile] (rev a1)
	Subsystem: Lenovo Device 3e47
	Kernel driver in use: nvidia
	Kernel modules: nouveau, nvidia_drm, nvidia

Note that nouveau is not used due to the kernel parameters pre-configured by the bluefin-dx-nvidia image:

❯ journalctl -b | grep Command
Apr 26 21:54:44 fedora kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/ostree/default-1a8fdd4d36ffc5ae1fe73ef692f3d5fa06ce6e5c5d16715f3299fa9ab32081f8/vmlinuz-6.8.7-300.fc40.x86_64 rd.luks.uuid=luks-0f45e4b2-02d1-4a30-9462-a67ed1db53bd rhgb quiet root=UUID=72ba393e-d809-4349-aec7-0b761a41a98e rootflags=subvol=root rw ostree=/ostree/boot.1/default/1a8fdd4d36ffc5ae1fe73ef692f3d5fa06ce6e5c5d16715f3299fa9ab32081f8/0 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rd.luks.options=discard initcall_blacklist=simpledrm_platform_driver_init
Apr 26 21:54:44 fedora kernel: Command line: BOOT_IMAGE=(hd0,gpt2)/ostree/default-1a8fdd4d36ffc5ae1fe73ef692f3d5fa06ce6e5c5d16715f3299fa9ab32081f8/vmlinuz-6.8.7-300.fc40.x86_64 rd.luks.uuid=luks-0f45e4b2-02d1-4a30-9462-a67ed1db53bd rhgb quiet root=UUID=72ba393e-d809-4349-aec7-0b761a41a98e rootflags=subvol=root rw ostree=/ostree/boot.1/default/1a8fdd4d36ffc5ae1fe73ef692f3d5fa06ce6e5c5d16715f3299fa9ab32081f8/0 rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1 rd.luks.options=discard initcall_blacklist=simpledrm_platform_driver_init

Here is what is actually loaded:

❯ lsmod | grep -E "nouveau|nvidia"
nvidia_wmi_ec_backlight    12288  0
nvidia_drm            122880  3
nvidia_modeset       1605632  2 nvidia_drm
video                  77824  5 nvidia_wmi_ec_backlight,ideapad_laptop,xe,i915,nvidia_modeset
wmi                    36864  4 video,nvidia_wmi_ec_backlight,wmi_bmof,ideapad_laptop
nvidia_uvm           6656000  0
nvidia              60497920  42 nvidia_uvm,nvidia_modeset

Just adding that I found the docs, for nvidia-powerd and they seem to have “XFree86” in the URL. Could it be that this is simply not supported for Wayland?

Anyway, running the command from the documentation seems to indicate that dynamic boost is “not supported”, as I get:

❯ nvidia-settings -q DynamicBoostSupport

ERROR: Error resolving target specification '' (No targets match target specification), specified in query 'DynamicBoostSupport'.

EDIT: I have stopped the service (systemctl stop nvidia-powerd) to stop the log spamming.

I have logged into Gnome using Xorg and now:

❯ nvidia-settings -q DynamicBoostSupport

  Attribute 'DynamicBoostSupport' (aurelius.ad.home.lan:0[gpu:0]): 1.
    'DynamicBoostSupport' is a boolean attribute; valid values are: 1 (on/true) and 0
    (off/false).
    'DynamicBoostSupport' is a read-only attribute.
    'DynamicBoostSupport' can use the following target types: GPU.

So it would seem like nvidia-powerd is a Xorg-only service? But even so, after restarting the service, I get the same 2 lines of messages every 2 seconds:

Apr 26 22:27:53 myhost /usr/bin/nvidia-powerd[29417]: error setting power limit
Apr 26 22:27:53 myhost /usr/bin/nvidia-powerd[29417]: Error setting GPU limit: 55000.
Apr 26 22:27:55 myhost /usr/bin/nvidia-powerd[29417]: error setting power limit
Apr 26 22:27:55 myhost /usr/bin/nvidia-powerd[29417]: Error setting GPU limit: 55000.
...

I tried to suspend/resume from Xorg and it failed. Here are the last messages from the kernel log:

Apr 26 22:30:27 myhost /usr/bin/nvidia-powerd[29417]: error setting power limit
Apr 26 22:30:27 myhost /usr/bin/nvidia-powerd[29417]: Error setting GPU limit: 55000.
Apr 26 22:30:29 myhost /usr/bin/nvidia-powerd[29417]: error setting power limit
Apr 26 22:30:29 myhost /usr/bin/nvidia-powerd[29417]: Error setting GPU limit: 55000.
Apr 26 22:30:31 myhost /usr/bin/nvidia-powerd[29417]: error setting power limit
Apr 26 22:30:31 myhost /usr/bin/nvidia-powerd[29417]: Error setting GPU limit: 55000.
Apr 26 22:30:31 myhost /usr/libexec/gdm-x-session[23845]: (EE) NVIDIA(GPU-0): WAIT (1, 8, 0x8000, 0x0000072c, 0x00000d2c)
Apr 26 22:30:31 myhost /usr/libexec/gdm-x-session[23845]: (II) NVIDIA(G0): Setting mode "NULL"
Apr 26 22:30:31 myhost kernel: BUG: kernel NULL pointer dereference, address: 0000000000000008
Apr 26 22:30:31 myhost kernel: #PF: supervisor read access in kernel mode
Apr 26 22:30:31 myhost kernel: #PF: error_code(0x0000) - not-present page
Apr 26 22:30:31 myhost kernel: PGD 0 P4D 0 
Apr 26 22:30:31 myhost kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Apr 26 22:30:31 myhost kernel: CPU: 4 PID: 23845 Comm: Xorg Tainted: P           OE      6.8.7-300.fc40.x86_64 #1
Apr 26 22:30:31 myhost kernel: Hardware name: LENOVO 83DN/LNVNB161216, BIOS NKCN25WW 02/05/2024
Apr 26 22:30:31 myhost kernel: RIP: 0010:_nv002475kms+0x29/0xb0 [nvidia_modeset]
Apr 26 22:30:31 myhost kernel: Code: 00 f3 0f 1e fa 55 41 b8 14 00 00 00 48 89 e5 41 56 49 89 ce 41 55 48 8d 4d cc 41 89 d5 41 54 49 89 fc 53 48 89 f3 48 83 ec 20 <48>
Apr 26 22:30:31 myhost kernel: RSP: 0018:ffffadb2647e38e8 EFLAGS: 00010286
Apr 26 22:30:31 myhost kernel: RAX: ffffffffc438b170 RBX: 0000000000000000 RCX: ffffadb2647e38f4
Apr 26 22:30:31 myhost kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffadb2401a9008
Apr 26 22:30:31 myhost kernel: RBP: ffffadb2647e3928 R08: 0000000000000014 R09: ffffadb243251008

Well, this is definitely an Nvidia driver issue. I was worried about this, haven’t owned an Nvidia device in years because of it…

Ironically the card works perfectly fine with PCI-passthrough into a Windows VM!

  • I’ve reverted to the non-nvidia image of Bluefin-DX (ostree-image-signed:docker://ghcr.io/ublue-os/bluefin-dx:latest)
  • I created a new Windows 11 virtual machine using Virt-Manager
  • I added a PCI device selecting the Nvidia card

Sure enough, I’m back to a stable host-system (battery life is reasonable, suspend/resume works, etc). Meanwhile the Nvidia card works fine inside the Windows VM! I was even able to suspend/resume with the Windows VM running!

Somewhat regretting my Yoga Pro 9i purchase now… I wonder how long before there is a stable kernel/nvidia-driver combination for my machine. Here is the nvidia module info, seems to be version 550.76:

Apr 26 22:43:18 fedora kernel: nvidia: module license 'NVIDIA' taints kernel.
Apr 26 22:43:18 fedora kernel: Disabling lock debugging due to kernel taint
Apr 26 22:43:18 fedora kernel: nvidia: module license taints kernel.
Apr 26 22:43:18 fedora kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 236
Apr 26 22:43:18 fedora kernel: 
Apr 26 22:43:18 fedora kernel: nvidia 0000:01:00.0: enabling device (0000 -> 0003)
Apr 26 22:43:18 fedora kernel: nvidia 0000:01:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none
Apr 26 22:43:18 fedora kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  550.76  Wed Apr 10 20:41:20 UTC 2024
Apr 26 22:43:18 fedora kernel: nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
Apr 26 22:43:18 fedora kernel: nvidia-uvm: Loaded the UVM driver, major device number 234.
Apr 26 22:43:18 fedora kernel: nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  550.76  Wed Apr 10 20:05:49 UTC 2024
Apr 26 22:43:18 fedora kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Apr 26 22:43:19 fedora kernel: ACPI Warning: \_SB.NPCF._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
Apr 26 22:43:19 fedora kernel: ACPI Warning: \_SB.PC00.RP12.PXSX._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20230628/nsarguments-61)
Apr 26 22:43:20 fedora kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
Apr 26 22:43:20 fedora kernel: nvidia-modeset: WARNING: GPU:0: Unable to read EDID for display device DP-0
Apr 26 22:43:20 fedora kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 4
Apr 26 22:43:20 fedora kernel: nvidia 0000:01:00.0: [drm] Cannot find any crtc or sizes

And of course the latest Bluefin-DX kernel:

❯ uname -a
Linux myhost 6.8.7-300.fc40.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Apr 17 19:21:08 UTC 2024 x86_64 GNU/Linux

I was actually able to add the PCI card to a VM running Fedora 40 as well, but for some reason the QXL adapter (the VM has multiple GPUs, one software QXL and one being the physical Nvidia) doesn’t take control of the login screen. The VM comes up, I can ssh into it, but I can’t login from the GDM login screen as it is not visible. Seems like it renders on the Nvidia screen and needs some physical monitor attached? I’ll look into it more, as it may be a good way to try newer kernels/drivers to see if it stabilizes…

I will update this post if I come across anything new, in case anyone else is using the same laptop…

The new kernel and drivers just arrived today and the situation is greatly improved:

  • suspend/resume works without issue so far
  • battery life is back to normal#
  • there are no errors logged from nvidia-powerd
  • Unigine Superposition runs at ~110 FPS average on Nvidia (as opposed to ~50 for Arc)

Just a few hours on this new kernel, but things are looking great.

EDIT: just for reference, the packages are

❯ rpm -qa | grep kmod-nvidia
kmod-nvidia-6.8.8-300.fc40.x86_64-550.78-1.fc40.x86_64

~ 
❯ rpm -qa | grep kernel
kernel-modules-core-6.8.8-300.fc40.x86_64
kernel-core-6.8.8-300.fc40.x86_64
kernel-modules-6.8.8-300.fc40.x86_64
kernel-6.8.8-300.fc40.x86_64
kernel-modules-extra-6.8.8-300.fc40.x86_64
kernel-tools-libs-6.8.8-300.fc40.x86_64
kernel-tools-6.8.8-300.fc40.x86_64
kernel-headers-6.8.3-300.fc40.x86_64
1 Like