Machine Learning

Hello everyone,

I am working on a container with ROCm support. I’ve managed to get it working with PyTorch, but there are some quirks I need help with my mlbox-rocm.ini file:

[mlbox-rocm]
image=rocm/dev-ubuntu-22.04
additional_packages="git"
#pre-init-hooks="/init_script.sh"
nvidia=true
pull=false
root=false
# Init hooks will fail so commenting them out, see below

Now, where is the the problems:

  1. There is not enough space in the /tmp file-system by default to install pytorch with ROCm support
  2. I need multiple init hooks, as I need to run two pip install commands using different index sites.

For (1) the problem is simple, by default Bluefin-DX has an 8GB temp file system:

> df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           7.8G  107M  7.7G   2% /tmp

As a result, once the container is created, I install it by overriding TMPDIR with:

> TMPDIR="/home/myuser/tmp" pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

So my first question is: is it possible to override the temporary directory in the manifest instructions so that the user’s home is used?

Now regarding the second problem, as you can see you need to specify --index-url https://download.pytorch.org/whl/rocm6.0 to get the correct pytorch for ROCm. Therefore I need to run two separate commands, one is the one immediately above, the second would pull in the extra packages afterword. Something like:

# This hook must run first to install PyTorch for ROCm. Note that it must also override TMPDIR
init_hoos_01="TMPDIR=<what_can_i_use?> pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0"

# This hook can run second, it will resolve pytorch as already-installed and only add the extras
 init_hooks_02="pip3 install huggingface_hub tokenizers transformers accelerate datasets peft bitsandbytes"

I have created the container and ran the two commands manually, and PyTorch works fine with my Radeon 6800.