Hello everyone,
I am working on a container with ROCm support. I’ve managed to get it working with PyTorch, but there are some quirks I need help with my mlbox-rocm.ini
file:
[mlbox-rocm]
image=rocm/dev-ubuntu-22.04
additional_packages="git"
#pre-init-hooks="/init_script.sh"
nvidia=true
pull=false
root=false
# Init hooks will fail so commenting them out, see below
Now, where is the the problems:
- There is not enough space in the /tmp file-system by default to install pytorch with ROCm support
- I need multiple init hooks, as I need to run two pip install commands using different index sites.
For (1) the problem is simple, by default Bluefin-DX has an 8GB temp file system:
> df -h /tmp
Filesystem Size Used Avail Use% Mounted on
tmpfs 7.8G 107M 7.7G 2% /tmp
As a result, once the container is created, I install it by overriding TMPDIR with:
> TMPDIR="/home/myuser/tmp" pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
So my first question is: is it possible to override the temporary directory in the manifest instructions so that the user’s home is used?
Now regarding the second problem, as you can see you need to specify --index-url https://download.pytorch.org/whl/rocm6.0
to get the correct pytorch for ROCm. Therefore I need to run two separate commands, one is the one immediately above, the second would pull in the extra packages afterword. Something like:
# This hook must run first to install PyTorch for ROCm. Note that it must also override TMPDIR
init_hoos_01="TMPDIR=<what_can_i_use?> pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0"
# This hook can run second, it will resolve pytorch as already-installed and only add the extras
init_hooks_02="pip3 install huggingface_hub tokenizers transformers accelerate datasets peft bitsandbytes"
I have created the container and ran the two commands manually, and PyTorch works fine with my Radeon 6800.