Creating a distrobox with Tensorflow and Nvidia support

Thanks @m2Giles for the patient assistance!

n.b. This is a work in progress. Feedback is welcome!

The procedure below will allow you to create a distrobox with Tensorflow and CUDA installed for running machine learning / artificial intelligence workflows. You will also be able to connect to the distrobox from Visual Studio Code using ssh (127.0.0.1 port 2222). From there, the fact that you’re working in a container will be transparent to you.

Prerequisites

An installation of Bluefin-dx-nvidia.
An Nvdia GPU. (Tested on an RTX A4500)

Resources

Download the three files in the gist on my GitHub.

  • nvbox.ini
    – This is the ‘assemble’ file to create the distrobox.
  • check-nvdia-cuda
    – A bash script to test for installed libraries, libcuda.so, libcudnn, and libcudart
    – It also checks for nvcc, then runs nvidia-smi to show installed versions of Nvidia drivers
  • tensorflow_mnist_test
    – Checks that the GPU is available
    – Creates a basic model to test the ability to train a model using the MNIST dataset

Creating the Distrobox

From the folder where you downloaded the files:

distrobox assemble create --file nvbox.ini

This will download the blobs for the Tensorflow image at nvcr.io/nvidia/tensorflow:23.12-tf2-py3, create, and start the container.
The home folder for the distrobox is ~/.local/shared/distrobox/nvbox. This avoids polluting your own home folder.
The contents of your ~/.ssh folder will have been copied to the distrobox’s home folder. This allows password-less login to the ssh server running in the distrobox.

Testing Libraries

Enter the distrobox:
distrobox enter nvbox
You should see your prompt change to nvbox%.
You will be in your user’s home folder (not the distrobox’s home folder).
Change directory to the folder where you downloaded the files from the gist, then run the library checks:
./check-nvidia-cuda
You should see messages about libcuda.so and others being installed, the version number of nvcc, and the information from nvidia-smi. Your GPU should be listed.

Testing Tensorflow

Again from the folder where you downloaded the gist files, run:
./tensorflow_mnist_test
You will see many informational messages about Tensorflow (they have the date, time and I for Information. You should also see your GPU listed, such as:

2024-01-05 16:10:06.596935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1883] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15098 MB memory:  -> device: 0, name: NVIDIA RTX A4500, pci bus id: 0000:01:00.0, compute capability: 8.6
Epoch 1/6
2024-01-05 16:10:07.678879: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f51006a2170 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-01-05 16:10:07.678909: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): NVIDIA RTX A4500, Compute Capability 8.6

Then, there will be six epochs listed as the model trains. Finally, the last few lines will be something like:

Epoch 6/6
469/469 [==============================] - 0s 940us/step - loss: 0.0614 - sparse_categorical_accuracy: 0.9824 - val_loss: 0.0801 - val_sparse_categorical_accuracy: 0.9745

If all the above works, you should be ready to test connecting with VSCode.

VSCode via SSH

Open VSCode, open the palette (Ctrl+Shift+P), then type ssh conn
Select Remote-SSH: Connect to host
Select + Add New SSH Host...

Enter ssh yourname@127.0.0.1:2222
Select /home/yourname/.ssh/config

You will see a message at the lower right that the host has been added. Click Open Config.
The host will look like this:

Host 127.0.0.1
     HostName 127.0.0.1
     Port 2222
     User yourname

I like to change the host name to something more intuitive:

Host nvbox
	HostName 127.0.0.1
	Port 2222
	User yourname

and save the file.

Now open the command palette again with Ctrl+Shift+P.
Select Remote-SSH: Connect to host
You should see your host listed, which you can select.

From here, you should be able to run Python notebooks and scripts using Tensorflow.

SSH Strict Host Checking

The first time you ssh into a host, you are given the opportunity to add the host to your ~/.ssh/known_hosts file. If you recognize the host, you will want to add it.

If you rebuild the distrobox later, its ssh “identity” will change. When you attempt to ssh into the host, you will see a warning about the host’s identity changing, and that someone may be doing something nefarious. The message will also give you the line number of the offending host within the ~/.ssh/known_hosts file.

To fix this, just use your favorite editor to open ~/.ssh/known_hosts and delete the line. The next time you ssh into the host, you will be asked to add the host’s identity as a new host, and everything will be back to normal. (h/t @m2Giles for pointing this out.)

Final Thoughts

Home

As briefly mentioned above, the home folder for the distrobox is ~/.local/share/distrobox/nvbox This makes it easy to copy files from your host home folder /home/yourname to the distrobox’s home folder.
I also add a final cd command in my .zshrc so that when I’m given the shell prompt, I’m in ~/.local/share/distrobox/nvbox, rather than ~/.

Installing other software

Since the distrobox is based on Ubuntu, after you distrobox enter nvbox, you can used apt to install software.
I have a ~/.local/bin folder in my distrobox where I put things like

4 Likes

Update

Using vscode through ssh to the distrobox is painfully slow.

I’ve installed vscode in the distrobox, exported the app, and it is a much better experience. One drawback is I haven’t been able to sync settings by logging into GitHub. This isn’t a huge thing for me, as I just install the extensions I need, and change a few settings.

Download link for the vscode .deb file

Then you:
distrobox enter nvbox

sudo apt install /home/yourname/Downloads/code_<whatever the version number is>

distrobox-export --app code
And vscode will appear when you search for apps. You will have (at least) two, the one installed in the host, and the one installed in the distrobox.

1 Like

This seems to be a better solution. Probably old hat to cloud natives, but a long time coming for me.

Edit, build and run this Dockerfile, use Dev Container in vscode to connect to the running container.

# Use the existing image as the base
FROM nvcr.io/nvidia/tensorflow:24.05-tf2-py3

# Set the working directory
WORKDIR /workspace

# Create a user and group with specified IDs
RUN groupadd -g 1000 john && \
    useradd -u 1000 -g 1000 -m -s /bin/bash john

USER root

# Set timezone to EST and configure tzdata non-interactively
ENV TZ=America/New_York
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get update && \
    apt-get install -y tzdata && \
    ln -fs /usr/share/zoneinfo/America/New_York /etc/localtime && \
    dpkg-reconfigure --frontend noninteractive tzdata


# Install the desired Python version (Python 3.11.9)
RUN apt-get update && \
    apt-get install -y software-properties-common && \
    add-apt-repository ppa:deadsnakes/ppa && \
    apt-get update && \
    apt-get install -y python3.11 python3.11-dev python3.11-venv python3.11-distutils && \
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
    update-alternatives --install /usr/bin/python python /usr/bin/python3.11 1

# Ensure pip is up-to-date and points to the new Python version
RUN python3.11 -m ensurepip && \
    python3.11 -m pip install --upgrade pip

USER john

# Install additional Python packages
RUN pip install --no-cache-dir lxml==5.2.1
RUN pip install --no-cache-dir numpy==1.26.4
RUN pip install --no-cache-dir scikit-learn==1.4.2
RUN pip install --no-cache-dir scipy==1.13.0
RUN pip install --no-cache-dir ipykernel==6.29.4

RUN pip install --no-cache-dir dvc


USER root
# Install VSCode packages (extensions)
RUN apt-get update && apt-get install -y wget \
    && wget -qO- https://packages.microsoft.com/keys/microsoft.asc | gpg --dearmor > microsoft.gpg \
    && install -o root -g root -m 644 microsoft.gpg /etc/apt/trusted.gpg.d/ \
    && sh -c 'echo "deb [arch=amd64] https://packages.microsoft.com/repos/vscode stable main" > /etc/apt/sources.list.d/vscode.list' \
    && apt-get update \
    && apt-get install -y code

# Other packages
RUN apt-get install -y sudo

RUN usermod -aG sudo john
RUN echo "john ALL=(ALL) NOPASSWD:ALL" >>/etc/sudoers

USER john

# Optionally install specific VSCode extensions
RUN code --install-extension jebbs.plantuml
RUN code --install-extension mechatroner.rainbow-csv
RUN code --install-extension mhutchie.git-graph
RUN code --install-extension ms-python.black-formatter
RUN code --install-extension ms-python.debugpy
RUN code --install-extension ms-python.isort
RUN code --install-extension ms-python.mypy-type-checker
RUN code --install-extension ms-python.pylint
RUN code --install-extension ms-python.python
RUN code --install-extension ms-python.vscode-pylance
RUN code --install-extension ms-toolsai.jupyter
RUN code --install-extension ms-toolsai.jupyter-keymap
RUN code --install-extension ms-toolsai.jupyter-renderers
RUN code --install-extension ms-toolsai.tensorboard
RUN code --install-extension ms-toolsai.vscode-jupyter-cell-tags
RUN code --install-extension ms-toolsai.vscode-jupyter-slideshow

USER root
RUN rm -rf /var/lib/apt/lists/* microsoft.gpg

RUN apt-get clean

# Switch to the new user
USER john

# Set the entrypoint (optional)
CMD ["bash"]

I use this script to run the container and mount a volume where my source is located, and run the container. My understanding is podman should be a direct replacement for docker.

docker run --gpus all \
	--ipc=host --ulimit memlock=1 --ulimit stack=67108864 \
	-u 1000:1000 \
	-v /home/john/work/project:/workspace \
	-it \ 
	--rm \
	--name project2 \
	project:2.0

Thank you for writing up this! This was really useful to me.

1 Like

You’re welcome!
I’m glad it was helpful.
I still don’t feel like I have a good grasp of dev containers.