Managing fedora toolboxes with ansible (including pyspark installation)

I’ve been using toolbx to manage development toolboxes on Fedora for quite a few years now, but I’ve been a bit lazy about automating the setup of new toolbox containers. I’ve finally got a mostly-working ansible playbook that I can share here. Getting pyspark to work took a bit of trial and error, so this will definitely be useful in the future.

Briefly, a toolbox is a mutable container that allows you to install development toolchains - or anything else - in a way that is isolated from the host operating system. The advantage of using toolbox over ordinary podman or docker containers is that everything “just works” - your home directory is mounted, networking is all configured, usb devices work, etc.

The full playbook is embedded below, but I thought I’d call out the interesting parts.

Prerequisites

The playbook doesn’t create the toolbox so we have to do this manually.

toolbox create fedora40-pyspark

A new toolbox doesn’t include ansible, so we need to install that too.

toolbox enter fedora40-pyspark
sudo dnf install ansible

In order to get Spark to work, I needed to create a few environment variables. However, I didn’t want to add them to my main .bashrc, as they will only apply to this container.

I added the following section at the end of my .bashrc.

# toolbox-specific env vars

if [ -f /run/.toolboxenv ]; then
    source /run/.containerenv
    if [ -f ~/.toolbox-$name.rc ]; then
        source ~/.toolbox-$name.rc
    fi
fi

This detects whether we are running in a toolbox, and if we are it looks for a supplementary file called toolbox-<toolbox-name>.rc. This change in .bashrc is done manually, but the playbook will create the toolbox-specific file.

Details

At the beginning we make sure we are running inside a toolbox, by checking for the same .toolboxenv file. This is important as the playbook relies on being able to write to our home directory as well as to the container filesystem. We also retrieve the name of the toolbox so we can use it to create the toolbox-specific environment file.

After that there is some generic setup stuff, mostly installing the tools I use for work.

For Spark, we download and install to /opt/spark-x.y.z, where x.y.z is the spark version number.

For the Spark-specific environment variables, we write them to the toolbox-specific environment file.

The key to getting pyspark to work inside the toolbox turned out to be adding export SPARK_LOCAL_IP=localhost to the rest of the “standard” variables. This is because the hostname inside the toolbox is always toolbox and this name can’t be resolved to an IP address. /etc/hosts is symlinked (indirectly) to the host’s /etc/hosts and is not writeable.

Finally we do a bit of Spark config and install pyspark.

Running the playbook

I run this from inside the container with

ansible-playbook ~/code/misc/toolbox-setup.yml --ask-become-pass 

toolbox containers have passwordless sudo, but ansible doesn’t know about this by default, so entering the password by hand was easier.

After running the playbook, it’s best to exit and reenter the toolbox so that the environment variables are picked up.

exit

toolbox enter fedora40-pyspark

The complete playbook follows, hopefully this is a useful starting point for somebody else.