Local PySpark Development on Windows with WSL2, Docker Desktop, and VSCode

Installing Prerequisites

I’m not a frequent user of Windows, but I understand getting dependencies installed for local development can sometimes be a bit of a pain. I’m using an Azure VM1, but these instructions should work on a regular Windows 10 installation. Since I’m not a “Windows Insider”, I followed the manual steps here to get WSL installed, then upgrade to WSL2. The steps are reproduced here for convenience:

Setting up WSL2

  1. Enable WSL
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
  1. Enable the “Virtual Machine Platform”
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
  1. Reboot

  2. Download and run the Linux Kernel update package from https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi

  3. Set the default WSL version to v2

wsl --set-default-version 2
  1. Install a windows distribution from the Microsoft Store. I chose Ubuntu 20.04. Naturally, this operation requires signing in with a Microsoft account.

When you launch the Ubuntu “app” for the first time, it will set up the filesystem, and prompt you for a username and password.

Docker Desktop

I installed this from https://www.docker.com/products/docker-desktop. This requires a logout/login after installation. I skipped the tutorial.

Visual Studio Code

I installed this from https://code.visualstudio.com. The first time I ran it, it prompted me to install the recommended extensions for WSL.

Installing the recommended extensions

Installing the recommended extensions

At the time of writing this is a single extension, Remote - WSL. The important extension for our purposes is the Remote - Containers extension.

There are some tutorials that cover a range of scenarios for VS Code Remote Development at https://code.visualstudio.com/docs/remote/remote-overview , which are worth looking through.

Configuring the project

If you don’t have an existing project to set up, create a folder to hold a new project along with its configuration. I am using ~\source\repos\sparkstuff. Next, open this folder in VS Code. This is the folder where our python files will be stored. Inside this folder, create another folder .devcontainer, and inside that create a file devcontainer.json.

The devcontainer.json file under the source folder

The devcontainer.json file under the source folder

This file should have the following contents:


// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.163.1/containers/debian
{
	"name": "pyspark",
	"image": "jupyter/pyspark-notebook",

	// Set *default* container specific settings.json values on container create.
	"settings": { 
		"terminal.integrated.shell.linux": "/bin/bash"
	},

	// Add the IDs of extensions you want installed when the container is created.
	"extensions": [],

	// Use 'forwardPorts' to make a list of ports inside the container available locally.
    "forwardPorts": [8888,4040],

	// Uncomment to use the Docker CLI from inside the container. See https://aka.ms/vscode-remote/samples/docker-from-docker.
	// "mounts": [ "source=/var/run/docker.sock,target=/var/run/docker.sock,type=bind" ],

	// Uncomment when using a ptrace-based debugger like C++, Go, and Rust
	// "runArgs": [ "--cap-add=SYS_PTRACE", "--security-opt", "seccomp=unconfined" ],

	// Comment out connect as root instead. More info: https://aka.ms/vscode-remote/containers/non-root.
	"remoteUser": "jovyan"
	// this is the user defined in the jupyter container
}


For this example we’re using the jupyter/pyspark-notebook image created by the Jupyter Docker Stacks project. The focus of this example isn’t Jupyter, but this image is a convenient way to get started with pyspark in docker. If thse images contain too much or too little for your purposes, they can be used as a starting point to build your own images. The ports being exposed are 8888 for the Jupyter Server, and 4040 for SparkUI.

Opening the folder in the container

The green button at the bottom left of the VS Code window opens the “Remote” menu.

Open the ‘Remote’ menu

Open the ‘Remote’ menu

After selecting Remote-Container: Open folder in container, select the current folder. You may get a warning about poor performance of mounted folders. I concur with this warning, it does seem a bit slow; there are a few things you can do about this should you feel the need.
Performance warning

Performance warning

Once the container is started, you should see something like the following:
Success

Success

Using Spark

Visual Studio Code has opened our folder on the Windows filesystem, and mounted it into the pyspark container. Inside the terminal window - which is attached to the container - we can run pyspark interactively:

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
      /_/

Using Python version 3.8.8 (default, Feb 20 2021 16:22:27)
Spark context Web UI available at http://c13572fdf7e7:4040
Spark context available as 'sc' (master = local[*], app id = local-1615335763786).
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import *
>>> quizresults = spark.read.json('quizresults.json')
>>> winner = quizresults.orderBy(desc("Points")).first()
>>> winner
Row(Name='Roscoe P Coltrane', Points=68)
>>> 

or submit jobs with spark-submit

jovyan@c13572fdf7e7:/workspaces/sparkstuff$ spark-submit quizresults.py
...
...
21/03/10 01:31:38 INFO DAGScheduler: Job 3 finished: showString at NativeMethodAccessorImpl.java:0, took 0.298366 s
21/03/10 01:31:38 INFO CodeGenerator: Code generated in 63.5178 ms
+-----------------+------+
|             Name|Points|
+-----------------+------+
|Roscoe P Coltrane|    68|
+-----------------+------+

21/03/10 01:31:38 INFO SparkUI: Stopped Spark web UI at http://c13572fdf7e7:4040

Conclusion

This seems a bit slow, but it’s workable. I think with a bit of effort on slimming down the image, as well as some of the other steps Microsoft suggest to improve performance, this is a good way to run PySpark locally. You can download the files used in this example from this gist.


  1. Not all types of Azure VM support nested virtualization, the list of types that do is currently here: https://docs.microsoft.com/en-us/azure/virtual-machines/acu##. I’m using a Standard_D2s_v4, which has 2 vcpus and 8 GiB of memory. Windows versions sometimes matter for tasks like this, the version I am using is Windows 10 Professional, Version 2004, Build 19041.804. ↩︎