I’m not a frequent user of Windows, but I understand getting dependencies installed for local development can sometimes be a bit of a pain. I’m using an Azure VM1, but these instructions should work on a regular Windows 10 installation. Since I’m not a “Windows Insider”, I followed the manual steps here to get WSL installed, then upgrade to WSL2. The steps are reproduced here for convenience:
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
Reboot
Download and run the Linux Kernel update package from https://wslstorestorage.blob.core.windows.net/wslblob/wsl_update_x64.msi
Set the default WSL version to v2
wsl --set-default-version 2
When you launch the Ubuntu “app” for the first time, it will set up the filesystem, and prompt you for a username and password.
I installed this from https://www.docker.com/products/docker-desktop. This requires a logout/login after installation. I skipped the tutorial.
I installed this from https://code.visualstudio.com. The first time I ran it, it prompted me to install the recommended extensions for WSL. Installing the recommended extensions
There are some tutorials that cover a range of scenarios for VS Code Remote Development at https://code.visualstudio.com/docs/remote/remote-overview , which are worth looking through.
If you don’t have an existing project to set up, create a folder to hold a new project along with its configuration. I am using The ~\source\repos\sparkstuff
. Next, open this folder in VS Code. This is the folder where our python files will be stored. Inside this folder, create another folder .devcontainer
, and inside that create a file devcontainer.json
.devcontainer.json
file under the source folder
This file should have the following contents:
// For format details, see https://aka.ms/devcontainer.json. For config options, see the README at:
// https://github.com/microsoft/vscode-dev-containers/tree/v0.163.1/containers/debian
{
"name": "pyspark",
"image": "jupyter/pyspark-notebook",
// Set *default* container specific settings.json values on container create.
"settings": {
"terminal.integrated.shell.linux": "/bin/bash"
},
// Add the IDs of extensions you want installed when the container is created.
"extensions": [],
// Use 'forwardPorts' to make a list of ports inside the container available locally.
"forwardPorts": [8888,4040],
// Uncomment to use the Docker CLI from inside the container. See https://aka.ms/vscode-remote/samples/docker-from-docker.
// "mounts": [ "source=/var/run/docker.sock,target=/var/run/docker.sock,type=bind" ],
// Uncomment when using a ptrace-based debugger like C++, Go, and Rust
// "runArgs": [ "--cap-add=SYS_PTRACE", "--security-opt", "seccomp=unconfined" ],
// Comment out connect as root instead. More info: https://aka.ms/vscode-remote/containers/non-root.
"remoteUser": "jovyan"
// this is the user defined in the jupyter container
}
For this example we’re using the jupyter/pyspark-notebook
image created by the Jupyter Docker Stacks project. The focus of this example isn’t Jupyter, but this image is a convenient way to get started with pyspark in docker. If thse images contain too much or too little for your purposes, they can be used as a starting point to build your own images. The ports being exposed are 8888 for the Jupyter Server, and 4040 for SparkUI.
The green button at the bottom left of the VS Code window opens the “Remote” menu. Open the ‘Remote’ menu Performance warning SuccessRemote-Container: Open folder in container
, select the current folder. You may get a warning about poor performance of mounted folders. I concur with this warning, it does seem a bit slow; there are a few things you can do about this should you feel the need.
Visual Studio Code has opened our folder on the Windows filesystem, and mounted it into the pyspark container. Inside the terminal window - which is attached to the container - we can run pyspark interactively:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.1
/_/
Using Python version 3.8.8 (default, Feb 20 2021 16:22:27)
Spark context Web UI available at http://c13572fdf7e7:4040
Spark context available as 'sc' (master = local[*], app id = local-1615335763786).
SparkSession available as 'spark'.
>>> from pyspark.sql.functions import *
>>> quizresults = spark.read.json('quizresults.json')
>>> winner = quizresults.orderBy(desc("Points")).first()
>>> winner
Row(Name='Roscoe P Coltrane', Points=68)
>>>
or submit jobs with spark-submit
jovyan@c13572fdf7e7:/workspaces/sparkstuff$ spark-submit quizresults.py
...
...
21/03/10 01:31:38 INFO DAGScheduler: Job 3 finished: showString at NativeMethodAccessorImpl.java:0, took 0.298366 s
21/03/10 01:31:38 INFO CodeGenerator: Code generated in 63.5178 ms
+-----------------+------+
| Name|Points|
+-----------------+------+
|Roscoe P Coltrane| 68|
+-----------------+------+
21/03/10 01:31:38 INFO SparkUI: Stopped Spark web UI at http://c13572fdf7e7:4040
This seems a bit slow, but it’s workable. I think with a bit of effort on slimming down the image, as well as some of the other steps Microsoft suggest to improve performance, this is a good way to run PySpark locally. You can download the files used in this example from this gist.
Not all types of Azure VM support nested virtualization, the list of types that do is currently here: https://docs.microsoft.com/en-us/azure/virtual-machines/acu##. I’m using a Standard_D2s_v4
, which has 2 vcpus and 8 GiB of memory. Windows versions sometimes matter for tasks like this, the version I am using is Windows 10 Professional, Version 2004, Build 19041.804. ↩︎