This lesson is in the early stages of development (Alpha version)

Running a parallel job

Overview

Teaching: 30 min
Exercises: 30 min
Questions
  • How do we execute a task in parallel?

Objectives
  • Understand how to run a parallel job on a cluster.

We now have the tools we need to run a multi-processor job. This is a very important aspect of HPC systems, as parallelism is one of the primary tools we have to improve the performance of computational tasks.

Running the Parallel Job

We will run an example that uses the Message Passing Interface (MPI) for parallelism — this is a common tool on HPC systems.

What is MPI?

The Message Passing Interface is a set of tools which allow multiple parallel jobs to communicate with each other. Typically, a single executable is run multiple times, possibly on different machines, and the MPI tools are used to inform each instance of the executable about how many instances there are, which instance it is. MPI also provides tools to allow communication and coordination between instances. An MPI instance typically has its own copy of all the local variables.

MPI jobs cannot generally be run as stand-alone executables. Instead, they should be started with the mpirun command, which ensures that the appropriate run-time support for parallelism is included.

On its own, mpirun can take many arguments specifying how many machines will participate in the process. In the context of our queuing system, however, we do not need to specify this information, the mpirun command will obtain it from the queuing system, by examining the environment variables set when the job is launched.

Our example implements a stochastic algorithm for estimating the value of π, the ratio of the circumference to the diameter of a circle. The program generates a large number of random points on a 1×1 square centered on (½,½), and checks how many of these points fall inside the unit circle. On average, π/4 of the randomly-selected points should fall in the circle, so π can be estimated from 4f, where f is the observed fraction of points that fall in the circle. Because each sample is independent, this algorithm is easily implemented in parallel.

We have provided a Python implementation, which uses MPI and NumPy, a popular library for efficient numerical operations.

Download the Python executable using the following command:

[[email protected] ~]$ wget https://learnhpc.github.io/2021-02-hpc-intro/files/pi.py

Let’s take a quick look inside the file. It is richly commented, and should explain itself clearly. Press “q” to exit the pager program (less).

[[email protected] ~]$ less pi.py

What’s pi.py doing?

One subroutine, inside_circle, does all the work. It randomly samples points with both x and y on the half-open interval [0, 1). It then computes their distances from the origin (i.e., radii), and returns those values. All of this is done using vectors of single-precision (32-bit) floating-point values.

The implicitly defined “main” function performs the overhead and accounting work required to subdivide the total number of points to be sampled and partitioning the total workload among the various parallel processors available. At the end, all the workers report back to a “rank 0,” which prints out the result.

This relatively simple program exercises four important concepts:

  • COMM_WORLD: the default MPI Communicator, providing a channel for all the processes involved in this mpirun to exchange information with one another.
  • Scatter: A collective operation in which an array of data on one MPI rank is divided up, with separate portions being sent out to the partner ranks. Each partner rank receives data from the matching index of the host array.
  • Gather: The inverse of scatter. One rank populates a local array, with the array element at each index assigned the value provided by the corresponding partner rank — including the host’s own value.
  • Conditional Output: since every rank is running the same code, the general print statements are wrapped in a conditional so that only one rank does it.

In general, achieving a better estimate of π requires a greater number of points. Take a closer look at inside_circle: should we expect to get high accuracy on a single machine?

Probably not. The function allocates two arrays of size N equal to the number of points belonging to this process. Using 32-bit floating point numbers, the memory footprint of these arrays can get quite large. The default total number — 8,738,128 — was selected to achieve a 100 MB memory footprint. Pushing this number to a billion yields a memory footprint of 11.2 GB: if your machine has less RAM than that, it will grind to a halt. If you have 16 GB installed, you won’t quite make it to 1½ billion points.

Our purpose here is to exercise the parallel workflow of the cluster, not to optimize the program to minimize its memory footprint. Rather than push our local machines to the breaking point (or, worse, the login node), let’s give it to a cluster node with more resources.

Create a submission file, requesting more than one task on a single node:

[[email protected] ~]$ nano parallel-pi.sh
[[email protected] ~]$ cat parallel-pi.sh
#!/bin/bash
#SBATCH -J parallel-pi
#SBATCH -p cpubase_bycore_b1
#SBATCH -N 1
#SBATCH -n 4

module load Python
mpirun ./pi.py 1431652028

Then submit your job. We will use the batch file to set the options, rather than the command line.

[[email protected] ~]$ sbatch parallel-pi.sh

As before, use the status commands to check when your job runs. Use ls to locate the output file, and examine it. Is it what you expected?

Key Points

  • Parallelism is an important feature of HPC clusters.

  • MPI parallelism is a common case.

  • The queuing system facilitates executing parallel tasks.