High-performance computing and how to use it
This document introduces the high-performance computing resources available to researchers in Oxford (or people working on collaborative projects), particularly in the School of Geography and the Environment.
There’s a risk that this will go out of date - please see SoGE intranet pages for potentially more recent information.
- Why would I run code on a computer that’s not my laptop/desktop?
- What resources are available? How do I use them?
Why?
Reasons to consider using a server, a cluster or HPC:
- laptop fan is getting tired
- long-running process that stops you shutting down/unplugging
- too much data to handle, running out of disk space
- not enough memory
- way more model-runs/iterations/replicates than would be feasible to run before the end of the project
Essentially - wanting to run more or bigger or longer computations than are possible on a single, normal-sized machine.
What?
All of these things - clusters, HPC, the cloud - are essentially more or bigger computers, running somewhere else.
School of Geography and the Environment (SoGE) Linux cluster
The SoGE Linux cluster has (this bit may be out of date, gives a lower bound):
- plenty of terabytes of storage
- Over 30 nodes with >12 cores each
- some reasonably high-memory nodes (>100GB RAM)
- some GPU nodes
Contact itsupport to request an account. You might want to be added to a project group to access a shared directory.
VPN - Virtual Private Network
You need to be connected to the VPN before you can log in to the server from outside of the department.
SSH - Secure Shell
The usual way to connect to and run programs on high-performance compute clusters is on the command line. This applies for the SoGE and ARC clusters.
On Linux or Mac, open a Terminal:
$ echo "Hello"
> Hello
$ pwd
> /home/tom/some_directory
$ ls
> some_file.txt data.csv image.png ...
On Windows, the same commands would work in the Windows Subsystem for Linux (WSL) or Git Bash or would be a little different in Powershell or cmd.exe. Consider installing WSL and/or Git Bash - they let you learn the Linux command line on your local computer and mean you don’t need to learn and remember two different ways of working.
On Linux or Mac (or in WSL or Bash), you should be able to use the ssh command to connect to a remote computer - remember to use your Oxford username:
ssh abcd1234@linux.ouce.ox.ac.uk
> prompt for password
> once you're logged in, try running "pwd" and "ls" to see where you are
On Windows, without installing a different command line, you can use the program putty to connect - click through the forms to set up a new SSH connection.
One quirk when connecting to the OUCE server is that it randomly assigns you a node on the cluster to log in to (linux1 or linux2 …. linux32) and sometimes that node is overloaded so login fails. Try again when that happens, or pick a specific node:
ssh abcd1234@linux3.ouce.ox.ac.uk
Moving files
scp
- copy files to the cluster
Copy a single file to your home directory (~
for short) on the cluster:
scp data.csv abcd1234@linux.ouce.ox.ac.uk:~/
Copy a folder to the cluster, spelling out the full path:
scp -r ./data abcd1234@linux.ouce.ox.ac.uk:/ouce/staff/home/abcd1234/project/
Copy from the cluster to your local computer:
scp -r abcd1234@linux.ouce.ox.ac.uk:/ouce/staff/home/abcd1234/project/results ./
rsync
- sync a directory with the cluster
Similar to scp
, but can keep directories in sync and avoid sending files that
already exist:
rsync -azP ./data abcd1234@linux.ouce.ox.ac.uk:/ouce/staff/home/abcd1234/project/
Filezilla
- File transfer
Filezilla is a graphical program for file transfer - set up connection, browse through folders.
Long jobs
screen
- leave a job running and log out
- log in to a node
- remember which node number you’re on (e.g. linux24)
- type
screen
to start a screen session - run command
- type
CTRL+A, D
to quit screen - log out
- log back in to the specific node number (e.g. linux24.ouce.ox.ac.uk)
- type
screen -r
- see how your process is doing
Many jobs
GNU parallel
lets you run the same command lots of
times in parallel, without necessarily making changes to the script or command-line tool you
want to run.
parallel
has a book-style tutorial and documentation and
videos which give a good
introduction.
A few quick examples follow:
Count the lines in all the CSV files anywhere in the current directory:
find . -name '*.csv' | parallel wc -l
Use an argument list from a file (each line from list.txt
gets passed to the command
separately, replacing the {}
):
cat list.txt | parallel echo {}
Use an argument list from a CSV (each row gets passed to the command separately, and each column value is available as a numbered argument).
For example, arg.csv
might contain:
a,b
1,2
3,4
Then parallel
can read the file and pick up column names from the header:
parallel --header : -a args.csv --colsep ',' echo "a={a} b={b}"
One important option for parallel
is -j
- to limit the number of jobs that
run in parallel. E.g.
seq 1 8 | parallel -j4 'sleep 1 && echo {}'
Checking the cluster
Handy commands to give a slightly clunky overview of what’s running and the resources available
# check how many copies of my_program are running on each node
for i in {1..32}; do printf "linux$i "; ssh linux$i.ouce.ox.ac.uk ps -e | grep my_program | wc -l; done
# check number of cpus on each node
for i in {1..32}; do printf "linux$i "; ssh linux$i.ouce.ox.ac.uk nproc --all; done
# check memory usage on each node
for i in {1..32}; do printf "\nlinux$i\n"; ssh linux$i.ouce.ox.ac.uk free -g; done
Oxford Advanced Research Computing (ARC) clusters
The University of Oxford has some high-performance computing which is managed centrally by the ARC service, with two main clusters.
htc
is optimised for single-core jobs and running serial applications
many times - like a parameter sweep, sensitivity analysis or big scenario
analysis.
arc
is optimised for large parallel jobs spanning multiple nodes - like a
climate model run or other gridded, massively parallel computation. In terms of
resources.
More information:
Scheduler
The ARC cluster runs jobs using a scheduler. Instead of running things directly
or through screen
, you write a short configuration script to tell the cluster
what you want to run, and submit it to the job scheduler. Your jobs sit in a
queue until there’s space to run them.
Here’s a simple example, plenty more details in the documentation linked above.
Save the following in a script test_job.sh
:
#!/bin/bash
# set the number of nodes
#SBATCH --nodes=1
# set max wallclock time
#SBATCH --time=00:00:10
# set name of job
#SBATCH --job-name=test123
# mail alert at start, end and abortion of execution
#SBATCH --mail-type=ALL
# send mail to this address
#SBATCH --mail-user=user.name@ouce.ox.ac.uk
# run the application
echo "Hello"
Then submit the job:
sbatch test_job.sh
External facilites
For more storage or compute, access to specialist data or computational support, or to support broader collaborations, there are national facilities, and further international links through these.
- DAFNI (Data and Analytics Facility for National Infrastructure) for infrastructure systems data and modelling, mainly EPSRC-remit research
- JASMIN for environmental data and modelling, mainly NERC-remit research
- Tier-2 HPC facilities for access to various newer architectures (multicore/GPU/ARM…)
- ARCHER (Advanced Research Computing High End Resource) national supercomputer, “Tier-1”, for massive parallel workloads