All jobs on the general purpose cluster request resources via SLURM. SLURM, is open source software that allocates resources to users for their computations, provides a framework for starting, executing and monitoring compute jobs, and arbitrates contention for resources by managing a queue of pending work. SLURM is widely used in the high performance computing (HPC) landscape and it is likely you will encounter it outside of our systems. For more information please see https://slurm.schedmd.com/
General Purpose Computing
All resources on the general purpose cluster are submitted using the SLURM scheduler. For more information, please read the Frequently asked Questions. Jobs can be submitted from the following headnodes:
Or from the large memory machine:
All users have access to the "batch" partition for general purpose computing.
Frequently asked questions
SLURM documentation can be found at the SLURM website (https://slurm.schedmd.com); but below are answers to frequently asked questions which demonstrate several useful SLURM commands.
How can I view the current status, or resources available, of batch nodes?
sinfo is commonly used to few the status of a give cluster or node, or how many resources are available to schedule.
How can I view jobs currently running, and waiting in queue?
squeue will show jobs currently waiting in the queue or running, for all partitions that you have access to.
At the time this command was run, there were 7 jobs running or waiting in queue. JOBID 140574 is waiting in the queue due to inadequate available resources, while the other jobs have been running for a few days.
How can I view the resources requested for an active job?
scontrol show job [jobid] will generate a report with information about how a job was scheduled.
Note that once a job is completed, this report can no longer be generated via scontrol. See How do I view the resources used by my job? for accessing similar information upon job completion.
Here, the job requested 32 CPUs on one node, with 87.5GB of memory, at 2019-02-13T07:48:25, with a constraint of Features=avx2.
NumNodes=1 NumCPUs=32 NumTasks=1 CPUs/Task=32 ReqB:S:C:T=0:0:*:*
What are the maximum resources I can request?
batch has some important restrictions. A job can only request 3 nodes and will run for 14 days before being automatically terminated. If you need an exception to this rule, please contact askIT@albany.edu
How can I request access to more nodes, or a longer time limit?
On a case by case basis, ARCC will grant users temporary access to more than the default job limitations. Please contact askIT@albany.edu if you would like to request access to more nodes, or a longer time limit.
How do I schedule a non-interactive job?
There are many ways to schedule jobs via slurm. For non-interactive jobs, we recommend using sbatch with a shell script that runs your script. We will use #SBATCH commands to allocate the appropriate resources required for our script. Below is an example workflow of how to submit a python script via sbatch to batch.
First ssh into head.arcc.albany.edu. On windows, you can use an ssh client such as PuTTY, on mac, simply use the terminal. Replace [netid] below with your username and type in your password at the prompt. You will not see your password, but it is being typed.
Next, change directories to /network/rit/misc/software/examples/slurm/
/network/rit/misc/software/examples/slurm/run.sh contains #SBATCH commands that will request the appropriate amount of resources for our python code, then execute the code.
--cpus-per-task=4 tells SLURM how many cores we want to allocate on one node
--mem-per-cpu=100 tells SLURM how much memory to allocate per core (see also --mem)
In total, we are requesting 4 cores and 400MB of memory for this simple python code
To submit the job, we simply run sbatch run.sh. Keep note of the Job ID that is output to the terminal, it will be different that what is shown below.
Note that you can use squeue to view the job status
The job will output a file to your home directory called ~/example-slurm-[jobid].out. We will view it using the "more" command. You should see output similar to below.
- Congratulations, you just ran your first job on the cluster!
How do I schedule an interactive job?
To spawn a terminal session on a cluster node, with X11 forwarding, run:
This will spawn a 01:00:00 hour session, with 4 CPUs and 400mb of RAM. To spawn the same terminal, without X11 forwarding:
How do I view the resources used by a completed job?
sacct is useful to view accounting information on completed jobs. Read the documentation for all output fields.
This job ran on rhea-09, and it's max memory size was ~52 GB. That that I requested 60000MB, so I could refine this job to request slightly less memory. It ran for 14:50:14 and used about 350 CPU hours.
Can I restrict my job to a certain CPU architecture?
Yes! Use the --constraint flag in #SBATCH. To few available architecture on individual nodes use scontrol show node
How do I spawn on the infiniband nodes?
You need to add the directive --constraint=mpi_ib
How can I allocate GPU resources?
You can request access to the GPUs on --partition=batch or --partition=ceashpc by adding the following flag:
--gres=gpu:1 # For half of the K80
--gres=gpu:2 # For the full K80
How can I run jupyter notebook on the cluster?
There are two ways to spawn jupyter notebooks on the server:
- https://jupyterlab.arcc.albany.edu ; please see How-to: Using Jupyterhub for more information
If you need more resources, or longer than an eight hour time limit, you can run jupyter notebook interactively
First, ssh into head.arcc.albany.edu and run; then enter a password at the prompt (note that you will not see your password, but it is being registered)
Next, you can either run jupyter notebook interactively with srun, or you can submit the process via sbatch script located at /network/rit/misc/software/examples/slurm/spawn_jhub.sh (see below)
Spawning jupyter notebook interactively using ARCC's anaconda (you may change the path to your own conda distribution)
You should see a jupyter output related to launching the server. Once it is complete, you should see output that looks like:
Open up a web browser and navigate to the suggested location, in the example we would navigate to uagc19-02.rit.albany.edu:8889 , enter the configured password at the prompt, and that you are all set!
- Spawning jupyter notebook via sbatch using ARCCs anaconda (you may change the path to your own conda distribution):
ssh into head.arcc.albany.edu and copy the file below to your home directory and submit the script with sbatch.
Note that you will want to edit the script to request the amount of resources that you need
This script will create an output file called juptyer.[jobid].log. Open up this file, replacing [jobid] with the allocation number you were given (you can get this by looking at squeue) and you will see output that looks like:
Open up a web browser, and point to the location noted in the second to last line, in the above example, http://uagc12-02.arcc.albany.edu:8888, enter your password, and you are all set!