|
\[\[_TOC_\]\]
|
|
[[_TOC_]]
|
|
|
|
|
|
## Introduction
|
|
## Introduction
|
|
To execute non-interactive work on Pyrene, users can request resources from the cluster computing nodes by submitting batch jobs.
|
|
To execute non-interactive work on Pyrene, users can request resources from the cluster computing nodes by submitting batch jobs.
|
|
|
|
|
|
The [Slurm](https://slurm.schedmd.com) scheduler manages the resource allocations for submitted batch jobs on Pyrene:
|
|
The [Slurm](https://slurm.schedmd.com) scheduler manages the resource allocations for submitted batch jobs on Pyrene:
|
|
* The cluster nodes are put in **partitions**, depending on their hardware characteristics and the policy defined to use them.
|
|
* The cluster nodes are put in **partitions**, depending on their hardware characteristics and the policy defined to use them.
|
|
* Partitions are accessible by **accounts**, representing group of users granted with some permissions.
|
|
* Partitions are accessible by **accounts**, representing group of users granted with some permissions.
|
|
|
|
|
|
A resource request must specify a partition, an account, and other parameters such as the maximal execution time, the number of cores and the maximal memory of the job. These Slurm parameters are written in **job scripts** and submitted to Slurm with the `sbatch` command:
|
|
A resource request must specify a partition, an account, and other parameters such as the maximal execution time, the number of cores and the maximal memory of the job. These Slurm parameters are written in **job scripts** and submitted to Slurm with the `sbatch` command:
|
|
```
|
|
```
|
|
sbatch script.sh
|
|
sbatch script.sh
|
|
```
|
|
```
|
|
where *script.sh* is the job script.
|
|
where *script.sh* is the job script.
|
|
|
|
|
|
## Job script examples
|
|
## Job script examples
|
|
|
|
|
|
Slurm job scripts contains two parts:
|
|
Slurm job scripts contains two parts:
|
|
1. Slurm directives: starting with `#SBATCH`, they specify Slurm options.
|
|
1. Slurm directives: starting with `#SBATCH`, they specify Slurm options.
|
|
1. Unix directives: commands for your job execution, such as loading modules, launching an executable program, etc.
|
|
1. Unix directives: commands for your job execution, such as loading modules, launching an executable program, etc.
|
|
|
|
|
|
Here are some self-explanatory job script examples for several softwares/applications on Pyrene:
|
|
Here are some self-explanatory job script examples for several softwares/applications on Pyrene:
|
|
* [Sequential job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sseq.sh).
|
|
* [Sequential job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sseq.sh).
|
|
* [Parallel, shared memory (OpenMP type) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_shared.sh).
|
|
* [Parallel, shared memory (OpenMP type) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_shared.sh).
|
|
* [Parallel, distributed memory (MPI with OpenMPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_openmpi.sh).
|
|
* [Parallel, distributed memory (MPI with OpenMPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_openmpi.sh).
|
|
* [Parallel, distributed memory (MPI with Intel MPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_intelmpi.sh).
|
|
* [Parallel, distributed memory (MPI with Intel MPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_intelmpi.sh).
|
|
* [Gaussian 09](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg09.sh) / [Gaussian 16](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg16.sh) job.
|
|
* [Gaussian 09](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg09.sh) / [Gaussian 16](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg16.sh) job.
|
|
* [Molpro job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smolpro.sh).
|
|
* [Molpro job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smolpro.sh).
|
|
* [R job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sR.sh).
|
|
* [R job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sR.sh).
|
|
* [ORCA job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sorca.sh)
|
|
* [ORCA job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sorca.sh)
|
|
* [MATLAB job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smatlab.sh)
|
|
* [MATLAB job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smatlab.sh)
|
|
|
|
|
|
## Usage of the scratch space in jobs
|
|
## Usage of the scratch space in jobs
|
|
The `/scratch` space is accessible in jobs using the `SCRATCHDIR` environment variable.
|
|
The `/scratch` space is accessible in jobs using the `SCRATCHDIR` environment variable.
|
|
|
|
|
|
A temporary directory named `$SCRATCHDIR=/scratch/$SLURM_JOB_ID` is automatically created in `/scratch` when a job begins. Therefore, each job `$SLURM_JOB_ID` is granted a specific `$SCRATCHDIR` directory where it can write and read its execution temporary files.
|
|
A temporary directory named `$SCRATCHDIR=/scratch/$SLURM_JOB_ID` is automatically created in `/scratch` when a job begins. Therefore, each job `$SLURM_JOB_ID` is granted a specific `$SCRATCHDIR` directory where it can write and read its execution temporary files.
|
|
|
|
|
|
The `$SCRATCHDIR` directory is destroyed 5 days after the end of the job: **do not forget to save the `$SCRATCHDIR` important files in your home directory before the `$SCRATCHDIR` is deleted!**.
|
|
The `$SCRATCHDIR` directory is destroyed 5 days after the end of the job: **do not forget to save the `$SCRATCHDIR` important files in your home directory before the `$SCRATCHDIR` is deleted!**.
|
|
|
|
|
|
## Partitions and limits
|
|
## Partitions and limits
|
|
|
|
|
|
| **Partition** | **Nodes** | **Account** | **Max cores<br/>per job** | **Min mem<br/>per core (MB)** | **Max mem<br/>per core (MB)** | **Time limit** | **Preempted by** |**Preemption behavior** |
|
|
| **Partition** | **Nodes** | **Account** | **Max cores<br/>per job** | **Min mem<br/>per core (MB)** | **Max mem<br/>per core (MB)** | **Time limit** | **Preempted by** |**Preemption behavior** |
|
|
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
|
| ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
|
|
| short | n[001-040], bigmem[01-05] | uppa | 64 | 1 | 2000 | 12 hours | - |- |
|
|
| short | n[001-040], bigmem[01-05] | uppa | 64 | 1 | 2000 | 12 hours | - |- |
|
|
| standard | n[001-036] | uppa | 64 | 1 | 2000 | 5 days | project, short | job suspended |
|
|
| standard | n[001-036] | uppa | 64 | 1 | 2000 | 5 days | project, short | job suspended |
|
|
| free | n[001-036] | free | 96 | 1 | 2000 | - | project, short, standard | job suspended |
|
|
| free | n[001-036] | free | 96 | 1 | 2000 | - | project, short, standard | job suspended |
|
|
| midmem | n[037-040] | midmem | 64 | 2001 | 7800 | 5 days | - | - |
|
|
| midmem | n[037-040] | midmem | 64 | 2001 | 7800 | 5 days | - | - |
|
|
| bigmem | bigmem[01-05] | bigmem | 64 | 2001 | 23000 | 5 days | - | - |
|
|
| bigmem | bigmem[01-05] | bigmem | 64 | 2001 | 23000 | 5 days | - | - |
|
|
| gpu | gpu01 | gpu | 32 | - | - | - | - | - |
|
|
| gpu | gpu01 | gpu | 32 | - | - | - | - | - |
|
|
| bonus | n[001-040], bigmem[01-05], gpu01, visu01 | bonus | - | 1 | - | - | _any other partition_ | job requeued |
|
|
| bonus | n[001-040], bigmem[01-05], gpu01, visu01 | bonus | - | 1 | - | - | _any other partition_ | job requeued |
|
|
| project | n[001-040], bigmem[01-05] | _depends on project_ | 80 | 1 | - | 5 days | - |- |
|
|
| project | n[001-040], bigmem[01-05] | _depends on project_ | 80 | 1 | - | 5 days | - |- |
|
|
|
|
|
|
### Regular partitions: short, standard
|
|
### Regular partitions: short, standard
|
|
* Accessible to all users via the uppa account.
|
|
* Accessible to all users via the uppa account.
|
|
* The number of concurrently allocated cores per user cumulated over the two partitions is limited to 64. This means that if you have 64 cores running on short and standard (cumulated), your subsequent jobs will pend until some of your running jobs end.
|
|
* The number of concurrently allocated cores per user cumulated over the two partitions is limited to 64. This means that if you have 64 cores running on short and standard (cumulated), your subsequent jobs will pend until some of your running jobs end.
|
|
|
|
|
|
### Special partitions: free, midmem, bigmem, gpu, bonus, project
|
|
### Special partitions: free, midmem, bigmem, gpu, bonus, project
|
|
* Every user is member of the free, midmem, bigmem and bonus accounts. These accounts are needed to access the free, midmem, bigmem and bonus partitions, respectively.
|
|
* Every user is member of the free, midmem, bigmem and bonus accounts. These accounts are needed to access the free, midmem, bigmem and bonus partitions, respectively.
|
|
* The gpu partition is only accessible to the members of the gpu account.
|
|
* The gpu partition is only accessible to the members of the gpu account.
|
|
* The number of concurrently allocated cores per user on the free partition is limited to 96.
|
|
* The number of concurrently allocated cores per user on the free partition is limited to 96.
|
|
* The number of concurrently allocated cores per user on the midmem partition is limited to 64.
|
|
* The number of concurrently allocated cores per user on the midmem partition is limited to 64.
|
|
* The number of concurrently allocated cores per user on the bigmem partition is limited to 128.
|
|
* The number of concurrently allocated cores per user on the bigmem partition is limited to 128.
|
|
* The number of concurrently allocated cores per user on the gpu and bonus partitions is not limited.
|
|
* The number of concurrently allocated cores per user on the gpu and bonus partitions is not limited.
|
|
* The bonus partition is especially well suited:
|
|
* The bonus partition is especially well suited:
|
|
* To run a large number of jobs.
|
|
* To run a large number of jobs.
|
|
* To concurrently use a large number of cores.
|
|
* To concurrently use a large number of cores.
|
|
* When the cluster occupation is low or the job does not last long (limits the risk for the bonus job to be requeued by the preemption mechanism).
|
|
* When the cluster occupation is low or the job does not last long (limits the risk for the bonus job to be requeued by the preemption mechanism).
|
|
* For restartable jobs (in order not to completely start over the simulation if the bonus job is requeued by the preemption mechanism).
|
|
* For restartable jobs (in order not to completely start over the simulation if the bonus job is requeued by the preemption mechanism).
|
|
* When having a job requeued is not a big issue.
|
|
* When having a job requeued is not a big issue.
|
|
* On the project partition:
|
|
* On the project partition:
|
|
* Jobs related to special projects use prioritary hours.
|
|
* Jobs related to special projects use prioritary hours.
|
|
* The number of concurrently allocated cores per user is limited to 80.
|
|
* The number of concurrently allocated cores per user is limited to 80.
|
|
|
|
|
|
## Usual Slurm user commands
|
|
## Usual Slurm user commands
|
|
Submit a job:
|
|
Submit a job:
|
|
```
|
|
```
|
|
sbatch script.sh
|
|
sbatch script.sh
|
|
```
|
|
```
|
|
Cancel the *job_id* job (*job_id* is the number provided by Slurm to identify the job):
|
|
Cancel the *job_id* job (*job_id* is the number provided by Slurm to identify the job):
|
|
```
|
|
```
|
|
scancel job_id
|
|
scancel job_id
|
|
```
|
|
```
|
|
Display the jobs in the waiting queue:
|
|
Display the jobs in the waiting queue:
|
|
```
|
|
```
|
|
squeue
|
|
squeue
|
|
```
|
|
```
|
|
Display the jobs of user *username* in the waiting queue:
|
|
Display the jobs of user *username* in the waiting queue:
|
|
```
|
|
```
|
|
squeue -u username
|
|
squeue -u username
|
|
```
|
|
```
|
|
|
|
|
|
## Job and cluster monitoring
|
|
## Job and cluster monitoring
|
|
In the output of the `squeue` command, the "ST" column provides the job state. The most common states are:
|
|
In the output of the `squeue` command, the "ST" column provides the job state. The most common states are:
|
|
* **R**: running.
|
|
* **R**: running.
|
|
* **PD**: pending. The job is awaiting for resources.
|
|
* **PD**: pending. The job is awaiting for resources.
|
|
* **S**: suspended. This typically happens when the job is preempted by another job. In this case, no action is required. Slurm will resume the job when the preemptor job ends.
|
|
* **S**: suspended. This typically happens when the job is preempted by another job. In this case, no action is required. Slurm will resume the job when the preemptor job ends.
|
|
|
|
|
|
The title of the last column displayed by `squeue` is "NODELIST(REASON)":
|
|
The title of the last column displayed by `squeue` is "NODELIST(REASON)":
|
|
* For running jobs, displays the list of allocated nodes.
|
|
* For running jobs, displays the list of allocated nodes.
|
|
* For pending jobs, displays the pending reason:
|
|
* For pending jobs, displays the pending reason:
|
|
* **Resources**: the resources requested by the job are not currently available since used by other jobs.
|
|
* **Resources**: the resources requested by the job are not currently available since used by other jobs.
|
|
* **Priority**: the job priority is lower than the priority of other jobs.
|
|
* **Priority**: the job priority is lower than the priority of other jobs.
|
|
* **QOSMaxCpuPerUserLimit**: the maximal number of authorized allocated cores has been reached by *username*; the job is waiting for some running jobs of *username* to end.
|
|
* **QOSMaxCpuPerUserLimit**: the maximal number of authorized allocated cores has been reached by *username*; the job is waiting for some running jobs of *username* to end.
|
|
* **BeginTime**: the job earliest start time has not been reached yet. Can happen when the job is requeued by Slurm to fix an issue: in this case, Slurm sets a delayed start time for the job.
|
|
* **BeginTime**: the job earliest start time has not been reached yet. Can happen when the job is requeued by Slurm to fix an issue: in this case, Slurm sets a delayed start time for the job.
|
|
* **Held state**: the job *job_id* is hold by Slurm. To unlock it, do `scontrol release job_id`.
|
|
* **Held state**: the job *job_id* is hold by Slurm. To unlock it, do `scontrol release job_id`.
|
|
* **QOSMaxCpuPerJobLimit**: if a job specifies a memory per CPU limit that exceeds the partition limit, that job's count of CPUs per task will automatically be increased. This may result in the job failing due to CPU count limits. ***In this case, cancel your job and resubmit it with the correct parameters, otherwise it will pend forever***.
|
|
* **QOSMaxCpuPerJobLimit**: if a job specifies a memory per CPU limit that exceeds the partition limit, that job's count of CPUs per task will automatically be increased. This may result in the job failing due to CPU count limits. ***In this case, cancel your job and resubmit it with the correct parameters, otherwise it will pend forever***.
|
|
|
|
|
|
The following command provides detailed information on a running or recently terminated job:
|
|
The following command provides detailed information on a running or recently terminated job:
|
|
```
|
|
```
|
|
scontrol show job job_id
|
|
scontrol show job job_id
|
|
```
|
|
```
|
|
|
|
|
|
The `sinfo` command displays the current state of compute nodes:
|
|
The `sinfo` command displays the current state of compute nodes:
|
|
* **STATE=alloc**: the node is fully allocated.
|
|
* **STATE=alloc**: the node is fully allocated.
|
|
* **STATE=mix**: the node is partly allocated.
|
|
* **STATE=mix**: the node is partly allocated.
|
|
* **STATE=idle**: the node is not allocated.
|
|
* **STATE=idle**: the node is not allocated.
|
|
* **STATE=drain**: the node does not accept new jobs, but the jobs currently allocated on the node keep running.
|
|
* **STATE=drain**: the node does not accept new jobs, but the jobs currently allocated on the node keep running.
|
|
|
|
|
|
## Accounting
|
|
## Accounting
|
|
Slurm is connected to a database recording job accounting data. The `sacct` and `sreport` commands allow to access this accounting information.
|
|
Slurm is connected to a database recording job accounting data. The `sacct` and `sreport` commands allow to access this accounting information.
|
|
|
|
|
|
Show information on the *job_id* job:
|
|
Show information on the *job_id* job:
|
|
```
|
|
```
|
|
# Short format
|
|
# Short format
|
|
sacct -X -j job_id
|
|
sacct -X -j job_id
|
|
# Long format
|
|
# Long format
|
|
sacct -X -l -j job_id
|
|
sacct -X -l -j job_id
|
|
```
|
|
```
|
|
|
|
|
|
Display jobs starting and ending between January 1, 2019 and January 1, 2020 on the bigmem01 node:
|
|
Display jobs starting and ending between January 1, 2019 and January 1, 2020 on the bigmem01 node:
|
|
```
|
|
```
|
|
sacct -X --nodelist=bigmem01 --starttime=2019-01-01 --endtime=2020-01-01
|
|
sacct -X --nodelist=bigmem01 --starttime=2019-01-01 --endtime=2020-01-01
|
|
```
|
|
```
|
|
|
|
|
|
Display the number of hours computed by *username* between June 1, 2019 and January 1, 2020 on each account:
|
|
Display the number of hours computed by *username* between June 1, 2019 and January 1, 2020 on each account:
|
|
```
|
|
```
|
|
sreport -t hours user TopUsage Start=2019-06-01 End=2020-01-01 Users=username
|
|
sreport -t hours user TopUsage Start=2019-06-01 End=2020-01-01 Users=username
|
|
```
|
|
```
|
|
|
|
|
|
Display the global number of hours computed by *username* between June 1, 2019 and January 1, 2020:
|
|
Display the global number of hours computed by *username* between June 1, 2019 and January 1, 2020:
|
|
```
|
|
```
|
|
sreport -t hours user TopUsage Group Start=2019-06-01 End=2020-01-01 Users=username
|
|
sreport -t hours user TopUsage Group Start=2019-06-01 End=2020-01-01 Users=username
|
|
```
|
|
```
|
|
|
|
|
|
Display the real resources used by the terminated *job_id* job:
|
|
Display the real resources used by the terminated *job_id* job:
|
|
```
|
|
```
|
|
sacct -j job_id --format=JobID,State,TRESUsageInMax%100 | grep batch
|
|
sacct -j job_id --format=JobID,State,TRESUsageInMax%100 | grep batch
|
|
```
|
|
```
|
|
The output of this command will look like this:
|
|
The output of this command will look like this:
|
|
|
|
|
|
_job_id.batch COMPLETED **cpu=10:00:28**,energy=0,fs/disk=234247,**mem=13353272K**,pages=0,vmem=185824K_
|
|
_job_id.batch COMPLETED **cpu=10:00:28**,energy=0,fs/disk=234247,**mem=13353272K**,pages=0,vmem=185824K_
|
|
|
|
|
|
In the above example, the used cumulated CPU time of the *job_id* job is about 10 hours, and the used maximal total memory during the execution is about 13 GB.
|
|
In the above example, the used cumulated CPU time of the *job_id* job is about 10 hours, and the used maximal total memory during the execution is about 13 GB.
|
|
|
|
|
|
Display the real resources used by the terminated jobs of *username* starting and ending between February 1, 2021 and March 1, 2021 on the bigmem partition:
|
|
Display the real resources used by the terminated jobs of *username* starting and ending between February 1, 2021 and March 1, 2021 on the bigmem partition:
|
|
```
|
|
```
|
|
sacct -u username --partition=bigmem --starttime=2021-02-01 --endtime=2021-03-01 --format=JobID,State,TRESUsageInMax%100 | grep batch
|
|
sacct -u username --partition=bigmem --starttime=2021-02-01 --endtime=2021-03-01 --format=JobID,State,TRESUsageInMax%100 | grep batch
|
|
``` |
|
``` |
|
|
|
\ No newline at end of file |