... | @@ -7,7 +7,7 @@ The [Slurm](https://slurm.schedmd.com) scheduler manages the resource allocation |
... | @@ -7,7 +7,7 @@ The [Slurm](https://slurm.schedmd.com) scheduler manages the resource allocation |
|
* The cluster nodes are put in **partitions**, depending on their hardware characteristics and the policy defined to used them.
|
|
* The cluster nodes are put in **partitions**, depending on their hardware characteristics and the policy defined to used them.
|
|
* Partitions are accessible by **accounts**, representing group of users granted with some permissions.
|
|
* Partitions are accessible by **accounts**, representing group of users granted with some permissions.
|
|
|
|
|
|
A resource allocation must specify a partition, an account, and other parameters such as the maximal execution time, the number of cores and the maximal used memory of the job. These Slurm parameters are written in **job scripts** and submitted to Slurm with the following command:
|
|
A resource allocation must specify a partition, an account, and other parameters such as the maximal execution time, the number of cores and the maximal used memory of the job. These Slurm parameters are written in **job scripts** and submitted to Slurm with the `sbatch` command:
|
|
```
|
|
```
|
|
sbatch script.sh
|
|
sbatch script.sh
|
|
```
|
|
```
|
... | @@ -67,12 +67,12 @@ squeue -u username |
... | @@ -67,12 +67,12 @@ squeue -u username |
|
```
|
|
```
|
|
|
|
|
|
## Job and cluster monitoring
|
|
## Job and cluster monitoring
|
|
In the output of the *squeue* command, the "ST" column provides the state of the job. The most common states are:
|
|
In the output of the `squeue` command, the "ST" column provides the state of the job. The most common states are:
|
|
* **R**: running.
|
|
* **R**: running.
|
|
* **PD**: pending. The job is awaiting or resources.
|
|
* **PD**: pending. The job is awaiting or resources.
|
|
* **S**: suspended. This typically happens when the job is preempted by another job. In this case, no action is required. Slurm will resume the job when the preemptor job ends.
|
|
* **S**: suspended. This typically happens when the job is preempted by another job. In this case, no action is required. Slurm will resume the job when the preemptor job ends.
|
|
|
|
|
|
The title of the last column displayed by *squeue* is "NODELIST(REASON)":
|
|
The title of the last column displayed by `squeue` is "NODELIST(REASON)":
|
|
* For running jobs, displays the list of allocated nodes.
|
|
* For running jobs, displays the list of allocated nodes.
|
|
* For pending jobs, displays the pending reason:
|
|
* For pending jobs, displays the pending reason:
|
|
* **Resources**: the resources requested by the job are not currently available since used by other jobs.
|
|
* **Resources**: the resources requested by the job are not currently available since used by other jobs.
|
... | @@ -95,7 +95,7 @@ The `sinfo` command displays the current state of compute nodes: |
... | @@ -95,7 +95,7 @@ The `sinfo` command displays the current state of compute nodes: |
|
## Accounting
|
|
## Accounting
|
|
Slurm is connected to a database recording job acccounting data. The `sacct` and `sreport` commands allow to access this accounting information.
|
|
Slurm is connected to a database recording job acccounting data. The `sacct` and `sreport` commands allow to access this accounting information.
|
|
|
|
|
|
Show information on the job *job_id*:
|
|
Show information on the *job_id* job:
|
|
```
|
|
```
|
|
# Short format
|
|
# Short format
|
|
sacct -j job_id
|
|
sacct -j job_id
|
... | @@ -103,17 +103,17 @@ sacct -j job_id |
... | @@ -103,17 +103,17 @@ sacct -j job_id |
|
sacct -l -j job_id
|
|
sacct -l -j job_id
|
|
```
|
|
```
|
|
|
|
|
|
Display jobs starting and ending between January 1, 2019 and January 1, 2020 ont the bigmem01 node
|
|
Display jobs starting and ending between January 1, 2019 and January 1, 2020 on the bigmem01 node:
|
|
```
|
|
```
|
|
sacct --nodelist=bigmem01 --starttime=2019-01-01 --endtime=2020-01-01
|
|
sacct --nodelist=bigmem01 --starttime=2019-01-01 --endtime=2020-01-01
|
|
```
|
|
```
|
|
|
|
|
|
Display the number of hours computed by a user *username* between June 1, 2019 and January 1, 2020 on each account.
|
|
Display the number of hours computed by *username* between June 1, 2019 and January 1, 2020 on each account:
|
|
```
|
|
```
|
|
sreport -t hours user TopUsage Start=2019-06-01 End=2020-01-01 Users=username
|
|
sreport -t hours user TopUsage Start=2019-06-01 End=2020-01-01 Users=username
|
|
```
|
|
```
|
|
|
|
|
|
Display the global number of hours computed by a user *username* between June 1, 2019 and January 1, 2020.
|
|
Display the global number of hours computed by *username* between June 1, 2019 and January 1, 2020:
|
|
```
|
|
```
|
|
sreport -t hours user TopUsage Group Start=2019-06-01 End=2020-01-01 Users=username
|
|
sreport -t hours user TopUsage Group Start=2019-06-01 End=2020-01-01 Users=username
|
|
``` |
|
``` |