Changes

Odunlami Marc · d3422cc6
--- a/2-Slurm/2.1-Batch-job-submission.md
+++ b/2-Slurm/2.1-Batch-job-submission.md
-\[\[_TOC_\]\]
+[[_TOC_]]
 ## Introduction 
 To execute non-interactive work on Pyrene, users can request resources from the cluster computing nodes by submitting batch jobs.
 The [Slurm](https://slurm.schedmd.com) scheduler manages the resource allocations for submitted batch jobs on Pyrene:
 *  The cluster nodes are put in **partitions**, depending on their hardware characteristics and the policy defined to use them.
 *  Partitions are accessible by **accounts**, representing group of users granted with some permissions.
 A resource request must specify a partition, an account, and other parameters such as the maximal execution time, the number of cores and the maximal memory of the job. These Slurm parameters are written in **job scripts** and submitted to Slurm with the `sbatch` command:  
 ```
 sbatch script.sh
 ```
 where *script.sh* is the job script.
 ## Job script examples
 Slurm job scripts contains two parts:
 1.  Slurm directives: starting with `#SBATCH`, they specify Slurm options.
 1.  Unix directives: commands for your job execution, such as loading modules, launching an executable program, etc.
 Here are some self-explanatory job script examples for several softwares/applications on Pyrene:
 * [Sequential job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sseq.sh).
 * [Parallel, shared memory (OpenMP type) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_shared.sh).
 * [Parallel, distributed memory (MPI with OpenMPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_openmpi.sh).
 * [Parallel, distributed memory (MPI with Intel MPI) job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/spar_distrib_intelmpi.sh). 
 * [Gaussian 09](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg09.sh) / [Gaussian 16](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sg16.sh) job.
 * [Molpro job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smolpro.sh).
 * [R job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sR.sh).
 * [ORCA job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/sorca.sh)
 * [MATLAB job](https://git.univ-pau.fr/num-as/pyrene-cluster/-/blob/master/jobs/smatlab.sh)
 ## Usage of the scratch space in jobs
 The `/scratch` space is accessible in jobs using the `SCRATCHDIR` environment variable.
 A temporary directory named `$SCRATCHDIR=/scratch/$SLURM_JOB_ID` is automatically created in `/scratch` when a job begins. Therefore, each job `$SLURM_JOB_ID` is granted a specific `$SCRATCHDIR` directory where it can write and read its execution temporary files.
 The `$SCRATCHDIR` directory is destroyed 5 days after the end of the job: **do not forget to save the `$SCRATCHDIR` important files in your home directory before the `$SCRATCHDIR` is deleted!**. 
 ## Partitions and limits
 | **Partition** | **Nodes** | **Account** | **Max cores<br/>per job** | **Min mem<br/>per core (MB)** | **Max mem<br/>per core (MB)** | **Time limit** | **Preempted by** |**Preemption behavior** | 
 | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ | ------ |
 | short | n[001-040], bigmem[01-05] | uppa | 64 | 1 | 2000 | 12 hours | - |- |
 | standard | n[001-036] | uppa | 64 | 1 | 2000 | 5 days | project, short | job suspended |
 | free | n[001-036] | free | 96 | 1 | 2000 | - | project, short, standard | job suspended |
 | midmem | n[037-040] | midmem | 64 | 2001 | 7800 | 5 days | - | - | 
 | bigmem | bigmem[01-05] | bigmem | 64 | 2001 | 23000 | 5 days | - | - |
 | gpu | gpu01 | gpu | 32 | - | - | - | - | - |
 | bonus | n[001-040], bigmem[01-05], gpu01, visu01 | bonus | - | 1 | - | - | _any other partition_ | job requeued |
 | project | n[001-040], bigmem[01-05] | _depends on project_ | 80 | 1 | - | 5 days | - |- |
 ### Regular partitions: short, standard
 *  Accessible to all users via the uppa account.
 *  The number of concurrently allocated cores per user cumulated over the two partitions is limited to 64. This means that if you have 64 cores running on short and standard (cumulated), your subsequent jobs will pend until some of your running jobs end. 
 ### Special partitions: free, midmem, bigmem, gpu, bonus, project
 *  Every user is member of the free, midmem, bigmem and bonus accounts. These accounts are needed to access the free, midmem, bigmem and bonus partitions, respectively.
 *  The gpu partition is only accessible to the members of the gpu account.
 *  The number of concurrently allocated cores per user on the free partition is limited to 96.
 *  The number of concurrently allocated cores per user on the midmem partition is limited to 64.
 *  The number of concurrently allocated cores per user on the bigmem partition is limited to 128.
 *  The number of concurrently allocated cores per user on the gpu and bonus partitions is not limited.
 *  The bonus partition is especially well suited:
   * To run a large number of jobs. 
   * To concurrently use a large number of cores.
   * When the cluster occupation is low or the job does not last long (limits the risk for the bonus job to be requeued by the preemption mechanism).
   * For restartable jobs (in order not to completely start over the simulation if the bonus job is requeued by the preemption mechanism).
   * When having a job requeued is not a big issue.
 *  On the project partition:
   * Jobs related to special projects use prioritary hours.
   * The number of concurrently allocated cores per user is limited to 80.
 ## Usual Slurm user commands
 Submit a job:
 ```
 sbatch script.sh
 ```
 Cancel the *job_id* job (*job_id* is the number provided by Slurm to identify the job):
 ```
 scancel job_id
 ```
 Display the jobs in the waiting queue:
 ```
 squeue
 ```
 Display the jobs of user *username* in the waiting queue:
 ```
 squeue -u username
 ```
 ## Job and cluster monitoring
 In the output of the `squeue` command, the "ST" column provides the job state. The most common states are:
 *  **R**: running.
 *  **PD**: pending. The job is awaiting for resources.
 *  **S**: suspended. This typically happens when the job is preempted by another job. In this case, no action is required. Slurm will resume the job when the preemptor job ends.
 The title of the last column displayed by `squeue` is "NODELIST(REASON)":
 *  For running jobs, displays the list of allocated nodes.
 *  For pending jobs, displays the pending reason:
   *  **Resources**: the resources requested by the job are not currently available since used by other jobs.
   *  **Priority**: the job priority is lower than the priority of other jobs. 
   *  **QOSMaxCpuPerUserLimit**: the maximal number of authorized allocated cores has been reached by *username*; the job is waiting for some running jobs of *username* to end.
   *  **BeginTime**: the job earliest start time has not been reached yet. Can happen when the job is requeued by Slurm to fix an issue: in this case, Slurm sets a delayed start time for the job.
   *  **Held state**: the job *job_id* is hold by Slurm. To unlock it, do `scontrol release job_id`.
   *  **QOSMaxCpuPerJobLimit**: if a job specifies a memory per CPU limit that exceeds the partition limit, that job's count of CPUs per task will automatically be increased. This may result in the job failing due to CPU count limits. ***In this case, cancel your job and resubmit it with the correct parameters, otherwise it will pend forever***.
 The following command provides detailed information on a running or recently terminated job:
 ```
 scontrol show job job_id
 ```
 The `sinfo` command displays the current state of compute nodes:
 *  **STATE=alloc**: the node is fully allocated.
 *  **STATE=mix**: the node is partly allocated.
 *  **STATE=idle**: the node is not allocated.
 *  **STATE=drain**: the node does not accept new jobs, but the jobs currently allocated on the node keep running.
 ## Accounting
 Slurm is connected to a database recording job accounting data. The `sacct` and `sreport` commands allow to access this accounting information.
 Show information on the *job_id* job: 
 ```
 # Short format
 sacct -X -j job_id
 # Long format
 sacct -X -l -j job_id
 ```
 Display jobs starting and ending between January 1, 2019 and January 1, 2020 on the bigmem01 node:
 ```
 sacct -X --nodelist=bigmem01 --starttime=2019-01-01 --endtime=2020-01-01
 ```
 Display the number of hours computed by *username* between June 1, 2019 and January 1, 2020 on each account:
 ```
 sreport -t hours user TopUsage Start=2019-06-01 End=2020-01-01 Users=username
 ```
 Display the global number of hours computed by *username* between June 1, 2019 and January 1, 2020:
 ```
 sreport -t hours user TopUsage Group Start=2019-06-01 End=2020-01-01 Users=username
 ```
 Display the real resources used by the terminated *job_id* job:
 ```
 sacct -j job_id --format=JobID,State,TRESUsageInMax%100 | grep batch
 ```
 The output of this command will look like this:
 _job_id.batch COMPLETED **cpu=10:00:28**,energy=0,fs/disk=234247,**mem=13353272K**,pages=0,vmem=185824K_
 In the above example, the used cumulated CPU time of the *job_id* job is about 10 hours, and the used maximal total memory during the execution is about 13 GB.
 Display the real resources used by the terminated jobs of *username* starting and ending between February 1, 2021 and March 1, 2021 on the bigmem partition:
 ```
 sacct -u username --partition=bigmem --starttime=2021-02-01 --endtime=2021-03-01 --format=JobID,State,TRESUsageInMax%100 | grep batch
 ```
\ No newline at end of file