Projet

Général

Profil

Utilisation

Le calculateur est composé de processeurs disposant de 48 coeurs chacun et 380Go de mémoire, pour un total de 768 coeurs et 6To de mémoire . Tous les calculs et exécution de code sont à réaliser à travers le gestionnaire de ressources slurm :

Réservation de ressources interactives

Il est possible de réserver les ressources via slurm de manière interactive avec la commande srun. Ces réservations peuvent être utilisées pour tout type de calcul, utilisation de codes, librairies et autre outils, via l'usage de la partition interactive ; exemple :

$ srun  -p interactive -n 24 --cpus-per-task=1  --pty bash -i
...
  • il est impératif d'utiliser la partition interactive (-p interactive)
  • les autres options et la syntaxe associée sont similaires à l'usage classique de la commande sbatch
  • le job interactif se place dans le répertoire de travail où la commande srun est exécutée, vous pouvez vous replacer dans n'importe quel répertoire contenu sur votre compte ensuite.

Slurm

Introduction

quelques commandes de gestion des jobs

  • Lancer un job décrit dans le script <jobscript>:

    $ sbatch <jobscript>
    
  • consulter la liste des jobs soumis:

    $ squeue
    #ou en ajoutant un formatage différent
    $ squeue -o '%A %8u %12j   %3C  %8N %2t %12l %12M %23e %10m %5Q'
    
  • supprimer un job (référencé par un jobid)

    $ scancel <jobid>
    
  • Obtenir des informations sur un job soumis:

    $ scontrol show job <jobid>
    # ou
    $ sacct -j <jobid> --format=User%20,JobID%20,Jobname%60,partition,state%20,time,start,end,elapsed,MaxRss,MaxVMSize,ReqMem,nnodes,ncpus,nodelist%60,submit%40
    
  • Surcharger des options en ligne de commande

    $ sbatch  -n 96 -t 24:00:00 --qos batch_long ./script_with_small_walltime.sh
    
  • Ajout des mails

    $ sbatch --mail-user=homer.simpson@springfield.fr --mail-type=BEGIN, END,FAIL submit.sh
    
  • Execution directe via srun d'un code nommé ex17

    $ srun -p interactive -n 12 --mem=48G -- mpirun -n 12 ./ex17 -dm_refine 9 
    

calcul "MPI"

Plusieurs implémentations de mpi sont disponibles, directement géré par les modules : openmpi, intelmpi (ou le module intel compilers/intel/all_tools_2025.0.0 chargeant tous les outils intel);

Exemple de script mpi

  • je demande 48 processus pendant 10 min
#!/bin/bash
#SBATCH --job-name=LULESH-MPI
#SBATCH --output=LULESH-MPI.log
#SBATCH -n 48
#SBATCH -p calcul
#SBATCH --time=10:00

module add compilers/gcc/ mpi/openmpi/5.0.5
srun  ./omp_4.0/lulesh2.0 -i 20 -p

calcul "OPENMP"

Exemple de script openmp

  • je demande 48 processus pendant 10 min
#!/bin/bash
#SBATCH --job-name=LULESH-OMP4
#SBATCH --output=LULESH-OMP4.log
#SBATCH -n 1
#SBATCH --cpus-per-task=48
#SBATCH -p calcul
#SBATCH --time=10:00

module add compilers/gcc/ 
./omp_4.0/lulesh2.0 -i 20 -p

suivi des jobs

En plus des commandes squeue, sinfo, sacct classiques, d'autres commandes sont disponibles (fournies par le projet https://github.com/OleHolmNielsen/Slurm_tools.git)

  • pestat : Print Slurm nodes status with 1 line per node including job info.

    $  pestat -h
    Usage: pestat [-p partition(s)] [-P] [-u username] [-g groupname] [-A accountname] [-a]
    [-q qoslist] [-s/-t statelist] [-n/-w hostlist] [-j joblist] [-G] [-N]
    [-f | -F | -m free_mem | -M free_mem ] [-1|-2] [-d] [-S] [-E] [-T] [-C|-c] [-V] [-h]
    where:
    -p partition: Select only partion <partition>
        -P: Include all partitions, including hidden and unavailable ones
    -u username: Print only jobs of a single user <username>
    -g groupname: Print only users in UNIX group <groupname>
    -A accountname: Print only jobs in Slurm account <accountname>
    -a: Print User(Account) information after each JobID
    -q qoslist: Print only QOS in the qoslist <qoslist>
    -R reservationlist: Print only node reservations <reservationlist>
    -s|-t statelist: Print only nodes with state in <statelist>
    -n|-w hostlist: Print only nodes in hostlist
    -j joblist: Print only nodes in job <joblist>
    -G: Print GRES (Generic Resources) in addition to JobID
    -N: Print JobName in addition to JobID
    -f: Print only nodes that are flagged by * (unexpected load etc.)
    -F: Like -f, but only nodes flagged in RED are printed.
    -m free_mem: Print only nodes with free memory LESS than free_mem MB
    -M free_mem: Print only nodes with free memory GREATER than free_mem MB (under-utilized)
    -d: Omit nodes with states: down down~ idle~ alloc# drain drng resv maint boot
    -1: Default: Only 1 line per node (unique nodes in multiple partitions are printed once only)
    -2: 2..N lines per node which participates in multiple partitions
    -S: Job StartTime is printed after each JobID/user
    -E: Job EndTime is printed after each JobID/user
    -T: Job TimeUsed is printed after each JobID/user
    -C: Color output is forced ON
    -c: Color output is forced OFF
    -h: Print this help information
    -V: Version information
    
  • showuserjobs : Print the current node status and batch jobs status broken down into userids.

    $  showuserjobs -h
    Usage: /usr/local/bin/showuserjobs [-u username] [-a account] [-p partition] [-P] [-q QOS] [-r] [-A] [-C] [-h]
    where:
    -u username: Print only user <username>
    -a account: Print only jobs in Slurm account <account>
    -A: Print only ACCT_TOTAL lines
    -C: Print comma separated lines for Excel
    -p partition: Print only jobs in partition <partition-list>
    -P: Include all partitions, including hidden and unavailable ones
    -q qos-list: Print only jobs in QOS <qos-list>
    -r: Print additional job Reason columns
    -h: Print this help information
    
  • showpartitions : Print a Slurm cluster partition status overview with 1 line per partition, and other tools.

    $  showpartitions -h
    Usage: /usr/local/bin/showpartitions [-p partition-list] [-g] [-m] [-a|-P] [-f] [-h]
    where:
    -p partition-list: Print only jobs in partition(s) <partition-list>
    -g: Print also GRES information.
    -m: Print minimum and maximum values for memory and cores/node.
    -a|-P: Display information about all partitions including hidden ones.
    -f: Show all partitions from the federation if a member of one. Only Slurm 18.08 and newer.
    -n: no headers or colors will be printed (for parsing).
    -h: Print this help information.
    
    $ showpartitions
    Partition statistics for cluster jarvis at Fri Feb 14 02:50:19 PM CET 2025
       Partition     #Nodes     #CPU_cores  Cores_pending   Job_Nodes MaxJobTime Cores Mem/Node
       Name State Total  Idle  Total   Idle Resorc  Other   Min   Max  Day-hr:mn /node     (GB)
    calcul:*    up     1     0    768    576      0    106     1 infin    5-00:00   768    6095
       visu    up     1     0    768    576      0      0     1 infin      12:00   768    6095
    interactive    up     1     0    768    576      0      0     1 infin      12:00   768    6095
    Note: The cluster default partition name is indicated by :*
    
  • showuserlimits : Print Slurm resource user limits and usage.

    $  showuserlimits -h
    Usage: /usr/local/bin/showuserlimits [-u username [-A account] [-p partition] [-M cluster] [-q qos] [-l limit] [-s sublimit1[,sublimit2,...]] [-n] | -h ]
    where:
        -u username: Print user <username> (Default is current user)
        -A accountname: Print only account <accountname>
        -p partition: Print only Slurm partition <partition>
        -M cluster: Print only cluster <cluster>
        -q Print only QOS=<qos>
        -l Print selected limits only
        -s Print selected sublimit1[,sublimit2,...] only
    -n Print also limits with value None
        -h Print help information