====== OAR Tutorial ======

On this page, you will learn how to properly use the CPU and GPU clusters of the team, and especially the OAR submission system.

In particular, you are highly encouraged to read the [[oar_tutorial#etiquette|section]] about the "rules" and "good practices" when submitting jobs to ensure a fair use of resources among all users from the team.

----

===== Basics =====

The CPU cluster of the team is now part to the “shared cluster” of the center where you can use computing resources from different teams. You have a priority access to the computing nodes from THOTH.

The CPU cluster frontal node is called ''access2-cp''. You can ssh to it, but you **should not use it to do computations** (it is just a bridge). For this, you have to submit jobs to the computation/worker nodes. A job is a task (generally a script in bash, python, etc.) that you cannot run on your working station (because of a lack of computing power/memory, etc.). 

The GPU cluster is not shared and dedicated to our team. The frontal node is called ''edgar''. Same thing, it is not a machine to do computations on.

In both cases you cannot directly SSH to computations nodes, you can only use them by submitting jobs. The job management system is handled by a program called [[http://oar.imag.fr|OAR]]. OAR is a resource manager, it allows you to reserve some computing resources (xx CPU cores, a GPU) for a certain amount of times to do a task (run your script). When you submit a job, OAR allocates to you the resources that you requested and your job is run on this resources.

**Important:** You have to estimate the resources and time that your job will require when you submit it. If you reserve 20 cpu cores for 6 hours, your job should finish before the end of its allocated time. If not, your computations will be lost. To avoid this, you can use **checkpointing** and intermediate saving. 

==== Using OAR ====

The following command will be useful to use OAR, especially to submit and monitor jobs:

  * To show currently running and waiting jobs, and who they belong to, use:

  oarstat

You can pipe the result to a grep on your username to print your jobs, e.g. ''oarstat | grep username'' or you can use the following option ''oarstat -u username''.

**Note:** any submitted job is assigned a job id or job number (denoted <job_ID> in the following) that can be used to monitor it.

To monitor a running or finished job, you can use (replace ''<job_ID>'' by your job number):
  
  oarstat -fj <job_ID>

  * To submit a job, you have to use the ''oarsub'' command. The simplest reservation looks like this:

  oarsub -I   # Interactive

If the cluster isn't full, your job will be executed almost immediately. The ''-I'' option stands for "interactive", it opens a shell on a computation node where you can run your experiment. This interactive mode is useful to run some small tests (session limited in time, at most 2 hours) before using the standard submission mode.

=== Standard job submission === 

  * The following command submits the given script as a job, it will be run on a node when a slot becomes available:

  oarsub my_script.sh

The command gives the prompt back immediately and will execute the job at a later time, this is called "batch mode". The standard output and error are redirected to ''OAR.<job_ID>.stdout'' and ''OAR.<job_ID>.stderr'' in the current directory when you submitted the job.

**Attention:** Your script ''my_script.sh'' should have execution right (more on Linux file permission can be found [[https://www.linux.com/learn/understanding-linux-file-permissions|here]]). You can modify access right with the following command ''chmod u+x my_script.sh''.

  * In case you would like to delete some of your jobs, use:

  oardel <job_ID>

(you can get the ''<job_ID>'' from ''oarstat'')

A visual overview of the nodes is given by the Monika software ([[http://visu-cp.inrialpes.fr/monika|cpu cluster]] and [[http://edgar/monika|gpu cluster]]).

**Note:** Some tools (python script) were designed in the team to parse oarstat results in a more human-readable format, check out the ''thoth_utils'' [[https://gitlab.inria.fr/thoth/thoth_utils|Gitlab webpage]]. These scripts are also available in the following directory ''/home/thoth/gdurif/shared/thoth_utils'' (that you can add to the ''PATH'' in your ''.bashrc'' or equivalent for instance). The more intersting scripts in here are ''gpustat'' which summarize the GPU cluster current use and ''oarstat++.py'' which cn be used on both clusters (''oarstat++.py -s edgar'' or ''oarstat++.py -s access2-cp''). Other options are documented (see ''oarstat++.py -h'').

  * If you want to see what is going on with your job on the cluster, you can ssh to the node where it is running with:

  oarsub -C<job_id>

  * Many options can be passed to the ''oarsub'' command to specify the resources that you want to reserve with the ''-l'' option. These options can be specific for the CPU or GPU clusters. In both cases, to specify the duration of your job, use the following option:

  oarsub -l "walltime=16:0:0" <your_command>

  * You can also specify some specific properties for the node on which you want to run your job with the ''-p'' option. This properties can be combined ''oarsub -p "property1='XX' AND property2>'YY'"'' (note the importance of all single '' ' ''  and double quotes '' " ''). It is also possible to use the ''OR'' keyword. More details can be found in the corresponding section below depending on CPU or GPU cluster. The available properties can be found on the dedicated ''monika'' web page for each cluster ([[http://visu-cp.inrialpes.fr/monika|cpu cluster]] and [[http://edgar/monika|gpu cluster]]).


=== Default and besteffort jobs ===

When not specify otherwise, all jobs that you submit are **default** jobs. Once they started, they will run until they finish/crash/reach the walltime without interruption. To ensure a fair share of the resources between users, the resources allocated to default jobs of a single user are limited (c.f. [[tutorials:oar_tutorial#etiquette|below]]).

To submit a job in best effort, you should add the following option ''-t besteffort -t idempotent'' to your submission:

  oarsub -l xxx -t besteffort -t idempotent <your_command>

Besteffort job will be killed when resources are needed (for waiting default jobs) and restart it automatically (if you do not forger the option ''-t idempotent''). However, you have to handle checkpointing yourself.
==== CPU cluster ====

  * The ''-l'' option allows you to specify the resources that you want to reserve. To get a machine with 16 cores, use:

  oarsub -l "nodes=1/core=16" <your_command>

To get 8 cores on 1 machine for at most 16 hours, use: 

  oarsub -l "nodes=1/core=8,walltime=16:0:0" <your_command>

You are now sharing the RAM with other jobs. Make sure you specify "nodes=1", such that all your cores and RAM are reserved on the same machine.

**Attention:** Your main concern should be the use of memory. Before submitting a job, you should estimate the memory it will require and request for a computation node with sufficient memory (see below). Since OAR does not monitor the quantity of memory requested by submitted jobs, multiple heavy memory jobs can be assigned to the same computation node, creating some issue and potentially crashing the node. Thus, it is recommended that your experiments do not waste memory (for safety, you can use the ''ulimit'' bash command in your script).

  * The ''-p'' property allow you to specify the characteristics of the computation node that you want to reserve. It can be mixed with the ''-l'' option. For instance, if you want 8 cores on a single node with more than 64Gb of memory, you can use:

  oarsub -l "nodes=1/core=8,walltime=16:0:0" -p "mem>64" <your_command>

**Note:** More information about the shared cluster can be found in the [[tutorials:shared_cluster|dedicated page]], in particular regarding the use of the resources from our team (THOTH) or from other teams.

The CPU computation nodes are Xeons with 8 to 48 cores. They have 32 GB of ram at least (you can find out how much ram each machine has with the command ''oarnodes'', property 'mem'). All nodes run Ubuntu 16.04 x86_64 with a set of commonly packages installed by default. You can install packages on the cluster nodes, however if you feel you need those everywhere, you should seek the help of a [[:system_administrators|system administrator]]. On all nodes you can use /tmp and /dev/shm for (local) temporary files. Global data has to be on scratches.
==== GPU cluster ====

**Important:** On the GPU cluster, many nodes have more than 1 GPU. You are responsible for using the GPU that was allocated to you by OAR.

To do so, you have to source the file ''gpu_setVisibleDevices.sh'' (available on all nodes) at the beginning of the script that you submit to OAR. This file sets the "CUDA_VISIBLE_DEVICES" environment variable. Using this, you no longer have to specify a GPU ID when running your single-GPU experiments. For example, if you are assigned GPU number 5 on gpuhost21, it will show up as "gpuid=0" in your job. Thus, your (bash/zsh) script should be something like:

  source gpu_setVisibleDevices.sh
  ...
  GPUID=0 # ALWAYS
  RunCudaCode(... , GPUID , ...)

**Note 1:** the previous procedure based on the script ''gpu_getIDs.sh'' is now deprecated.

**Note 2:** If you don't use bash/zsh script and submit directly a command to OAR, you have to modify your submission to source ''gpu_setVisibleDevices.sh'' before running your command, for instance:

  oarsub -p ... -l ... "source gpu_setVisibleDevices.sh; python myscript.py"

**Attention:** You have to check if the program/code that you use is compatible with setting the environment variable ''CUDA_VISIBLE_DEVICES''. Standard libraries such as TensorFlow, Caffe, PyTorch (and anything based on Cuda) are compatible. In doubt, you can contact your [[:system_administrators|system administrators]] before submitting your job on the cluster.

To check GPU memory consumption and use, you can connect to the node where your computations are done (with ''oarsub -C<job_id>'') and run:

  nvidia-smi

If you are doing everything right, you will see decent percent of GPU-Util for the GPU you are using.

To reserve 2 GPUs for a single job (if you use multi-GPU computations for instance), you can use the following option ''-l "host=1/gpuid=2"'' for ''oarsub'', which can be combined with a walltime option, for instance:

  oarsub -l "host=1/gpuid=2,walltime=48:0:0" <your_command>

In this case (multi-GPU code), sourcing ''gpu_setVisibleDevices.sh'' will set up the visible GPU id accordingly, and your (bash/zsh) script should be something like:

  source gpu_setVisibleDevices.sh
  ...
  GPUID=0,1 # ALWAYS
  RunCudaCode(... , GPUID , ...)

The [[oar_tutorial#gpu_cluster_oar_cheat_sheet|next section]] will be give you other examples of ''oarsub'' commands.

See the man pages or the online OAR documentation for more info.

The GPU worker nodes are called ''gpuhost1'',.., ''gpuhost22''. Some of them are desktop machines, some of them are racked. They run the same base configuration as desktop computers, and you also have rights to install packages. They generally have a scratch attached, which means you can get high volume local storage if that is one of your computation requirements.

You can run the following command ''oarnodes'' on ''edgar'' to see the available resources (which GPU model) on the cluster or check the files listing the machines from the team [[http://thoth/private/machines.list|txt]] or 
[[http://thoth/private/machines.ods|spreadsheet]].

The last version (or a very recent one) of the NVIDIA drivers are installed on each nodes. However, you have to install your own version of CUDA somewhere in your scratchdir (or use a version installed by one of your colleagues) and correctly set the ''PATH'' and ''LD_LIBRARY_PATH'' in your scripts to use it.


----


===== GPU cluster OAR cheat sheet =====
<code>
oarsub -I # reserve 1 GPU interactively for 2 hours
oarsub -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU for 48 hours
oarsub -p "gpumodel='titan_xp'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 Titan Xp for 48 hours
oarsub -p "host='gpuhost5'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 1 GPU on gpuhsot5 for 48 hours
oarsub -p "host='gpuhost3' and gpuid='1'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve GPUID 1 on gpuhost3 for 48 hours

oarsub -p "host='gpuhost1'" -l "gpuid=4,walltime=48:0:0" "/path/to/my/script.sh" # reserve 4 GPUs on gpuhost1
oarsub -p "gpumodel='titan_black'" -l "host=1/gpuid=2" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve 2 Titan Blacks on a single host

oarsub -r "2015-05-24 15:20:00" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve a GPU in advance

oarsub -p "host='gpuhost3'" -l "host=1" -t besteffort -t idempotent "/path/to/my/script.sh" # reserve a whole node in besteffort (acceptable for CPU jobs)
oarsub -p "gpuid='0'" -l "walltime=48:0:0" "/path/to/my/script.sh" # reserve exclusively GPUID 0, thus no need to modify your code (default GPU is 0)
</code>

----


===== Etiquette =====

**Some rules to use the cluster with efficiency and show courtesy towards your colleagues.** :)

Do not run any jobs on submission nodes (''edgar'', ''access{1,2}-cp''), which are just gateways to the different clusters, but also on clear (former frontend for the CPU cluster) a vital point in the THOTH network, notably as DHCP and NFS server (you don't want it to crash!).

The use of persistent terminal multiplexers (screen, tmux) is **forbidden** on the nodes themselves.
Run your screen session from the OAR server or from your desktop machine.

Some of these rules are not enforced automatically, but if some jobs do not follow them, the administrators may suspend or kill them. Of course, exceptions can be made, especially during paper submission periods, but always by first discussing with your colleagues, and in a good spirit.

==== CPU cluster ====

Submitting a job on //n// cores means that your job must be able to fully use //n// cores (cf. parallelization, at code or thread/process level). An exception is made for jobs that require high amounts of RAM. In those cases, inflate the number of cores allocated according to the RAM you use. **If you use too much memory, the machine might stall completely** (what we call swap thrashing), requiring a manual reboot from a sysadmin.

Avoid overloading the OAR server by launching too many jobs in a row. There should not be many more jobs in the waiting state than cores.

In the same spirit, you shouldn't occupy more than **100** cores on the CPU cluster with default jobs. If you need more, switch to besteffort jobs.

It is forbidden to reserve a node via the command ''sleep''.

==== GPU cluster ====

You are limited to **3** default jobs on the GPU cluster, to avoid that a user reserve all the resources. If you need more, you can use besteffort jobs.

You should **control** the **memory** (as in RAM) and **CPU consumption** of your jobs using GPU since your jobs are sharing worker with jobs from other people (1, 2, 4 or 8 GPUs by nodes).

In addition, you should also avoid using "big" GPUs if your task does not require huge memory. You can specify the GPU model or GPU memory with the following options ''-p "gpumodel='XX'"'' or ''-p "gpumem>'10000'"'' (note the importance of the single quote around 1000). 

Reserving a GPU via the command ''sleep'' can be tolerated but is highly discouraged.

**Equalizer script**: in the past, it happened that a single user took all empty resources with besteffort jobs, forcing besteffort jobs from other users to wait (when all users already have their 2 default jobs), which was not a fair repartition of resources. To avoid this, we set up an "equalizer" script that regularly and automatically kills the last submitted job from the user with the largest number of besteffort job if some besteffort jobs from other users are waiting, enforcing a redistribution of the resources. So do not be surprised if some of your besteffort jobs are killed and resubmitted from time to time.

**Fixing OAR lockup**: sometimes OAR becomes completely unresponsive. Jobs can still be submitted using "oarsub", however the list of jobs in "Waiting" state keeps growing, even though resources are available.
In those cases and when admins are not around, you can use the following command on edgar:

''sudo /bin/OAR_hard_reset.sh''

It will attempt to restart OAR forcefully. Double-check that OAR is locked up before running this command.

----

===== OAR Scripts =====

OAR has another way of specifying all options for a job.
Just pass it a script name:

    oarsub --scanscript ./myscript.sh

OAR will now process any line starting with ''#OAR''.
Here an example:

    #! /bin/bash
    
    #OAR -l nodes=1/core=10
    #OAR -n mybigbeautifuljob
    #OAR -t besteffort
    #OAR -t idempotent
    
    echo "start at $(date +%c)"
    sleep 10
    echo "start at $(date +%c)"

In this way every job can be a fairly self-contained description of what to do and which resources are needed.
In case you are not a big fan of wrapping all your python scripts in shell scripts, this seems to work with arbitrary text files:

    #! /usr/bin/env python
    
    #OAR -t besteffort
    #OAR -t idempotent
    #OAR -l nodes=1/core=10
    
    import time
    from datetime import datetime
    
    print(datetime.now())
    time.sleep(20)
    print(datetime.now())

==== Notes ====

  * The script needs to be executable! I.e. ''chmod +x myscript.sh''
  * Since you only pass the script name to OAR you need a shebang (e.g. ''#! /bin/bash'') in the script itself.
  * Only one option per line (e.g. one for ''besteffort'', one for ''idempotent'')


===== Utilities =====

Some useful scripts are available in the ''thoth_utils'' repository available [[https://gitlab.inria.fr/thoth/thoth_utils|here]] and on any machine of the team in the following directory ''/home/thoth/utils/thoth_utils''.

Here is a short description ot their usage:

  * ''oarstat++.py'' is an improved version of ''oarstat''. It displays the names of the nodes, their number of cores and some statistics of the busy/free/waiting nodes as a function of their number of cores. For your convenience, just paste into your ''~/.bashrc'' file the following shortcut: ''alias oarstat++="python /path/to/oarstat++.py"'' To get help about it, call ''oarstat++ -h''. Idem for ''oarstatr''.
  * ''oargetnode <nb_of_cores> [<job_name>]'' is an improved version of job submission. To use it, paste into your ''~/.bashrc'' file the following shortcut: ''alias oargetnode="/path/to/oargetnode"''
  * In case you wan to kill ALL your jobs, you can use the python script python ''/path/to/oardelall''. It will ask you confirmation, and will first delete your waiting jobs, and finally your running jobs after a second confirmation.

=== Shell Aliases ===

Two useful commands to start interactive sessions on the CPU and GPU cluster, respectively:


    ###################################################
    # Start an interactive CPU job in the current directory
    #(2 cores by default)
    # E.g. to make a 20 core job:
    # $ c 20
    ###################################################
    function c () {
      ssh -tA access2-cp "oarsub -I  -d $(pwd) -l nodes=1/core=${1:-2}"
    }
    ###################################################
    
    
    ###################################################
    # Start an interactive GPU job in the current directory.
    # I use gtx 980 for debugging to not block the more powerful GPUs
    ###################################################
    alias g="ssh -t edgar oarsub -I -d $(pwd) -p \"\\\"gpumodel='gtx980'\\\"\""
    ###################################################

===== Web visualisations =====

  * [[http://visu-cp.inrialpes.fr/monika]] 
  * [[http://visu-cp.inrialpes.fr/drawgantt]]

  * [[http://edgar/monika]]
  * [[http://edgar/drawgantt]]