You are here: CLASSE Wiki>Computing Web>GridEngine (07 Jan 2025, AttilioDeFalco)Edit Attach

Grid Engine

Please see ComputeFarmIntro for a brief introduction for newcomers to the CLASSE Compute Farm.

Introduction

The CLASSE Compute Farm uses the Son of Grid Engine - SGE batch queuing system. Interactive, batch, and parallel compute jobs can be launched from CLASSE Linux machines. All CLASSE batch nodes run 64-bit SL7 with Intel processors.

You should find using the Compute Farm to be faster, more reliable, easier, and more efficient than using individual desktops.

Compute Farm Default OS Switch to AlmaLinux 9

On 07-Jan-2024, we changed the default operating system for CLASSE Compute Farm jobs from Scientific Linux 7 to AlmaLinux 9. If you wish to submit jobs to legacy Scientific Linux 7 nodes after this change (for compatibility with any legacy executables), please use the "-l os_name=el7" flag in your job submission scripts. Otherwise, any submitted jobs will default to running on AlmaLinux 9 nodes. To see a complete list of AlmaLinux 9 nodes (and the queues on those nodes), please type "qhost -q -l os_name=el9". Please note that after this change, JupyterHub will also default to AlmaLinux 9 Farm nodes, which might be incompatible with some dedicated CHESS queues that are still running Scientific Linux 7. If you are responsible for any of these queues, please submit Service Request to schedule an upgrade to AlmaLinux 9.

Maximum Running and Queued Job Limits

It is important to remember that the batch queues, our network, and remote filesystems are all shared resources available for use by anyone in our lab. In this spirit, each user is limited to having 120 jobs running at once in the queues. If you submit more than this number of jobs at once (and sufficient queue slots are available) 120 jobs should start running and the rest will be put on hold until some of your running jobs finish. The queued job limit is 1000.

Summary:

The maximum number of concurrent jobs that any user can run, total in all queues, is now 120.
The maximum number of concurrent queued jobs that any user can queue up for execution, total in all queues, is now 1000.

These limits will be increased as execution nodes are added into the Compute Farm.

To see what the maximum running job limit is, please type

qconf -ssconf | grep maxujobs

To see what the maximum queued job limit is, please type

qconf -sconf | grep max_u_jobs

Complete SGE Manuals - LINK CURRENTLY DOWN

LINK CURRENTLY DOWN - The SGE Manuals are available for our installation of Son of Grid Engine. Older documentation can be found at older SGE Manuals. Documentation guaranteed to match our installed version can be found with the command man -M /nfs/sge/root/man sge_intro and commands like man -M /nfs/sge/root/man qsub. Note that a simple man qsub will likely bring up the documentation for a generic POSIX-compliant qsub, not the one specific to our installation.

Son of Grid Engine source code for version 8.1.9

The Son of Grid Engine source code is found here on the SGE github site.

A second site, with Almalinux 9.2 build support, is found here

Job Submission

There are two basic steps to submitting a job on the CLASSE Compute Farm:

Create a shell script containing the commands to be run (examples below).
Submit this script to the Compute Farm using the qsub command.

For example, if you created a shell script called myscript.sh, then you would submit that script with the command

qsub -q all.q myscript.sh

Then, you can monitor the status of your job using the qstat command (see Useful Commands below).

Creating and submitting a batch job script

If your shell script is named "shell_script", it would be submitted to the Compute Farm using qsub -q all.q shell_script (if "shell_script" is not in your PATH, you must use the full path: qsub /path/to/shell_script ). See man -M /nfs/sge/root/man qsub for more information and examples. Any shell script can be run on the Compute Farm. For example:

#!/bin/bash
echo Running on host: `hostname`.
echo Starting on: `date`.
sleep 10
echo Ending on: `date`.

In addition, arguments to grid engine scheduler can be included in the shell script. For example:

#$ -q all.q
#$ -S /bin/bash
#$ -l mem_free=8G
/path/to/program.exe

The lines starting with #$ are used to pass arguments to the qsubcommand and are not executed as part of the job (see Environment Setup below).
- Instead of including these lines in the script, one could have issued the command qsub -q all.q -S /bin/bash -l mem_free=8G shell_script.

After submitting with qsub, the job will be assigned a number (e.g. 87835). Unless specified otherwise, the output will be directed to a file called shell_script.o87835 in your home directory. Unless specified otherwise, the error output will be directed to file called shell_script.e87835 in your home directory.

Environment Setup

To properly setup ones environment, please be sure to source any project or program related setup scripts in your batch script. For example, CESR users need to source CESRDEFS or acc_vars.sh in their batch script. Likewise CMS users source the CMS setup scripts. To insure portability across multiple shell environments, please include the shell identifier line that identifies the shell your script(s) are written in, at the top of your batch script.

Filesystem usage

You can improve performance (and avoid negatively impacting others) by making use of local /tmp filesystems on the client nodes (100GB, or as large as allowed by the Compute Farm node's disks - please see TemDisk for automatic deletion policies of files in /tmp). Wherever possible, copy files to and from these local filesystems at the begining and end of your jobs to avoid continuously tying up network resources. Of course it also helps to delete files from the tmp filesystems when your jobs complete. To view the amount of free, total, and used tmp space on each Compute Farm member use qhost -F tmp_free,tmptot,tmpused . To schedule a job on a node with at least 90G free in /tmp, for example, use qsub -l tmp_free=90G myscript.sh .

Don't forget that opening and closing files has a lot of disk-bound overhead in order to create entries in directories. A single file is written much more quickly than multiple files containing the same number of bytes.

Also, writing files to a local disk increases throughput substantially. An NFS write operation is not allowed to start until the previous output has been written successfully to the remote disk.

Finally, if you are submitting a set of particularly data intensive jobs to the queue, we recommend using the -hold_jid option to limit the number of jobs you have running simultaneously.

Template Script

This is an example script that you can copy and modify for your needs. It has a couple variables you can set to copy files in at the beginning of a job, and then back out at the end to achieve the recommended best practices.

#!/bin/bash
#This is a template script for use in the batch queueing system. Please only edit the user sections.
LOCAL_DIR=${TMPDIR}
################USER SETTINGS###########################
#You can edit the variables here, but something valid must be present.
# SOURCE_DIR defaults to a files subdirectory of the directory you submitted the job from
#Comment out to not import any files for the job. Edit for your specific needs.
SOURCE_DIR=${SGE_O_WORKDIR}/files   
DESTINATION_DIR=/cdat/tem/sge/${USER}/${JOB_ID}
########################################################
if [ ! -z "${SOURCE_DIR}" ]; then
   rsync -avz ${SOURCE_DIR}/ ${LOCAL_DIR}
fi

#Put your code in the user section below. You can delete the entire
#section between USER SECTION and END USER SECTION - it is a very simple
#example script that does a loop and echos some job data for testing.
#################USER SECTION###########################
for i in {1..10}
do
echo `date` >> $LOCAL_DIR/myfile
sleep 1
done
echo "TMPDIR ${TMPDIR}" > $LOCAL_DIR/test
echo "JOB_ID ${JOB_ID}" >> $LOCAL_DIR/test
echo "USER ${USER}" >> $LOCAL_DIR/test
#################END USER SECTION#######################
mkdir -p ${DESTINATION_DIR}
ls ${DESTINATION_DIR}
rsync -avz ${LOCAL_DIR}/ ${DESTINATION_DIR}

Parallel Jobs

If you want to submit a parallel job that uses multiple threads on a single system, specify -pe sge_pe number_of_threads (for example, qsub -q all.q -pe sge_pe 12)

To spread your job across as many machines as possible (for jobs that have little or no communication requirements) use -pe sge_pe_rr 100 (for 100 threads). This will distribute your jobs in "round robin" fashion across all the nodes in the Compute Farm. To spread your job over fewer systems (but potentially more than one system), use -pe sge_pe_fillup 100 which will fill up all the available slots of each system allocated for the job.

Submitting a Mathematica Job

A somewhat more elaborate example of submitting a Mathematica job to run in Grid Engine can be seen at SubmitMathematicaJob.

Submitting a Matlab Job

In your job script, please use the following syntax to run Matlab in batch (replace YOURINPUTFILE with the full path of your own matlab file with the ".m" file extension):

#$ -cwd
#$ -q all.q
ulimit -S -s 8192 
/usr/local/bin/matlab -singleCompThread -nosplash -nodisplay -r "run('YOURINPUTFILE');exit;"

For more on MatLAB, including instructions for specifying other versions of matlab, please see MatLAB. For multithreaded matlab jobs, see the section below on execution time.

Submitting a CUDA Job

To submit a job that uses one cuda resource, add -l cuda_free=1 to your qsub or qrsh command (where "l" is a lowercase L). For example:

qsub -l cuda_free=1 myjob.sh

For details on programming CUDA, see CUDA.

Interactive Jobs -- qrsh

Scripted from Windows

We have a script CLASSE GRID that simplifies some of this.

Manual from Linux

The best way to start interactive jobs using the grid engine us to use the command, qrsh. After typing qrsh , you will be logged into a node as soon as a slot is available. Once you're logged in, you can run interactive processes as needed, such as mathematica, matlab, etc.

You can also specify the command or program you'd like to run in the qrsh command. For example, to run matlab interactively on the Compute Farm nodes reserved for interactive use, type:

qrsh -q interactive.q matlab -desktop

When submitting an interactive job, it is very helpful to specify the number of threads you expect to use. For example to start an interactive job that uses 12 threads on a single system, you should type the command

qrsh -pe sge_pe 12

Please see Resource Requirements, Parallel Jobs, and Useful options for more information.

qrsh by itself uses all.q and will go to any available Compute Farm node with the same priority as batch jobs.
qrsh -q interactive.q will go to interactive.q which specifies nodes which are allocated for interactive use (a subset of all Compute Farm nodes). On those nodes interactive jobs have priority over batch submissions.
To end a qrsh session, type logout or exit, which returns you to the parent session (where qrsh was invoked from).

Resource Requirements (specifying which Compute Farm nodes a job is executed on)

To view the available resources (number of processor cores, amount of free memory, architecture type, etc) of each Compute Farm node, use the command qhost. If your job requires more resources (memory, CPU's, etc.) than are available on the Compute Farm node it executes on, the kernel will spend much of its time swapping and paging which will severely degrade the performance of the system.

If you know your jobs are going to need a certain amount of memory or number of CPU's, you can specify the attributes of the Compute Farm nodes to execute a job on by using the "-l" flag (lowercase L) followed by any Resource Attribute. Available attributes can be listed by typing qconf -sc. Multiple "-l" flags can be used in a single submission, and the "-l" flag can be used with either the qsub or qrsh commands.

Examples of some of the most useful attributes are:

Atribute Example	Explanation	Column in =qhost=
`-l mem_free=8G`	nodes with at least 8 GB of free memory
`-l tmp_free=90G`	nodes with at least 90GB free in /tmp
`-l cuda_free=1`	nodes with at least one unused cuda GPU card

qhost -F tmp_free,tmp_total,tmp_used,cuda_free,cudatot,cudaused will also show the free, total, and used /tmp space and cuda resources for each Compute Farm member. qhost -l mem_free=6G will show all the nodes that satisfy the criteria.

Useful options

See man -M /nfs/sge/root/man qsub for more information and examples. All options can be used with either qsub or qrsh commands.

Option Example	Explanation
-q all.q	To access all available batch nodes in Farm
`-N logname`	set the output file name. In this example, logname.o87835
`-j y`	combine the error and output streams into a single file (for example, logname.o87835)
-m a -M user@cornell.edu	send an e-mail if the job is aborted
-m b -M user@cornell.edu	send an email at the beginning of the jobs execution
-m e -M user@cornell.edu	send an email when the job ends
-e /PATH/TO/DIRECTORY	Specify where the error will be saved
-o /PATH/TO/DIRECTORY	Specify where the output log will be saved
-o `pwd` -e `pwd`	direct both output log and error log output into the directory where the job(s) was submitted from
-S /bin/bash	Specify that your program will run using bash. This will fix warnings like "no access to tty (Bad file descriptor).

Execution Time

Each job run under Grid Engine is limited to two days (48 hours) of wall clock time. CPU time is limited to 48 hours per core. This time limit is the result of much negotiation and is unlikely to be changed, since the queues have to be shared by many people. Jobs which require more than 48 hours must be split into jobs which complete in less time. If your job terminates unexpectedly in fewer than 48 hours, it may have exceeded the time that you specified with a ulimit command (ulimit -t), or it may have used more cores than you requested. Use of more cores than you requested can happen by accident if your program is using the Intel Math Kernel Libraries (used implicitly by some python packages) or OpenMP threading. In those cases you should set

export MKL_NUM_THREADS=$NSLOTS
export OMP_NUM_THREADS=$NSLOTS

in your job script. For matlab jobs,

matlab -singleCompThread

should be used for single-slot jobs (the default). For parallel jobs, use something like

maxNumCompThreads(str2num(getenv('NSLOTS')))

For python scripts using thread pools with the concurrent futures module, limit the maximum number of worker threads when the pool executor is created with

executor = PoolExecutor(max_workers=os.getenv("NSLOTS"))

Useful Commands

qstat -f - shows the status of the queues, including the job you just submitted
qstat -f -u "*" - shows the status of all jobs of all users in all queues
qstat -f -j 87835 - gives full information on job 87835 while it is in the queues.
qacct -j 87835 - gives information on job 87835 after it has finished
qdel -f 87835 - aborts job 87835

Specific projects for local groups

There are a few special project names allocated for groups to run under other than the user account, and allowing project token to be raised if priority is needed. The list of projects can be found by "qconf -sprjl", and they can be specified using the -P option to the qsub or qrsh command.

Project Specification	Used for
-P accpar	Parallel jobs for Accelerator physics
-P ilcsim	Simulation jobs for ILC

Errors

If jobs are submitted with the qsub -m a command, an abort mail is sent to the address specified with the -M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.

Otherwise, you can use "qacct -j job_id" to get some of the information after the fact.

Please notify the computer group of any errors that seem related to a specific node. If needed, you can exclude nodes from your jobs using the -q option (see man -M /nfs/sge/root/man qsub). You can also black list queues from jobs that are already submitted using qalter -q.