Grid Engine
Please see ComputeFarmIntro for a brief introduction for newcomers to the CLASSE Compute Farm.
Introduction
The CLASSE Compute Farm uses the
Son of Grid Engine - SGE batch queuing system. Interactive, batch, and parallel compute jobs can be launched from CLASSE Linux machines. All CLASSE batch nodes run 64-bit SL7 with Intel processors.
You should find using the Compute Farm to be faster, more reliable, easier, and more efficient than using individual desktops.
Maximum Running and Queued Job Limits
It is important to remember that the batch queues, our network, and remote filesystems are all shared resources available for use by anyone in our lab. In this spirit, each user is limited to having
120 jobs running at once in the queues. If you submit more than this number of jobs at once (and sufficient queue slots are available)
120 jobs should start running and the rest will be put on hold until some of your running jobs finish. The queued job limit is
1000.
Summary:
- The maximum number of concurrent jobs that any user can run, total in all queues, is now 120.
- The maximum number of concurrent queued jobs that any user can queue up for execution, total in all queues, is now 1000.
These limits will be increased as execution nodes are added into the Compute Farm.
To see what the maximum running job limit is, please type
qconf -ssconf | grep maxujobs
To see what the maximum queued job limit is, please type
qconf -sconf | grep max_u_jobs
LINK CURRENTLY DOWN - The
SGE Manuals are available for our installation of
Son of Grid Engine. Older documentation can be found at
older SGE Manuals. Documentation guaranteed to match our installed version can be found with the command
man -M /nfs/sge/root/man sge_intro and commands like
man -M /nfs/sge/root/man qsub. Note that a simple
man qsub will likely bring up the documentation for a generic POSIX-compliant qsub, not the one specific to our installation.
Son of Grid Engine source code for version 8.1.9
The
Son of Grid Engine source code is found here on the
SGE github site.
A second site, with Almalinux 9.2 build support, is found
here
Job Submission
There are two basic steps to submitting a job on the CLASSE Compute Farm:
- Create a shell script containing the commands to be run (examples below).
- Submit this script to the Compute Farm using the qsub command.
For example, if you created a shell script called
myscript.sh
, then you would submit that script with the command
qsub -q all.q myscript.sh
Then, you can monitor the status of your job using the
qstat command (see Useful Commands below).
Creating and submitting a batch job script
If your shell script is named "shell_script", it would be submitted to the Compute Farm using
qsub -q all.q shell_script
(if "shell_script" is not in your PATH, you must use the full path:
qsub /path/to/shell_script
). See
man -M /nfs/sge/root/man qsub for more information and examples. Any shell script can be run on the Compute Farm. For example:
#!/bin/bash
echo Running on host: `hostname`.
echo Starting on: `date`.
sleep 10
echo Ending on: `date`.
In addition, arguments to grid engine scheduler can be included in the shell script. For example:
#$ -q all.q
#$ -S /bin/bash
#$ -l mem_free=8G
/path/to/program.exe
- The lines starting with
#$
are used to pass arguments to the qsubcommand and are not executed as part of the job (see Environment Setup below).
- Instead of including these lines in the script, one could have issued the command
qsub -q all.q -S /bin/bash -l mem_free=8G shell_script
.
After submitting with
qsub
, the job will be assigned a number (e.g. 87835). Unless specified otherwise, the output will be directed to a file called shell_script.o87835 in your home directory. Unless specified otherwise, the error output will be directed to file called shell_script.e87835 in your home directory.
Environment Setup
To properly setup ones environment, please be sure to source any project or program related setup scripts in your batch script. For example, CESR users need to source CESRDEFS or acc_vars.sh in their batch script. Likewise CMS users source the CMS setup scripts. To insure portability across multiple shell environments, please include the shell identifier line that identifies the shell your script(s) are written in, at the top of your batch script.
Filesystem usage
You can improve performance (and avoid negatively impacting others) by making use of local /tmp filesystems on the client nodes (100GB, or as large as allowed by the Compute Farm node's disks - please see
TemDisk for automatic deletion policies of files in /tmp). Wherever possible, copy files to and from these local filesystems at the begining and end of your jobs to avoid continuously tying up network resources. Of course it also helps to delete files from the tmp filesystems when your jobs complete. To view the amount of free, total, and used tmp space on each Compute Farm member use
qhost -F tmp_free,tmptot,tmpused
. To schedule a job on a node with at least 90G free in /tmp, for example, use
qsub -l tmp_free=90G myscript.sh
.
Don't forget that opening and closing files has a lot of disk-bound overhead in order to create entries in directories. A single file is written much more quickly than multiple files containing the same number of bytes.
Also, writing files to a local disk increases throughput substantially. An NFS write operation is not allowed to start until the previous output has been written successfully to the remote disk.
Finally, if you are submitting a set of particularly data intensive jobs to the queue, we recommend using the
-hold_jid option to limit the number of jobs you have running simultaneously.
Template Script
This is an example script that you can copy and modify for your needs. It has a couple variables you can set to copy files in at the beginning of a job, and then back out at the end to achieve the recommended best practices.
#!/bin/bash
#This is a template script for use in the batch queueing system. Please only edit the user sections.
LOCAL_DIR=${TMPDIR}
################USER SETTINGS###########################
#You can edit the variables here, but something valid must be present.
# SOURCE_DIR defaults to a files subdirectory of the directory you submitted the job from
#Comment out to not import any files for the job. Edit for your specific needs.
SOURCE_DIR=${SGE_O_WORKDIR}/files
DESTINATION_DIR=/cdat/tem/sge/${USER}/${JOB_ID}
########################################################
if [ ! -z "${SOURCE_DIR}" ]; then
rsync -avz ${SOURCE_DIR}/ ${LOCAL_DIR}
fi
#Put your code in the user section below. You can delete the entire
#section between USER SECTION and END USER SECTION - it is a very simple
#example script that does a loop and echos some job data for testing.
#################USER SECTION###########################
for i in {1..10}
do
echo `date` >> $LOCAL_DIR/myfile
sleep 1
done
echo "TMPDIR ${TMPDIR}" > $LOCAL_DIR/test
echo "JOB_ID ${JOB_ID}" >> $LOCAL_DIR/test
echo "USER ${USER}" >> $LOCAL_DIR/test
#################END USER SECTION#######################
mkdir -p ${DESTINATION_DIR}
ls ${DESTINATION_DIR}
rsync -avz ${LOCAL_DIR}/ ${DESTINATION_DIR}
Parallel Jobs
If you want to submit a parallel job that uses multiple threads on a single system, specify
-pe sge_pe number_of_threads
(for example,
qsub -q all.q -pe sge_pe 12)
To spread your job across as many machines as possible (for jobs that have little or no communication requirements) use
-pe sge_pe_rr 100 (for 100 threads). This will distribute your jobs in "round robin" fashion across all the nodes in the Compute Farm. To spread your job over fewer systems (but potentially more than one system), use
-pe sge_pe_fillup 100 which will fill up all the available slots of each system allocated for the job.
Submitting a Mathematica Job
A somewhat more elaborate example of submitting a Mathematica job to run in Grid Engine can be seen at
SubmitMathematicaJob.
Submitting a Matlab Job
In your job script, please use the following syntax to run Matlab in batch (replace YOURINPUTFILE with the full path of your own matlab file with the ".m" file extension):
#$ -cwd
#$ -q all.q
ulimit -S -s 8192
/usr/local/bin/matlab -singleCompThread -nosplash -nodisplay -r "run('YOURINPUTFILE');exit;"
For more on
MatLAB, including instructions for specifying other versions of matlab, please see
MatLAB. For multithreaded matlab jobs, see the section below on execution time.
Submitting a CUDA Job
To submit a job that uses one cuda resource, add
-l cuda_free=1
to your qsub or qrsh command (where "l" is a lowercase L). For example:
-
qsub -l cuda_free=1 myjob.sh
For details on programming
CUDA, see
CUDA.
Interactive Jobs -- qrsh
Scripted from Windows
We have a script
CLASSE GRID that simplifies some of this.
Manual from Linux
The best way to start interactive jobs using the grid engine us to use the command,
qrsh
. After typing
qrsh
, you will be logged into a node as soon as a slot is available. Once you're logged in, you can run interactive processes as needed, such as
mathematica,
matlab, etc.
You can also specify the command or program you'd like to run in the qrsh command. For example, to run
matlab interactively on the Compute Farm nodes reserved for interactive use, type:
qrsh -q interactive.q matlab -desktop
When submitting an interactive job, it is very helpful to specify the number of threads you expect to use. For example to start an interactive job that uses 12 threads on a single system, you should type the command
qrsh -pe sge_pe 12
Please see
Resource Requirements,
Parallel Jobs, and
Useful options for more information.
-
qrsh
by itself uses all.q
and will go to any available Compute Farm node with the same priority as batch jobs.
-
qrsh -q interactive.q
will go to interactive.q
which specifies nodes which are allocated for interactive use (a subset of all Compute Farm nodes). On those nodes interactive jobs have priority over batch submissions.
- To end a
qrsh
session, type logout
or exit
, which returns you to the parent session (where qrsh
was invoked from).
Resource Requirements (specifying which Compute Farm nodes a job is executed on)
To view the available resources (number of processor cores, amount of free memory, architecture type, etc) of each Compute Farm node, use the command
qhost
. If your job requires more resources (memory, CPU's, etc.) than are available on the Compute Farm node it executes on, the kernel will spend much of its time swapping and paging which will severely degrade the performance of the system.
If you know your jobs are going to need a certain amount of memory or number of CPU's, you can specify the attributes of the Compute Farm nodes to execute a job on by using the "-l" flag (lowercase L) followed by any
Resource Attribute. Available attributes can be listed by typing
qconf -sc
. Multiple "-l" flags can be used in a single submission, and the "-l" flag can be used with either the
qsub
or
qrsh
commands.
Examples of some of the most useful attributes are:
qhost -F tmp_free,tmp_total,tmp_used,cuda_free,cudatot,cudaused will also show the free, total, and used /tmp space and cuda resources for each Compute Farm member.
qhost -l mem_free=6G will show all the nodes that satisfy the criteria.
Useful options
See
man -M /nfs/sge/root/man qsub for more information and examples. All options can be used with either
qsub
or
qrsh
commands.
Option Example |
Explanation |
-q all.q |
To access all available batch nodes in Farm |
-N logname |
set the output file name. In this example, logname.o87835 |
-j y |
combine the error and output streams into a single file (for example, logname.o87835) |
-m a -M user@cornell.edu |
send an e-mail if the job is aborted |
-m b -M user@cornell.edu |
send an email at the beginning of the jobs execution |
-m e -M user@cornell.edu |
send an email when the job ends |
-e /PATH/TO/DIRECTORY |
Specify where the error will be saved |
-o /PATH/TO/DIRECTORY |
Specify where the output log will be saved |
-o `pwd` -e `pwd` |
direct both output log and error log output into the directory where the job(s) was submitted from |
-S /bin/bash |
Specify that your program will run using bash. This will fix warnings like "no access to tty (Bad file descriptor). |
Execution Time
Each job run under Grid Engine is limited to two days (48 hours) of wall clock time. CPU time is limited to 48 hours per core. This time limit is the result of much negotiation and is unlikely to be changed, since the queues have to be shared by many people. Jobs which require more than 48 hours must be split into jobs which complete in less time. If your job terminates unexpectedly in fewer than 48 hours, it may have exceeded the time that you specified with a
ulimit
command (
ulimit -t
), or it may have used more cores than you requested. Use of more cores than you requested can happen by accident if your program is using the Intel Math Kernel Libraries (used implicitly by some python packages) or OpenMP threading. In those cases you should set
export MKL_NUM_THREADS=$NSLOTS
export OMP_NUM_THREADS=$NSLOTS
in your job script. For matlab jobs,
matlab -singleCompThread
should be used for single-slot jobs (the default). For parallel jobs, use something like
maxNumCompThreads(str2num(getenv('NSLOTS')))
For python scripts using thread pools with the concurrent futures module, limit the maximum number of worker threads when the pool executor is created with
executor = PoolExecutor(max_workers=os.getenv("NSLOTS"))
Useful Commands
- qstat -f - shows the status of the queues, including the job you just submitted
- qstat -f -u "*" - shows the status of all jobs of all users in all queues
- qstat -f -j 87835 - gives full information on job 87835 while it is in the queues.
- qacct -j 87835 - gives information on job 87835 after it has finished
- qdel -f 87835 - aborts job 87835
Specific projects for local groups
There are a few special project names allocated for groups to run under other than the user account, and allowing project token to be raised if priority is needed. The list of projects can be found by "qconf -sprjl", and they can be specified using the -P option to the qsub or qrsh command.
Errors
If jobs are submitted with the
qsub -m a command, an abort mail is sent to the address specified with the
-M user[@host] option. The abort mail contains diagnosis information about job errors. Abort mail is the recommended source of information for users.
Otherwise, you can use "qacct -j job_id" to get some of the information after the fact.
Please notify the computer group of any errors that seem related to a specific node. If needed, you can exclude nodes from your jobs using the
-q option (see
man -M /nfs/sge/root/man qsub). You can also black list queues from jobs that are already submitted using
qalter -q.
Links