- Please see ComputeFarmIntro for a brief introduction for newcomers to the CLASSE Compute Farm.
- Please see the main GridEngine wiki page for detailed instructions on using the CLASSE Compute Farm.
- This wiki was prepared for "the-more-you-know" CHESS talk on 13-Jun-2019
Please see the
Maximum Running and Queued Job Limits section of the
GridEngine page.
REMINDER: How to open a terminal session on a CLASSE Linux System (e.g. lnx201).
Please use any of the following to initiate an lnx201 terminal session:
The
SGE Manuals are available for our installation of
Son of Grid Engine.
Checking on Farm and Job Status
To see complete Farm load, using
qstat, please type:
qstat -f -u "*"
This command shows all jobs submitted by all users.
- Notice that the " * " character is being interpreted by SGE as the linux "wildcard" character, the wildcard can be used with all SGE commands.
To see all jobs submitted by all users to all the interactive.q nodes:
qstat -f -u "*" -q *interactive*
Either
au or
adu appearing under the
states heading, in the output of
qstat, denotes a node/queue is
DOWN. As in this example:
[amd275@lnx201 ~]$ qstat -f -u "*" -q *interactive*
queuename qtype resv/used/tot. load_avg arch states
---------------------------------------------------------------------------------
cesrta_ilc_interactive.q@ilc20 IP 0/0/64 3.04 lx-amd64
---------------------------------------------------------------------------------
chess_interactive.q@lnx335.cla IP 0/0/24 -NA- lx-amd64 adu
It's safe to assume that we already know about DOWN nodes but if you do notice a node/queue status change right in front of you, please do send email to
service-classe@cornell.edu.
To see a table of cores/memory, both available and in use, of all Farm nodes, please type:
qhost
The "
- " character in any column of the
qhost output means that the node is NOT available to the queueing system. Some of the DOWN nodes listed are past failed nodes that are now just hostname place holders for future node purchases.
To see the inital submission information about any job, please use the "
-j " flag, followed by the JOB_ID number, to
qstat. e.g:
qstat -j 3088984
This information can be very helpful when diagnosing "why is my job not running?" inquiries.
If you need to see how your batch job is running (using
top ,
ps ,
pidstat ,
strace, etc), please first type
qstat, then use
qrsh (
NOT SSH) to login into the node(s) running your job(s). So if your job is running in all.q@lnx326, then login to lnx326 using:
qrsh -q all.q@lnx326
Other useful Farm commands can be found on the main
GridEngine wiki - please see
Useful Commands
Notes on running jobs and use of the Farm
- The SGE is a resource RESERVATION system, not a complete ENFORCEMENT system. At this time, users can still submit jobs that use a greater amount of cores and memory than RESERVED at the time of job submission. We are continually working on improvements to user requested resource limit ENFORCEMENT. We will be moving to using the "Slot" terminology for a group of resources (cores and memory).
- However - Jobs requiring multiple cores ARE limited to a single core, if the " -pe sge_pe " flag is NOT used. Please see Job Execution Time wiki for an explaination of how the 48 hour runtime limit effects multi-core jobs.
- Using the The CLASSE GRID Script (which automatically lands the user's session onto an interactive Farm node), a user can set the number of cores needed - it's activated by clicking the box to the left of "More Power", then selecting a value from the "Slots" dialog - the numerical value corresponds to the number of cores requested.
- Increasing Node availability - We are eager to purchase more farm nodes as funding allows. Any projects or groups with available funds can purchase nodes and be given priority over those nodes to ensure they're availble when needed.