sdiag - Scheduling diagnostic tool for Slurm
sdiag shows information related to slurmctld execution about: threads, agents,
jobs, and scheduling algorithms. The goal is to obtain data from slurmctld
behaviour helping to adjust configuration parameters or queues policies. The
main reason behind is to know Slurm behaviour under systems with a high
throughput.
It has two execution modes. The default mode --all shows
several counters and statistics explained later, and there is another
execution option --reset for resetting those values.
Values are reset at midnight UTC time by default.
The first block of information is related to global slurmctld
execution:
- Server thread count
- The number of current active slurmctld threads. A high number would mean a
high load processing events like job submissions, jobs dispatching, jobs
completing, etc. If this is often close to MAX_SERVER_THREADS it could
point to a potential bottleneck.
- Agent queue size
- Slurm design has scalability in mind and sending messages to thousands of
nodes is not a trivial task. The agent mechanism helps to control
communication between slurmctld and the slurmd daemons for a best effort.
This value denotes the count of enqueued outgoing RPC requests in an
internal retry list.
- Agent count
- Number of agent threads. Each of these agent threads can create in turn a
group of up to 2 + AGENT_THREAD_COUNT active threads at a time.
- Agent thread count
- Total count of active threads created by all the agent threads.
- DBD Agent queue size
- Slurm queues up the messages intended for the SlurmDBD and processes them
in a separate thread. If the SlurmDBD, or database, is down then this
number will increase.
The max queue size is configured in the slurm.conf with
MaxDBDMsgs. If this number begins to grow more than half of the max
queue size, the slurmdbd and the database should be investigated
immediately.
- Jobs submitted
- Number of jobs submitted since last reset
- Jobs started
- Number of jobs started since last reset. This includes backfilled jobs.
- Jobs completed
- Number of jobs completed since last reset.
- Jobs canceled
- Number of jobs canceled since last reset.
- Jobs failed
- Number of jobs failed due to slurmd or other internal issues since last
reset.
- Job states ts:
- Lists the timestamp of when the following job state counts were gathered.
- Jobs pending:
- Number of jobs pending at the given time of the time stamp above.
- Jobs running:
- Number of jobs running at the given time of the time stamp above.
- Jobs running ts:
- Time stamp of when the running job count was taken.
The next block of information is related to main scheduling
algorithm based on jobs priorities. A scheduling cycle implies to get the
job_write_lock lock, then trying to get resources for jobs pending, starting
from the most priority one and going in descendent order. Once a job can not
get the resources the loop keeps going but just for jobs requesting other
partitions. Jobs with dependencies or affected by accounts limits are not
processed.
- Last cycle
- Time in microseconds for last scheduling cycle.
- Max cycle
- Maximum time in microseconds for any scheduling cycle since last reset.
- Total cycles
- Total run time in microseconds for all scheduling cycles since last reset.
Scheduling is performed periodically and (depending upon configuration)
when a job is submitted or a job is completed.
- Mean cycle
- Mean time in microseconds for all scheduling cycles since last reset.
- Mean depth cycle
- Mean of cycle depth. Depth means number of jobs processed in a scheduling
cycle.
- Cycles per minute
- Counter of scheduling executions per minute.
- Last queue length
- Length of jobs pending queue.
The next block of information is related to backfilling scheduling
algorithm. A backfilling scheduling cycle implies to get locks for jobs,
nodes and partitions objects then trying to get resources for jobs pending.
Jobs are processed based on priorities. If a job can not get resources the
algorithm calculates when it could get them obtaining a future start time
for the job. Then next job is processed and the algorithm tries to get
resources for that job but avoiding to affect the previous ones, and
again it calculates the future start time if not current resources
available. The backfilling algorithm takes more time for each new job to
process since more priority jobs can not be affected. The algorithm itself
takes measures for avoiding a long execution cycle and for taking all the
locks for too long.
- Total backfilled jobs (since last slurm start)
- Number of jobs started thanks to backfilling since last slurm start.
- Total backfilled jobs (since last stats cycle start)
- Number of jobs started thanks to backfilling since last time stats where
reset. By default these values are reset at midnight UTC time.
- Total backfilled heterogeneous job components
- Number of heterogeneous job components started thanks to backfilling since
last Slurm start.
- Total cycles
- Number of backfill scheduling cycles since last reset
- Last cycle when
- Time when last backfill scheduling cycle happened in the format
"weekday Month MonthDay hour:minute.seconds year"
- Last cycle
- Time in microseconds of last backfill scheduling cycle. It counts only
execution time, removing sleep time inside a scheduling cycle when it
executes for an extended period time. Note that locks are released during
the sleep time so that other work can proceed.
- Max cycle
- Time in microseconds of maximum backfill scheduling cycle execution since
last reset. It counts only execution time, removing sleep time inside a
scheduling cycle when it executes for an extended period time. Note that
locks are released during the sleep time so that other work can proceed.
- Mean cycle
- Mean time in microseconds of backfilling scheduling cycles since last
reset.
- Last depth cycle
- Number of processed jobs during last backfilling scheduling cycle. It
counts every job even if that job can not be started due to dependencies
or limits.
- Last depth cycle (try sched)
- Number of processed jobs during last backfilling scheduling cycle. It
counts only jobs with a chance to start using available resources. These
jobs consume more scheduling time than jobs which are found can not be
started due to dependencies or limits.
- Depth Mean
- Mean count of jobs processed during all backfilling scheduling cycles
since last reset. Jobs which are found to be ineligible to run when
examined by the backfill scheduler are not counted (e.g. jobs submitted to
multiple partitions and already started, jobs which have reached a QOS or
account limit such as maximum running jobs for an account, etc).
- Depth Mean (try sched)
- The subset of Depth Mean that the backfill scheduler attempted to
schedule.
- Last queue length
- Number of jobs pending to be processed by backfilling algorithm. A job is
counted once for each partition it is queued to use. A pending job array
will normally be counted as one job (tasks of a job array which have
already been started/requeued or individually modified will already have
individual job records and are each counted as a separate job).
- Queue length Mean
- Mean count of jobs pending to be processed by backfilling algorithm. A job
is counted once for each partition it requested. A pending job array will
normally be counted as one job (tasks of a job array which have already
been started/requeued or individually modified will already have
individual job records and are each counted as a separate job).
- Last table size
- Count of different time slots tested by the backfill scheduler in its last
iteration.
- Mean table size
- Mean count of different time slots tested by the backfill scheduler.
Larger counts increase the time required for the backfill operation. The
table size is influenced by many schuling parameters, including:
bf_min_age_reserve, bf_min_prio_reserve, bf_resolution, and bf_window.
- Latency for 1000 calls to gettimeofday()
- Latency of 1000 calls to the gettimeofday() syscall in microseconds, as
measured at controller startup.
The next blocks of information report the most frequently issued
remote procedure calls (RPCs), calls made for the Slurmctld daemon to
perform some action. The fourth block reports the RPCs issued by message
type. You will need to look up those RPC codes in the Slurm source code by
looking them up in the file src/common/slurm_protocol_defs.h. The report
includes the number of times each RPC is invoked, the total time consumed by
all of those RPCs plus the average time consumed by each RPC in
microseconds. The fifth block reports the RPCs issued by user ID, the total
number of RPCs they have issued, the total time consumed by all of those
RPCs plus the average time consumed by each RPC in microseconds. RPCs
statistics are collected for the life of the slurmctld process unless
explicitly --reset.
The sixth block of information, labeled Pending RPC Statistics,
shows information about pending outgoing RPCs on the slurmctld agent queue.
The first section of this block shows types of RPCs on the queue and the
count of each. The second section shows up to the first 25 individual RPCs
pending on the agent queue, including the type and the destination host
list. This information is cached and only refreshed on 30 second
intervals.
- -a, --all
- Get and report information. This is the default mode of operation.
- -h, --help
- Print description of options and exit.
- -i, --sort-by-id
- Sort Remote Procedure Call (RPC) data by message type ID and user ID.
- -M, --cluster=<string>
- The cluster to issue commands to. Only one cluster name may be specified.
Note that the SlurmDBD must be up for this option to work properly.
- -r, --reset
- Reset scheduler and RPC counters to 0. Only supported for Slurm operators
and administrators.
- -t, --sort-by-time
- Sort Remote Procedure Call (RPC) data by total run time.
- -T, --sort-by-time2
- Sort Remote Procedure Call (RPC) data by average run time.
- --usage
- Print list of options and exit.
- -V, --version
- Print current version number and exit.
Executing sdiag sends a remote procedure call to slurmctld. If
enough calls from sdiag or other Slurm client commands that send remote
procedure calls to the slurmctld daemon come in at once, it can result
in a degradation of performance of the slurmctld daemon, possibly
resulting in a denial of service.
Do not run sdiag or other Slurm client commands that send
remote procedure calls to slurmctld from loops in shell scripts or
other programs. Ensure that programs limit calls to sdiag to the
minimum necessary for the information you are trying to gather.
Some sdiag options may be set via environment variables. These
environment variables, along with their corresponding options, are listed
below. (Note: commandline options will always override these settings)
- SLURM_CLUSTERS
- Same as --cluster
- SLURM_CONF
- The location of the Slurm configuration file.
Copyright (C) 2010-2011 Barcelona Supercomputing Center.
Copyright (C) 2010-2019 SchedMD LLC.
Slurm is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
Slurm is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.
sinfo(1), squeue(1), scontrol(1), slurm.conf(5),