nonstop.conf - Slurm configuration file for fault-tolerant computing.
nonstop.conf is an ASCII file which describes the configuration used for
fault-tolerant computing with Slurm using the optional slurmctld/nonstop
plugin. This plugin provides a means for users to notify Slurm of nodes it
believes are suspect, replace the job's failing or failed nodes, and extend a
job's in response to failures. The file location can be modified at system
build time using the DEFAULT_SLURM_CONF parameter or at execution time by
setting the SLURM_CONF environment variable. The file will always be located
in the same directory as the slurm.conf file.
Parameter names are case insensitive. Any text following a
"#" in the configuration file is treated as a comment through the
end of that line. Changes to the configuration file take effect upon restart
of Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
command "scontrol reconfigure" unless otherwise noted. The
configuration parameters available include:
- BackupAddr
- Communications address used for the slurmctld daemon. This can either be a
hostname or IP address. This value would typically be the same as the
secondary SlurmctldHost in the slurm.conf file, when applicable.
- ControlAddr
- Communications address used for the slurmctld daemon. This can either be a
hostname or IP address. This value would typically be the same as the
SlurmctldHost in the slurm.conf file.
- Debug
- A number indicating the level of additional logging desired for the
plugin. The default value is zero, which generates no additional logging.
- HotSpareCount
- This identifies how many nodes in each partition should be maintained as
spare resources. When a job fails, this pool of resources will be depleted
and then replenished when possible using idle resources. The value should
be a comma delimited list of partition and node count pairs separated by a
colon.
- MaxSpareNodeCount
- This identifies the maximum number of nodes any single job may replace
through the job's entire lifetime. This could prevent a single job from
causing all of the nodes in a cluster to fail. By default, there is no
maximum node count.
- Port
- Port used for communications. The default value is 6820.
- TimeLimitDelay
- If a job requires replacement resources and none are immediately
available, then permit a job to extend its time limit by the length of
time required to secure replacement resources up to the number of minutes
specified by TimeLimitDelay. This option will only take effect if
no hot spare resources are available at the time replacement resources are
requested. This time limit extension is in addition to the value
calculated using the TimeLimitExtend. The default value is zero (no
time limit extension). The value may not exceed 65533 seconds.
- TimeLimitDrop
- Specifies the number of minutes that a job can extend its time limit for
each failed or failing node removed from the job's allocation. The default
value is zero (no time limit extension). The value may not exceed 65533
seconds.
- TimeLimitExtend
- Specifies the number of minutes that a job can extend its time limit for
each replaced node. The default value is zero (no time limit extension).
The value may not exceed 65533 seconds.
- UserDrainAllow
- This identifies a comma delimited list of user names or user IDs of users
who are authorized to drain nodes they believe are failing. Specify a
value of "ALL" to permit any user to drain nodes. By default, no
users may drain nodes using this interface.
- UserDrainDeny
- This identifies a comma delimited list of user names or user IDs of users
who are NOT authorized to drain nodes they believe are failing. Specifying
a value for UserDrainDeny implicitly allows all other users to
drain nodes (sets the value of UserDrainAllow to "ALL").
#
# Sample nonstop.conf file
# Date: 12 Feb 2013
#
ControlAddr=12.34.56.78
BackupAddr=12.34.56.79
Port=1234
#
HotSpareCount=batch:6,interactive:0
MaxSpareNodesCount=4
TimeLimitDelay=30
TimeLimitExtend=20
TimeLimitExtend=10
UserDrainAllow=adam,brenda
Copyright (C) 2013-2014 SchedMD LLC. All rights reserved.
Slurm is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.