acct_gather.conf - Slurm configuration file for the acct_gather plugins
acct_gather.conf is an ASCII file which defines parameters used by
Slurm's acct_gather related plugins. The file location can be modified at
system build time using the DEFAULT_SLURM_CONF parameter or at execution time
by setting the SLURM_CONF environment variable. The file will always be
located in the same directory as the slurm.conf file.
Parameter names are case insensitive. Any text following a
"#" in the configuration file is treated as a comment through the
end of that line. The size of each line in the file is limited to 1024
characters. Changes to the configuration file take effect upon restart of
Slurm daemons, daemon receipt of the SIGHUP signal, or execution of the
command "scontrol reconfigure" unless otherwise noted.
The following acct_gather.conf parameters are defined to control
the general behavior of various plugins in Slurm.
The acct_gather.conf file is different than other Slurm .conf
files. Each plugin defines which options are available. So if you do not
load the respective plugin for an option that option will appear to be
unknown by Slurm and could cause Slurm not to load. If you decide to change
plugin types you might also have to change the related options as well.
- EnergyIPMI
- Options used for acct_gather_energy/ipmi are as follows:
- EnergyIPMIFrequency=<number>
- This parameter is the number of seconds between BMC access samples.
- EnergyIPMICalcAdjustment=<yes|no>
- If set to "yes", the consumption between the last BMC access
sample and a step consumption update is approximated to get more accurate
task consumption. The adjustment is made at the step start and each time
the consumption is updated, including the step end. The approximations are
not accumulated, only the first and last adjustments are used to
calculated the consumption. The default is "no".
- EnergyIPMIPowerSensors=<key=values>
- Optionally specify the ids of the sensors to used. Multiple
<key=values> can be set with ";" separators. The key
"Node" is mandatory and is used to know the consumed energy for
nodes (scontrol show node) and jobs (sacct). Other keys are optional and
are named by administrator. These keys are useful only when profile is
activated for energy to store power (in watt) of each key. <values>
are integers, multiple values can be set with "," separators.
The sum of the listed sensors is used for each key. EnergyIPMIPowerSensors
is optional, default value is "Node=number" where
"number" is the id of the first power sensor returned by
ipmi-sensors.
i.e.
EnergyIPMIPowerSensors=Node=16,19,23,26;Socket0=16,23;Socket1=19,26;SSUP=23,26;KNC=16,19
EnergyIPMIPowerSensors=Node=29,32;SSUP0=29;SSUP1=32
EnergyIPMIPowerSensors=Node=1280
The following acct_gather.conf parameters are defined to control
the IPMI config default values for libipmiconsole.
- EnergyIPMIUsername=USERNAME
- Specify BMC Username.
- EnergyIPMIPassword=PASSWORD
- Specify BMC Password.
- EnergyXCC
- Options used for acct_gather_energy/xcc include only in-band
communications with XClarity Controller, thus a reduced set of
configurations is supported:
- EnergyIPMIFrequency=<number>
- This parameter is the number of seconds between XCC access samples.
Default is 30 seconds.
- EnergyIPMITimeout=<number>
- Timeout, in seconds, for initializing the IPMI XCC context for a new
gathering thread. Default is 10 seconds.
- ProfileHDF5
- Options used for acct_gather_profile/hdf5 are as follows:
- ProfileHDF5Dir=<path>
- This parameter is the path to the shared folder into which the
acct_gather_profile plugin will write detailed data (usually as an HDF5
file). The directory is assumed to be on a file system shared by the
controller and all compute nodes. This is a required parameter.
- ProfileHDF5Default
- A comma delimited list of data types to be collected for each job
submission. Allowed values are:
- All
- All data types are collected. (Cannot be combined with other values.)
- None
- No data types are collected. This is the default. (Cannot be combined with
other values.)
- Energy
- Energy data is collected.
- Filesystem
- File system (Lustre) data is collected.
- Network
- Network (InfiniBand) data is collected.
- Task
- Task (I/O, Memory, ...) data is collected.
- ProfileInfluxDB
- Options used for acct_gather_profile/influxdb are as follows:
- ProfileInfluxDBDatabase
- InfluxDB database name where profiling information is to be written.
- ProfileInfluxDBDefault
- A comma delimited list of data types to be collected for each job
submission. Allowed values are:
- All
- All data types are collected. (Cannot be combined with other values.)
- None
- No data types are collected. This is the default. (Cannot be combined with
other values.)
- Energy
- Energy data is collected.
- Filesystem
- File system (Lustre) data is collected.
- Network
- Network (InfiniBand) data is collected.
- Task
- Task (I/O, Memory, ...) data is collected.
- ProfileInfluxDBHost=<hostname>:<port>
- The hostname of the machine where the influxd instance is executed and the
port used by the HTTP API. The port used by the HTTP API is the one
configured through the bind-address influxdb.conf option in the [http]
section. Example:
ProfileInfluxDBHost=myinfluxhost:8086
- ProfileInfluxDBPass
- Optional password for username configured in ProfileInfluxDBUser.
- ProfileInfluxDBRTPolicy
- The InfluxDB retention policy name for the database configured in
ProfileInfluxDBDatabase option.
- ProfileInfluxDBUser
- Optional InfluxDB username that should be used to gain access to the
database configured in ProfileInfluxDBDatabase. This is only needed
InfluxDB is configured with authentication enabled in the [http] config
section and a user has been granted at least WRITE access to the database.
See also ProfileInfluxDBPass.
- NOTE:
- This plugin requires the libcurl development files to be installed.
- NOTE:
- Information on how to install and configure InfluxDB and manage databases,
retention policies and such is available on the official webpage.
- NOTE:
- Collected information is written from every compute node where a job runs
to the influxd instance listening on the ProfileInfluxDBHost. In order to
avoid overloading the influxd instance with incoming connection requests,
the plugin uses an internal buffer which is filled with samples. Once the
buffer is full, a HTTP API write request is performed and the buffer is
emptied to hold subsequent samples. A final request is also performed when
a task ends even if the buffer isn't full.
- NOTE:
- Failed HTTP API write requests are discarded. This means that collected
profile information in the plugin buffer is lost if it can't be written to
the influxd database for any reason.
- NOTE:
- Plugin messages are logged along with the slurmstepd logs to
SlurmdLogFile. In order to troubleshoot any issues, it is recommended to
temporarily increase the slurmd debug level to debug3 and add Profile to
the debug flags. This can be accomplished by setting the slurm.conf
SlurmdDebug and DebugFlags respectively or dynamically through scontrol
setdebug and setdebugflags.
- NOTE:
- Perhaps it's a good idea to use a monitoring and analytics tool such as
Grafana on top of InfluxDB. This kind of tools permit one to create
dashboards, tables, and other graphics using the stored time series. This
way, it is easier to correlate resource usage peaks reported by other node
monitoring tools such as Ganglia with specific job step tasks.
- InfinibandOFED
- Options used for acct_gather_interconnect/ofed are as follows:
- InfinibandOFEDPort=<number>
- This parameter represents the port number of the local Infiniband card
that we are willing to monitor. The default port is 1.
###
# Slurm acct_gather configuration file
###
# Parameters for acct_gather_energy/impi plugin
EnergyIPMIFrequency=10
EnergyIPMICalcAdjustment=yes
#
# Parameters for acct_gather_profile/hdf5 plugin
ProfileHDF5Dir=/app/slurm/profile_data
# Parameters for acct_gather_interconnect/ofed plugin
InfinibandOFEDPort=1
Copyright (C) 2012-2013 Bull. Produced at Bull (cf, DISCLAIMER).
This file is part of Slurm, a resource management program. For
details, see <https://slurm.schedmd.com/>.
Slurm is free software; you can redistribute it and/or modify it
under the terms of the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your option)
any later version.
Slurm is distributed in the hope that it will be useful, but
WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for
more details.