qgpureset - reset GPU error counts
qgpureset -H host -g gpuid -p -v
The qgpureset command will request a MOM to reset the ECC counts on one
of it's Nvidia GPUs. The GPU's error count is reset by sending a GPU
Control batch request to the batch server.
Changing the GPU mode requires PBS Operator or Manager privilege.
It also requires that Torque be configured with --enable-nvidia-gpu.
- -H host
- Specifies the host within the cluster on which the GPU is located. The
argument is the name of a host that is a member of the cluster of hosts
managed by the server.
- -g gpuid
- Specifies the ID of the GPU.
- -p
- Specifies to reset the GPU's permanent ECC error count.
- -v
- Specifies to reset the GPU's volatile ECC error count.
The qgpureset command will write a diagnostic messages to standard error for
each error occurrence.
Upon successful processing of all the operands presented to the
qgpureset command, the exit status will be a value of zero.
If the qgpureset command fails to process any operand, the command
exits with a value greater than zero.
pbs_mom(8B) and pbs_server(8B)