|
|
| |
PT-STALK(1) |
User Contributed Perl Documentation |
PT-STALK(1) |
pt-stalk - Collect forensic data about MySQL when problems occur.
Usage: pt-stalk [OPTIONS]
pt-stalk waits for a trigger condition to occur, then collects
data to help diagnose problems. The tool is designed to run as a daemon with
root privileges, so that you can diagnose intermittent problems that you
cannot observe directly. You can also use it to execute a custom command, or
to collect data on demand without waiting for the trigger to occur.
Percona Toolkit is mature, proven in the real world, and well tested, but all
database tools can pose a risk to the system and the database server. Before
using this tool, please:
- Read the tool's documentation
- Review the tool's known "BUGS"
- Test the tool on a non-production server
- Backup your production server and verify the backups
Sometimes a problem happens infrequently and for a short time, giving you no
chance to see the system when it happens. How do you solve intermittent MySQL
problems when you can't observe them? That's why pt-stalk exists. In addition
to using it when there's a known problem on your servers, it is a good idea to
run pt-stalk all the time, even when you think nothing is wrong. You will
appreciate the data it collects when a problem occurs, because problems such
as MySQL lockups or spikes in activity typically leave no evidence to use in
root cause analysis.
pt-stalk does two things: it watches a MySQL server and waits for
a trigger condition to occur, and it collects diagnostic data when that
trigger occurs. To avoid false-positives caused by short-lived problems, the
trigger condition must be true at least "--cycles" times before a
"--collect" is triggered.
To use pt-stalk effectively, you need to define a good trigger. A
good trigger is sensitive enough to fire reliably when a problem occurs, so
that you don't miss a chance to solve problems. On the other hand, a good
trigger isn't prone to false positives, so you don't gather information when
the server is functioning normally.
The most reliable triggers for MySQL tend to be the number of
connections to the server, and the number of queries running concurrently.
These are available in the SHOW GLOBAL STATUS command as Threads_connected
and Threads_running. Sometimes Threads_connected is not a reliable indicator
of trouble, but Threads_running usually is. Your job, as the tool's user, is
to define an appropriate trigger condition for the tool. Choose carefully,
because the quality of your results will depend on the trigger you
choose.
You define the trigger with the "--function",
"--variable", "--threshold", and "--cycles"
options. The default values for these options define a reasonable trigger,
but you should adjust or change them to suite your particular system and
needs.
By default, pt-stalk tool watches MySQL forever until the trigger
occurs, then it collects diagnostic data for a while, and sleeps afterwards
to avoid repeatedly collecting data if the trigger remains true. The general
order of operations is:
while true; do
if --variable from --function > --threshold; then
cycles_true++
if cycles_true >= --cycles; then
--notify-by-email
if --collect; then
if --disk-bytes-free and --disk-pct-free ok; then
(--collect for --run-time seconds) &
fi
rm files in --dest older than --retention-time
fi
iter++
cycles_true=0
fi
if iter < --iterations; then
sleep --sleep seconds
else
break
fi
else
if iter < --iterations; then
sleep --interval seconds
else
break
fi
fi
done
rm old --dest files older than --retention-time
if --collect process are still running; then
wait up to --run-time * 3 seconds
kill any remaining --collect processes
fi
The diagnostic data is written to files whose names begin with a
timestamp, so you can distinguish samples from each other in case the tool
collects data multiple times. The pt-sift tool is designed to help you
browse and analyze the resulting data samples.
Although this sounds simple enough, in practice there are a number
of subtleties, such as detecting when the disk is beginning to fill up so
that the tool doesn't cause the server to run out of disk space. This tool
handles these types of potential problems, so it's a good idea to use this
tool instead of writing something from scratch and possibly experiencing
some of the hazards this tool is designed to avoid.
You can use standard Percona Toolkit configuration files to set command line
options.
You will probably want to run the tool as a daemon and customize
at least the "--threshold". Here's a sample configuration file for
triggering when there are more than 20 queries running at once:
daemonize
threshold=20
If you don't run the tool as root, then you will need specify
several options, such as "--pid", "--log", and
"--dest", else the tool will probably fail to start.
- --ask-pass
- Prompt for a password when connecting to MySQL.
- --collect
- default: yes; negatable: yes
Collect diagnostic data when the trigger occurs. Specify
"--no-collect" to make the tool watch
the system but not collect data.
See also "--stalk".
- --collect-gdb
- Collect GDB stacktraces. This is achieved by attaching to MySQL and
printing stack traces from all threads. This will freeze the server for
some period of time, ranging from a second or so to much longer on very
busy systems with a lot of memory and many threads in the server. For this
reason, it is disabled by default. However, if you are trying to diagnose
a server stall or lockup, freezing the server causes no additional harm,
and the stack traces can be vital for diagnosis.
In addition to freezing the server, there is also some risk of
the server crashing or performing badly after GDB detaches from it.
- --collect-oprofile
- Collect oprofile data. This is achieved by starting an oprofile session,
letting it run for the collection time, and then stopping and saving the
resulting profile data in the system's default location. Please read your
system's oprofile documentation to learn more about this.
- --collect-strace
- Collect strace data. This is achieved by attaching strace to the server,
which will make it run very slowly until strace detaches. The same
cautions apply as those listed in --collect-gdb. You should not enable
this option together with --collect-gdb, because GDB and strace can't
attach to the server process simultaneously.
- --collect-tcpdump
- Collect tcpdump data. This option causes tcpdump to capture all traffic on
all interfaces for the port on which MySQL is listening. You can later use
pt-query-digest to decode the MySQL protocol and extract a log of query
traffic from it.
- --config
- type: string
Read this comma-separated list of config files. If specified,
this must be the first option on the command line.
- --cycles
- type: int; default: 5
How many times "--variable" must be greater than
"--threshold" before triggering "--collect". This
helps prevent false positives, and makes the trigger condition less
likely to fire when the problem recovers quickly.
- --daemonize
- Daemonize the tool. This causes the tool to fork into the background and
log its output as specified in --log.
- --defaults-file
- short form: -F; type: string
Only read mysql options from the given file. You must give an
absolute pathname.
- --dest
- type: string; default: /var/lib/pt-stalk
Where to save diagnostic data from "--collect". Each
time the tool collects data, it writes to a new set of files, which are
named with the current system timestamp.
- --disk-bytes-free
- type: size; default: 100M
Do not "--collect" if the disk has less than this
much free space. This prevents the tool from filling up the disk with
diagnostic data.
If the "--dest" directory contains a previously
captured sample of data, the tool will measure its size and use that as
an estimate of how much data is likely to be gathered this time, too. It
will then be even more pessimistic, and will refuse to collect data
unless the disk has enough free space to hold the sample and still have
the desired amount of free space. For example, if you'd like 100MB of
free space and the previous diagnostic sample consumed 100MB, the tool
won't collect any data unless the disk has 200MB free.
Valid size value suffixes are k, M, G, and T.
- --disk-pct-free
- type: int; default: 5
Do not "--collect" if the disk has less than this
percent free space. This prevents the tool from filling up the disk with
diagnostic data.
This option works similarly to "--disk-bytes-free"
but specifies a percentage margin of safety instead of a bytes margin of
safety. The tool honors both options, and will not collect any data
unless both margins are satisfied.
- --function
- type: string; default: status
What to watch for the trigger. The default value watches
"SHOW GLOBAL STATUS", but you can also
watch "SHOW PROCESSLIST" and specify a
file with your own custom code. This function supplies the value of
"--variable", which is then compared against
"--threshold" to see if the the trigger condition is met.
Additional options may be required as well; see below. Possible values
are:
- status
Watch "SHOW GLOBAL STATUS"
for the trigger. The value of "--variable" then defines which
status counter is the trigger.
- processlist
Watch "SHOW FULL
PROCESSLIST" for the trigger. The trigger value is the count
of processes whose "--variable" column matches the
"--match" option. For example, to trigger
"--collect" when more than 10 processes are in the
"statistics" state, specify:
--function processlist \
--variable State \
--match statistics \
--threshold 10
In addition, you can specify a file that contains your custom
trigger function, written in Unix shell script. This can be a wrapper that
executes anything you wish. If the argument to "--function" is a
file, then it takes precedence over built-in functions, so if there is a
file in the working directory named "status" or
"processlist" then the tool will use that file even though are
valid built-in values.
The file works by providing a function called
"trg_plugin", and the tool simply sources
the file and executes the function. For example, the file might contain:
trg_plugin() {
mysql $EXT_ARGV -e "SHOW ENGINE INNODB STATUS" \
| grep -c "has waited at"
}
This snippet will count the number of mutex waits inside InnoDB.
It illustrates the general principle: the function must output a number,
which is then compared to "--threshold" as usual. The
$EXT_ARGV variable contains the MySQL options
mentioned in the "SYNOPSIS" above.
The file should not alter the tool's existing global variables.
Prefix any file-specific global variables with
"PLUGIN_" or make them local.
- --help
- Print help and exit.
- --host
- short form: -h; type: string
Host to connect to.
- --interval
- type: int; default: 1
How often to check the if trigger is true, in seconds.
- --iterations
- type: int
How many times to "--collect" diagnostic data. By
default, the tool runs forever and collects data every time the trigger
occurs. Specify "--iterations" to collect data a limited
number of times. This option is also useful with
"--no-stalk" to collect data once and
exit, for example.
- --log
- type: string; default: /var/log/pt-stalk.log
Print all output to this file when daemonized.
- --match
- type: string
The pattern to use when watching SHOW PROCESSLIST. See
"--function" for details.
- --notify-by-email
- type: string
Send an email to these addresses for every
"--collect".
- --password
- short form: -p; type: string
Password to use when connecting. If password contains commas
they must be escaped with a backslash: "exam\,ple"
- --pid
- type: string; default: /var/run/pt-stalk.pid
Create the given PID file. The tool won't start if the PID
file already exists and the PID it contains is different than the
current PID. However, if the PID file exists and the PID it contains is
no longer running, the tool will overwrite the PID file with the current
PID. The PID file is removed automatically when the tool exits.
- --plugin
- type: string
Load a plugin to hook into the tool and extend is
functionality. The specified file does not need to be executable, nor
does its first line need to be shebang line. It only needs to define one
or more of these Bash functions:
- before_stalk
- Called before stalking.
- before_collect
- Called when the trigger occurs, before running a "--collect"
subprocesses in the background.
- after_collect
- Called after running a collector process. The PID of the collector process
is passed as the first argument. This hook is called before
"after_collect_sleep".
- after_collect_sleep
- Called after sleeping "--sleep" seconds for the collector
process to finish. This hook is called after
"after_collect".
- after_interval_sleep
- Called after sleeping "--interval" seconds after each trigger
check.
- after_stalk
- Called after stalking. Since pt-stalk stalks forever by default, this hook
is only called if "--iterations" is specified.
For example, a very simple plugin that touches a file when
"--collect" is triggered:
before_collect() {
touch /tmp/foo
}
Since the plugin is completely sourced (imported) into the tool's
namespace, be careful not to define other functions or global variables that
already exist in the tool. You should prefix all plugin-specific functions
and global variables with "plugin_" or
"PLUGIN_".
Plugins have access to all command line options but they should
not modify them. Each option is a global variable like
$OPT_DEST which corresponds to "--dest".
Therefore, the global variable for each command line option is
"OPT_" plus the option name in all caps
with hyphens replaced by underscores.
Plugins can stop the tool by setting the global variable
"OKTORUN" to 1. In
this case, the global variable
"EXIT_REASON" should also be set to
indicate why the tool was stopped.
Plugin writers should keep in mind that the file destination
prefix currently in use should be accessed through the
$prefix variable, rather than
$OPT_PREFIX.
- --mysql-only
- Trigger only MySQL related captures, ignoring all others. The only not
MySQL related value being collected is the disk space, because it is
needed to calculate the available free disk space to write the result
files. This option is useful for RDS instances.
- --port
- short form: -P; type: int
Port number to use for connection.
- --prefix
- type: string
The filename prefix for diagnostic samples. By default, all
files created by the same "--collect" instance have a
timestamp prefix based on the current local time, like
"2011_12_06_14_02_02", which is
December 6, 2011 at 14:02:02.
- --retention-count
- type: int; default: 0
Keep the data for the last N runs. If N > 0, the program
will keep the data for the last N runs and will delete the older
data.
- --retention-size
- type: int; default: 0
Keep up to --retention-size MB of data. It will keep at least
1 run even if the size is bigger than the specified in this
parameter
- --retention-time
- type: int; default: 30
Number of days to retain collected samples. Any samples that
are older will be purged.
- --run-time
- type: int; default: 30
How long to "--collect" diagnostic data when the
trigger occurs. The value is in seconds and should not be longer than
"--sleep". It is usually not necessary to change this; if the
default 30 seconds doesn't collect enough data, running longer is not
likely to help because the system or MySQL server is probably too busy
to respond. In fact, in many cases a shorter collection period is
appropriate.
This value is used two other times. After collecting, the
collect subprocess will wait another "--run-time" seconds for
its commands to finish. Some commands can take awhile if the system is
running very slowly (which can likely be the case given that a
collection was triggered). Since empty files are deleted, the extra wait
gives commands time to finish and write their data. The value is
potentially used again just before the tool exits to wait again for any
collect subprocesses to finish. In most cases this won't happen because
of the aforementioned extra wait. If it happens, the tool will log
"Waiting up to N seconds for subprocesses to finish..." where
N is three times "--run-time". In both cases, after waiting,
the tool kills all of its subprocesses.
- --sleep
- type: int; default: 300
How long to sleep after "--collect". This prevents
the tool from triggering continuously, which might be a problem if the
collection process is intrusive. It also prevents filling up the disk or
gathering too much data to analyze reasonably.
- --sleep-collect
- type: int; default: 1
How long to sleep between collection loop cycles. This is
useful with "--no-stalk" to do long
collections. For example, to collect data every minute for an hour,
specify: "--no-stalk --run-time 3600
--sleep-collect 60".
- --socket
- short form: -S; type: string
Socket file to use for connection.
- --stalk
- default: yes; negatable: yes
Watch the server and wait for the trigger to occur. Specify
"--no-stalk" to collect diagnostic
data immediately, that is, without waiting for the trigger to occur. You
probably also want to specify values for "--interval",
"--iterations", and "--sleep". For example, to
immediately collect data for 1 minute then exit, specify:
--no-stalk --run-time 60 --iterations 1
"--cycles", "--daemonize",
"--log" and "--pid" have no effect with
"--no-stalk". Safeguard options, like
"--disk-bytes-free" and "--disk-pct-free", are still
respected.
See also "--collect".
- --threshold
- type: int; default: 25
The maximum acceptable value for "--variable".
"--collect" is triggered when the value of
"--variable" is greater than "--threshold" for
"--cycles" many times. Currently, there is no way to define a
lower threshold to check for a "--variable" value that is too
low.
See also "--function".
- --user
- short form: -u; type: string
User for login if not current user.
- --variable
- type: string; default: Threads_running
The variable to compare against "--threshold". See
also "--function".
- --verbose
- type: int; default: 2
Print more or less information while running. Since the tool
is designed to be a long-running daemon, the default verbosity level
only prints the most important information. If you run the tool
interactively, you may want to use a higher verbosity level.
LEVEL PRINTS
===== =====================================
0 Errors
1 Warnings
2 Matching triggers and collection info
3 Non-matching triggers
- --version
- Print tool's version and exit.
This tool does not require any environment variables for configuration, although
it can be influenced to work differently by through several variables. Keep in
mind that these are expert settings, and should not be used in most cases.
Specifically, the variables that can be set are:
- CMD_GDB
- CMD_IOSTAT
- CMD_MPSTAT
- CMD_MYSQL
- CMD_MYSQLADMIN
- CMD_OPCONTROL
- CMD_OPREPORT
- CMD_PMAP
- CMD_STRACE
- CMD_SYSCTL
- CMD_TCPDUMP
- CMD_VMSTAT
For example, during collection iostat is called with a -dx
argument, but because you have an NFS partition, you also need the -n flag
there. Instead of editing the source, you can call pt-stalk as
CMD_IOSTAT="iostat -n" pt-stalk ...
which will do exactly what you need. Combined with the plugin
hooks, this gives you a fine-grained control of what the tool does.
It is possible to enable "debug"
mode in mysqladmin specifying:
"CMD_MYSQLADMIN='mysqladmin debug' pt-stalk
params ..."
This tool requires Bash v3 or newer. Certain options require other programs:
- "--collect-gdb" requires "gdb"
- "--collect-oprofile" requires "opcontrol" and
"opreport"
- "--collect-strace" requires "strace"
- "--collect-tcpdump" requires "tcpdump"
For a list of known bugs, see <http://www.percona.com/bugs/pt-stalk>.
Please report bugs at
<https://jira.percona.com/projects/PT>. Include the following
information in your bug report:
- Complete command-line used to run the tool
- Tool "--version"
- MySQL version of all servers involved
- Output from the tool including STDERR
- Input files (log/dump/config files, etc.)
If possible, include debugging output by running the tool with
"PTDEBUG"; see
"ENVIRONMENT".
Visit <http://www.percona.com/software/percona-toolkit/> to download the
latest release of Percona Toolkit. Or, get the latest release from the command
line:
wget percona.com/get/percona-toolkit.tar.gz
wget percona.com/get/percona-toolkit.rpm
wget percona.com/get/percona-toolkit.deb
You can also get individual tools from the latest release:
wget percona.com/get/TOOL
Replace "TOOL" with the name of
any tool.
Baron Schwartz, Justin Swanhart, Fernando Ipar, Daniel Nichter, and Brian Fraser
This tool is part of Percona Toolkit, a collection of advanced command-line
tools for MySQL developed by Percona. Percona Toolkit was forked from two
projects in June, 2011: Maatkit and Aspersa. Those projects were created by
Baron Schwartz and primarily developed by him and Daniel Nichter. Visit
<http://www.percona.com/software/> to learn about other free,
open-source software from Percona.
This program is copyright 2011-2018 Percona LLC and/or its affiliates, 2010-2011
Baron Schwartz.
THIS PROGRAM IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
This program is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License as published by
the Free Software Foundation, version 2; OR the Perl Artistic License. On
UNIX and similar systems, you can issue `man perlgpl' or `man perlartistic'
to read these licenses.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software Foundation,
Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |