rwsplit - Divide a SiLK file into a (sampled) collection of subfiles
rwsplit --basename=BASENAME
{ --ip-limit=LIMIT | --flow-limit=LIMIT
| --packet-limit=LIMIT | --byte-limit=LIMIT }
[--seed=NUMBER] [--sample-ratio=SAMPLE_RATIO]
[--file-ratio=FILE_RATIO] [--max-outputs=MAX_OUTPUTS]
[--note-add=TEXT] [--note-file-add=FILE]
[--compression-method=COMP_METHOD]
[--print-filenames] [--site-config-file=FILENAME]
[--xargs[=FILE] | FILE [FILES...]]
rwsplit --help
rwsplit --version
rwsplit reads SiLK Flow records from the standard input or from files
named on the command line and writes the flows into a set of subfiles based on
the splitting criterion. In its simplest form, rwsplit
partitions the file, meaning that each input flow will appear in one
(and only one) of the subfiles.
In addition to splitting the file, rwsplit can generate
files containing sample flows. Sampling is specified by using the
--sample-ratio and --file-ratio switches.
rwsplit reads SiLK Flow records from the files named on the
command line or from the standard input when no file names are specified and
--xargs is not present. To read the standard input in addition to the
named files, use "-" or
"stdin" as a file name. If an input file
name ends in ".gz", the file is
uncompressed as it is read. When the --xargs switch is provided,
rwsplit reads the names of the files to process from the named text
file or from the standard input if no file name argument is provided to the
switch. The input to --xargs must contain one file name per line.
If you wish to use the size of the output files as the splitting
criterion, use the --flow-limit switch. The paramater to this switch
should be the size of the desired output files divided by the record size.
The record size can be determined by rwfileinfo(1).
When the output files are compressed (see the description of
--compression-method below), you should assume about a 50%
compression ratio.
Option names may be abbreviated if the abbreviation is unique or is an exact
match for an option. A parameter to an option may be specified as
--arg=param or --arg param, though the first form
is required for options that take optional parameters.
The splitting criterion is defined using one of the limit
specifiers; one and only one must be specified. They are:
- --ip-limit=LIMIT
- Close the current subfile and begin a new subfile when the count of unique
source and destination IPs in the current subfile meets or exceeds
LIMIT. The next-hop-IP does not count toward LIMIT.
- --flow-limit=LIMIT
- Close the current subfile and begin a new subfile when the number of SiLK
Flow records in the current subfile meets LIMIT.
- --packet-limit=LIMIT
- Close the current subfile and begin a new subfile when the sum of the
packet counts across all SiLK Flow records in the current subfile meets or
exceeds LIMIT.
- --byte-limit=LIMIT
- Close the current subfile and begin a new subfile when the sum of the byte
counts across all SiLK Flow records in the current subfile meets or
exceeds LIMIT. This switch does not specify the size of the
subfiles.
The other switches are:
- --basename=BASENAME
- Specifies the basename of the output files; this switch is required. The
flows are written sequentially to a set of subfiles whose names follow the
format BASENAME.ORDER.rwf, where
ORDER is an 8-digit zero-formatted sequence number (i.e., 00000000,
00000001, and so on). The sequence number will begin at zero and increase
by one for every file written, unless --file-ratio is
specified,
- --seed=NUMBER
- Use NUMBER to seed the pseudo-random number generator for the
--sample-ratio or --file-ratio switch. This can be used to
put the random number generator into a known state, which is useful for
testing.
- --sample-ratio=SAMPLE_RATIO
- Writes one flow record, chosen at random, from every SAMPLE_RATIO
flows that are read.
- --file-ratio=FILE_RATIO
- Picks one subfile, chosen from random, out of every FILE_RATIO
names generated, for writing to disk.
- --max-outputs=NUMBER
- Limits the number of files that are written to disk to NUMBER.
- --note-add=TEXT
- Add the specified TEXT to the header of the output file as an
annotation. This switch may be repeated to add multiple annotations to a
file. To view the annotations, use the rwfileinfo(1)
tool.
- --note-file-add=FILENAME
- Open FILENAME and add the contents of that file to the header of
the output file as an annotation. This switch may be repeated to add
multiple annotations. Currently the application makes no effort to ensure
that FILENAME contains text; be careful that you do not attempt to
add a SiLK data file as an annotation.
- --compression-method=COMP_METHOD
- Specify the compression library to use when writing output files. If this
switch is not given, the value in the SILK_COMPRESSION_METHOD environment
variable is used if the value names an available compression method. When
no compression method is specified, the output files are compressed using
the default chosen when SiLK was compiled. The valid values for
COMP_METHOD are determined by which external libraries were found
when SiLK was compiled. To see the available compression methods and the
default method, use the --help or --version switch. SiLK can
support the following COMP_METHOD values when the required
libraries are available.
- none
- Do not compress the output using an external library.
- zlib
- Use the zlib(3) library for compressing the output.
Using zlib produces the smallest output files at the cost of speed.
- lzo1x
- Use the lzo1x algorithm from the LZO real time compression library
for compression. This compression provides good compression with less
memory and CPU overhead.
- snappy
- Use the snappy library for compression, and always compress the
output regardless of the destination. This compression provides good
compression with less memory and CPU overhead. Since SiLK
3.13.0.
- best
- Use lzo1x if available, otherwise use snappy if available, otherwise use
zlib if available.
- --print-filenames
- Print to the standard error the names of input files as they are
opened.
- --site-config-file=FILENAME
- Read the SiLK site configuration from the named file FILENAME. When
this switch is not provided, rwsplit searches for the site
configuration file in the locations specified in the "FILES"
section.
- --xargs
- --xargs=FILENAME
- Read the names of the input files from FILENAME or from the
standard input if FILENAME is not provided. The input is expected
to have one filename per line. rwsplit opens each named file in
turn and reads records from it as if the filenames had been listed on the
command line.
- --help
- Print the available options and exit.
- --version
- Print the version number and information about how SiLK was configured,
then exit the application.
In the following examples, the dollar sign
("$") represents the shell prompt. The text
after the dollar sign represents the command line. Lines have been wrapped for
improved readability, and the back slash
("\") is used to indicate a wrapped line.
Assume a source file source.rwf; to split that file into
files that each contain about 100 unique IP addresses:
$ rwsplit --basename=result --ip-limit=100 source.rwf
To split source.rwf into files that each contain 100
flows:
$ rwsplit --basename=result --flow-limit=100 source.rwf
The following causes rwsplit to sample 1 out of every 10
records from source.rwf; i.e., rwsplit will read 1000 flow
records to produce each subfile:
$ rwsplit --basename=result --flow-limit=100 --sample-ratio=10 source.rwf
When --file-ratio is specified, the file names are
generated as usual (e.g., base-00000000, base-00000001, ...); however, one
of these names will be chosen randomly from each set of --file-ratio
candidates, and only that file will be written to disk.
$ rwsplit --basename=result --flow-limit=100 --file-ratio=5 source.rwf
$ ls
result-00000002.rwf
result-00000008.rwf
result-00000013.rwf
result-00000016.rwf
rwsplit can take exactly 1 partitioning switch per invocation.
Partitioning is not exact, rwsplit keeps appending flow
records a file until it meets or exceeds the specified LIMIT. For
example, if you specify --ip-limit=100, then rwsplit will fill
up the file until it has 100 IP addresses in it; if the file has 99
addresses and a new record with 2 previously unseen addresses is received,
rwsplit will put this in the current file, resulting in a 101-address
file. Similarly, if you specify --byte-limit=2000, and rwsplit
receives a 10kb flow record, that flow record will be placed in the current
subfile.
The switches --sample-ratio, --file-ratio, and
--max-outputs are processed in that order. So, when you specify
$ rwsplit --sample-ratio=10 --ip-limit=100 \
--file-ratio=10 --max-outputs=20
rwsplit will pick 1 out of every 10 flow records, write
that to a file until it has 100 IP's per file, pick 1 out of every 10 files
to write, and write up to 20 files. If there are 1000 records, each with 2
unique IPs in them, then rwsplit will write at most 1 file (it
will write 200 unique IP addresses, but it may not pick one of the files
from the set to write).
- SILK_CLOBBER
- The SiLK tools normally refuse to overwrite existing files. Setting
SILK_CLOBBER to a non-empty value removes this restriction.
- SILK_COMPRESSION_METHOD
- This environment variable is used as the value for
--compression-method when that switch is not provided. Since
SiLK 3.13.0.
- SILK_CONFIG_FILE
- This environment variable is used as the value for the
--site-config-file when that switch is not provided.
- SILK_DATA_ROOTDIR
- This environment variable specifies the root directory of data repository.
As described in the "FILES" section, rwsplit may use this
environment variable when searching for the SiLK site configuration
file.
- SILK_PATH
- This environment variable gives the root of the install tree. When
searching for configuration files, rwsplit may use this environment
variable. See the "FILES" section for details.
- ${SILK_CONFIG_FILE}
- ${SILK_DATA_ROOTDIR}/silk.conf
- /data/silk.conf
- ${SILK_PATH}/share/silk/silk.conf
- ${SILK_PATH}/share/silk.conf
- /usr/local/share/silk/silk.conf
- /usr/local/share/silk.conf
- Possible locations for the SiLK site configuration file which are checked
when the --site-config-file switch is not provided.
rwfileinfo(1), silk(7),
zlib (3)