|
|
| |
Discrete(3) |
User Contributed Perl Documentation |
Discrete(3) |
Statistics::Descriptive::Discrete - Compute descriptive statistics for discrete
data sets.
To install, use the CPAN module
(https://metacpan.org/pod/Statistics::Descriptive::Discrete).
use Statistics::Descriptive::Discrete;
my $stats = new Statistics::Descriptive::Discrete;
$stats->add_data(1,10,2,1,1,4,5,1,10,8,7);
print "count = ",$stats->count(),"\n";
print "uniq = ",$stats->uniq(),"\n";
print "sum = ",$stats->sum(),"\n";
print "min = ",$stats->min(),"\n";
print "min index = ",$stats->mindex(),"\n";
print "max = ",$stats->max(),"\n";
print "max index = ",$stats->maxdex(),"\n";
print "mean = ",$stats->mean(),"\n";
print "geometric mean = ",$stats->geometric_mean(),"\n";
print "harmonic mean = ", $stats->harmonic_mean(),"\n";
print "standard_deviation = ",$stats->standard_deviation(),"\n";
print "variance = ",$stats->variance(),"\n";
print "sample_range = ",$stats->sample_range(),"\n";
print "mode = ",$stats->mode(),"\n";
print "median = ",$stats->median(),"\n";
my $f = $stats->frequency_distribution_ref(3);
for (sort {$a <=> $b} keys %$f) {
print "key = $_, count = $f->{$_}\n";
}
This module provides basic functions used in descriptive statistics. It borrows
very heavily from Statistics::Descriptive::Full (which is included with
Statistics::Descriptive) with one major difference. This module is optimized
for discretized data e.g. data from an A/D conversion that has a discrete set
of possible values. E.g. if your data is produced by an 8 bit A/D then you'd
have only 256 possible values in your data set. Even though you might have a
million data points, you'd only have 256 different values in those million
points. Instead of storing the entire data set as Statistics::Descriptive
does, this module only stores the values seen and the number of times each
value occurs.
For very large data sets, this storage method results in
significant speed and memory improvements. For example, for an 8-bit data
set (256 possible values), with 1,000,000 data points, this module is about
10x faster than Statistics::Descriptive::Full or
Statistics::Descriptive::Sparse.
Statistics::Descriptive run time is a factor of the size of the
data set. In particular, repeated calls to
"add_data" are slow.
Statistics::Descriptive::Discrete's
"add_data" is optimized for speed. For a
give number of data points, this module's run time will increase as the
number of unique data values in the data set increases. For example, while
this module runs about 10x the speed of Statistics::Descriptive::Full for an
8-bit data set, the run speed drops to about 3x for an equivalent sized
20-bit data set.
See sdd_prof.pl in the examples directory to play with profiling
this module against Statistics::Descriptive::Full.
- $stat = Statistics::Descriptive::Discrete->new();
- Create a new statistics object.
- $stat->add_data(1,2,3,4,5);
- Adds data to the statistics object. Sets a flag so that the statistics
will be recomputed the next time they're needed.
- $stat->add_data_tuple(1,2,42,3);
- Adds data to the statistics object where every two elements are a value
and a count (how many times did the value occur?) The above is equivalent
to "$stat->add_data(1,1,42,42,42);"
Use this when your data is in a form isomorphic to ($value,
$occurrence).
- $stat->max();
- Returns the maximum value of the data set.
- $stat->min();
- Returns the minimum value of the data set.
- $stat->mindex();
- Returns the index of the minimum value of the data set. The index returned
is the first occurence of the minimum value.
Note: the index is determined by the order data was added
using add_data() or add_data_tuple(). It is meaningless in
context of get_data() as get_data() does not return values
in the same order in which they were added. This behavior is different
than Statistics::Descriptive which does preserve order.
- $stat->maxdex();
- Returns the index of the maximum value of the data set. The index returned
is the first occurence of the maximum value.
Note: the index is determined by the order data was added
using "add_data()" or
"add_data_tuple()". It is meaningless
in context of "get_data()" as
"get_data()" does not return values in
the same order in which they were added. This behavior is different than
Statistics::Descriptive which does preserve order.
- $stat->count();
- Returns the total number of elements in the data set.
- $stat->uniq();
- If called in scalar context, returns the total number of unique elements
in the data set. For example, if your data set is (1,2,2,3,3,3), uniq will
return 3.
If called in array context, returns an array of each data
value in the data set in sorted order. In the above example,
"@uniq = $stats->uniq();" would
return (1,2,3)
This function is specific to Statistics::Descriptive::Discrete
and is not implemented in Statistics::Descriptive.
It is useful for getting a frequency distribution for each
discrete value in the data the set:
my $stats = Statistics::Descriptive::Discrete->new();
$stats->add_data_tuple(1,1,2,2,3,3,4,4,5,5,6,6,7,7);
my @bins = $stats->uniq();
my $f = $stats->frequency_distribution_ref(\@bins);
for (sort {$a <=> $b} keys %$f) {
print "value = $_, count = $f->{$_}\n";
}
- $stat->sum();
- Returns the sum of all the values in the data set.
- $stat->mean();
- Returns the mean of the data.
- $stat->harmonic_mean();
- Returns the harmonic mean of the data. Since the mean is undefined if any
of the data are zero or if the sum of the reciprocals is zero, it will
return undef for both of those cases.
- $stat->geometric_mean();
- Returns the geometric mean of the data. Returns
"undef" if any of the data are less than
0. Returns 0 if any of the data are 0.
- $stat->median();
- Returns the median value of the data.
- $stat->mode();
- Returns the mode of the data.
- $stat->variance();
- Returns the variance of the data.
- $stat->standard_deviation();
- Returns the standard_deviation of the data.
- $stat->sample_range();
- Returns the sample range (max - min) of the data set.
- $stat->frequency_distribution_ref($num_partitions);
- $stat->frequency_distribution_ref(\@bins);
- $stat->frequency_distribution_ref();
- "frequency_distribution_ref($num_partitions)"
slices the data into $num_partitions sets (where
$num_partitions is greater than 1) and counts the
number of items that fall into each partition. It returns a reference to a
hash where the keys are the numerical values of the partitions used. The
minimum value of the data set is not a key and the maximum value of the
data set is always a key. The number of entries for a particular partition
key are the number of items which are greater than the previous partition
key and less then or equal to the current partition key. As an example,
$stat->add_data(1,1.5,2,2.5,3,3.5,4);
$f = $stat->frequency_distribution_ref(2);
for (sort {$a <=> $b} keys %$f) {
print "key = $_, count = $f->{$_}\n";
}
prints
key = 2.5, count = 4
key = 4, count = 3
since there are four items less than or equal to 2.5, and 3
items greater than 2.5 and less than 4.
"frequency_distribution_ref(\@bins)"
provides the bins that are to be used for the distribution. This allows
for non-uniform distributions as well as trimmed or sample distributions
to be found. @bins must be monotonic and must
contain at least one element. Note that unless the set of bins contains
the full range of the data, the total counts returned will be less than
the sample size.
Calling
"frequency_distribution_ref()" with no
arguments returns the last distribution calculated, if such exists.
- my %hash = $stat->frequency_distribution($partitions);
- my %hash = $stat->frequency_distribution(\@bins);
- my %hash = $stat->frequency_distribution();
- Same as "frequency_distribution_ref()"
except that it returns the hash clobbered into the return list. Kept for
compatibility reasons with previous versions of
Statistics::Descriptive::Discrete and using it is discouraged.
Note: in earlier versions of Statistics:Descriptive::Discrete,
"frequency_distribution()" behaved
differently than the Statistics::Descriptive implementation. Any code
that uses this function should be carefully checked to ensure
compatability with the current implementation.
- $stat->get_data();
- Returns a copy of the data array. Note: This array could be very large and
would thus defeat the purpose of using this module. Make sure you really
need it before using get_data().
The returned array contains the values sorted by value. It
does not preserve the order in which the values were added. Preserving
order would defeat the purpose of this module which trades speed and
memory usage over preserving order. If order is important, use
Statistics::Descriptive.
- $stat->clear();
- Clears all data and resets the instance as if it were newly created
Effectively the same as
my $class = ref($stat);
undef $stat;
$stat = new $class;
The interface for this module strives to be identical to
Statistics::Descriptive. Any differences are noted in the description for each
method.
- Code for calculating mode is not as robust as it should be.
- Other bugs are lurking I'm sure.
- •
- Add rest of methods (at least ones that don't depend on original order of
data) from Statistics::Descriptive
Rhet Turnbull, rturnbull+cpan@gmail.com
Thanks to the following individuals for finding bugs, providing feedback, and
submitting changes:
- Peter Dienes for finding and fixing a bug in the variance
calculation.
- Bill Dueber for suggesting the add_data_tuple method.
Copyright (c) 2002, 2019 Rhet Turnbull. All rights reserved. This
program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
Portions of this code is from Statistics::Descriptive which is under
the following copyrights:
Copyright (c) 1997,1998 Colin Kuskie. All rights reserved. This
program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
Copyright (c) 1998 Andrea Spinelli. All rights reserved. This program
is free software; you can redistribute it and/or modify it under the
same terms as Perl itself.
Copyright (c) 1994,1995 Jason Kastner. All rights
reserved. This program is free software; you can redistribute it
and/or modify it under the same terms as Perl itself.
Statistics::Descriptive
Statistics::Discrete
Visit the GSP FreeBSD Man Page Interface. Output converted with ManDoc. |