math::statistics(3tcl) | Tcl Math Library | math::statistics(3tcl) |
math::statistics - Basic statistical functions and procedures
package require Tcl 8.5
package require math::statistics 1
::math::statistics::mean data
::math::statistics::min data
::math::statistics::max data
::math::statistics::number data
::math::statistics::stdev data
::math::statistics::var data
::math::statistics::pstdev data
::math::statistics::pvar data
::math::statistics::median data
::math::statistics::basic-stats data
::math::statistics::histogram limits values ?weights?
::math::statistics::histogram-alt limits values ?weights?
::math::statistics::corr data1 data2
::math::statistics::interval-mean-stdev data confidence
::math::statistics::t-test-mean data est_mean est_stdev alpha
::math::statistics::test-normal data significance
::math::statistics::lillieforsFit data
::math::statistics::test-Duckworth list1 list2 significance
::math::statistics::test-anova-F alpha args
::math::statistics::test-Tukey-range alpha args
::math::statistics::test-Dunnett alpha control args
::math::statistics::quantiles data confidence
::math::statistics::quantiles limits counts confidence
::math::statistics::autocorr data
::math::statistics::crosscorr data1 data2
::math::statistics::mean-histogram-limits mean stdev number
::math::statistics::minmax-histogram-limits min max number
::math::statistics::linear-model xdata ydata intercept
::math::statistics::linear-residuals xdata ydata intercept
::math::statistics::test-2x2 n11 n21 n12 n22
::math::statistics::print-2x2 n11 n21 n12 n22
::math::statistics::control-xbar data ?nsamples?
::math::statistics::control-Rchart data ?nsamples?
::math::statistics::test-xbar control data
::math::statistics::test-Rchart control data
::math::statistics::test-Kruskal-Wallis confidence args
::math::statistics::analyse-Kruskal-Wallis args
::math::statistics::test-Levene groups
::math::statistics::test-Brown-Forsythe groups
::math::statistics::group-rank args
::math::statistics::test-Wilcoxon sample_a sample_b
::math::statistics::spearman-rank sample_a sample_b
::math::statistics::spearman-rank-extended sample_a sample_b
::math::statistics::kernel-density data opt -option value ...
::math::statistics::bootstrap data sampleSize ?numberSamples?
::math::statistics::wasserstein-distance prob1 prob2
::math::statistics::kl-divergence prob1 prob2
::math::statistics::logistic-model xdata ydata
::math::statistics::logistic-probability coeffs x
::math::statistics::tstat dof ?alpha?
::math::statistics::mv-wls wt1 weights_and_values
::math::statistics::mv-ols values
::math::statistics::pdf-normal mean stdev value
::math::statistics::pdf-lognormal mean stdev value
::math::statistics::pdf-exponential mean value
::math::statistics::pdf-uniform xmin xmax value
::math::statistics::pdf-triangular xmin xmax value
::math::statistics::pdf-symmetric-triangular xmin xmax value
::math::statistics::pdf-gamma alpha beta value
::math::statistics::pdf-poisson mu k
::math::statistics::pdf-chisquare df value
::math::statistics::pdf-student-t df value
::math::statistics::pdf-gamma a b value
::math::statistics::pdf-beta a b value
::math::statistics::pdf-weibull scale shape value
::math::statistics::pdf-gumbel location scale value
::math::statistics::pdf-pareto scale shape value
::math::statistics::pdf-cauchy location scale value
::math::statistics::pdf-laplace location scale value
::math::statistics::pdf-kumaraswamy a b value
::math::statistics::pdf-negative-binomial r p value
::math::statistics::cdf-normal mean stdev value
::math::statistics::cdf-lognormal mean stdev value
::math::statistics::cdf-exponential mean value
::math::statistics::cdf-uniform xmin xmax value
::math::statistics::cdf-triangular xmin xmax value
::math::statistics::cdf-symmetric-triangular xmin xmax value
::math::statistics::cdf-students-t degrees value
::math::statistics::cdf-gamma alpha beta value
::math::statistics::cdf-poisson mu k
::math::statistics::cdf-beta a b value
::math::statistics::cdf-weibull scale shape value
::math::statistics::cdf-gumbel location scale value
::math::statistics::cdf-pareto scale shape value
::math::statistics::cdf-cauchy location scale value
::math::statistics::cdf-F nf1 nf2 value
::math::statistics::cdf-laplace location scale value
::math::statistics::cdf-kumaraswamy a b value
::math::statistics::cdf-negative-binomial r p value
::math::statistics::empirical-distribution values
::math::statistics::random-normal mean stdev number
::math::statistics::random-lognormal mean stdev number
::math::statistics::random-exponential mean number
::math::statistics::random-uniform xmin xmax number
::math::statistics::random-triangular xmin xmax number
::math::statistics::random-symmetric-triangular xmin xmax number
::math::statistics::random-gamma alpha beta number
::math::statistics::random-poisson mu number
::math::statistics::random-chisquare df number
::math::statistics::random-student-t df number
::math::statistics::random-beta a b number
::math::statistics::random-weibull scale shape number
::math::statistics::random-gumbel location scale number
::math::statistics::random-pareto scale shape number
::math::statistics::random-cauchy location scale number
::math::statistics::random-laplace location scale number
::math::statistics::random-kumaraswamy a b number
::math::statistics::random-negative-binomial r p number
::math::statistics::histogram-uniform xmin xmax limits number
::math::statistics::incompleteGamma x p ?tol?
::math::statistics::incompleteBeta a b x ?tol?
::math::statistics::estimate-pareto values
::math::statistics::estimate-exponential values
::math::statistics::estimate-laplace values
::math::statistics::estimante-negative-binomial r values
::math::statistics::filter varname data expression
::math::statistics::map varname data expression
::math::statistics::samplescount varname list expression
::math::statistics::subdivide
::math::statistics::plot-scale canvas xmin xmax ymin ymax
::math::statistics::plot-xydata canvas xdata ydata tag
::math::statistics::plot-xyline canvas xdata ydata tag
::math::statistics::plot-tdata canvas tdata tag
::math::statistics::plot-tline canvas tdata tag
::math::statistics::plot-histogram canvas counts limits tag
The math::statistics package contains functions and procedures for basic statistical data analysis, such as:
It is meant to help in developing data analysis applications or doing ad hoc data analysis, it is not in itself a full application, nor is it intended to rival with full (non-)commercial statistical packages.
The purpose of this document is to describe the implemented procedures and provide some examples of their usage. As there is ample literature on the algorithms involved, we refer to relevant text books for more explanations. The package contains a fairly large number of public procedures. They can be distinguished in three sets: general procedures, procedures that deal with specific statistical distributions, list procedures to select or transform data and simple plotting procedures (these require Tk). Note: The data that need to be analyzed are always contained in a simple list. Missing values are represented as empty list elements. Note: With version 1.0.1 a mistake in the procs pdf-lognormal, cdf-lognormal and random-lognormal has been corrected. In previous versions the argument for the standard deviation was actually used as if it was the variance.
The general statistical procedures are:
(This routine is called whenever either or all of the basic statistical parameters are required. Hence all calculations are done and the relevant values are returned.)
Optionally, you can use weights to influence the histogram.
Compatibility issue: the original implementation and documentation used the term "confidence" and used a value 1-significance (see ticket 2812473fff). This has been corrected as of version 0.9.3.
test-anova-F 0.05 $A $B $C # # Or equivalently: # test-anova-F 0.05 [list $A $B $C]
Note: some care is required if there is only one group to compare the control with:
test-Dunnett-F 0.05 $control [list $A]
The correlation is determined in such a way that the first value is always 1 and all others are equal to or smaller than 1. The number of values involved will diminish as the "time" (the index in the list of returned values) increases
The correlation is determined in such a way that the values can never exceed 1 in magnitude. The number of values involved will diminish as the "time" (the index in the list of returned values) increases.
Convenience function - the result is suitable for the histogram function.
The result consists of the following list:
Returns a list of the differences between the actual data and the predicted values.
Returns the "chi-square" value, which can be used to the determine the significance.
Returns a short report, useful in an interactive session.
Returns the mean, the lower limit, the upper limit and the number of data per subsample.
Returns the mean range, the lower limit, the upper limit and the number of data per subsample.
Returns a list of subsamples (their indices) that indeed violate the limits.
Returns a list of subsamples (their indices) that indeed violate the limits.
The return value consists of three lists: the centres of the bins, the associated probability density and a list of computational parameters (begin and end of the interval, mean and standard deviation and the used bandwidth). The computational parameters can be used for further analysis.
Note: the histograms are assumed to be based on the same equidistant intervals. As the bounds are not passed, the value is expressed in the length of the intervals.
Note: the histograms are assumed to be based on the same equidistant intervals. As the bounds are not passed, the value is expressed in the length of the intervals.
Note also that the KL divergence is not symmetric and that the second histogram should not contain zeroes in places where the first histogram has non-zero values.
Besides the linear regression with a single independent variable, the statistics package provides two procedures for doing ordinary least squares (OLS) and weighted least squares (WLS) linear regression with several variables. They were written by Eric Kemp-Benedict.
In addition to these two, it provides a procedure (tstat) for calculating the value of the t-statistic for the specified number of degrees of freedom that is required to demonstrate a given level of significance.
Note: These procedures depend on the math::linearalgebra package.
Description of the procedures
P(t*) = 1 - alpha/2 P(-t*) = alpha/2
Given a sample of normally-distributed data x, with an estimate xbar for the mean and sbar for the standard deviation, the alpha confidence interval for the estimate of the mean can be calculated as
( xbar - t* sbar , xbar + t* sbar)
The linear model is of the form
y = b0 + b1 * x1 + b2 * x2 ... + bN * xN + error
yi = b0 + b1 * xi1 + b2 * xi2 + ... + bN * xiN + Residual_i
The procedure returns a list with the following elements:
This procedure simply calls ::mvlinreg::wls with the weights set to 1.0, and returns the same information.
Example of the use:
# Store the value of the unicode value for the "+/-" character set pm "\u00B1" # Provide some data set data {{ -.67 14.18 60.03 -7.5 } { 36.97 15.52 34.24 14.61 } {-29.57 21.85 83.36 -7. } {-16.9 11.79 51.67 -6.56 } { 14.09 16.24 36.97 -12.84} { 31.52 20.93 45.99 -25.4 } { 24.05 20.69 50.27 17.27} { 22.23 16.91 45.07 -4.3 } { 40.79 20.49 38.92 -.73 } {-10.35 17.24 58.77 18.78}} # Call the ols routine set results [::math::statistics::mv-ols $data] # Pretty-print the results puts "R-squared: [lindex $results 0]" puts "Adj R-squared: [lindex $results 1]" puts "Coefficients $pm s.e. -- \[95% confidence interval\]:" foreach val [lindex $results 2] se [lindex $results 3] bounds [lindex $results 4] { set lb [lindex $bounds 0] set ub [lindex $bounds 1] puts " $val $pm $se -- \[$lb to $ub\]" }
In the literature a large number of probability distributions can be found. The statistics package supports:
In principle for each distribution one has procedures for:
The following procedures have been implemented:
1 / x p-1 P(p,x) = -------- | dt exp(-t) * t Gamma(p) / 0
TO DO: more function descriptions to be added
The data manipulation procedures act on lists or lists of lists:
The following simple plotting procedures are available:
The following procedures are yet to be implemented:
The code below is a small example of how you can examine a set of data:
# Simple example: # - Generate data (as a cheap way of getting some) # - Perform statistical analysis to describe the data # package require math::statistics # # Two auxiliary procs # proc pause {time} { set wait 0 after [expr {$time*1000}] {set ::wait 1} vwait wait } proc print-histogram {counts limits} { foreach count $counts limit $limits { if { $limit != {} } { puts [format "<%12.4g\t%d" $limit $count] set prev_limit $limit } else { puts [format ">%12.4g\t%d" $prev_limit $count] } } } # # Our source of arbitrary data # proc generateData { data1 data2 } { upvar 1 $data1 _data1 upvar 1 $data2 _data2 set d1 0.0 set d2 0.0 for { set i 0 } { $i < 100 } { incr i } { set d1 [expr {10.0-2.0*cos(2.0*3.1415926*$i/24.0)+3.5*rand()}] set d2 [expr {0.7*$d2+0.3*$d1+0.7*rand()}] lappend _data1 $d1 lappend _data2 $d2 } return {} } # # The analysis session # package require Tk console show canvas .plot1 canvas .plot2 pack .plot1 .plot2 -fill both -side top generateData data1 data2 puts "Basic statistics:" set b1 [::math::statistics::basic-stats $data1] set b2 [::math::statistics::basic-stats $data2] foreach label {mean min max number stdev var} v1 $b1 v2 $b2 { puts "$label\t$v1\t$v2" } puts "Plot the data as function of \"time\" and against each other" ::math::statistics::plot-scale .plot1 0 100 0 20 ::math::statistics::plot-scale .plot2 0 20 0 20 ::math::statistics::plot-tline .plot1 $data1 ::math::statistics::plot-tline .plot1 $data2 ::math::statistics::plot-xydata .plot2 $data1 $data2 puts "Correlation coefficient:" puts [::math::statistics::corr $data1 $data2] pause 2 puts "Plot histograms" .plot2 delete all ::math::statistics::plot-scale .plot2 0 20 0 100 set limits [::math::statistics::minmax-histogram-limits 7 16] set histogram_data [::math::statistics::histogram $limits $data1] ::math::statistics::plot-histogram .plot2 $histogram_data $limits puts "First series:" print-histogram $histogram_data $limits pause 2 set limits [::math::statistics::minmax-histogram-limits 0 15 10] set histogram_data [::math::statistics::histogram $limits $data2] ::math::statistics::plot-histogram .plot2 $histogram_data $limits d2 .plot2 itemconfigure d2 -fill red puts "Second series:" print-histogram $histogram_data $limits puts "Autocorrelation function:" set autoc [::math::statistics::autocorr $data1] puts [::math::statistics::map $autoc {[format "%.2f" $x]}] puts "Cross-correlation function:" set crossc [::math::statistics::crosscorr $data1 $data2] puts [::math::statistics::map $crossc {[format "%.2f" $x]}] ::math::statistics::plot-scale .plot1 0 100 -1 4 ::math::statistics::plot-tline .plot1 $autoc "autoc" ::math::statistics::plot-tline .plot1 $crossc "crossc" .plot1 itemconfigure autoc -fill green .plot1 itemconfigure crossc -fill yellow puts "Quantiles: 0.1, 0.2, 0.5, 0.8, 0.9" puts "First: [::math::statistics::quantiles $data1 {0.1 0.2 0.5 0.8 0.9}]" puts "Second: [::math::statistics::quantiles $data2 {0.1 0.2 0.5 0.8 0.9}]"
This document, and the package it describes, will undoubtedly contain bugs and other problems. Please report such in the category math :: statistics of the Tcllib Trackers [http://core.tcl.tk/tcllib/reportlist]. Please also report any ideas for enhancements you may have for either package and/or documentation.
When proposing code changes, please provide unified diffs, i.e the output of diff -u.
Note further that attachments are strongly preferred over inlined patches. Attachments can be made by going to the Edit form of the ticket immediately after its creation, and then using the left-most button in the secondary navigation bar.
data analysis, mathematics, statistics
Mathematics
1 | tcllib |