Glossary: List of Mathematical/Statistical Terms
This list of terms is given in alphabetic order with terms defined
in this glossary given in bold format. Each definition is provided reference,
given in brackets, to page number and source of literature given in
the end of this document. For example [pg. 92 in 5] is the reference
to page 92 in source numbered 5 in the source list.
Amplitude is the maximum absolute deviation from the
mean of a time series when described by a sin/cos relationship like
Xt = Rcos(ωt
+ θ) + Zt,
where R is the amplitude, ω is the angular frequency (i.e., the frequency
is ƒ = ω/2π),
θ
is the phase, and Zt is a random term. [pg. 92 in 5]
Analysis of Variance (ANOVA) is a statistical test
that is commonly used in regression analysis. ANOVA is used for intercomparison
of mean responses to a number of different factors or to a different
level of the same factor. [pg. 171-190, 2]
Anomaly is the difference between the value of a variable
(for example temperature) at a given location and its "normal"
or long term time average at that location. The anomaly may vary depending
on what is used to define the “normal”. [1]
Asymmetry (Skewness) is a distribution property computed
as the third central moment: if distribution is perfectly symmetrical
coefficient is asymmetry is zero, positive skewed distributions are
bended to the right, and negatively skewed bended to the left [pg.32,
2]
Autocorrelation is a type of serial correlation
and is the correlation (usually linear-squared correlation)
between members of a time series of observations, and the same values
at a fixed time interval later. [1]
Autoregressive process AR(p) (see Autocorrelation,
Damped Persistence, Markov Process) (of order p) of a time-dependent
random variable X is described by the relationship Xt=a0
+ΕakXt-k
+Zt; where k=0,…, p; a is a constant, and
Z is a white noise process [pg 204, 2]
Bias is the expected error of an estimator of a random
variable. [pg 84-85, 2]
Bootstrap (see Random Sampling, Resampling, Permutation Procedures)
is construction of artificial data batches using sampling with replacements
[pg. 146, 3]
Canonical Correlation Analysis (CCA) is a statistical
technique that identifies a sequence of pairs of patterns in two multivariate
data sets, and constructs sets of transformed variables by projecting
the original data onto these patterns. The patterns are chosen such
that the new variables exhibit maximum correlation. [pg. 398-403, 3]
Cluster analysis is an exploratory data analysis tool
that deals with separating data into groups using degree of similarity
or difference (distance measure) between individual observations [pg
419-428, 3]
Coefficient of Variation is the scale parameter of
a random variable and is a ratio of the standard deviation of the random
variable to its mean (Cx=σx/μx).
[pg.32, 2]
Compositing is the sampling that is done according
to a specific criterion. For example, one could produce a composite
of the rainfall at a station for all years where the temperature was
much above average. [1; pg 378, 2]
Confidence Interval/Level is a range of values that
has a specified probability of containing the parameter being estimated.
For example, a true mean value might have a 95% probability of being
between X1 and X2 where X1
and X2 are determined from sampled values. P(X1<=Xm<=X2)=95%.
[1]
Correlation (time series, autocorrelation, serial,
and spatial) is the linear statistical relationship between two random
variables. The correlation that describes the relationship (1) in time,
is serial correlation or lagged correlation (see also
autocorrelation), (2) in space is spatial correlation,
and (3) between different climate variables is the cross-correlation.
[pg 4-6, 2]
Covariance is the expected value of the product of
differences between random variables and their expected values. If the
random variables are truly independent their covariance is zero [pg
146, 2]
Cross-Validation is a re-sampling technique used in
forecast verification when independent data for forecast testing are
limited. Cross-validation repeatedly divides all available
data into development and verification data subsets. It evaluates performance
of a forecast algorithm on the development subset and uses the verification
subset as an independent sample [pg 194-195, 3]
Damped persistence (see Autoregressive Process,
Autocorrelation, Markov Process)
Degree of Freedom the number of independent pieces
of information contained in a statistic. If observations are independent,
it is computed as the number of observations minus the number of estimated
parameters. [6]
Distance measure is the statistic that describes dissimilarities
between random variables (see cluster analysis).
Distributions (Binomial, Normal, Lognormal, Gamma,
etc.) are a probability function which describes the relative frequency
of occurrence of data values when sampled from a population [6].
Binomial distribution gives the probability (p)
of observing successes in a fixed number (n) of independent trials.
It is characterized by two parameters p and n.
Normal (or Guassian) distribution is a symmetric
distribution, shaped like a bell, and is completely determined by
its parameters: the mean and standard deviation.
Lognormal distribution is used for random variables
which are constrained to be greater or equal to 0. It is characterized
by two parameters: mean and standard deviation.
Gamma distribution is used for continuous random
variables which are constrained to be greater or equal to 0. It is
characterized by two parameters: shape and scale.
Eigenvalue and Eigenvector (see PCA,
EOF) of a square matrix B are a scalar and a vector that satisfies the
equation Be = λe,
where λ
is the eigenvalue and e is the eigenvector.
[pg 369, 3]
Empirical Orthogonal Function (EOF,
see also PCA) analysis is a special case of eigenvalue-eigenvector
analysis where the square matrix is also symmetric. The eigenvectors
have unit amplitude and are orthogonal, and the eigenvalues are the
variance accounted for by the eigenvector. Other scalings of the eigenvalue
and eigenvector are also possible. [pg 373, 3]
Field Significance (also Map, Pattern
and Global Significance) is a measure of statistical
significance of patterns (fields) of a spatially distributed variable
[pg 151-153, 3]
Filter is a linear operation that converts one time
series into another. Low-pass, band-pass and high-pass filters are used
to separate different signals from a time series. An example is a running
mean, which removes high frequency fluctuations (low-pass filter)
from a series [pg. 13-15, 5]
Frequency (low, high): see definition under Amplitude.
Low frequency are the long-term and high frequency
are the short-term fluctuations of a time series. [pg. 13-15, 5].
F-test is the statistical test that examines the variances
of two populations [1]
Goodness of Fit tests are statistical tests (like
Chi-square test, ANOVA, Kolmogorov-Smirnov test, etc.)
that determine fit of empirical distribution to a tested distribution.
[pg 129-135, 3]
Harmonics are integer multiples of a fundamental frequency
(2π/n, where n
is length of the data record). For example, the seasonal cycle is often
described by the first 3 or 4 annual harmonics with n = 365 days. [pg
113, 5; pg 325-332, 3]
High-Pass (see Filters and Frequency)
Histogram is a graphical display showing the distribution
of data values in a sample by dividing the range of the data into non-overlapping
intervals and counting the number of values that fall into each interval.
These counts can be divided by the total number to give a frequency
of occurrence. Bars are plotted with height proportional to the frequencies.
[6]
Independent is a property which results when the outcome
of one trial does not depend in any way on what happens in other trials
(example: tossing dice) . Two observations are said to be statistically
independent when the value of one observation does not influence, or
change, the value of another. Most statistical procedures assume that
the available data represents a random sample of independent observations.
[6]
Kurtosis (Peakedness) is a property
of distribution of a random variables computed as forth central moment
and characterize how peaked (positive kurtosis or leptokurtic) or flat
(negative kurtosis or platykurtic) the distribution is. [pg 32, 2]
Linear model is a model which takes the form: Y =
a + b*X [6]
Low-Pass (see Filters and Frequency)
Markov process (see Random Walk,
Autocorrelation, Autoregressive process,
Damped Persistence) is the Autoregressive process
of the first order.
Mean is a statistic that measures the center of a
sample of data (the first central moment) by adding up the observations
and dividing by the number of data points. It may be thought of as the
center of mass or balancing point for the data, i.e., that point at
which a ruler would balance if all the data values were placed along
it at their appropriate numerical values. [6]
Median is a statistic which measures the center of
a set of data by finding that value which divides the data in half.
A technical definition is that the median is the value which is greater
than or equal to half of the values in the data set and less than or
equal to half the values. [6]
Mode is a statistic defined as the most frequently
occurring data value. It is sometimes used as an alternative to the
mean or median as a measure of central tendency [6]
Moments are the expected values of the difference
between a random variable and its expected value raised to a power (the
moment number). Moments characterize distribution parameters, i.e.,
the first moment is the mean, the second moment the variance, the third
the skewness, etc. [pg 32, 2]
Moving Average (see Filters)
Multivariate Analysis is performed when datasets are
composed of vector observations. These can consist of observations of
different variables at one location or gridpoint values for a particular
time sequence. Different multivariate data analysis techniques are used
in climate research (see CCA, EOF,
SVD, PCA, etc.). [3, pg 359]
Noise (see random sample
and statistical noise) is the error term in every model
fit to the observations. [pg 90, 5; pg 185, 2]
Normalization (see standardization)
is the data transformation (ratio of difference between each observation
and their mean, to the standard deviation) that leads to parameters
of the distribution to be equal to 0 (mean) and 1 (variance), close
to a standard normal distribution N (0,1).
Non-linear is not linear. In other words, functions
that lack either of the properties f(x+y) =f(x)+f(y) and f(ax) = af(x).[1]
Periodogram (see spectra) is a plot
of variance in each fixed frequency interval versus frequency. [pg 110,
5]
Permutation Procedures (see Bootstrap,
Random Sampling, Resample)
is a sampling procedure
Phase Shift – see Amplitude
Principal Component Analysis (see EOF,
CCA) [3, pg 373]
Probability is a number between 0 (never occur) and
1 (always occur) which represents how likely an event is to occur. Probability
is normally defined in terms of the relative frequency of occurrence
of an event which can be repeated many times. [6]
Probability Density Function (pdf,
see Distributions) is a continuous function f(x) where
f(x) ≥ 0 and ∫f(x)dx=1. The probability for x to be between
a and b is equal to ∫abf(x)dx. [2]
Probability Distributions (see Distributions,
Binomial; Normal; Lognormal;
Gamma)
Probability of Exceedance is the probability that
a certain value of interest would be exceeded given a forecast shift
in a distribution. Probability of Exceedance is computed as the area
under the pdf from the certain value to 8. [2]
Quintiles divide a sample into five equally-populated,
rank-ordered classes. [6]
Random Sampling is sampling method in which all elements
in the population have an equal chance of being selected, and in which
the value of one observation does not affect the outcome of other observations.
[6]
Random Variables (Continuous; Discrete)
is a function which assigns a numerical value to all possible outcomes
of an experiment. The values of random variables differ from one observation
to the next in a manner described by their probability distribution.
[6]
Continuous is a type of random variable which may
take any value over an interval.
Discrete is a type of random variable which may
take on only a limited set of values, such as 1,2,3,...,10. The list
may be finite, or there may be an infinite number of values. A discrete
random variable is to be contrasted with a continuous random variable.
Random walk is a process {Xt} if Xt
= Xt-1 + Zt where is {Zt}are mutually
independent identically distributed random variables
[pg 32, 5]
Regression is a statistical model relating a dependent
variable to one or more independent variables. [6]
Resample (Bootstrap, Permutation
Procedures, Random Sampling)
is an artificial data set constructed from a given collection of real
data [pg. 145, 3]
Residual is remainder term after a filter
or a model was applied to the data series.
Sample is a set of observations, usually considered
to have been taken from a much larger population. Statistics are numerical
or graphical quantities calculated from a sample. Since the data in
one sample will vary from that of another, so will the statistics calculated
from those samples. [6]
Seasonality is the typical or "mean" variation
of a variable throughout the year.[1]
Skewness (asymmetry) is the third
moment of a random variable.
Skill Scores are measures of the skill of forecasts
[pg. 9, 131; 2]
Smoothing (see low-pass filters)
Spectra (aka spectrum) is periodogram.
Spectral analysis is the name given to methods of
estimating the spectral density function, or spectrum.
Standard Deviation (see variance)
is positive square root of variance. If a random
variable is normally distributed ~67%
(>99%) of its values lie within ± one (three) standard deviation(s)
from its mean. [pg 149, 163; 4; 6]
Standardization (see normalization)
Stationary (Non-stationary) time
series is the characteristic of a time series whose distribution does
not (does) change over time. [6]
Statistical Homogeneity is property of random variables
to have the same statistical qualities.
Statistical Noise (white, red)
is a model that is applied to the error term after the predictable portion
of time series is removed
Red noise refers to any linear stochastic process
in which power declines monotonically with increasing frequency. [1]
White noise refers to a purely random process in which
power is equal at all frequencies.
Statistical Significance is the level of recurrence
or probability the tested signal or hypothesis would not be masked by
noise in another identical but statistically independent sample [pg
15, 3]
T- test is a hypothesis test based on Student's t
distribution. This is commonly used to test hypotheses about one or
more population means or coefficients in a linear regression model.
[6]
Terciles are the statistics which divide the observations
in a numeric sample into 3 intervals, each containing 33.333% of the
data. The lower, middle, and upper terciles are computed by ordering
the data from smallest to largest and then finding the values below
which fall 1/3 and 2/3 of the data.
Transformation is a function (exponentiation, squared
root, etc.) that is applied to the data that would make the data fit
more closely to a normal distribution (and useful in application in
statistical models). [168-169, 2]
Trend is long-term change in the mean level [pg 10,
5].
Validation is the test of a model.
Variance (see moments, standard
deviation) is the second central moment,
and is a scale parameter of a probability distribution.
[pg 32, 2]
Verification is a test of the model procedure (whether
the model does computation correctly) or a model forecast (whether forecast
is concordant with the observations) [3]
Literature:
- von Storch, H., and Zwiers, F.W., 1999. Statistical Analysis in Climate Research. Cambridge University Press.
- Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences. Academic Press
- Dowdy, S., and Weaden, S, 1991, Statistics For Research, 2nd edition. John Willey & sons
- Chatfield, C, 1995, The analysis of Time Series, 4th edition. Chapman & Hall
|