Glossary

Glossary: List of Mathematical/Statistical Terms

This list of terms is given in alphabetic order with terms defined in this glossary given in bold format. Each definition is provided reference, given in brackets, to page number and source of literature given in the end of this document. For example [pg. 92 in 5] is the reference to page 92 in source numbered 5 in the source list.

Top A-C

D-H

I-M

N-R

S-Z

Amplitude is the maximum absolute deviation from the mean of a time series when described by a sin/cos relationship like X_t= Rcos(ωt + θ) + Zt, where R is the amplitude, ω is the angular frequency (i.e., the frequency is ƒ = ω/2π), θ is the phase, and Z_t is a random term. [pg. 92 in 5]

Analysis of Variance (ANOVA) is a statistical test that is commonly used in regression analysis. ANOVA is used for intercomparison of mean responses to a number of different factors or to a different level of the same factor. [pg. 171-190, 2]

Anomaly is the difference between the value of a variable (for example temperature) at a given location and its "normal" or long term time average at that location. The anomaly may vary depending on what is used to define the “normal”. [1]

Asymmetry (Skewness) is a distribution property computed as the third central moment: if distribution is perfectly symmetrical coefficient is asymmetry is zero, positive skewed distributions are bended to the right, and negatively skewed bended to the left [pg.32, 2]

Autocorrelation is a type of serial correlation and is the correlation (usually linear-squared correlation) between members of a time series of observations, and the same values at a fixed time interval later. [1]

Autoregressive process AR(p) (see Autocorrelation, Damped Persistence, Markov Process) (of order p) of a time-dependent random variable X is described by the relationship Xt=a₀ +Εa_kX_t-k +Z_t; where k=0,…, p; a is a constant, and Z is a white noise process [pg 204, 2]

Bias is the expected error of an estimator of a random variable. [pg 84-85, 2]

Bootstrap (see Random Sampling, Resampling, Permutation Procedures) is construction of artificial data batches using sampling with replacements [pg. 146, 3]

Canonical Correlation Analysis (CCA) is a statistical technique that identifies a sequence of pairs of patterns in two multivariate data sets, and constructs sets of transformed variables by projecting the original data onto these patterns. The patterns are chosen such that the new variables exhibit maximum correlation. [pg. 398-403, 3]

Cluster analysis is an exploratory data analysis tool that deals with separating data into groups using degree of similarity or difference (distance measure) between individual observations [pg 419-428, 3]

Coefficient of Variation is the scale parameter of a random variable and is a ratio of the standard deviation of the random variable to its mean (C_x=σx/μx). [pg.32, 2]

Compositing is the sampling that is done according to a specific criterion. For example, one could produce a composite of the rainfall at a station for all years where the temperature was much above average. [1; pg 378, 2]

Confidence Interval/Level is a range of values that has a specified probability of containing the parameter being estimated. For example, a true mean value might have a 95% probability of being between X1 and X2 where X1 and X2 are determined from sampled values. P(X1<=Xm<=X2)=95%. [1]

Correlation (time series, autocorrelation, serial, and spatial) is the linear statistical relationship between two random variables. The correlation that describes the relationship (1) in time, is serial correlation or lagged correlation (see also autocorrelation), (2) in space is spatial correlation, and (3) between different climate variables is the cross-correlation. [pg 4-6, 2]

Covariance is the expected value of the product of differences between random variables and their expected values. If the random variables are truly independent their covariance is zero [pg 146, 2]

Cross-Validation is a re-sampling technique used in forecast verification when independent data for forecast testing are limited. Cross-validation repeatedly divides all available data into development and verification data subsets. It evaluates performance of a forecast algorithm on the development subset and uses the verification subset as an independent sample [pg 194-195, 3]

Top A-C

D-H

I-M

N-R

S-Z

Damped persistence (see Autoregressive Process, Autocorrelation, Markov Process)

Degree of Freedom the number of independent pieces of information contained in a statistic. If observations are independent, it is computed as the number of observations minus the number of estimated parameters. [6]

Distance measure is the statistic that describes dissimilarities between random variables (see cluster analysis).

Distributions (Binomial, Normal, Lognormal, Gamma, etc.) are a probability function which describes the relative frequency of occurrence of data values when sampled from a population [6].

Binomial distribution gives the probability (p) of observing successes in a fixed number (n) of independent trials. It is characterized by two parameters p and n.

Normal (or Guassian) distribution is a symmetric distribution, shaped like a bell, and is completely determined by its parameters: the mean and standard deviation.

Lognormal distribution is used for random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: mean and standard deviation.

Gamma distribution is used for continuous random variables which are constrained to be greater or equal to 0. It is characterized by two parameters: shape and scale.

Eigenvalue and Eigenvector (see PCA, EOF) of a square matrix B are a scalar and a vector that satisfies the equation Be = λe, where λ is the eigenvalue and e is the eigenvector. [pg 369, 3]

Empirical Orthogonal Function (EOF, see also PCA) analysis is a special case of eigenvalue-eigenvector analysis where the square matrix is also symmetric. The eigenvectors have unit amplitude and are orthogonal, and the eigenvalues are the variance accounted for by the eigenvector. Other scalings of the eigenvalue and eigenvector are also possible. [pg 373, 3]

Field Significance (also Map, Pattern and Global Significance) is a measure of statistical significance of patterns (fields) of a spatially distributed variable [pg 151-153, 3]

Filter is a linear operation that converts one time series into another. Low-pass, band-pass and high-pass filters are used to separate different signals from a time series. An example is a running mean, which removes high frequency fluctuations (low-pass filter) from a series [pg. 13-15, 5]

Frequency (low, high): see definition under Amplitude. Low frequency are the long-term and high frequency are the short-term fluctuations of a time series. [pg. 13-15, 5].

F-test is the statistical test that examines the variances of two populations [1]

Goodness of Fit tests are statistical tests (like Chi-square test, ANOVA, Kolmogorov-Smirnov test, etc.) that determine fit of empirical distribution to a tested distribution. [pg 129-135, 3]

Harmonics are integer multiples of a fundamental frequency (2π/n, where n is length of the data record). For example, the seasonal cycle is often described by the first 3 or 4 annual harmonics with n = 365 days. [pg 113, 5; pg 325-332, 3]

High-Pass (see Filters and Frequency)

Histogram is a graphical display showing the distribution of data values in a sample by dividing the range of the data into non-overlapping intervals and counting the number of values that fall into each interval. These counts can be divided by the total number to give a frequency of occurrence. Bars are plotted with height proportional to the frequencies. [6]

Top A-C

D-H

I-M

N-R

S-Z

Independent is a property which results when the outcome of one trial does not depend in any way on what happens in other trials (example: tossing dice) . Two observations are said to be statistically independent when the value of one observation does not influence, or change, the value of another. Most statistical procedures assume that the available data represents a random sample of independent observations. [6]

Kurtosis (Peakedness) is a property of distribution of a random variables computed as forth central moment and characterize how peaked (positive kurtosis or leptokurtic) or flat (negative kurtosis or platykurtic) the distribution is. [pg 32, 2]

Linear model is a model which takes the form: Y = a + b*X [6]

Low-Pass (see Filters and Frequency)

Markov process (see Random Walk, Autocorrelation, Autoregressive process, Damped Persistence) is the Autoregressive process of the first order.

Mean is a statistic that measures the center of a sample of data (the first central moment) by adding up the observations and dividing by the number of data points. It may be thought of as the center of mass or balancing point for the data, i.e., that point at which a ruler would balance if all the data values were placed along it at their appropriate numerical values. [6]

Median is a statistic which measures the center of a set of data by finding that value which divides the data in half. A technical definition is that the median is the value which is greater than or equal to half of the values in the data set and less than or equal to half the values. [6]

Mode is a statistic defined as the most frequently occurring data value. It is sometimes used as an alternative to the mean or median as a measure of central tendency [6]

Moments are the expected values of the difference between a random variable and its expected value raised to a power (the moment number). Moments characterize distribution parameters, i.e., the first moment is the mean, the second moment the variance, the third the skewness, etc. [pg 32, 2]

Moving Average (see Filters)

Multivariate Analysis is performed when datasets are composed of vector observations. These can consist of observations of different variables at one location or gridpoint values for a particular time sequence. Different multivariate data analysis techniques are used in climate research (see CCA, EOF, SVD, PCA, etc.). [3, pg 359]

Top A-C

D-H

I-M

N-R

S-Z

Noise (see random sample and statistical noise) is the error term in every model fit to the observations. [pg 90, 5; pg 185, 2]

Normalization (see standardization) is the data transformation (ratio of difference between each observation and their mean, to the standard deviation) that leads to parameters of the distribution to be equal to 0 (mean) and 1 (variance), close to a standard normal distribution N (0,1).

Non-linear is not linear. In other words, functions that lack either of the properties f(x+y) =f(x)+f(y) and f(ax) = af(x).[1]

Periodogram (see spectra) is a plot of variance in each fixed frequency interval versus frequency. [pg 110, 5]

Permutation Procedures (see Bootstrap, Random Sampling, Resample) is a sampling procedure

Phase Shift – see Amplitude

Principal Component Analysis (see EOF, CCA) [3, pg 373]

Probability is a number between 0 (never occur) and 1 (always occur) which represents how likely an event is to occur. Probability is normally defined in terms of the relative frequency of occurrence of an event which can be repeated many times. [6]

Probability Density Function (pdf, see Distributions) is a continuous function f(x) where f(x) ≥ 0 and ∫f(x)dx=1. The probability for x to be between a and b is equal to ∫_a^bf(x)dx. [2]

Probability Distributions (see Distributions, Binomial; Normal; Lognormal; Gamma)

Probability of Exceedance is the probability that a certain value of interest would be exceeded given a forecast shift in a distribution. Probability of Exceedance is computed as the area under the pdf from the certain value to 8. [2]

Quintiles divide a sample into five equally-populated, rank-ordered classes. [6]

Random Sampling is sampling method in which all elements in the population have an equal chance of being selected, and in which the value of one observation does not affect the outcome of other observations. [6]

Random Variables (Continuous; Discrete) is a function which assigns a numerical value to all possible outcomes of an experiment. The values of random variables differ from one observation to the next in a manner described by their probability distribution. [6]

Continuous is a type of random variable which may take any value over an interval.

Discrete is a type of random variable which may take on only a limited set of values, such as 1,2,3,...,10. The list may be finite, or there may be an infinite number of values. A discrete random variable is to be contrasted with a continuous random variable.

Random walk is a process {X_t} if X_t = X_t-1 + Z_t where is {Z_t}are mutually independent identically distributed random variables [pg 32, 5]

Regression is a statistical model relating a dependent variable to one or more independent variables. [6]

Resample (Bootstrap, Permutation Procedures, Random Sampling) is an artificial data set constructed from a given collection of real data [pg. 145, 3]

Residual is remainder term after a filter or a model was applied to the data series.

Top A-C

D-H

I-M

N-R

S-Z

Sample is a set of observations, usually considered to have been taken from a much larger population. Statistics are numerical or graphical quantities calculated from a sample. Since the data in one sample will vary from that of another, so will the statistics calculated from those samples. [6]

Seasonality is the typical or "mean" variation of a variable throughout the year.[1]

Skewness (asymmetry) is the third moment of a random variable.

Skill Scores are measures of the skill of forecasts [pg. 9, 131; 2]

Smoothing (see low-pass filters)

Spectra (aka spectrum) is periodogram.

Spectral analysis is the name given to methods of estimating the spectral density function, or spectrum.

Standard Deviation (see variance) is positive square root of variance. If a random variable is normally distributed ~67% (>99%) of its values lie within ± one (three) standard deviation(s) from its mean. [pg 149, 163; 4; 6]

Standardization (see normalization)

Stationary (Non-stationary) time series is the characteristic of a time series whose distribution does not (does) change over time. [6]

Statistical Homogeneity is property of random variables to have the same statistical qualities.

Statistical Noise (white, red) is a model that is applied to the error term after the predictable portion of time series is removed

Red noise refers to any linear stochastic process in which power declines monotonically with increasing frequency. [1]

White noise refers to a purely random process in which power is equal at all frequencies.

Statistical Significance is the level of recurrence or probability the tested signal or hypothesis would not be masked by noise in another identical but statistically independent sample [pg 15, 3]

T- test is a hypothesis test based on Student's t distribution. This is commonly used to test hypotheses about one or more population means or coefficients in a linear regression model. [6]

Terciles are the statistics which divide the observations in a numeric sample into 3 intervals, each containing 33.333% of the data. The lower, middle, and upper terciles are computed by ordering the data from smallest to largest and then finding the values below which fall 1/3 and 2/3 of the data.

Transformation is a function (exponentiation, squared root, etc.) that is applied to the data that would make the data fit more closely to a normal distribution (and useful in application in statistical models). [168-169, 2]

Trend is long-term change in the mean level [pg 10, 5].

Validation is the test of a model.

Variance (see moments, standard deviation) is the second central moment, and is a scale parameter of a probability distribution. [pg 32, 2]

Verification is a test of the model procedure (whether the model does computation correctly) or a model forecast (whether forecast is concordant with the observations) [3]

Top A-C

D-H

I-M

N-R

S-Z

Literature:

von Storch, H., and Zwiers, F.W., 1999. Statistical Analysis in Climate Research. Cambridge University Press.

Wilks, D.S., 1995. Statistical Methods in the Atmospheric Sciences. Academic Press

Dowdy, S., and Weaden, S, 1991, Statistics For Research, 2nd edition. John Willey & sons

Chatfield, C, 1995, The analysis of Time Series, 4th edition. Chapman & Hall

Produced by the COMET^� Program