NAG Logo
Numerical Algorithms Group

NAG Data Mining Components - Functionality

Select any of the following to read more about the functionality in NAG's Data Mining Components:

Data Cleaning

Data Imputation
Outlier Detection

Data Transformations

Scaling Data
Principal Component Analysis

Cluster Analysis

k-means Clustering
Hierarchical Clustering

Classification

Classification Trees
Generalised Linear Models
Nearest Neighbours

Regression

Regression Trees
Linear Regression
Multi-layer Perceptron Neural Networks
Nearest Neighbours
Radial Basis Function Models

Association Rules
Utility Functions

Data Cleaning

Data Imputation. Missing values in data are replaced by suitable values by using one of three fundamental approaches (summary statistics; distanced-based measures; or the EM algorithm for multivariate Normal data).

Outlier Detection. Outlier detection concerns finding suspect data records in a set of data. Data records are identified as suspect if they seem not to be drawn from an assumed data distribution.

Data Transformations

Scaling Data. The contribution to a distance computation by data values on a continuous variable depends on the range of its values. Thus, for functions which do not including scaling as an option, it is often preferable to transform all data values on continuous variables to be on the same scale.

Principal Component Analysis. Principal component analysis (PCA) is a tool for reducing the number of variables that you need to consider in an analysis. PCA derives a set of orthogonal, i.e., uncorrelated, variables that contain most of the information in the original data. The new variables, called principal components, are calculated as linear transformations of the original data.

Cluster Analysis

Cluster analysis is the statistical name for techniques that aim to find groups of similar data records in a study.

*k-means Clustering. In k-means clustering the analyst decides how many groups or clusters there are in the data.

*Hierarchical Clustering. Hierarchical clustering starts from the collection of data records and agglomerates them step-by-step until there is only one group. The analyst uses results from the hierarchical clustering to determine a natural number of clusters.

Classification

Classification Trees. NAG Data Mining Components includes functions to calculate binary and n-ary decision trees for classification. The binary classification tree uses the Gini index criterion at nodes, whereas the n-ary classification tree uses an entropy-based criterion.

Generalised Linear Models. Generalised linear models allow a wide range of models to be fitted. These include logistic and probit regression models for binary data, and log-linear models for contingency tables. In NAG DMC the following distributions are available: binomial distribution (for binary classification tasks) and Poisson distribution (typically used for count data).

Nearest Neighbours. k-nearest neighbour models predict values based on values of the k most similar data records in a training set of data. The measure of similarity is taken to be one of two distance functions. Prior probabilities can be set for the classes in the data. Training data are stored in a binary tree to enable efficient searching for nearest neighbours.

Regression

Regression Trees. The two decision trees available in NAG DMC for regression tasks are both binary trees. Each regression tree minimises the sum of squares about the mean for data at a node. However, one of the regression trees uses a robust estimate of the mean, whereas the other uses the sample average.

Linear Regression. Linear regression models can be used to predict an outcome y from a number of independent variables. The predictive model is a linear combination of independent variables and a constant term. NAG DMC can automatically select a good subset of independent variables to use in a model by using stepwise selection.

Multi-layer Perceptron Neural Networks. Multi-layer perceptrons (MLPs) are flexible non-linear models that may be represented by a directed graph. The process of optimising values of the free parameters in an MLP is known as training. Training involves minimising the sum of squared error function between MLP predictions and training data values. NAG DMC uses a conjugate gradients optimiser to train MLPs.

Nearest Neighbours. k-nearest neighbour models predict values based on values of the k most similar data records in a training set of data. The measure of similarity is taken to be one of two distance functions. Training data are stored in a binary tree to enable efficient searching for nearest neighbours.

Radial Basis Function Models. A radial basis function (RBF) computes a scalar function of the Euclidean distance from its centre location to data records. A linear combination of RBF outputs defines a RBF model. The advantages of RBF models are that the centres can be positioned to reflect domain knowledge and the optimisation is fast and accurate.

Association Rules

The goal of association analysis is to determine relationships between nominal data values. These models are typically used for market basket analysis.

Utility Functions

The utility functions are designed to support the main functions described above, and to help with prototyping. Utility functions are included which:

  • Random number generators
  • Rank ordering
  • Sorting
  • †Mean and sum of squares updates
  • Two-way classification comparison
  • Save and load models

† indicates an option of out-of-core memory optimization option
* indicates a memory-efficient implementation of the method

© The Numerical Algorithms Group 2008
Privacy Policy | Trademarks

© Numerical Algorithms Group

Visit NAG on the web at:

www.nag.co.uk (Europe and ROW)
www.nag.com (North America)
www.nag-j.co.jp (Japan)

http://www.nag.co.uk/numeric/DR/DRfunctionality.asp