Data Mining Glossary of Terms

Return to "An Introduction to Data Mining"

Data Mining Glossary of Terms

Analytical model - a data structure and data process for analyzing a dataset.

Example: a Decision tree is a model for the classification of a dataset.

Anomalous data - data that result from errors or that represent unusual events. Anomalous data should be examined carefully because it may carry important information.

Example: data entry keying errors.

Artificial neural networks - non-linear predictive models that learn through training and resemble biological neural networks in structure.

CART (Classification and Regression Trees) - a decision tree technique which is used for classifying a dataset. Provides a set of rules that can be applied to a new, unclassified dataset to predict which records will have a given outcome. This will segment a dataset by creating a 2-way split. This technique requires less data preparation than a CHAID decision tree.

CHAID (Chi Square Automatic Interaction Detection) - a decision tree technique which is used for classifying a dataset. Provides a set of rules that can be applied to a new, unclassified dataset to predict which records will have a given outcome. This will segment a dataset by using chi square tests to create multi-way splits. This technique requires more data preparation than a CART decision tree.

Classification - the process of dividing a dataset into mutually exclusive groups. Members of each group are as "close" as possible to one another, and different groups are "far" as possible from one another. Distance is measured with respect to the specific variable(s) you are attempting to predict.

Example: a typical classification problem is to divide a database of companies into groups that are homogeneous as possible with respect to a creditworthiness variable, with values "Good" and "Bad".

Clustering - the process of dividing a dataset into mutually exclusive groups. Members of each group are as"close" as possible to one another and different groups are as "far" as possible from one another. Distance is measured with respect to all available variables.

Data cleansing - the process of ensuring that all values in the dataset are consistent and have been correctly recorded.

Data mining - the extraction of hidden predictive information from large databases.

Data navigation - the process of viewing different dimensions, different slices and different levels of detail of a multidimensional database, also see OLAP.

Data visualization - the visual interpretation of complex relationships in multidimensional data.

Data warehouse - a system for storing and delivering massive quantities of data.

Decision tree - a tree-shaped data structure that represents a set of decisions. These decisions then generate rules for the classification of a dataset. Also see CART and CHAID.

Dimension - in a flat or relational database, each field in a record represents a dimension. In a multidimensional database, a dimension is a set of similar entities.

Example: a multidimensional sales database may contain the following dimensions; Product, Time, and City.

Exploratory data analysis - the use of graphical and descriptive statistical techniques to learn about the structure of a dataset.

Genetic algorithms - optimization techniques that use processes such as genetic combination, mutation, and natural selection in the design, based on the concepts of natural evolution.

Linear model - an analytical model that assumes linear relationships in the coefficients of the variables being studied.

Linear regression - a statistical technique used to find the best-fitting linear relationship between target, the dependent variable and its predictors, the independent variables.

Logistic regression - a linear regression technique that predicts the proportion of a categorical target, the dependent variable.

Example: A type of customer in a statistical population.

Multidimensional database - a database designed for online processing. They are structured as a multidimensional hypercube with one axis per dimension.

Multiprocessor computer - a computer that consists of multiple processors. A multiprocessor computer can also be created by networking multiple, single processor workstations or PC's. Also see Parallel Processing.

Nearest neighbor - a data structure technique that classifies each record in a dataset. The nearest neighbor technique is based on a combination of classes, of the k record(s), that are most similar in the past historical dataset.

Noise - a synonym for outliers. Also see Outliers.

Non-linear model - an analytical model that does not assume linear relationships in the coefficients of variables being studied.

OLAP - Online analytical processing. Refers to array-oriented database applications that allow users to view, navigate through, manipulate, and analyze multidimensional databases.

Outlier - a data item whose value falls outside the bounds which enclose most of the other corresponding values in the sample. The outliers may indicate anomalous data. Outliers should be examined carefully; they may contain important data.

Parallel processing - the coordinated use of multiple processors to perform computational tasks. Parallel processing can occur on a multiprocessor computer or on a network of workstations or PC's.

Predictive model - a structure and process for predicting the values of specified variables in a dataset.

Prospective data analysis - data analysis that predicts future trends, behaviors, or events based on past historical data.

RAID - Redundant Array of Inexpensive Disks. A technology for the efficient parallel storage of data from high-performance computer systems.

Retrospective data analysis - data analysis that provides insight into trends, behaviors, or events that have already occurred.

Rule induction - the extraction of useful if-then rules from data, based on statistical significance.

SMP - Symmetric Multi-processor. A computer in which memory is shared among the processors.

Terabyte - One Trillion bytes.

Time series analysis - the analysis of a sequence of measurements made at a specified time interval. Time is usually the dominating dimension of the data.