The Home Page ·  The Integral Worm, Inc. ·  My Resume ·  My Show Car ·  My White Papers ·  Organizations I Belong To

Contact Me ·  FAQ ·  Useful Links

HAL 9000 HAL 9000

My Information Systems Papers and Projects

Human Computer Interaction (IFSM303) ·  Web Information Architecture (IFSM387) ·  Database Project Overview (IFSM420) ·  Expert System Project (IFSM425)

Information Systems and Security (IFSM430) ·  Information Systems Analysis & Design (IFSM436) ·  Project Management (IFSM438)

Legal Aspects of Information Systems (IFSM474) ·  Enterprise Network System Design (IFSM498)





Christopher Paul's Introduction to Data Mining Christopher Paul's Introduction to Data Mining

Artificial Intelligence Presentations ·  Artificial Intelligence Repository of Journal Articles and White Papers

An Introduction to Data Mining

Artificial Intelligence (IFSM427)

Last Update December 23, 2003


Table of Contents

Return to the top of the page

Executive Summary

        The purpose of this paper is to provide the reader with an introduction and survey of data mining technology. It is by no means conclusive and the reader is spared any complex details of the many algorithms that operate in the background of any popular data mining software provided by software vendors such as SPSS, SAS, IBM and other vendors.

        The paper will thoroughly explain what data mining is, what it is not, plus the terminology used in data mining plus various applications data mining is currently being used for and possible future applications.

Return to the top of the page

What is Data Mining?

        Data mining is the process that automates the extracting of hidden predictive information from large databases and interprets previously unknown patterns in data to solve various problems. A large database can be thought of as being a mountain. Somewhere within the mountain, the database, it is known that there is gold or precious hidden resources, hidden information. Data mining is the process of trying to unearth or extract unknown patterns, the gold, out of the database. Data mining software are technologically powerful tools that assist in predicting future trends and behaviors, converting detailed data into competitive knowledge that organizations can use to make proactive, knowledge-driven decisions. Data mining tools quickly answer questions that traditionally were too time consuming. Teams of statisticians and data analysts had to sift through massive amounts of data trying to establish patterns hidden within the database. These tools scour databases for hidden patterns, finding predictive information that may be missed by experts because the prediction may lie outside the scope of the expert's expectations. A well-known example is the purchase of beer with diapers on a Saturday afternoon.

        Data mining techniques can be implemented on software and hardware platforms existing today. The techniques are utilized to enhance the value of existing information resources, and can easily integrate with new products and systems as they are brought on-line. When data mining tools are implemented on high performance client/server or parallel processing computer systems, data mining tools can analyze massive databases to deliver answers to questions such as, "Who is exhibiting suspicious behavior that may lead to a terrorist act within the United States?"

        Data mining is not, nor should it be confused with data warehousing, SQL, ad hoc queries, software agents, on-line analytical processing (OLAP), or data visualization which are all technologies that may be used in conjunction with data mining, but are not related.

Return to the top of the page

The Iterative Process of Discovering Knowledge

        Data mining is not a linear process, but an iterative process in which each cycle further refines the result and increases the relevance, plus the predictive capabilities. Data mining is a process in which discovery is evolved. When one arrives at the end of any step, it is possible to go back to a previous step to make amendments. Upper management may also decide to scrap the project at any one of the phases.

There are five level phases in data mining (Yoon):



Fig 1: Steps involved in Data Mining

Phase 1: Selecting the application domain and defining the problem

        Selecting an appropriate domain area is absolutely crucial in nay data mining project. The developers must assess whether data mining is a viable alternative to resolving the problem assigned. Sufficiently, quality data must exist and be available for the domain. It is vitally important to clearly define the specific objects and that the goals are clearly understood and defined.

Phase 2: Selecting the target data

        The types of data to be used in producing the discovery and then selected. Once a target data set has been created for discovery, data mining can be performed on a subset of variables or data samples within a larger database.

Phase 3: Exploring and preprocessing the data

        After the target data has been acquired and selected it is preprocessed. Preprocessing consists of cleaning, scrubbing, and transforming the data to improve the effectiveness of the discovery. Developers must remove the noise and outliers if deemed necessary, decide on strategies for dealing with missing data fields, and must also account for time sequence information or nay known changes in the data. The data is often transformed to reduce the number of variables being considered either by converting one data type to another; converting categorical data to numerical data or by deriving new attributes by applying mathematical or logical operators. As an example, the database may contain a field defined as "male." If the subject is male, a 1 is assigned, if the subject is "female", 0 is assigned. This reduces the number of variables.

Phase 4: Extracting information/knowledge

        Extracting information consists of a series of activities in discovering knowledge buried deep within the data.
These activities consist of:

Data mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of retail sales data to identify seemingly unrelated products that are often purchased together. Other pattern discovery problems include detecting fraudulent credit card transactions such as frequent gasoline purchases or large jewelry purchases or identifying anomalous data that could represent data entry keying errors.

        Data mining techniques yield benefits of automation on existing software and hardware platforms, and are easily implemented on new systems as existing platforms are upgraded and new products are developed. When data mining tools are implemented on high performance parallel processing systems, they can analyze massive databases in minutes. Faster processing leads to users being able to automatically experiment with more models to understand complex data. High-speed processors make it possible for end users to analyze huge quantities of data. Databases can be larger in both depth and breadth due to the decreasing costs of hard drives:

Larger databases operations in turn, yield improved predictions.

        Data mining operations are segmented into classification, regression, segmentation, link analysis, and deviation detection. Once an operation has been chosen for the particular application, an appropriate mining technique is selected. A classification model is often developed by supervised neural net and induction techniques. Link analysis is supported by association discovery and sequence discovery techniques. Clustering techniques are used to segment the database, and deviation detection is done by statistical as well as visualization techniques.

        Once data mining technique is chosen, the next step is to select a particular algorithm to utilize within the data mining technique chosen. Choosing a data mining algorithm includes a method to search for patterns within the data, such as deciding which parameters may be appropriate and matching a particular data mining technique with overall objective of the data mining project. Once the appropriate algorithm has been selected, the data is finally mined using the algorithm to search and extract hidden patterns within the data. Statisticians or analytical math modelers typically perform this procedure, who have the expertise to extract meaningful patterns in the data, using advanced statistical techniques to search for patterns.

        This results in the development and the validation of a model that represents knowledge, a set of rules, or mathematical formulas that describe a pattern or characteristics within the patterns. The model is updated and modified as trends or behaviors change over time. Data mining techniques automate the process of creating predictive and descriptive models that assist in making strategic and tactical decisions. Questions that traditionally required extensive hands-on analysis, by either statisticians or data analysts, can now be answered directly from the data - quickly.

        Predictive models are created from situations where the outcome is known, then the models are applied to other situations where the outcome is unknown. Predictive techniques build models based on historical data with known outcomes, such as prior buying patterns. The algorithm analyzes the values of all input variables and identifies which variables are significant predictors for a desired outcome, such as repeat purchases or purchasing due to targeted advertising campaigns.

        An example of a predictive modeling problem is targeted marketing. Data mining uses data on past promotional mailings to identify the targets that are most likely to maximize the return on investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identifying segments of the general population that are likely to respond similarly.

        Another example of a predictive model, CenterParcs, an online European holiday company, anticipates how its customers will behave by applying predictive analytics to forecast the outcomes of particular actions. At its most basic, predictive analysis takes attributes about a customer or transaction, performs a simple linear equation, and produces a score. This score could reflect the likelihood of customer attrition, the expected response to an offer, the expected profit, or a lifetime value of a customer.

        To make these predictions more accurate, organizations can combine these simple equations with more complex algorithmic techniques such as neural networking or rule trees, which support more complex relationships between different sets of data. Analytic tools provided by software companies such as NCR, Sand Technology, SAS Institute, Data Distilleries, IBM's ProbE, and SPSS, allow organizations to run many thousands of these algorithms at any one time. Once the number crunching is complete, the system produces a series of models that can then be tested against a data sample, modified and put into use. As customer behavior changes, these models can then be adapted as often as required.

        One of the reasons predictive analysis has gained popularity more recently is a steep drop in the cost of hardware required to process the huge number of mathematical equations involved in deep statistical analysis. Only a few years ago these models would have required the use of a supercomputer. Now, an inexpensive cluster of Intel servers can do the job at hand.

        Predictive analysis produces so many more scenarios than business managers can think of, the likelihood of any one becoming likely greatly increases. Applications such as Oracle and SAP, simply identify patterns in behavior and do not interpret them. The greater the number of potential variables an organization can feed into the system, the greater the potential for accuracy. Predictive analysis works best for companies that can amass huge of data on their customers, such as telecommunications, retail or banking. When speaking of huge amounts of data, we are addressing systems that can amass data in terabytes.

        Descriptive models on the other hand, are used to describe a particular pattern where the outcome is completely unknown. A commonly used descriptive technique is clustering. The data is grouped into subsets based on common attributes, such as age, group, gender, or education level. Descriptive models can be used to determine customer segments and their profitability to the corporation.

Phase 5: Interpretation and evaluation

        Interpreting and evaluating discovered patterns involves filtering the information to be presented by removing redundant or irrelevant patterns, visualizing graphically or logically the useful patterns, and then translating them into terms that may be easily understood by the end user. The resulting interpretation of the data determines and resolves potential conflicts with previously discovered knowledge. This evaluation may also lead to the decision to redo any one or all of the previous steps. The extracted knowledge is also evaluated for its usefulness to a decision-maker and to the subsequent objective of the project. The knowledge is then used to support human decision making such as predicting trends or explaining observed phenomena.

Return to the top of the page

Data Mining Operations

        The primary goal of data mining is to unearth information and knowledge that may be used for making decisions. Data mining builds a real-world model from the data collected from various sources such as customer histories, transactions, and credit bureau information. The model is designed to discover patterns and relationships within the data set. Data mining models provide useful knowledge that can assist end users in making predictions and understanding latent patterns to determine appropriate actions to be taken.

        There are primarily five types of data mining operations. They consist of classification, regression, link analysis, segmentation, and detection. Classification and regression are useful for predicting trends, whereas link analysis, segmentation, deviation detection describe the patterns in the data. Most data mining applications typically require the combination of two or more data mining processes.

Classification

        The goal of classification is to develop a model that maps a data item into one of several predefined classes. Once created, the model is used to classify a new instance into one of the classes. One example would be the classification of bankruptcy patterns based on the financial ratios of the firm. Another example would be customer buying patterns based on demographic information to assist a firm in the targeting of effective advertising toward the appropriate customer base.

Regression

        Regression builds a model that maps data items into real-valued prediction variable utilizing existing databases. Models have traditionally been created using statistical methods such as linear and logic regression. Both classifications and regression are used for predicting trends. The distinction between these two models is that the output variable of classification is categorical, whereas regression is numeric and continuous. Examples of regression are predicting change in a rate between the yen and the Government Bond Market or the crime rate of a city based on the description of various input variables such as populations, average income level, education, and other variables.

Link Analysis

        Link analysis establishes relevant connections between database records. IT is typically applied in the market-basket analysis to analyze point-of-sale transaction data to identify product affinities. The use of laser scanners and UPC codes has enabled a retailer to analyze the transaction thereby making link analysis possible. A retail store is primarily concerned with what items sell together - such as the purchase of a new computer by a user usually leads to the purchase of a mouse pad and a surge protector - so it can be determined what items to display together for effective marketing. Another application could find relationships among procedures by analyzing claim forms submitted to an insurance firm. The link analysis is most often applied in conjunction with database segmentation.

Segmentation

        The goal of segmentation is to identify clusters of records that exhibit similar behaviors or characteristics hidden within the data. Clusters may be mutually exclusive and exhaustive or they may consist of a richer representation such as hierarchical or overlapping categories. Sets of data records are partitioned into segments so separate predictive models can be developed from each segment. This type of modeling is popular among data analysts and applied statisticians. Segmentation is approached as a sequential process in which the data is first segmented, and then the predictive models are developed for those segments. The disadvantage of the sequential approach is that it ignores the strong influence that segmentation exerts on the predictive accuracies of the models within each segment. Good segmentations tend to be obtained only through trial and error by varying the segmentation criteria. This used to be a long and tedious process.

        Examples of segmentation include discovering homogeneous groups of consumers in marketing databases and segmenting the records that describe sales during, "Mother's Day," "Father's Day," or "Graduation Day." Once the database has been segmented, link analysis can be performed on each segment to identify the association among the records with each cluster.

Detecting Deviations

        Detecting deviations focuses on the discovery of deviations. There are various types of deviation: unusual patterns that do not fit into previously measured or normative classes; significant changes in the data from one time period to the next; outlying points in a dataset, known as outliers - records that do not belong to any particular cluster; and discrepancies between an observation and a reference. Detecting deviations is usually performed after the database has been segmented to determine whether they represent noisy data or unusual casualty. Deviation detection is often the source of true discovery since deviations represent an anomaly from some known experimentation and normality.

Return to the top of the page

Who is Developing Data Mining?

        Researchers, who primarily work in the fields of computer science and statistics, have been responsible for the development of most data mining technology currently available. From a business standpoint, this was a problem since academic researchers are excellent at developing and evaluating data mining technologies, but on the other hand, they tend to get caught up in the minute details of the technology. Scholars are not concerned with the fact that the core technology is only a small part of providing a business solution. Academics miss the point that compromises must be made in order to deliver a piece of software that is usable by the casual computer user.

        Another group of data mining researchers are known as, "downsized data miners." These people primarily have research backgrounds and have worked on data mining research projects until cutbacks and company downsizing forced them into product development. When downsized data miners develop software, the end product is usually a complex tool, as opposed to a problem solving application. These complex data mining tools compete with other high-end analysis tools, e.g., SAS or S-Plus, but require the users to have sophisticated skills and statistical training. In the end, very few of these researchers will directly impact the development of database marketing as a business solution.

        The other group of people who are trying to create database marketing software applications for business users are the software developers. Unlike data mining tools, these applications do not require end users to know how to set up statistical experiments or build data models. The developers of the database marketing applications start with the business problem and try to determine if some piece of data mining technology might be useful in resolving the problem. The technology used within a data mining software application, is just one small part of the overall product, will be designed using techniques developed by researchers.

Return to the top of the page

The Foundations of Data Mining

        Data mining techniques are the result of a long process of research and product development. This evolution began when business data was first being stored on computers, continued with improvements in data access, and more recently, generated technologies that allow users to navigate through their data in real time. Data mining takes this evolutionary process beyond their retrospective data access and navigation to prospective and proactive information delivery. Data mining is finally possible for application in the business community because it is supported by three technologies that are now sufficiently mature:

Commercial databases are growing at unprecedented rates. The accompanying need for improved computational engines can now be met in a cost effective manner with parallel multiprocessor technology. Data mining algorithms embody techniques that have existed for at least thirteen years, but have only recently been implemented as mature, reliable, understandable tools that consistently outperform older statistical methods.

        In the evolution from business data to business information, each new step has built upon the previous one. As an example, dynamic data access is critical for drill-through in data navigation applications, and their ability to store large databases is critical to data mining. From the end user's point of view, the four steps listed in Table 1, were revolutionary because they allowed new business questions to be answered accurately and quickly.

 

Evolutionary Step Business Question Enabling Technologies Product Providers Characteristics
Data Collection

(1960s)

"What was my total revenue in the last five years?" Computers, tapes, disks IBM, CDC Retrospective, static data delivery
Data Access

(1980s)

"What were unit sales in New England last March?" Relational databases (RDBMS), Structured Query Language (SQL), ODBC Oracle, Sybase, Informix, IBM, Microsoft Retrospective, dynamic data delivery at record level
Data Warehousing &

Decision Support

(1990s)

"What were unit sales in New England last March? Drill down to Boston." On-line analytic processing (OLAP), multidimensional databases, data warehouses Pilot, Comshare, Arbor, Cognos, Microstrategy Retrospective, dynamic data delivery at multiple levels
Data Mining

(Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?" Advanced algorithms, multiprocessor computers, massive databases Pilot, Lockheed, IBM, SGI, numerous startups (nascent industry) Prospective, proactive information delivery

Table 1. Steps in the Evolution of Data Mining.

http://ww.thearling.com. Sourced: 02/27/2003.

        The core components of data mining technology have been under development for decades, in research areas such as statistics, artificial intelligence, and machine learning. Today, the maturity of these techniques, coupled with high-performance relational database engines and broad data interrogation efforts, makes these technologies practical for current warehouse environments.

Return to the top of the page

How Data Mining Works

        Data mining is able to tell users important things that were unknown in the data or can make predictions based on existing data thru the technique that is known as modeling. Modeling is simply the act of building a mathematical model in one situation where the outcome is known and then applying it to another situation in which the outcome is unknown. This act of model building is thus something that people have been doing for a long time, certainly before the advent of computers or data mining technology. What happens on computers, however, is not much different than the way people build models. Computers are loaded up with huge amounts of information about a variety of situations where the outcome is known and then the data mining software on the computer must run through that data and distill the characteristics of the data that should go into the model. Once the model is built it can then be used in similar situations where the outcome is unknown.

        For example, the director of marketing for a telecommunications company would like to acquire some new long distance phone customers. The marketing director has access to a great deal of information about the customer base: their age, sex, credit history and long distance calling usage. The company also was a great deal of information about prospective customers; their age, sex, and credit history to list a few fields. The problem is that the long distance calling usage of the prospects is unknown since they are most likely now customers of the competition. The marketing director wants to concentrate on those prospects that use long distance often. This may be accomplished by building a model. Table 2, illustrates the data used for building a model for new customer prospecting in a data warehouse.

 

 

Customers

Prospects

General information (e.g. demographic data) Known Known
Proprietary information (e.g. customer transactions) Known Target

Table 2: Data Mining for Prospecting

http://ww.thearling.com. Sourced: 02/27/2003.

 

        The goal in prospecting is to make some calculated guesses about the information in the lower right hand quadrant based on the model that is going to be built from Customer General Information to Customer Proprietary Information. A simple model for a telecommunications company might be: 98% of the customers who make more than $60,000 per year and spend more than $80 per month on long distance phone calls.

        This model would then be applied to the prospect data to try to discover something about the proprietary information that the telecommunications company does not currently have access to. With this model, new customers can be selectively targeted.

        Test marketing is an excellent source of data for this type of modeling. Mining the results of a test market representing a broad but relatively small sample of prospects can provide a foundation for identifying good prospects in the overall market. Table 3, shows another common scenario for building models: predictive future trends.

 

 

Yesterday

Today

Tomorrow

Static information and current plans (e.g. demographic data, marketing plans) Known Known Known
Dynamic information (e.g. customer transactions) Known Known Target

Table 3: Data Mining for Predictions

http://ww.thearling.com. Sourced: 02/27/2003.

 

        When using data mining, the best way to establish whether a model is a good model is by setting aside some of your data in a vault to isolate it from the mining process. Once the mining is complete, the results can be tested against the data held in the vault to conform the model's validity. If the model works, its observations should hold for the vaulted data.

Return to the top of the page

System Architecture for Data Mining

        To best utilize advanced data mining techniques, the data must be fully integrated with a data warehouse as well as a flexible business analysis tool. Many data mining tools currently operate outside of the warehouse, requiring extra steps for extracting, importing, and analyzing the data. Furthermore, when new insights require operational implementation, integration with the warehouse simplifies the application of results from data mining. The resulting analytic data warehouse can be applied to improve business processes throughout the organization, in areas such as promotional campaign management, fraud protection, new product rollout, and so on. Figure 2 illustrates architecture for advanced analysis in a large data warehouse.

Figure 2: Integrated Data Mining Architecture

http://ww.thearling.com. Sourced: 02/27/2003.

 

        The ideal starting point is a data warehouse containing a combination of internal data, tracking all customer contact, coupled with external market data about competitor activity. Background information on potential customers provides an excellent basis for prospecting. The data warehouse can be implemented in a variety of relational database systems: Sybase, Oracle, IBM DB-2, to name a few and should be optimized for fast data access.

        An online analytical processing (OLAP) server enables a more sophisticated end-user business model to be applied when navigating the data warehouse. The multidimensional structures allow the user to analyze the data, as they want to view the business - summarizing by product line, region, or other key perspectives of the business. The data mining server must be integrated with the data warehouse and the OLAP server to embed return on investment (ROI) - focused business analyses directly into the infrastructure. A process-centric metadata template defines the data mining objectives for specific business issues like campaign management, prospecting, and promotion optimization. Integration with the data warehouse enables operational decisions to be directly implemented and tracked. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply them to future decisions.

        This design represents a fundamental shift from conventional decision support systems. Rather than simply delivering data to the end user through query and reporting software, the Advanced Analysis Server applies users' business models directly to the warehouse and returns a proactive analysis of the most relevant information. These results enhance the metadata in the OLAP Server by providing layer that represents a distilled view of the data. Reporting, visualization, and other analysis tools can then be applied to plan future actions and confirm the impact of those plans.

Return to the top of the page

Data Mining Techniques

        There are a variety of data mining techniques available to support data mining operations. Induction techniques, artificial neural networks as well as other example-based methods are used to develop classification models. Induction methods, artificial neural networks and statistic techniques are used to create regression models. Link analysis is done by association and sequence discovery techniques. Database segmentation is done by clustering techniques. Deviation detection is performed by statistics as well as visualization methods. Although visualization methods are useful in detecting deviations, this technique itself is used in conjunction with other data mining techniques to augment their utilities, functionality, and help the end user better understand information extracted by other data mining techniques.

Induction Technique

        This technique develops a classification model from a set of records - the training set of examples. The training set may be a sample database, a data mart, or an entire data warehouse. Each record in the training set belongs to one of many predefined classes, and an induction technique induces a general concept to develop a classification model. The induced model consists of patterns that distinguish each class. Once trained, a developed model can be used to predict the class of unclassified records automatically. Induction techniques represent a model in the form of either decision trees or decision rules. These representations are easier to understand, and their implementation is more efficient than neural network and generic algorithms.

Artificial Neural Networks

        Artificial neural networks are non-linear predictive models that learn through training and resemble biological neural networks in structure. Neural net methods can be used to develop classification, regression, association, and segment models. The artificial neural network technique represents its model in a form of nodes arranged in layers and weighted links between nodes.

Example-Based Methods

        Example-based methods include such algorithms such as nearest-neighbor and case-based reasoning. The nearest neighbor algorithm is a technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset, where k^3. This is sometimes referred to as the k-nearest neighbor technique.

Associated Discovery

        Association rule induction is the extraction of useful if-then rules from data based on statistical significance. There are a variety of algorithms to identify association rules such as AIS and BAM. Bayesian nets can also be used to identify distinctions and casual relationships between variables.

Sequence Discovery

        Sequence discovery is similar to association discovery except that the collection of items occurs over a period of time. Sequence discovery is treated as an association in which the items are linked by time. When customer names are available, their purchase over time can be analyzed. For example, it could be found that if a customer buys a tie, he would buy men's shoes within one month, 25% of the time. The dynamic programming approach, based on the dynamic recognition area, is available to identify the patterns in temporal databases.

Clustering

        Clustering techniques are employed to segment a database into clusters, each of which share common and interesting properties. The purpose of segmenting a database is often to summarize the contents of the target database by considering the common characteristics shared in a cluster. Clusters are also created to support the other types of data mining operations, i.e., link analysis within the cluster.

        The database can be segmented by traditional methods of pattern recognition techniques; unsupervised neural nets such as ART and Kononen's Feature Map; conceptual clustering techniques such as COBWEB and UNIMEM and the Bayesian approach like AutoClass. Conceptual clustering algorithms consider all the attributes that characterize each record and identify the subset of the attributes that will describe each created cluster to form concepts. The concepts in the conceptual clustering algorithm can be represented as the conjunctions of attributes and their values. Bayesian clustering algorithms automatically discover a number of clustering that is maximally probable with respect to the data using a Bayesian approach. The various clustering algorithms can be characterized by the type of acceptable attribute values such as continuous, discrete, or qualitative values and by their methods of organizing the set of clusters, either hierarchically or into flat files.

Visualization

        Visualizations are particularly useful for detecting phenomena hidden in a relatively small subset of the data. The techniques are often used in conjunction with other data mining techniques to augment their utilities. As an example, features that are difficult by scanning numbers may become obvious when a summary of data is graphically presented. Visualization techniques can also guide end users when they do not know what to look for to discover the novelty they are seeking. Also this technique is useful in facilitating end users' comprehension of information extracted by other data mining techniques. Specific visualization techniques include projection and the pursuit of successively displayed two-dimensional graphs showing the structures between different variable pairings. Parallel coordinates translate N-dimension data through a series of parallel vertical axes representing different variables. This algorithm enables users to analyze the hyper-geometric structure of multidimensional data by displaying the patterns and trends in multivariate data.

        Many of these technologies have been in use for more than a decade in specialized analysis tools that work with relatively small volumes of data. These capabilities are now evolving to integrate directly with industry-standard data warehouse and OLAP platforms.

Return to the top of the page

Types of Data Mining Algorithms

Induction Algorithms:

        CART

        Over the years, researchers in machine language have developed many induction algorithms including ID3 and Version Space. Independent studies in statistics have also developed similar induction algorithms such as CART and AID. Unlike other induction algorithms, CART can be used to develop a classification model as well as a regression model predicting the continuous output value of a dependent variable.

        ID3

        Among the many inductive algorithms, ID3 has been widely used in many applications. Its objective is to develop a decision tree that requires a minimum number of attribute tests for a set of examples. To minimize a decision tree, ID3 chooses the attribute whose discriminating power is largest among them and splits the examples into subsets according to the chosen attribute. Each subset is then classified by the attribute with the largest discriminating power among the remaining attributes. The process is repeated until all examples are appropriately classified or until no other attribute is available to be used for classification.

Return to the top of the page

Neural Net Algorithms:

        Supervised

        Supervised neural net algorithms such as Backpropagation and Perception require predefined output values to develop a classification model. Among the many algorithms, Backpropagation is the most popular supervised neural net algorithm. It uses a gradient decent approach by which network connection weights are iteratively modified to reduce the difference between the output of the model and the expected output over all examples. Backpropagation can be used to develop not only a classification model, but also a regression model. Other groups of supervised neural net algorithms like BAM are useful for detecting associations among input patterns.

        Unsupervised

        Unsupervised neural net algorithms such as ART and Kohonen's Feature Map do not require predefined output values for input data in the training schemes to segment the target data. Kohonen's Feature Map is a two-layered network that performs clustering through a competitive learning method, called "winner takes all." In the competitive learning method, the output node with the largest activation value given an input pattern is declared the winner in the competition, and the weights of the link connected to this output node are modified to encode its association with the input pattern represented. Once trained, the weight vector of each output node encodes the information of the group of similar input patterns. Although the neural net technique has a strong representational power, interpreting the information encapsulated in the weighted links can be very difficult.

Return to the top of the page

Genetic Algorithms

        These are optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design, based on the concepts of evolution.

Nearest-Neighbor Algorithm

        The nearest-neighbor algorithm uses a set of examples to approximate a classification model that decides which class to place a new instance by measuring similarity - counting the number of matches for each instance. The technique identifies the number of instances with the highest degree of match, called nearest neighbors, and assigns the new instance to the class to which most of the identified neighbors belong. It requires a well-defined distance metric for evaluating the distance between the instances. The number of neighbors to use for prediction is a parameter of the model to be estimated.

Case-Based Reasoning

        The nearest-neighbor algorithm uses a set of examples to approximate a classification model that decides which class to place a new instance by measuring similarity - counting the number of matches for each instance. The technique identifies the number of instances with the highest degree of match, called nearest neighbors, and assigns the new instance to the class to which most of the identified neighbors belong. It requires a well-defined distance metric for evaluating the distance between the instances. The number of neighbors to use for prediction is a parameter of the model to be estimated.

Association Rule Algorithm

        Given a collection of items and a set of records containing some of these items, association discovery techniques discover the rules to identify affinities among the collection of items as reflected in the examined records. As an example, 65% of the records that contain an item A also contain item B. An association rule measures, called support and confidence, to represent the strength of the association. The percentage of occurrences, 65% in this case, is the confidence factor of the association. The algorithms find the affinity rules by sorting the data while counting occurrences to calculate confidence. The efficiency with which association discovery algorithms can organize or transaction is one of the differentiators among the association discovery algorithms. There are a variety of algorithms to identify association rules such as AIS and BAM. Bayesian Net can also be used to identify distinctions and casual relationships between variables.

Return to the top of the page

Clustering Algorithms:

Return to the top of the page

Segmentation Algorithms:

        Tree bases segmentation algorithms can be used in conjunction with a wide range of multivariate statistical models as leaf models. Beginning with a single root node, an overall tree-based model is generated by recursively applying a model expansion procedure to the leaf nodes of the current tree. This model expansion step involves two distinct and complementary mechanisms. The first mechanism is a node split that is comparable to those traditionally used to build trees, and involves a univariate binary split of an existing leaf node into two descendant leaves. The second mechanism is a leaf-model extension that involves adding a single new feature to a multivariate statistical model that appears in the leaf node of the current tree. Examples of such multivariate leaf models include linear regression models, naïve Bayes models, Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).

  • Linear Regression Trees

            The combination of the tree-based segmentation algorithm with step-wise linear regression modeling at the leaves yields an overall algorithm that is referred to as a linear regression tree (LTR). The segment models in this case employ a Gaussian probability.

  • Naïve Bayes Trees

            The combination of the tree-based segmentation algorithm described above with a stepwise logistic regression modeling at the leaves yields an overall algorithm that is referred to as naïve Bayes trees (NBT). In contrast to LTR, the implementation of NBT with logistic regression segment models is more problematic because the unusual logistic regression algorithms require several data scans to fit even a single model, and there is no efficient way to use holdout data for a feature selection.

  • Visualization

            Parallel coordinates translate N-dimensions through a series of parallel vertical axes representing different variables. This algorithm enables users to analyze the hyper-geometrical structure of multidimensional data by displaying the patterns and trends in multivariate data.

    Return to the top of the page

    Data Analysis and Modeling

            The optimization process begins by modeling desired responses to marketing offers over a particular channel. Modeling is the process whereby data mining algorithms analyze the historical offer data, creating mathematical functions or models that can be used to predict customer responses for a specific offer or channel combination. As an example all of the offers will be sent out via a single channel, direct mail, but it is easy to extend this scenario such that each offer could be sent out via multiple channels.

            The simplest kind of modeling is to use the average historical response rate for an offer and assign it to all customers eligible for the offer. If a mortgage offer has traditionally received a 3.2 % response rate, on average any new prospect receiving the offer will respond at the same rate. This kind of modeling is very simple and can be used to get started in optimization. Once the end user has some experience, it makes sense to move on to more complicated data mining algorithms to improve the accuracy of the response predictions.

            The process of data mining can be broken down into sub-processes, each of which involves creating models for each of the different offers. At this point, the analysis for each offer is independent of the other offers. There might be some overlap in the customers you use to carry out the analyses, but the actual model building processes would be independent. For example, one customer might buy a house with a mortgage and then later on add a second mortgage.

            Once the offer models have been generated, each can be applied to new customer data in order to make predictions about those customers. The scores are simply the outputs of the models and might correspond to the predicted probability that a customer will purchase a specific mortgage product two months into the future. Since there are three different offers, there will be three different scores for each customer. This ends up producing a matrix of scores, with one row for each customer and one column for each offer score. The final step of the process is the optimization of the scoring matrix, which selects which of the multiple offers, will be made to each customer.

    Return to the top of the page

    Scoring

            Once the end user has generated the three models for the three different offers, it is now time to apply them to new customer data to determine which offers should be used to target which people. Deciding which customer to score with a particular model might require some thought since there is usually some sort of eligibility criteria used to pre-select customers for consideration of a particular offer. As an example, the end user might score only those customers with a re-finance model who the end user does not know these customers do not own a house. This might mean that the end user would score a customer with the "New Mortgage Model' even though they might have already signed a mortgage with another company, but the end user doesn't know that. The end user might also use a set of risk criteria in order to remove from the process those customers who are considered to have a high risk of non-payment.

            In the end, the end user will generate a matrix like that shown in Table 4. For each customer, the end user will have three different scores. In some cases, the customer score entry is NULL due to the fact that the customer was not eligible to receive an offer.

    Table 4: Customer Score Matrix

    Customer

    New Mortgage Score

    Re-finance Score

    Second Mortgage Score

    Tom Adams

    0.2422

    0.4926

    0.0872

    George Castle

    0.8600

    0.4465

    0.0982

    Betty Hanson

    NULL

    0.9700

    0.4453

    Robert Marcus

    0.7854

    NULL

    NULL

    Carol Smith

    0.5063

    NULL

    NULL

    Beverly Thompson

    0.8210

    0.5014

    0.6386

    Bill Samuels

    NULL

    0.5057

    0.9177

    Terry Jones

    0.2226

    0.1352

    0.0888

    Chris Peters

    0.2928

    0.1732

    0.5244

     

    http://ww.thearling.com. Sourced: 02/27/2003.

    Return to the top of the page

    Turning Scores into Value

            Once the scores are available, the next step is to convert each of the scores into profitability values. In the simplest approach, each offer has an economic value associated with it and each offer's value can be different. This value is the average over the potential customers and is usually determined by looking at characteristics of existing customers in the historical dataset. For example, the value to the company for a new home mortgage might be, on average, $6,000 per customer.

            Since the previously computed offer scores are the probabilities that a customer will respond to a particular offer, multiplying the economic values for each offer by the model scores will generate the expected average economic value for each customer/offer combination. For each customer, the offer with the highest expected economic return is highlighted.

            These financial characteristics of an offer can get complicated, with numerous facts about each customer used to evaluate the value of a particular offer/customer combination. For example, the end user might want to compute the net present value (NVP) of a customer, then evaluate their risk of filling bankruptcy and factor it into the risk adjusted NVP, and then combine this with the expected value for the mortgage offering. In some cases, it might be possible to use data miming technology to model the value directly, skipping the response probability stage. Assuming that the models are reliable, this would increase the accuracy of the optimization process.

    Return to the top of the page

    Constraints

            Before the optimization process can go any further, any constraints that the optimization process must handle need to be specified. Constraints put limits on the marketing campaign based on external factors. As an example, there could be a budget limitation for the campaign, which limits the number of offers that can be made. This would result in more inexpensive offers than would otherwise be made. Some possible constraints include:

            Sometimes it is not possible to satisfy all of the external constraints that are specified. It might be that some constraints are contradictory or simply that the characteristics of the customers do not allow for the constraints to be met. As an example, one constraint might be that customers selected from a particular offer be from each of the 50 Untied States and that each state have at least 1,000 recipients of the offer. If there were only 500 potential customers from North Dakota in the pool, it would be impossible to select 1,000 to receive the offer, regardless of the specified criteria.

            In the event that all constraints cannot be met, the process would either "hard fail" or it would fail gracefully. In a "hard fail," the process aborts when it is determined that all constraints cannot be met. The user is then informed of the problem and they can reformulate the constraints and try again. In the case of graceful failure, the optimization process attempts to meet as many of the constraints as possible despite the fact that it is not possible to satisfy all of them. The constraints may be weighted so that the process takes into consideration their relative importance.

    Return to the top of the page

    Optimization

            Optimization operates on the constraints and the values that were generated by combining the scores with the financial information about each offer. The primary objective of optimization is to maximize the overall value of the marketing campaign without violating any of the user specified constraints.

            Extending an earlier example further, assume that the new home mortgage is worth $6,000 per customer, a second mortgage is worth $500, and a re-financed mortgage is worth $2,000. In addition, assume that each offer can be assigned to the same number of customers as the other offers, and that each customer can receive one and only one offer. Table 5 shows the values of each offer for each customer as well as the offer that the optimization selected for that customer.

    Table 5: Values of Each Offer for Each Customer

    Customer

    New Mortgage Score

    Re-finance Score

    Second Mortgage Score

    Tom Adams

    $1,452.94

    $985.28

    $435.95

    George Castle

    $5,160.14

    $893.04

    $490.80

    Betty Hanson

    NULL

    $1,939.97

    $2,226.57

    Robert Marcus

    $4,712.41

    NULL

    NULL

    Carol Smith

    $3,037.68

    NULL

    NULL

    Beverly Thompson

    $4,925.71

    $1,002.75

    $3,192.89

    Bill Samuels

    NULL

    $1,011.48

    $4,588.70

    Terry Jones

    $1,335.85

    $270.34

    $443.90

    Chris Peters

    $1,757.01

    $346.34

    $2,621.97

     

    http://ww.thearling.com. Sourced: 02/27/2003.

            Note that the most valuable offer is not always selected. In fact, for four customers, George, Betty, Beverly, and Terry the most valuable offer was not selected because of the constraints.

    Return to the top of the page

    Application Selection

            For successful implementation of data mining techniques, an organization should carefully select its application. There are a number of characteristics to consider when selecting an application suitable for data mining techniques - both technical and non-technical issues. Although it is unlikely that any application will meet all desirable characteristics, missing characteristics frequently indicate weak points of the application, which software developers should attempt to minimize, eliminate, or at the very least, monitor carefully for successful implementation.

    Non-Technical Criteria

            The success of a newly developed system is measured by the benefits generated. A desirable application is a task whose solution has large potential benefits and payoffs. The benefits of data mining are measured by the novelty and quality of the discovered knowledge in terms of producing greater revenue, lower costs, higher quality, or savings in time. The benefits can either be tangible, such as profits or cost savings for the company, or intangible, such as increasing product quality, increasing the level of customer loyalty and/or improving corporate image. Also the development of a data mining application requires the investment by an organization in resources. The payoff of the discovered knowledge must be large enough to justify the costs of the project and generate a high level of return on investment. Another non-technical consideration is that there exist no better alternatives. The solution cannot be easily obtainable by other standard traditional approaches so that upper management has a vast interest in pursuing the data mining venture. Organizational support for an application is another important consideration. Support from upper management is critical for the success, especially in the initial phase of a data mining project. Therefore it is extremely necessary to select an application where top management has a strong vested interest. Also, for the application itself, there should be a cooperating domain expert who is available to define a proper measure for that domain to evaluate the value of discovered knowledge. Finally, selection criteria should include the potential for privacy and legal issues. Some kinds of knowledge discovery may be inappropriate and actually illegal, i.e., discovering patterns that possibly raise legal, ethical, as well as privacy issues.

    Technical Criteria

            In addition to the non-technical criteria for the desirable data mining application, there are many important technical attributes to be included when evaluating the appropriateness of an application. First, for any data mining application, there must be a sufficient amount of historical data. Although the amount of data necessary for knowledge discovery significantly varies depending on the data mining operation pursued and the techniques adopted, the more attributes used, the more instances that will be required. Second, there must be data attributes pertinent to the discovery of the task. A large amount of data does not necessarily guarantee the quality of discovered patterns.

            Attributes must encapsulate the information required to discover the novel patterns. Third, the data available must be of high quality - having lower noise levels and few data errors. Data mining with poor data is unreliable and often infeasible, if not impossible to use. Noise in the data makes it difficult to discover knowledge unless a large number of cases can alleviate random noise. Fourth, although there are debates about the utility and pitfall of using prior knowledge in data mining, it is often useful to have prior knowledge regarding the domain, such as the previous patterns, the likely relationships between attributes, the user utility function, and previous patterns. Some application problems are so large that the use of prior domain knowledge is necessary to limit the search space. However, the developer should note that the use of prior knowledge might limit the patterns they can find.

    Return to the top of the page

    Applications

    Data Mining and Customer Relationship Management

            Customer relationship management (CRM) is a process that manages the interactions between a company and its customers. The primary users of CRM software applications are database marketers who are looking to automate the process of interacting with customers.

            To be successful, database marketers must first identify market segments containing customers or prospects with high-profit potential. The database marketers then build and execute campaigns that favorably impact the behavior of these individuals. The first task is identifying market segments. This requires significant amounts of data about prospective customers and their buying behaviors. In theory, the more data collected the better off the models will be. In practice, however, massive data stores often impede marketers, who struggle to sift through the data to find the nuggets of valuable information.

            Recently, marketers have added a new class of software to their targeting arsenal. Data mining applications automate the process of searching the mountains of data to find patterns that are good predictors of purchasing behaviors. After mining the data, marketers must feed the results into campaign management software that, as the name implies, manages the campaign directed at the defined market segment.

            In the past, the link between data mining and campaign management software was mostly manual. In worst case scenarios, it involved a "sneaker net," creating a physical file on tape or disk, which someone then carried to another computer and loaded into the marketing database.

            This separation of the data mining and campaign management software introduces considerable inefficiency and opens the door for human errors. Tightly integrating the two disciplines presents an opportunity for companies to gain a competitive advantage.

            An excellent example of a CRM solution in action is used by the online travel agency Travelocity. In order to make sure customers get very personalized treatment, Travelocity installed a CRM offered by Teradata, to its existing terabyte-sized warehouse, which is Travelocity's root depository. The root depository is where Travelocity maintains data on its customers and partners.

            The key feature of the Teradata product in Travelocity's implementation has been its "event detective" feature which gives the company a way to quickly grasp the travel habits and preferences of customers. The event detective matches the customer with what is currently available from the airlines and car rental companies. This can boil down to something as simple as being able to know automatically that a customer is inclined to travel between two specific cites on certain days of the week or times of the year, and leading them to special fares when they are offered.

            Two keys to Travelocity's success were the selection of the specific business process that could be measured, and the use of a system that did not require entire back-end systems be rebuilt along the way. Teradata's CRM implementation mainly called on data warehouse administrators to map a middle or metadata layer into their data warehouse-CRN equation. In addition, CRM had significant senior management support, a factory agency IT staff, and business managers who are known to be critical to any e-commerce or e-government effort.

            Another interesting approach utilizing Customer Relationship Management was used by the retail store chain giant, Wal-mart in 2001. Wal-mart had reached a point of market saturation and was looking for new ways to increase revenue from its existing customers. They were able to do this by analyzing their data and tailoring their merchandise to local tastes. Wal-mart's regional merchandising strategy came up with items that sold extremely well in targeted regions. As an example, in rural areas, Wal-mart wanted to sell food to hunters who frequently shopped in the sporting-goods department. Wal-mart asked Hormel foods to come up with a snack that it could place on the shelves alongside rifles and fishing gear. The result was "Spamouflage," Spam in camouflage cans. This became a hot seller in the 760 rural Wal-marts. By tailoring merchandise to local tastes, Wal-mart was able to increase revenue from their existing customers, instead of going out and finding new customers.

    Return to the top of the page

    How Data Mining Helps Database Marketing

            Data mining helps marketing users to target marketing campaigns more accurately; and also to align campaigns more closely with the needs, wants, and attitudes of customers and prospects.

            If necessary information exists in the database, the data mining process can model virtually any customer activity. The key is to find patterns relevant to current business problems. Typical questions that data mining addresses include the following:

    Answers to these questions can help retain customers and increase campaign response rates, which, in turn, increase buying, cross-selling, and return on investment (ROI).

    Return to the top of the page

    The Role of Campaign Management Software

            Database marketing software enables companies to deliver timely, pertinent, and coordinated messages, value propositions, offers or gifts perceived as valuable to customers and prospective customers.

            Today's campaign management software goes considerably further. It manages and monitors customer communications across multiple touch-points, such as direct-mail, telemarketing, customer service, point-of-sale, interactive web, the branch office, and so on.

            Campaign management automates and integrates the planning, execution, assessment, and refinement of possibly tens to hundreds of highly segmented campaigns that run monthly, weekly, daily, or intermittently. The software can also run campaigns with multiple "communication points," triggered by time or customer behavior such as the opening of a new account.

            Finally, by tracking responses and following rules for attributing customer behavior, the campaign management software can help measure the profitability and return on investment of all ongoing campaigns.

            Consider customers of a bank who use the institution only for a checking account. An analysis reveals that after depositing large annual income bonuses, some customers wait for their funds to clear before moving the money quickly into their stock-brokerage or mutual fund accounts outside the bank. This represents a loss of potential business for the bank.

            To persuade these customers to keep their money in the bank, marketing managers can use campaign management software to immediately identify large deposits and trigger a response. The system might automatically schedule a direct mail or telemarketing promotion as soon as the customer's balance exceeds a predetermined amount. Based on the size of the deposit, the triggered promotion can then provide an appropriate incentive that encourages customers to invest their money in the bank's other products and services.

            Finally, by tracking responses and following rules for attributing customer behavior, the campaign management software can help measure the profitability and return on investment of all ongoing campaigns.

    Return to the top of the page

    Increasing Customer Lifetime Value

            Consider customers of a bank who use the institution only for a checking account. An analysis reveals that after depositing large annual income bonuses, some customers wait for their funds to clear before moving the money quickly into their stock-brokerage or mutual fund accounts outside the bank. This represents a loss of potential business for the bank.

            To persuade these customers to keep their money in the bank, marketing managers can use campaign management software to immediately identify large deposits and trigger a response. The system might automatically schedule a direct mail or telemarketing promotion as soon as the customer’s balance exceeds a predetermined amount. Based on the size of the deposit, the triggered promotion can then provide an appropriate incentive that encourages customers to invest their money in the bank’s other products and services.

            Finally, by tracking responses and following rules for attributing customer behavior, the campaign management software can help measure the profitability and return on investment of all ongoing campaigns.

    Return to the top of the page

    Combining Data Mining and Campaign Management

            The closer data mining and campaign management work together, the better the business results. Today, campaign management software uses the scores generated by the data mining model to sharpen the focus of targeted customers or prospects, thereby increasing response rates and campaign effectiveness. Ideally, marketers who build campaigns should be able to apply any model logged in the campaign management system to a defined target segment.

    Return to the top of the page

    Supply Cain Focused Data Mining Tools for Component Distributors

            Component distributors are now using data mining tools that can help their customers reduce costs and time-to-market costs. Arrow, Avnet, IC Solutions, Pioneer-Standard and Sager are among those fielding this kind of data mining tool.

            Arrow's "Risk Manager" helps identify and reduce procurement risks by addressing issues such as product family life cycle, parts status and availability, lead time, multi-souse profile and breadth of use. An alert service automatically notifies engineers and buyers of changes in a part's life cycle. This feature can be particularly helpful in the early stages of design, when engineers can catch components that are going into a decline state and can find another component that is a fit-to-function replacement.

            Avnet Electronics Marketing has added to its "Premiere" tool suite a database called "Component Selector." The database lessens the probability of an engineer selecting an obsolete or unavailable component.

            Another tool, Spin TT lets engineers integrate component design and supply chain information into their desktops electronic-design-automation tools, giving them access to part numbers, performance parameters and supply chain information such as part changes, obsolescence notices, lead time, availability and pricing. Design engineers may have to spend eight to ten weeks on a design, but then the purchasing person and the manufacturing engineer have to live with those design decisions for the next year or two.

            Sager's SynerSpec design tool is said to give its users visibility into every commodity part number for every supplier on Sager's line card. If a designer is working on a board, they click, "build a board." The tool then displays the steps a designer must take, such as choosing logic components, building or buying a power supply and choosing thermal-management solutions. Users have access to all available product data by product group or supplier, and can create a BOM (bill of materials), have it quoted or order a sample, all within SynerSpec's tool.

    Return to the top of the page

    Fraud Detection and Control

            Fraud and scams are a dominant white collar crime in today's business world. An unfortunate, but rather well-known fact is that many businesses and government organizations, particularly financial and related services, suffer from fraud of various kinds. Fraud bleeds companies to the tune of hundreds of billions of dollars worldwide, annually.

            Evidence of fraud is partly hidden in enormous data warehouses of data. Data analysis techniques can help businesses perform effective fraud management to prevent losses and bring the culprits to justice.

            Fraud management involves a whole gamut of activities: early warnings and alarms; telltale symptoms and patterns of various types of fraud; profiles of users and activities; fraud prevention, and avoidance; minimizing false alarms and avoiding customer dissatisfaction estimating losses; risk analysis; surveillance and monitoring; security of computers, data, networks, and physical facilities; data and records management collection of evidence from data and other sources; reports; summaries; data visualization; links to management information systems and operation systems such as billing and accounting; and control actions such as prosecution, employee education and ethics programs, hotlines, and cooperation with partners and law enforcement agencies.

            Several critical issues make building fraud management systems a challenging and difficult task: enormous volumes of data with complex structure; changing behavior of users, business activities, and fraudsters; continuous evolution of newer fraud techniques, particularly to bypass existing detection techniques; need for fast and accurate fraud detection without undue burden on the business operations; risks or false alarms and social issues such as privacy and discrimination.

            The techniques used for fraud detection fall into two primary classes: statistical techniques and artificial intelligence (AI) techniques. Many commercial tools are available for fraud detection that provides a variety of techniques from either of these areas, although usually not in any single integrated tool. Important statistical data analysis techniques for fraud detection are;

            In addition, a number of auxiliary tools can help surveillance personnel quickly grasp the nature of business data and activities. These include canned queries, summary reports, data visualization in various forms, software filters in the form of early warning indicators, alarm conditions, and so on. Usually, these techniques require considerable human expertise and active participation. Also, they're used in a sort of iterative way, where suspicious transactions are first identified and then further investigated to locate the victims, suspects, and their methods, which are then investigated to enable prevention or gather evidence.

            Applications of knowledge-based techniques from AI are a natural idea. Important AI techniques used for fraud management include:

    Other techniques such as Bayesian networks, decision theory, and sequence matching are also used for fraud detection.

    Return to the top of the page

    Ethical Issues

    Right to Privacy

            Privacy. It's a loaded issue. In recent years privacy concerns have taken on a more significant role in American society as merchants, insurance companies, and government agencies amass warehouses containing personal data. The concerns that people have over the collection of this data will naturally extend to any analytic capabilities applied to the data. Users of data mining should start thinking about how their use of this technology will be impacted by legal issues related to privacy.

            Consider the resent uproar over CVS drug stores and their use of Elensys, a Massachusetts direct marketing company that sends reminders to customers who have not renewed their prescriptions (Boston Globe, Feb. 19, 1998). After receiving criticism over what was considered to be a violation of privacy of their customer's medical records, CVS terminated its agreement with Elensys. Although there was no direct mention of data mining during the controversy, Evan Hendricks, editor of Privacy Times, said that recent Senate hearings on medical privacy (S.1368) included discussions of Elensys and the use of data mining for marketing activities. If and when this legislation is enacted, it could very well contain one of the first legal limitations on the use of data mining technology.

            This is just the tip of the iceberg. In several recent issues of Privacy Times, Hendricks describes a number of US, Canadian, and European policy decisions that could directly or indirectly impact the use of data mining technology by database marketing associations. Although the United States takes a more laissez-faire approach to privacy, these policies are facing challenges. Beginning in October 1998, the European Union's Directive on Data Protection will bar the movement of personal data to countries that do not have sufficient data privacy laws in place. Industry groups are urging that voluntary controls currently in place in the United States are sufficient. Privacy advocates, on the other hand, argue that any controls must be backed by legislation. Whether or not voluntary measures will suffice is still an open question but recent comments by EU officials have been critical of this voluntary approach.

            Another critical evaluation of data mining and privacy was recently released in a report by the Ontario Information and Privacy Commissioner, Ann Cavoukian, the report, "Data Mining: Staking a Claim on Your Privacy," said data mining "may be the most fundamental challenge that privacy advocates will face in the next decade..." The report looks at data mining and privacy in the context of international "fair information practice" principles. These principles, established in 1982, dictate how personal data should be protected in terms of quality, purpose, use, security, openness, individual participation, and accountability. According to Commissioner Cavoukian, a number of these principles conflict with many current uses of data mining technology. For example, looking at the "purpose principle", she writes: "For example, if the primary purpose of the collection of transactional information is to permit a credit card payment, then using the information for other purposes, such as data mining, without having identified the purpose before or at the time of collection, is in violation of the purpose and use limitation principles. The primary purpose of the collection must be clearly understood by the consumer and identified at the time of the collection. Data mining, however, is a secondary, future use. As such it requires the explicit consent of the data subject or consumer."

            Although broadly written use statements could be added to customer agreements to allow data mining, Cavoukian questions whether or not these waivers are truly meaningful to consumers. Since data mining is based on the extraction of unknown patterns from a database, "data mining does not know, cannot know, at the onset, what personal data will be of value or what relationships will emerge. Therefore, identifying a primary purpose at the beginning of the process, and then restricting one's use of the data to that purpose are the antithesis of a data mining exercise."

            Cavoukian sees informed consumer consent as the key issue. Customers should be told how the data collected about them would be used and whether or not it will be disclosed to third parties. The report recommends that customers be given three levels of "op-out" choices for any data that has been collected:

    1. Do not allow any data mining of customer's data.

    2. Allow data mining only for internal use.

    3. Allow data mining for both internal and external use.

            Although Canada (except Quebec) currently does not have laws limiting the use of personal information by private companies, Cavoukian calls for controls that are "codified through government enactment of data protection legislation for private sector businesses."

            These collisions between data mining and privacy are just the beginning. Over the next few years we should expect to see an increased level of scrutiny of data mining in the terms of its impact on privacy. The sheer amount of data that is collected about individuals, coupled with powerful new technologies such as data mining, will generate a great deal of concern by consumers. Unless this concern is effectively addressed, we can expect to see legal challenges to the use of data mining technology.

    Return to the top of the page

    National Security

            The Defense Advanced Research Projects Agency in the Defense Department has a budget of $137 million for fiscal 2003, to conduct research on a data mining project known as "Total Information Awareness." Total Information Awareness is aimed at helping identify potential domestic terrorists by amassing a database about people's lives. The system would mine the database searching for suspicious patterns within the data contained not only in the federal government databases, but also state, local authority and commercial databases such as those held by car rental agencies. Work has already begun to integrate terrorist-related information from databases across government agencies. The next step in this effort will be to adopt common metadata standards that will allow this information to be easily found and understood, as well as use advanced data mining techniques to reveal patterns of criminal behavior.

            Various legislators are concerned about the violation of civil liberties enforced by the US Constitution and the introduction of "virtual bloodhounds" capable of compiling records of a person's financial, travel, credit-card and health records. Jan Walker, spokesperson for the Defense Advanced Research Projects Agency, claims that the agency is involved in technology research and development and has no intention of collecting data on individuals. The objective is to create better computer tools for law enforcement and intelligence agencies, and that the data they are using in their experiments is either made up or information that the intelligence community can already obtain legally.

            What John Poindexter wishes to do is take the data mining tools familiar in the "one-to-one marketing" commercial world, and then use them to create templates of intelligence information to identify suspicious behavior. Since many of these tools were based on database platforms originally developed for the NSA, it's easy to reassign them to an expanded intelligence domain. Mr. Poindexter is also looking at making information appliances of all types more aware of data and voice flows being sent, and capable of reporting certain statistics to network nodes in public Internet and telephony networks.

            The technology for this, developed under Calea, is being inserted into Telco central offices and points-of-presence worldwide. The NSA already uses this technology globally, since the Supreme Court recognizes no civil liberties applying to foreigners. All it will take to put Calea-enabled equipment to broader use at home is a few amendments to the Homeland Security Act.

    The Future of Data Mining

            A resent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence as the top five key technology areas that "will clearly have a major impact across a wide range of industries with in the next 3 to 5 years." Gartner also listed parallel architectures and data mining as the two top 10 new technologies in which companies will invest during the next 5 years. According to a recent Gartner HPC Research Note, "With the rapid advance in data capture, transmission and storage, large-systems users will increasingly need to implement new and innovative ways to mine the after-market value of their vast stores of detail data, employing MPP (massively parallel processing) systems to create new sources of business advantage (0.9 probability)."

    Return to the top of the page

    Bibliography

    Abrams, Jim. "Senate Democrats Try to Stop Pentagon Data Mining Project; Amendment would cut off funding until Congress reviews the intention of the programs." Information Week. January 17, 2003. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    ACM SIGKKD Special Interest Group on Knowledge Discovery in Data and Data Mining. http://www.acm.org/sigkdd/. Sourced: 02/27/2003.

    Apte, C.V., R. Natarajan, E.P.D. Pednault, F.A. Tipu. "A probabilistic estimation framework for predictive modeling analytics." IBM Systems Journal. September 2002. v41 i3 p438 (11). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Bjorck, A. "Numerical Methods for Least Squares Problems." Philadelphia, PA: SIAM, 1996.

    Berthhold, Michael. (Editor), Hand, D. (Editor). Intelligent Data Analysis: An Introduction. 2nd ed. New York: Springer-Verlag, June 2002.

    Data Mining 2002-Post Conference Report. http://wessex.ac.uk/conferences/2002/datamining02/ . Sourced: 02/27/2003.

    Data Mining: concepts and Techniques. http://www.cs.sfu.ca/~han/dmbook . Sourced: 02/27/2003.

    Data Mining and Knowledge Discovery. http://www.digimine.com/usama/datamine/ . Sourced: 02/27/2003.

    Data Mining Group. http://www.dmg.org/ . Sourced: 02/27/2003.

    Data Mining in Bioinformatics. http://industry.ebi.ac.uk/~brazma/dm.html . Sourced: 02/27/2003.

    DM Review: The Premier Publication for Business Intelligence and Analytics. http://www.dmreview.com/ . Sourced: 02/27/2003.

    Directory of Data Warehouse, Data Mining, and Decision Support Resources. http://www.infogoal.com/dmc/dmcdwh.htm . Sourced: 02/27/2003.

    Enable Soft Inc, Foxtrot Data Mining Software (Free 15 day trail). http://www.enablesoft.com/ . Sourced: 02/27/2003.

    FDK: Data mining and machine learning. http://www.cs.helsinki.fi/research/pmdm/datamining/ . Sourced: 02/27/2003.

    "Future gazing." Information Age. September 10, 2002. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Goodwin, Bill "Burglars captured by police data mining kit." Computer Weekly. August 8, 2002. p3, Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Grossman, David, Ophir Frieder. "Integrated Structured Data." Intelligent Enterprise. September 18, 2001.v4 i14 p22. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Grushkin, Barry. "A Meeting of the Minds." Intelligent Enterprise. November 10, 2000. v3 i17 p20. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    http://www.bitpipe.com. Online. Internet. Sourced: April 16, 2003. (IT Source for whitepapers, webcasts, case studies, analytic reports and production information.) Sourced: 02/27/2003.

    Hastie, T.R. Tibshirani, and J.H. Friedman. The Elements of Statistical Leaning: Data Mining, Inference, and Prediction. New York: Springer-Verlag, October 2001.

    Hellerstein, J.L., C.S. Perng. "Discovering actionable patterns in event data." IBM Systems Journal. September 2002. v41 i3 p475 (19). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    ICDM-02: 2002 IEEE International Conference on Data Mining. http://kis.maebashi-it.ac.jp/icdm02/ . Sourced: 02/27/2003.

    Johnson, Collin R. "Protocol aimed at data mining." Electronic Engineering Times. September 18, 2000. p72. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Johnson, Gretel. "Data mining suggested to deter terror; Theory met with skepticism by government." InfoWorld.com. November 5, 2002. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Kemp, Ted. "Customer Analytics At A Cost - Expensive Integration Required To Get True Readings of Customer Preferences and Buying Habits." Interweek. December 10, 2001. p15. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Kestelyn, Justin. "The White House puts information sharing and analytics at the top of the Homeland Security priority list." Intelligent Enterprise. September 17, 2002. p12 (1). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Knowledge Discovery in Databases and Data Mining. http://db.cs.sfu.ca/sections/publication/kdd/kdd.html . Sourced: 02/27/2003.

    Montana, John. "Data mining: a slippery slope." Information Management Journal. October 2001. v35 i4 p50 (3). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Murphy, Don. "Predictive analytics as the proverbial early bird." Customer Interaction Solutions. January 2002. v20 i7 p26 (2). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    NCBI Tools for Bioinformatics Research. http://www,ncbi.nlm.nih.gov/Tools/ . Sourced: 02/27/2003.

    NCDM (National Center for Data Mining). http://www.ncdm.uic.edu/ . Sourced: 02/27/2003.

    New Architect Feature Article. http://www.webtechniuqes.com/archieves/2000/01/greening/ . Sourced: 02/27/2003.

    Newman, Christine. "Growing your revenue and profitability; it's all about your data." Intelligent Enterprise. September 17, 2002. v5 i15 p28 (1). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Oblinger, D.A., M. Reid, M. Brodie, R. de Salvo Braz. "Cross training and its application to skill mining." IBM Systems Journal. September 2002.v41 i3 p449 (12). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Palshikar, Girish Keshav. "The hidden truth: Data analysis can be a strategic weapon in your company's management and control of fraud." Intelligent Enterprise. May 28, 2002. v5 i9 p46 (5). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Quest: The Data Mining Group. http://www.almaden.ibm.com/cs/quest/ . Sourced: 02/27/2003.

    Roos, Gina. "Distributors go mining for information." Electronic Engineering Times. October 21, 2002. p81. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    SIAM International Conference on Data Mining. http://www.siam.org/meetings/sdm02/ . Sourced: 02/27/2003.

    Stodder, David. "A Discovery with No Name," Intelligent Enterprise. October 20, 2000. v3 i16 p14. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    "Teradata at Travelocity: The private sector's CRM-ROI formula can work in agencies too. (CRM in Action). Government Computer News. November 18, 2002. v21 i33 pS6 (1). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    The Data Mine-Data Mining and KDD Information. http://www.andypryke.com/university/TheDataMine.html . Sourced: 02/27/2003.

    Thearling, Kurt. "An Introduction to Data Mining: Discovering hidden value in your data warehouse." http://www.thearling.com . Online. Internet. Sourced 2/27/2003.

    Toigo, Jon William. "Data Mining Gets Real." Enterprise Systems Journal. April 1999. v14 i4 p60 (1). Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Two Crows: Data Mining Glossary. http://twocrows.com/glossary.htm . Sourced: 02/27/2003.

    Whiting, Rick. "Biz Intelligence Turns to Suppliers - Concerns about recession and reliability of supply chains lead companies to analyze links." Information Week. October 15, 2001. p57. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Wirbel, Loring. "True stateful inspection." Electronic Engineering Times. November 25, 2002. p35. Source: University of Maryland Baltimore County Computer Database. Sourced: 02/27/2003.

    Yoon, Younghoc. "Discovering Knowledge in Corporate Databases." Information Systems Management. Spring 1999.

    Return to the top of the page

    The Integral Worm, Inc. • Christopher Paul • Independent Senior Technical Writer/Editor

    The Home Page ·  The Integral Worm, Inc. ·  My Resume ·  My Show Car ·  My White Papers ·  Organizations I Belong To

    Contact Me ·  FAQ ·  Useful Links

    Return to the top of the page