What motivated data mining

Data mining

Table of Contents

  1. Concept and motivation
  2. Objective and classification
  3. Components
  4. Areas of application
  5. Problems and Outlook

Concept and motivation

Data mining is understood to mean the application of methods and algorithms for the automatic extraction of empirical relationships between planning objects, the data of which is made available in a database set up for this purpose. For example, it can be determined which products are often bought together (typical shopping carts) or which factors are decisive for customer loyalty. The efforts to use data mining are motivated by the obvious gap between the amounts of data collected and available in a company, as made possible by ERP systems and modern integrated company databases, and the clumsiness of this potential for tactical and strategic company decisions and the management process use. “We are drowning in data but starving for knowledge” is a statement that vividly outlines this phenomenon. So you need efficient analysis instruments that "dig out" the interesting and important statements from the data sets in order to generate knowledge that goes beyond the usual key figures in controlling as aggregates of large data sets.

Objective and classification

The objective is to identify those relationships in the data that are interesting and useful for the decision maker, i.e. that help improve their decisions. What needs to be solved here is the nontrivial problem of operationalizing the interest and usefulness of statements. The use of statistical significance is certainly not sufficient for this.

Integrated methods and procedures of artificial intelligence and statistics as well as models of the application area are used in data mining. In contrast to the classic approaches from these areas, data mining does not only cover the testing of manually created hypotheses, but also the generation of new hypotheses. The data minig is part of a comprehensive process called Knowledge Discovery in Databases (KDD).

Components

Data mining practices include the following components:

Data access: A data mining process must be able to access the company data. Ideally, this is done via the ODBC interface. This is usually preceded by a concentration on a certain area of ​​analysis (e.g. customer segmentation), whose characterizing data is summarized in a specially created database (data warehouse) or in a data table.

Model type: The model type determines on the one hand the type of hypotheses that can be generated and on the other hand the size of the solution space for data mining processes. In data mining, simple rule models in the form of if-then statements are often used. Decision trees are a frequently used form of representation of special sets of rules. More complex methods are based on predicate logic or use special neural networks such as Kohonen networks.

- Interest rate: In data mining, the problem arises of evaluating the patterns found in terms of their interest for a specific application. This measurement problem is solved by separating the interest measurement into several, as independent as possible, measured variables. This includes e.g. the conspicuity: the more a statement deviates from other (average) statements, the more interesting it is. In addition, the more general a statement is, the more interesting it is. Furthermore, the value of a statement depends on the objective of the decision maker. The statements must apply with a certain probability, i.e. they must be valid. The type of presentation influences the interpretability and thus the comprehensibility of the patterns. The potential usability of a statement is ultimately expressed by its operationality.

- Search procedure: The task of the search procedure is to search the solution space for the most interesting model (i.e. for the most interesting set of statements). The more complex the models of the selected model type can become, the greater the solution space for data mining. As a rule, such a solution space cannot be searched using exact procedures, so that a heuristic search procedure must be selected.

Areas of application

Data mining offers a range of application potentials for description, explanation and forecasting models. For example, the identification of buyer profiles (e.g. for cross selling) or market segmentation are application examples for description models. Association statements such as the shopping basket analysis and characterization statements such as the determination of the success determinants of a website belong to the group of explanatory models. The forecast models include direct forecast tasks such as the forecast of contract terms for insurance companies or the forecast of exchange rates for financial planning, as well as classification tasks such as the diagnosis of errors or diseases and the (classification) assessment of employees or the classification of a policyholder in a specific one Tariff class.

Problems and Outlook

The objective of data mining often results in extremely large solution spaces which, together with the complex algorithms, lead to long runtimes. Looking at the currently available data mining tools, the problem arises that the components of data access and interest assessment have only been implemented in rudimentary form in these tools. The operationalization of the aspect of interest and its measurement, which control the extraction process, are also often the subject of controversial discussions. Furthermore, the (ideal) requirement for autonomous extraction of patterns from data stocks can only be fulfilled with difficulty without knowledge of the environment, the potential connections and the use of the statements obtained. In practical cases, defects such as missing or incorrect data can be found in the database itself, which could negatively affect the results of data mining, so that preparatory data cleansing measures are necessary.

Upcoming data mining tools will reduce the problems addressed. There will be further developments in all four of the above components. Attempts to transfer data mining from structured to less structured databases, such as texts, images or HTML documents, are also interesting.