The path that leads from raw data to a deployable, mined model looks straightforward enough, but is in fact replete with many backtracks and detours. Developing mined models requires a lot of working with the data – making changes and taking actions that depend entirely on the needs of the business problem and what the miner discovers in the data. It will be quite easy to read through the next chapters and, for anyone unfamiliar with data mining, to assume that the miner’s progress will be similar to the order in which the material is presented. However, you must expect that your path will not be so straightforward. Do not be at all surprised if you return to a previous stage several times!
Mining data is not magic, and it is not something that computer software will do for you. Essentially, data mining is a structured way of playing with data, of finding out what potential information it contains and how it applies to solving the business problem. The tools used to mine data aren’t magic, either. Most of those currently in use were developed from three main areas: statistics, artificial intelligence, and machine learning. Despite apparently different roots, these tools essentially do only one thing: discover a relationship that more or less maps measurements in one part of a data set to measurements in another, linked part of the data set. That’s it. Nothing startling, nothing fancy. Just an expression of the relationship between two linked parts of a data set.
When brought down to such basics, it’s pretty obvious that the wonderful tools of data mining aren’t going to solve problems, especially business problems, by themselves. Data mining is a human activity, and it is the miner that produces the results of data mining, not the tools. Yet data mining can produce powerful, startling, insightful, surprising, and highly profitable results – but never forget that the results come from the insight and understanding
applied by human intelligence.
Most of a miner’s time in any data mining project is devoted to preparing data sets, and such preparation is the place where a miner begins the work of mining. You can expect to invest between 60 and 90% of your time simply preparing the data for mining. Data preparation for data mining is a complex subject, and this chapter can provide only the basic foundations of this activity. (For a more extensive description of data preparation techniques, see the author’s book Data Preparation for Data Mining, also from Morgan Kaufmann Publishers.) Nonetheless, this chapter is not an abbreviated recapitulation of that book. Instead, it lays out a basic method for preparing data that results in useable data sets.
The first three stages of mining data – the assay, feature extraction, and the data survey – are discussed in this chapter. These three stages taken together comprise most of what constitutes data preparation.