Data Mining is one of the hottest areas in computer science today.  It involves the intersection of several areas, from statistics, database technology, machine learning, and visualization.  And currently there are plenty of tools to process data such as R and the built in functionality in Microsoft Sql Server.  These allow you to do things like build decisions trees to classify data. 

These can be used in Data Mining pipeline which includes data collection, mining, and visualization of the results.  Some of the problems with blindly using these tools is that there are so many different ways to model and mine the data that one needs some knowledge of which models are appropriate for which kinds of data.  And ideally, one should have some metric which can measure the accuracy of the data mining procedure.  Too often, one finds a model which accurately fits some training data but that overfits the data and may not accurately predict.

So how does one Data Mine effectively without falling into these pitfalls?  Good question.  One approach is to get a better understanding of the models that are being applied and in what domains do they work.  What is a Naive Bayes Classifier?  What is a decision tree?  I hope to find some of these answers in a Data Mining class I am currently studying.  The text we are using is The Elements of Statistical Learning. It presents a good overview of many of the models available.  

One of the simplest models is a simple linear model related to least squares fitting.  This may seem like a crude model that wouldn't be effective in general but it is instructive.  Also, there are some interesting things one can do with Linear Algebra.  In particular using the Singular Value Decomposition of a Dataset allows one to remap a set of Data of say n rows and k columns into a new vector space.  The SVD produces the following relation DataSet  = USV' .  Where S are the singular values and U and V can be thought of some type of vectors.  These S values are ordered from largest to smallest.  The idea being that the data spreads out along different dimensions with the strongest componenst containing more information.  One can then remap, the data rows into this new space keeping only 3 of the values from the S vector to allow a mapping of a multi-dimensional data set into 3 dimesnsions.  This article, Using Information Retrieval for Intelligent Information Retrieval, explores some of the ideas of using these techniques for indexing.

Well, this is just the tip of the iceberg, in this field, but I hope to have more later.