Tuesday, May 13, 2008

Mine/d your data

I just participated in a week-long datamining course at the university. It was hard work, but actually a lot of fun. We plowed thorough a lot of things; including association rules, clustering, logistic regression, decision trees, neural networks, and also learned, well, made acquaintance with, some of the dataminging software like SAS Entreprise miner and used MatLab to check out the neural networks. What a strange world.

In one exercise we used the German credit dataset and wanted to come up with a decision tree to sort out the bad customers from the good ones. After lots of clicking and choosing values and setting roles, we came up with a tree that had an error rate of 47%. Wow. As well the banker could just flip a coin to choose which customer to give credit and whom not. Ok, probably a bad example, we did learn after that about the cost of misclassification, so we were able to make something better. But anyway, it just kind of made me laugh.

I was reading this blog and came across this interesting information about datamining methods that "miners" choose to use. Now that I know what all those words mean, this became an interesting piece of information for me :)

• Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.
I personally thought that most useful techniques for me could be mining association rules, clustering analysis and maybe the use of decision trees. To be seen.

What I was actually pretty amazed about was that Datamining is very related to predicting missing values, i.e. the same methods that many recommender systems/studies use to predict the missing values of ratings. Another thing which was totally new was that Datamining and Machine learning are actually very related, well, quasi-overlapping, I guess.

No comments: