Where to Start With Machine Learning and Data Mining?
I have just started reading about machine learning and predictive analytics in general, so I wanted to share my planned journey for those who wish to start the same.
We are collecting more and more data (i.e. multivariate) and it is becoming tricky to make sense of it and make decisions based on it. Sport science needs to catch up business, especially when it comes to data driven decision making. What is the difference between predicting churn (very skewed dataset on customers who will leave the service; e.g. switch to another phone company) and predicting injury? What is the difference between predicted value in marketing projects ($$$) or in sport injury (days loss/gain, or even contract money here as well)?
One simple problem you can probably relate to (and I wrote about it HERE and HERE) is if we collect a bunch of physical (or other, like game stats, psychology, technical skill rating, tactical skill rating) testing data is how to create athlete profiles to avoid easier individualization in training when we are working with larger groups? What athletes are similar (clustering) and in what (PCA; factor analysis)?
Anyway, here are the recommended books to start with and what I plan reading this year.
1. Data Science for Business
Excellent book to start with. Very low on math and covers basic principles of predictive analytics, like confusion matrix, ROC curve, overfitting, cross-validation, and so forth. This is a great book to read if you are going to work with an analysts, because you will start to understand their work and lingo. The only problem with this book is that it uses business/marketing problems, but with little imagination you can figure out the use in sport.
2. Data Smart: Using Data Science to Transform Information into Insight
This book explain predictive analysis in understandable way and using Excel. Using Excel gives you more “hands on” and concepts tend to be grasped easier.
3. Machine Learning with R
Once you are familiar with how to do things in Excel it is time to move to R. This one is pretty basic book, and after this one major players should come into play.
4. An Introduction to Statistical Learning: with Applications in R
This is lightweight (read no math and less thick) version of pretty famous book by the same authors: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. This is much serious reading, but it shouldn’t be hard since we have prepared by reading the preceding books. The PDF of the book is freely available by the publisher HERE
5. Applied Predictive Modeling
This book should round everything up in a nice useful package, where you are able to quickly and painfully utilize the knowledge using caret package in R.
6. An Introduction to Applied Multivariate Analysis with R
This book provides great introduction to multivariate (i.e. a lot of variables) analysis. Provides overview of PCA (principal components analysis), factor analysis, clustering and visualizing data.
Please note that above links are Amazon associate links and by clicking on them you increase my chances of getting gift cards by Amazon and allow me to buy more books :). I am also in process of reading the above books myself and will expand more on the machine learning / predictive analytics in the days to come.
FREE Online courses (MOOC)
There is a FREE course on Coursera on machine learning by Andrew Ng. Here is his interesting video and introduction video for the Coursera:
Stanford University is offering a free course on Statistical Learning by Trevor Hastie and Rob Tibshirani. This course follows the Introduction to Statistical Learning book.