Practical Machine Learning实用机器学习
1.1 Prediction motivation预测的动机
课程概览About this course
This course covers the basic ideas behind machine learning/prediction,What this course depends onWhat would be useful·Study design trainingvs. test setsConceptual issues outof sample error, ROC curvesPractical implementation thecaret package·The Data Scientist’s ToolboxR Programming·Exploratory analysisReporting Data and Reproducible ResearchRegression models
机器学习的用处
Local governments >pension(退休金) paymentsGoogle >whether you will click on an adAmazon >what movies you will watchInsurance companies >what your risk of death isJohns Hopkins >who will succeed in their programs
推荐书目及资源
The elements of statistical learning
Machine learning (more advanced material)
List of machine learning resources on QuoraList of machine learning resources from ScienceAdvanced notes from MIT open coursewareAdvanced notes from CMUKaggle machinelearning competitions
1.2 什么是预测What is prediction
预测问题的中心教条dogma
predict for these dots whether they’re red or blue:
choosing the right dataset and that knowing what the specific question is are again paramount(最重要的)
可能存在的问题
一个例子:Google Flu trends algorithm didn’t realize the search terms that people would use would change over time.They might use different terms when they were searching, and so that would affect the algorithm’s performance.And also, the way that those terms were actually being used in the algorithm wasn’t very well understood.And so when the function of a particular search term changed in their algorithm, it can cause problems.
预测器的流程components of a predictor
question -> input data -> features -> algorithm -> parameters -> evaluation
Note: question: What are you trying to predict and what are you trying to predict it with?
预测的一个例子:垃圾邮件
question -> input data -> features -> algorithm -> parameters -> evaluation
Start with a general question
Can I automatically detect emails that are SPAM that are not?
Make it concrete
Can I use quantitative characteristics of the emails to classify them as SPAM/HAM?
Note:try to make it as concrete as possible
question -> input data -> features -> algorithm -> parameters -> evaluation
rss.acs.unt.edu/Rdoc/library/kernlab/html/spam.html
question -> input data -> features -> algorithm -> parameters -> evaluation
library(kernlab)data(spam)head(spam)question -> input data -> features -> algorithm-> parameters -> evaluation
Our simple algorithm
Find a value C. frequency of ‘your’ > C predict "spam"
Note:best cut off is above 0.5 then we say that it’s SPAM, and if it’s below 0.5 we can say that it’s HAM.
question -> input data -> features -> algorithm -> parameters -> evaluation
question -> input data -> features -> algorithm -> parameters -> evaluation
1.3 步骤的相对重要性Relative importance of steps
{about the tradeoffs and the different components of building a machine learning algorithm}
Relative order of importance:question > data > features > algorithms
…
Then creating features is an important component in that if you don’t compress the data in the right way you might lose all of the relevant and valuable information.And finally, in my experience it’s been the algorithmis often the least important part of building a machine learning algorithm.It can be very important depending on the exact modality of the type of data that you’re using.For example, image data and voice data can require certain kinds of prediction algorithms that might not necessarily be as.
An important pointThe combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.–John Tukey一个人去旅行,而且是去故乡的山水间徜徉。