机器学习Machine Learning – Andrew NG courses学习笔记
multivariate linear regression多变量线性规划
(linear regression works with multiple variables or with multiple features)
Multiple Features(variables)多特征(变量)
{x上标i代表第i个trainning example; x下标i代表特定trainning example中的第i个数值}
the hypothesis for linear regression with multiple features(variables)多变量线性回归的假设函数的表示
additional zero feature x0(为了方便表示)
for every example i have a feature vector X superscript I and X superscript I subscript 0 is going to be equal to 1.
Gradient Descent for Multiple Variables多变量的梯度下降
模型表示
通过gradient descent algorithm求解cost func最小值来求parameters θ
{其中左边是单变量线性规划求解参数的gradient descent algorithm;
右边是多变量线性规划求解参数的算法}
Gradient Descent in Practice I – Feature Scaling梯度下降实践1 – 特征缩放
{feature skill : getting features to be on similar ranges of Scales of similar ranges of values of each other.for making gradient descent work well: make gradient descent run much faster and converge in a lot fewer other iterations.}
why:
if you make sure that the features are on a similar scale, then gradient descents can converge more quickly.(如下右图)如果不进行feature scaling:gradients may end up taking a long time and can oscillate(振荡) back and forth and take a long time before it can finally find its way to the global minimum.(如下左图)
how to feature scaling?
1. 除以最大值max或者范围range(max – min)
In addition to dividing by so that the maximum value when performing feature scaling sometimes.{每个feature都除以最大值来做feature scaling, 使取值区间在[-1, 1]类似范围内就可以}
If you end up having a different feature that winds being between -2 and + 0.5,this is close enough to minus one and plus one, and that’s fine.{x1,x2,x3 不必一定都在区间[-i, i]上, 只要比较接近就可以}
if you have a different feature, say X3 ranges [-100, +100] or if X 4 takes on values between [-0.0001, +0.0001], this is a very different values than minus 1 and plus 1. So, this might be a less well-skilled(poorly scaled) feature and similarly.{但差别不能太大}关于feature区间的一个好的选择:if a feature takes on the range of values from say [-3, 3] should be just fine.2. mean normalization
也即
note:x1 or x2 can actually be slightly larger than .5 but, close enough.any value that gets the features into anything close to these sorts of ranges will do fine.总之:the feature scaling doesn’t have to be too exact,in order to get gradient descent to run quite a lot faster.
Gradient Descent in Practice II – Learning Rate α梯度下降实践2 – 学习率
如何判断gradient descent迭代到收敛了Howto make sure gradient descent is working correctly
1. 绘图法(图左):usually by plotting this sort of plot,by looking at these plots that I tried to tell if gradient descent has converged.
2. 收敛测试(图右)
莫愁前路无知己,天下谁人不识君。