Ensemble methods: bagging, boosting and stacking

type

status

date

slug

summary

1. What are ensemble methods?

“三个臭皮匠，顶一个诸葛亮”，“团结就是力量”。

1) what

Ensemble learning is a machine learning paradigm where multiple models (often called “weak learners”) are trained to solve the same problem and combined to get better results.

2) why

由上图可知，bias和variance通常按两个相反的方向变化，而我们想要low bias and low variance，因此就想到将这些要么high bias要么high variance的模型ensemble，得到一个更strong的model。

3)how

我们需要选择aggregate的方式，即：

If we choose base models with low bias but high variance, it should be with an aggregating method that tends to reduce variance whereas if we choose base models with low variance but high bias, it should be with an aggregating method that tends to reduce bias.

2. Focus on bagging

standing for “bootstrap aggregating”

1) Bootstrapping

有放回的抽样，用小样本估计大样本。但基于一定的assumption：1）representativity。采样的样本能代表真实分布。2）independence，各个样本之间相互独立。

2) Bagging

思想核心：如果根据各个独立同分布的数据拟合出各个独立的模型，且通过average their prediction就可以获得一个更低的variance。原理：可参考Variance of an average of random variables。

证明过程如下：

是随机变量，且方差为，那么的方差为：

现在对进行平均，求的方差，如果变量与变量之间独立，且方差一样，就有：

因此，只要n>1，方差就会减少。

具体平均方法：

3) random forests

Forests are composed of multiple trees which can be chosen to be either shallow or deep. Shallow trees have less variance but higher bias. Deep trees have low bias but higher variance which are good choice for bagging.

random forests tricks:

i. sampling over the observations in the dataset to generate a bootstrap sample

ii. sample over features and keep only a random subset of them to build the tree. 这样做的好处是不仅可以减少数据集之间的相关性，同时也具有一定的robust能力。

3. Focus on boosting（Squence）

The idea is to fit models iteratively such that the training of model at a given step depends on the models fitted at the previous steps.

1) Boosting

Each model in the sequence is fitted giving more importance to observations in the dataset that were badly handled by the previous models in the sequence. 递进式学习，不断学习更难的部分。但由于该模型更关注减少bias，所以weak learner通常在随机森林中会选择更shallow的tree。但是我们需要定义，怎么选择更难的部分以及怎样聚合模型。由此产生了两种算法：adaboost 和 gradient boosting。

2) Adaptative boosting

此方法是思想核心是：通过训练一个弱分类器，然后不断改变数据的概率分布，让错误分类的样本权重变大，正确分类的样本权重变小，从而不断依次训练多个弱分类器；再运用加权求和这些弱分类器，此时的权重取决于弱分类器的分类误差率，分类误差越低，那么权重越高。

具体算法过程：

算法原理推导过程：

由于Adaboost是由一堆弱学习器的线性组合，即：

其中是弱学习器，是对应弱学习器的系数。

我们想要减少损失函数值，即：

或：

Adaboost采用的损失函数是指数损失，为：

由于同时优化所有弱学习器十分困难，于是采用了前向分步算法，损失函数可写成：

设不依赖于和，可以看做第m步训练之前赋予每个样本的权重。所以损失函数可以进一步写成：

(注意：这里的b输出要么是1要么是-1，也是二分类（-1,1））

对基学习器 b：如果，那么最小化损失函数为：

对学习器的系数：

对求导，使导数为0，就有：

所以有：

令带权误差率为：

所以有：

该式子可以看出，如果分类误差率越低，那么线性组合时的权重越大。

对下一轮的样本权重值：

这个式子可以看出若>0，分类错误的样本权重会变大，分类正确的样本权重会变小。

3) gradient boosting

基本思想是：串行的生成多个弱学习器，每个弱学习器的目标是拟合先前累加模型的损失函数的负梯度，使加上该弱学习器后的累积模型损失往负梯度的方向减少。

简单例子：举个简单的例子，假设有个样本真实值为 10，第一个若学习器拟合结果为 7，则残差为 10-7=3；则残差 3 作为下一个学习器的拟合目标，第二个若学习其拟合结果为 2，则这两个弱学习器组合而成的 Boosting 模型对于样本的预测为 7 + 2 = 9，以此类推可以继续增加弱学习器以提高性能。

梯度下降是经典的数值优化方法，参数更新公式为：

gradient boosting采用和Adaboost同样的加法模型，前m次迭代中，前m-1个基学习器都是固定的，就有：

在第m步时，如果最小化损失函数，对进行求导，就有：

我们会发现，若令，即用基学习器拟合前一轮模型损失函数的负梯度。算法流程如下：

4. Overview of stacking

1) 与bagging和boosting方法的不同：、

基学习器是heterogeneous weak learner，而bagging和boosting是homogenous weak learner。

stacking use meta-model to combine all weak models whereas bagging and boosting combine weak models following deterministic algorithm.

2) stacking

The idea of stacking is to learn several different weak learners and combine them by training a meta-model to output predictions based on the multiple predictions returned by these weak models.

具体流程：

完结啦~~撒花花 🌻🌻