Rate this post

Techniques that create multiple models and combine them to produce improved results are called ensemble methods. A single model can produce less accurate solutions than ensemble methods. In a number of machine learning competitions, the winning solutions used ensemble methods. The winner of the popular competition used an ensemble method. The winner of KDD 2009 used ensemble methods. You can find winners who used these methods in Kaggle, for example here is the interview with the winner of CrowdFlower.

We need to understand a few terminologies before we start. I used the term model throughout the article to describe the output of the training program. This model is used to make predictions. Any machine learning algorithm can be used. Base models are used as inputs of ensemble methods.

Voting, stacking, bagging and boosting are some of the methods of ensemble that I will describe in this post.

Voting and averaging methods are used.

The easiest methods to use are voting and averaging. They are easy to implement. Voting and averaging are used for regression.

Voting and Averaging Based Ensemble Methods

The first step in both methods is to create multiple classification/regression models. Different splits of the same training dataset can be used to create different base models. The use of the same training dataset is shown in the following pseudocode.

train = load_csv(“train.csv”)
target = train[“target”]
train = train.drop(“target”)
test = load_csv(“test.csv”)

algorithms = [logistic_regression, decision_tree_classification, …] #for classification
algorithms = [linear_regression, decision_tree_regressor, …] #for regression

predictions = matrix(row_length=len(target), column_length=len(algorithms))

for i,algorithm in enumerate(algorithms):
predictions[,i] = algorithm.fit(train, target).predict(test)

We created predictions for each model and saved them in a matrix where each column contained predictions from one model, according to the above pseudocode.

Most of the time.

The final output prediction is the one that gets more than half of the votes. The ensemble method could not make a stable prediction if none of the predictions get more than half of the votes. The final prediction may be the most voted prediction, even if that is less than half of the votes. This method is referred to as “ plurality voting ” in some articles.

Weighted voting.

Unlike majority voting, where each model has the same rights, we can increase the importance of one or more models. Weighted voting counts the prediction of the better models multiple times. It ‘s up to you to find a reasonable set of weights.

Averaging is simple.

The average predictions are calculated in the simple averaging method. A smoother regression model is created by this method. This simple averaging method is shown in the following pseudocode code.

final_predictions = []
for row_number in len(predictions):
mean(prediction[row_number, ])

Weighted Averaging is done.

Weighted averaging is a slightly modified version of simple averaging, where the prediction of each model is multiplied by the weight and then their average is calculated. The weighted averaging is shown in the following pseudocode code.

weights = […, …, …] #length is equal to len(algorithms)
final_predictions = []
for row_number in len(predictions):
mean(prediction[row_number, ]*weights)

There are multiple machine learning models.

Stacking, also known as stacked generalization, is an ensemble method where the models are combined. The basic idea is to train machine learning models with a training dataset. This new dataset is used for machine learning.

The pseudocode of a stacking procedure is summarized.

base_algorithms = [logistic_regression, decision_tree_classification, …] #for classification

stacking_train_dataset = matrix(row_length=len(target), column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test), column_length=len(algorithms))

for i,base_algorithm in enumerate(base_algorithms):
stacking_train_dataset[,i] = base_algorithm.fit(train, target).predict(train)
stacking_test_dataset[,i] = base_algorithm.predict(test)

final_predictions = combiner_algorithm.fit(stacking_train_dataset, target).predict(stacking_test_dataset)

In the above pseudocode, you can see that the training dataset for combiner is generated using the outputs of the base algorithms. The same dataset is used again and again to make predictions in the pseudocode. In the real world, we do n’t use the same training dataset for prediction, so you may see some implementations of stacking where training dataset is splitted. There is a pseudocode where the training dataset is split.

base_algorithms = [logistic_regression, decision_tree_classification, …] #for classification

stacking_train_dataset = matrix(row_length=len(target), column_length=len(algorithms))
stacking_test_dataset = matrix(row_length=len(test), column_length=len(algorithms))

for i,base_algorithm in enumerate(base_algorithms):
for trainix, testix in split(train, k=10): #you may use sklearn.cross_validation.KFold of sklearn library
stacking_train_dataset[testcv,i] = base_algorithm.fit(train[trainix], target[trainix]).predict(train[testix])
stacking_test_dataset[,i] = base_algorithm.fit(train).predict(test)

final_predictions = combiner_algorithm.fit(stacking_train_dataset, target).predict(stacking_test_dataset)

Aggregating using the bootstrap.

The key elements of this strategy are summarized in the name Bootstrap Aggregating. Creating multiple models is the first step in bagging. The models are generated using the same method with random sub-samples of the dataset which are drawn from the original dataset. Some original examples are not present in the sample. You should choose a random element from the original dataset if you want to create a sub-dataset with m elements. You follow the steps if the goal is generating a dataset.

Bootstrap Aggregating

The number of elements in each dataset is what we have at the end. bootstrap sampling is shown in the following Python-esque pseudocode.

def bootstrap_sample(original_dataset, m):
sub_dataset = []
for i in range(m):
return sub_dataset

Aggregating the generated models is the second step in bagging. Voting and averaging are used for this purpose.

The pseudocode looks like this :

def bagging(n, m, base_algorithm, train_dataset, target, test_dataset):
predictions = matrix(row_length=len(target), column_length=n)
for i in range(n):
sub_dataset = bootstrap_sample(train_dataset, m)
predictions[,i] = base_algorithm.fit(original_dataset, target).predict(test_dataset)

final_predictions = voting(predictions) # for classification
final_predictions = averaging(predictions) # for regression

return final_predictions

Each sub-sample can be generated on its own. Generation and training can be done in parallel.

Implementation of the bagging strategy can be found. Random Forest uses a different bagging technique. Random Forest has a decision tree algorithm that uses random feature selection.

Weak models can be converted to strong ones.

The term “ boosting ” is used to describe a family of algorithms that can convert weak models to strong models. The model is weak if it has a large error rate and the performance is not random. 5 for classification. When the weights of instances are adjusted according to the error of the last prediction, boosting builds an ensemble by training each model with the same dataset. The models have to focus on instances which are hard. You ca n’t use parallel operations because boosting is a sequential method.

The boosting algorithm has a general procedure.

def adjust_dataset(_train, errors):
#create a new dataset by using the hardest instances
ix = get_highest_errors_index(train)
return concat(_train[ix], random_select(train))

models = []
_train = random_select(train)
for i in range(n): #n rounds
model = base_algorithm.fit(_train)
predictions = model.predict(_train)
errors = calculate_error(predictions)
_train = adjust_dataset(_train, errors)

final_predictions = combine(models, test)

A new dataset containing the hardest instances can be returned by the adjust_dataset function.

A boosting method called adaboost is widely known. They won the Gdel Prize for their work. Most of the time, decision tree is the default base algorithm for Adaboost. The same method applies to Adaboost. The information gathered at each step of the AdaBoost is fed into the model. Weighted voting is used to calculate thecombining models step, which is different from theadjusting dataset step.


In industries where interpretability is more important, ensemble methods are often not preferred. The benefits of these methods can be tremendous. Even the smallest improvement in the accuracy of machine learning is valuable in healthcare.

Source: https://nhadep247.net
Category: Machine