Overfitting and Regularization

Machine learning models are trained to make accurate predictions on new data. However, sometimes


Updated March 24, 2023

Machine learning models are trained to make accurate predictions on new data. However, sometimes, models can become too complex and learn to fit the training data too closely. This phenomenon is known as overfitting, and it can lead to poor performance on new data. Regularization is a technique used to prevent overfitting and improve the generalization performance of machine learning models. In this article, we’ll explore overfitting, regularization, and the various techniques used in machine learning.

What is Overfitting?

Overfitting occurs when a model learns the noise in the training data instead of the underlying patterns. This results in a model that performs well on the training data but poorly on new data.

Overfitting can be caused by several factors, including using too many features, having too few training examples, and training for too long. The complexity of the model is also a significant factor in overfitting. A more complex model can fit the training data better but may not generalize well.

To understand overfitting better, let’s take an example of fitting a polynomial curve to a dataset. In the following code example, we generate a dataset of 20 points and fit a polynomial curve of degree 19 to it.

import numpy as np
import matplotlib.pyplot as plt

# Generate data
np.random.seed(42)
X = np.linspace(0, 1, 20)
y = np.sin(2 * np.pi * X) + np.random.normal(scale=0.1, size=20)

# Fit a polynomial curve
p = np.polyfit(X, y, deg=19)
y_pred = np.polyval(p, X)

# Plot the data and the curve
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.show()

The resulting plot shows that the polynomial curve fits the data perfectly, but it’s overly complex and doesn’t generalize well to new data.

What is Regularization?

Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function during model training. The penalty term discourages the model from learning overly complex representations that fit the noise in the training data.

There are several regularization techniques used in machine learning, including L1 regularization, L2 regularization, and dropout. Let’s explore each of these techniques in detail.

L1 Regularization

L1 regularization, also known as Lasso regularization, adds a penalty term to the loss function that is proportional to the absolute value of the model parameters. This penalty encourages the model to learn sparse representations that contain only the most important features.

L1 regularization can be implemented in scikit-learn by setting the penalty parameter to ‘l1’ when creating a linear model. For example, the following code example trains a linear regression model with L1 regularization.

from sklearn.linear_model import Lasso

# Create a Lasso model with alpha=0.1
model = Lasso(alpha=0.1)

# Fit the model to the training data
model.fit(X_train, y_train)

L2 Regularization

L2 regularization, also known as Ridge regularization, adds a penalty term to the loss function that is proportional to the square of the model parameters. This penalty encourages the model to learn small parameter values, which can reduce the complexity of the model.

L2 regularization can be implemented in scikit-learn by setting the penalty parameter to ‘l2’ when creating a linear model. For example, the following code example trains a linear regression model with L2 regularization.

from sklearn.linear_model import Ridge

# Create a Ridge model with alpha=0.1
model.fit(X_train, y_train)

Dropout

Dropout is a regularization technique used in neural networks that randomly drops out some of the neurons during training. This forces the network to learn redundant representations and can reduce overfitting.

Dropout can be implemented in Keras by adding a Dropout layer to the network. For example, the following code example trains a neural network with a single hidden layer and a dropout rate of 0.5.

from keras.models import Sequential
from keras.layers import Dense, Dropout

# Create a neural network model with dropout
model = Sequential()
model.add(Dense(64, activation='relu', input_dim=X_train.shape[1]))
model.add(Dropout(0.5))
model.add(Dense(1))

# Compile the model
model.compile(loss='mse', optimizer='adam')

# Fit the model to the training data
model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_test, y_test))

Conclusion

In summary, overfitting is a common problem in machine learning that occurs when a model learns to fit the noise in the training data. Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function during model training.

There are several regularization techniques used in machine learning, including L1 regularization, L2 regularization, and dropout. Each technique has its advantages and disadvantages, and the choice of technique depends on the nature of the problem being solved.

By understanding overfitting and regularization and implementing appropriate techniques, we can build machine learning models that can generalize well and make accurate predictions on new data.