Supervised Learning

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time

Introduction

Data pre-processing

Data

The dataset used in this part is the Bank Marketing from the UC Irvine Machine Learning Repository. The data is licensed under CC BY, allowing it to be freely used for this exercise.

The dataset contains data from direct marketing campaigns (phone calls) of a Portuguese banking institution. The data comprises 16 features and a labelled variable, “y”, which indicates if the client subscribed to a term deposit.

Load data

The data is loaded into a pandas dataframe from the downloaded CSV file. After loading the data, the first five rows of the dataframe are inspected using the head() function. The dataframe is then checked for any missing values. This dataset is clean with no missing values.

# Load the data file into a dataframe
bank_df = pd.read_csv("bank-full.csv", sep=";", comment="#")

# Inspect the head of the data
print(bank_df.head())
   age           job  marital  education default  balance housing loan  \
0   58    management  married   tertiary      no     2143     yes   no   
1   44    technician   single  secondary      no       29     yes   no   
2   33  entrepreneur  married  secondary      no        2     yes  yes   
3   47   blue-collar  married    unknown      no     1506     yes   no   
4   33       unknown   single    unknown      no        1      no   no   

   contact  day month  duration  campaign  pdays  previous poutcome   y  
0  unknown    5   may       261         1     -1         0  unknown  no  
1  unknown    5   may       151         1     -1         0  unknown  no  
2  unknown    5   may        76         1     -1         0  unknown  no  
3  unknown    5   may        92         1     -1         0  unknown  no  
4  unknown    5   may       198         1     -1         0  unknown  no  

Check for null values

# Check how many null values are in the data frame
print()
print("Feature Name           Number of missing entries")
print(bank_df.isnull().sum())

Feature Name           Number of missing entries
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

Check for duplicate values

# Count duplicate rows
duplicate_count = bank_df.duplicated().sum()

print(f"Number of duplicate rows: {duplicate_count}")
Number of duplicate rows: 0

Drop irrelevant features

The dataset metadata states that the “duration” feature should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

bank_df.drop(["duration"], axis=1, inplace=True)

Feature encoding

The source of the dataset provides information relating to each variable, such as data type.

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

Check datatypes

bank_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 16 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  campaign   45211 non-null  int64 
 12  pdays      45211 non-null  int64 
 13  previous   45211 non-null  int64 
 14  poutcome   45211 non-null  object
 15  y          45211 non-null  object
dtypes: int64(6), object(10)
memory usage: 5.5+ MB

Check categorical features

# List of columns to inspect
selected_columns = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome"]

# Print unique values for each selected column
for column in selected_columns:
    unique_values = bank_df[column].unique()
    print(f"Unique values in column '{column}':")
    print(unique_values)
    print("-" *70)
Unique values in column 'job':
['management' 'technician' 'entrepreneur' 'blue-collar' 'unknown'
 'retired' 'admin.' 'services' 'self-employed' 'unemployed' 'housemaid'
 'student']
----------------------------------------------------------------------
Unique values in column 'marital':
['married' 'single' 'divorced']
----------------------------------------------------------------------
Unique values in column 'education':
['tertiary' 'secondary' 'unknown' 'primary']
----------------------------------------------------------------------
Unique values in column 'default':
['no' 'yes']
----------------------------------------------------------------------
Unique values in column 'housing':
['yes' 'no']
----------------------------------------------------------------------
Unique values in column 'loan':
['no' 'yes']
----------------------------------------------------------------------
Unique values in column 'contact':
['unknown' 'cellular' 'telephone']
----------------------------------------------------------------------
Unique values in column 'month':
['may' 'jun' 'jul' 'aug' 'oct' 'nov' 'dec' 'jan' 'feb' 'mar' 'apr' 'sep']
----------------------------------------------------------------------
Unique values in column 'poutcome':
['unknown' 'failure' 'other' 'success']
----------------------------------------------------------------------

Encoding nominal categorical features

The categorical features require encoding. One-hot encoding was applied to the nominal features: “job”, “marital”, “contact”, “month”, “default”, “housing”, “loan” and “poutcome”.

# Columns to encode
one_hot_cols = ["job", "marital", "default", "housing", "loan", "contact", "month", "poutcome"]

# OneHotEncoder setup
encoder = OneHotEncoder(sparse_output=False)
encoded = encoder.fit_transform(bank_df[one_hot_cols])

# Create DataFrame from encoded data
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(one_hot_cols))

print(encoded_df.head())

#Copy the bank dataframe
en_bank_df= bank_df.copy()

# Drop original columns
en_bank_df = en_bank_df.drop(one_hot_cols, axis=1)

# Insert encoded columns before the last column
insert_position = en_bank_df.shape[1] - 1
for col in reversed(encoded_df.columns):
    en_bank_df.insert(insert_position, col, encoded_df[col])
   job_admin.  job_blue-collar  job_entrepreneur  job_housemaid  \
0         0.0              0.0               0.0            0.0   
1         0.0              0.0               0.0            0.0   
2         0.0              0.0               1.0            0.0   
3         0.0              1.0               0.0            0.0   
4         0.0              0.0               0.0            0.0   

   job_management  job_retired  job_self-employed  job_services  job_student  \
0             1.0          0.0                0.0           0.0          0.0   
1             0.0          0.0                0.0           0.0          0.0   
2             0.0          0.0                0.0           0.0          0.0   
3             0.0          0.0                0.0           0.0          0.0   
4             0.0          0.0                0.0           0.0          0.0   

   job_technician  ...  month_jun  month_mar  month_may  month_nov  month_oct  \
0             0.0  ...        0.0        0.0        1.0        0.0        0.0   
1             1.0  ...        0.0        0.0        1.0        0.0        0.0   
2             0.0  ...        0.0        0.0        1.0        0.0        0.0   
3             0.0  ...        0.0        0.0        1.0        0.0        0.0   
4             0.0  ...        0.0        0.0        1.0        0.0        0.0   

   month_sep  poutcome_failure  poutcome_other  poutcome_success  \
0        0.0               0.0             0.0               0.0   
1        0.0               0.0             0.0               0.0   
2        0.0               0.0             0.0               0.0   
3        0.0               0.0             0.0               0.0   
4        0.0               0.0             0.0               0.0   

   poutcome_unknown  
0               1.0  
1               1.0  
2               1.0  
3               1.0  
4               1.0  

[5 rows x 40 columns]

Encoding ordinal categorical features

Ordinal encoding was applied to the ordinal feature “education”, as its ordering is important.

# Ordinal encoding
# Education ordinal mapping
education_mapping = {
    "unknown": 0,
    "primary": 1,
    "secondary": 2,
    "tertiary": 3
}

# Apply the mapping
en_bank_df["education"] = en_bank_df["education"].map(education_mapping)

# Display the updated dataframe with encoded columns.
print(en_bank_df[["education"]].head())
   education
0          3
1          2
2          2
3          0
4          0

Target encoding

# Create label encoder
le = LabelEncoder()

en_bank_df = en_bank_df.rename(columns={"y": "target"})

en_bank_df["target"] = le.fit_transform(en_bank_df["target"])
en_bank_df.head(5)
age education balance day campaign pdays previous job_admin. job_blue-collar job_entrepreneur ... month_mar month_may month_nov month_oct month_sep poutcome_failure poutcome_other poutcome_success poutcome_unknown target
0 58 3 2143 5 1 -1 0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0
1 44 2 29 5 1 -1 0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0
2 33 2 2 5 1 -1 0 0.0 0.0 1.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0
3 47 0 1506 5 1 -1 0 0.0 1.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0
4 33 0 1 5 1 -1 0 0.0 0.0 0.0 ... 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0

5 rows × 48 columns

Dataset balance

Here I inspect the blance of the classes in the full dataset. The figure below shows there is many more no’s (0) and yes’s (1).

# Count distribution of 0s and 1s
target_counts = en_bank_df["target"].value_counts()

# Plot the distribution
plt.figure(figsize=(6, 4))
target_counts.plot(kind="bar", color=["skyblue", "orange"])
plt.title("Distribution of Target Variable")
plt.xlabel("Target Class")
plt.ylabel("Count")
plt.xticks(ticks=[0, 1], labels=["0 (no)", "1 (yes)"], rotation=0)
plt.tight_layout()
plt.show()

Handling dataset imbalance

It has been observed that the dataset is imbalanced with approximately 11% of the data in the “yes” class and the remaining in the “no” class. There are several ways to handle imbalanced datasets, such as under-sampling the majority class, oversampling the minority class, SMOTE, and cost-sensitive learning, where the weights of each class are adjusted to adjust the cost function. I tested both SMOTE and found that the increased training set size caused problems with overfitting in the K-NN section. Therefore, here I have opted to use cost-sensitive learning.

Train test split

Stratified sampling has been used to ensure the class proporations remaing the same in the test and train sets.

from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import StandardScaler

# Split the data
# Stratified sampling to help maintain similar distributions between the text and train sets
stratSplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=8)
for train_index, test_index in stratSplit.split(en_bank_df.iloc[:, :-1], en_bank_df["target"]):
    X_train = en_bank_df.iloc[:, :-1].iloc[train_index]
    X_test = en_bank_df.iloc[:, :-1].iloc[test_index]
    y_train = en_bank_df["target"].iloc[train_index]
    y_test = en_bank_df["target"].iloc[test_index]

# Fit scaler on training data
scaler = StandardScaler()
X_train_scald = scaler.fit_transform(X_train)

# Transform test data using the same scaler
X_test_scald = scaler.transform(X_test)

# check class distribution in test set
test_counts = y_test.value_counts()

# check null accuracy score
null_accuracy = (test_counts[0]/(test_counts[0] + test_counts[1]))

print(f"Null accuracy score: {null_accuracy:.4f}")
Null accuracy score: 0.8830

First logistic regression (no regularisation)

Here, a logistic regression model is fitted without any regularisation. The logistic regression model in scikit learn has several hyperparameters. Those specific to the non-regularised version are:

  • Solver: The type of solver used.
  • Maximum Iterations: Controls how long the solver runs.
  • Tolerance: Determines the stopping criteria for the optimisation.
  • Class Weight: Useful for imbalanced datasets.

Since I will be creating regularised logistic models for this data later, I decided to remove the solver from the hyperparameter tuning. There is only one solver (“saga”) that handles no regularisation and the L1 and L2 regularisation, and I want to use the same solver for each model.

Hyperparameter tuning

For the model evaluation, I have used the F1 score. F1 is the harmonic mean of precision and recall. The main objective of this modelling work is to predict whether a customer will subscribe to a term deposit. Therefore, I aim to strike a balance between the recall (don’t want to miss potential conversions) and the precision (don’t want to waste effort on uninterested people).

The scikit learn GridSearchCV is used to execute the parameter tuning. It runs through all combinations of the parameters and uses cross-validation to assess the performance of the model. The results showed that the maximum number of iterations and convergence tolerance were not significant compared to the class weights.

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Define pipeline (logistic regression)
pipeline = Pipeline([
    ("logreg", LogisticRegression(penalty=None, solver="saga", random_state=8))
])

param_grid = {
    "logreg__class_weight": [
        None,
        'balanced',
        {0: 1, 1: 2.5},  
        {0: 1, 1: 5},
        {0: 1, 1: 10}
    ],
    "logreg__max_iter": [100, 500, 1000],
    "logreg__tol": [1e-5, 1e-4, 1e-3],
}

# Set up GridSearchCV
# GridSearchCV uses stratified sampling internally, so need to do anything special here.
# Only interested in minorty class so don't weight score
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="f1", n_jobs=-1, return_train_score=True)

# Fit the model
grid_search.fit(X_train_scald, y_train)

# Convert results to DataFrame
results_df = pd.DataFrame(grid_search.cv_results_)

# Create a readable label for each parameter combination
results_df["param_combo"] = results_df.apply(
    lambda row: f"{row['param_logreg__max_iter']}, tol={row['param_logreg__tol']}, weight={row['param_logreg__class_weight']}",
    axis=1
)

# Sort by mean test score
results_df = results_df.sort_values(by="mean_test_score", ascending=False)

# Plot performance of each combination in grid search
plt.figure(figsize=(12, 12))
plt.barh(results_df["param_combo"], results_df["mean_test_score"], color="skyblue")
plt.xlabel("Mean Score")
plt.title("Grid Search Results for Logistic Regression (No Regularisation)")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.grid(True)
plt.show()

# Best parameters and score
print("Best Parameters:", grid_search.best_params_)
print(f"Best Score: {grid_search.best_score_:.4f}")

Best Parameters: {'logreg__class_weight': {0: 1, 1: 5}, 'logreg__max_iter': 100, 'logreg__tol': 0.001}
Best Score: 0.4336

Create a model with the best parameters and run cross-validation

Here, I’m using cross-validation to check the model’s ability to generalise. Looking for divergence between the training and validation scores. The results show the model is stable and not over-fitted.

from sklearn.model_selection import cross_validate

def performCV(model, X_train, y_train):
    """
    Function to perform cross-validation for a given model and training data.

    :param model: The model to validate.
    :param X_train: Training data.
    :param y_train: TRaining labels.
    :returns: Dictionary of results from the cross-validation.
    """
    results = cross_validate(model, X=X_train, y=y_train,
                    scoring=["f1", "precision", "recall", "accuracy", "roc_auc"],                         
                    cv=5, verbose=True, return_train_score=True,
                    return_estimator=True)
    return results


def visualiseResults(results):
    """
    Function to plot different scoring metrics from a cross-validsation
    study.

    :param results: Dictionary containing the cross-validation results.
    :returns: None.
    """
    cvData = {"val_f1":results["test_f1"],
            "train_f1":results["train_f1"],
            "val_precision":results["test_precision"],
            "train_precision":results["train_precision"],
            "val_recall":results["test_recall"],
            "train_recall":results["train_recall"],
            "val_acc":results["test_accuracy"],
            "train_acc":results["train_accuracy"],
            "val_roc_auc":results["test_roc_auc"],
            "train_roc_auc":results["train_roc_auc"]}

    cv_df = pd.DataFrame(cvData)
    fig = plt.figure(figsize=(8,6))
    ax = fig.add_subplot()
    plt.plot(cv_df.index, cv_df["train_acc"], label="train_acc", color="r")
    plt.plot(cv_df.index, cv_df["val_acc"], label="val_acc", c="r", ls="--")
    plt.plot(cv_df.index, cv_df["train_f1"], label="train_f1", c="b")
    plt.plot(cv_df.index, cv_df["val_f1"], label="val_f1", c="b", ls="--")
    plt.plot(cv_df.index, cv_df["train_recall"], label="train_recall", c="g")
    plt.plot(cv_df.index, cv_df["val_recall"], label="val_recall", c="g", ls="--")
    plt.plot(cv_df.index, cv_df["train_precision"], label="train_precision", c="purple")
    plt.plot(cv_df.index, cv_df["val_precision"], label="val_precision", c="purple", ls="--")
    plt.plot(cv_df.index, cv_df["train_roc_auc"], label="train_roc_auc", c="magenta")
    plt.plot(cv_df.index, cv_df["val_roc_auc"], label="val_roc_auc", c="magenta", ls="--")
    plt.legend()
    plt.ylabel("Score")
    plt.title("Model performance based on multiple scoring metrics")
    plt.xticks(range(len(results["fit_time"])),["run "+str(i+1) for i in range(len(results["fit_time"]))])
    plt.show()
# Create the model using the best parameters found in the hyperparameter tuning step
log_reg_nr = LogisticRegression(penalty=None, solver="saga", max_iter=400, tol=0.001, class_weight={0: 1, 1: 5}, random_state=8)

# Run cross-validation
cv_results = performCV(log_reg_nr, X_train_scald, y_train)
visualiseResults(cv_results)
[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed:    1.1s finished

Evaluate model performance

Here, I evaluate the model’s performance by calculating the precision, recall, F1 score and the ROC_AUC score. The confusion matrix is also plotted. The recall and precision are very good for the majority class, as expected, due to the class imbalance. The minority class shows the precision is approximately 42% and the recall is approximately 50%.

from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import roc_auc_score

log_reg_nr.fit(X_train_scald, y_train)
y_pred = log_reg_nr.predict(X_test_scald)

print(classification_report(y_test, y_pred, target_names=["no", "yes"], digits=4))

# Predict probabilities
y_probs = log_reg_nr.predict_proba(X_test_scald)[:, 1]  # Probabilities for class 1

# Compute ROC AUC
roc_auc = roc_auc_score(y_test, y_probs)
print(f"ROC AUC Score: {roc_auc:.4f}")

cm_nr = confusion_matrix(y_test, y_pred, labels=[0,1])
              precision    recall  f1-score   support

          no     0.9319    0.9088    0.9202      7985
         yes     0.4204    0.4991    0.4564      1058

    accuracy                         0.8609      9043
   macro avg     0.6762    0.7039    0.6883      9043
weighted avg     0.8721    0.8609    0.8660      9043

ROC AUC Score: 0.7735

Second and Third logistic regression models

In this second two additional logistic regission models are created with differing regularisation to invesitigate if regularisation affects the performance of the model for this application.

LASSO regularisation (L1)

This section creates a regularisation model using the L1 norm. Firstly, I run a grid search to find the value of C and the class weights that give the highest F1 score. Other parameters, such as convergence tolerance, maximum solver iterations and class weights have been set to the same values used for the non-regularised model. Like the non-regularised model, the class weights have the most effect on the model performance. The value of the regularisation strength (C) has little effect once the optimal class weights are selected.

# LASSO - L1
# Define pipeline
lasso_pipeline = Pipeline([
    ("lasso", LogisticRegression(penalty="l1", solver="saga", random_state=8,
                                 tol=1e-3, max_iter=400, class_weight=None))
])

# Define parameter grid
lasso_params = {
    "lasso__C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
     "lasso__class_weight": [
        None,
        'balanced',
        {0: 1, 1: 2.5},  
        {0: 1, 1: 5},
        {0: 1, 1: 10}
    ]
}

# Set up GridSearchCV
lasso_grid_search = GridSearchCV(lasso_pipeline, lasso_params, cv=5, scoring="f1", n_jobs=-1, return_train_score=True)

# Fit the model
lasso_grid_search.fit(X_train_scald, y_train)

# Convert results to DataFrame
lasso_results_df = pd.DataFrame(lasso_grid_search.cv_results_)

# Create a readable label for each parameter combination
lasso_results_df["param_combo"] = lasso_results_df.apply(
    lambda row: f"{row['param_lasso__C']},  weight={row['param_lasso__class_weight']}",
    axis=1
)

# Sort by mean test score
lasso_results_df = lasso_results_df.sort_values(by="mean_test_score", ascending=False)

# Plot performance of each combination in grid search
plt.figure(figsize=(12, 8))
plt.barh(lasso_results_df["param_combo"], lasso_results_df["mean_test_score"], color="skyblue")
plt.xlabel("Mean Score")
plt.title("Grid Search Results for Logistic Regression (LASSO Regularisation)")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.grid(True)
plt.show()

# Best parameters and score
print("Best Parameters:", lasso_grid_search.best_params_)
print(f"Best Score: {lasso_grid_search.best_score_:.4f}")

Best Parameters: {'lasso__C': 0.1, 'lasso__class_weight': {0: 1, 1: 5}}
Best Score: 0.4337
log_reg_l1 = LogisticRegression(penalty="l1", solver="saga", max_iter=400, tol=0.001, class_weight= {0: 1, 1: 5}, C=0.1)

Ridge regularisation (L2)

This section creates a regularisation model using the L2 norm. Firstly, I run a grid to find the values of C and class weights that give the highest F1 score. Other parameters, such as convergence tolerance, maximum solver iterations and class weights have been set to the same values used for the non-regularised model. Similar to the other two models, the class weights are the most important parameter. The regularisation strength has no effect for values about 0.1

# Ridge - L2
# Define pipeline
ridge_pipeline = Pipeline([
    ("ridge", LogisticRegression(penalty="l2", solver="saga", random_state=8,
                                 tol=1e-5, max_iter=100, class_weight=None))
])

# Define parameter grid
ridge_params = {
    "ridge__C": [0.0001, 0.001, 0.01, 0.1, 1.0, 10.0],
     "ridge__class_weight": [
        None,
        'balanced',
        {0: 1, 1: 2.5},  
        {0: 1, 1: 5},
        {0: 1, 1: 10}
    ]
}

# Set up GridSearchCV
ridge_grid_search = GridSearchCV(ridge_pipeline, ridge_params, cv=5, scoring="f1", n_jobs=-1, return_train_score=True)

# Fit the model
ridge_grid_search.fit(X_train_scald, y_train)

# Convert results to DataFrame
ridge_results_df = pd.DataFrame(ridge_grid_search.cv_results_)

# Create a readable label for each parameter combination
ridge_results_df["param_combo"] = ridge_results_df.apply(
    lambda row: f"{row['param_ridge__C']},  weight={row['param_ridge__class_weight']}",
    axis=1
)

# Sort by mean test score
ridge_results_df = ridge_results_df.sort_values(by="mean_test_score", ascending=False)

# Plot performance of each combination in grid search
plt.figure(figsize=(12, 8))
plt.barh(ridge_results_df["param_combo"], ridge_results_df["mean_test_score"], color="skyblue")
plt.xlabel("Mean Score")
plt.title("Grid Search Results for Logistic Regression (Ridge Regularisation)")
plt.gca().invert_yaxis()
plt.tight_layout()
plt.grid(True)
plt.show()

# Best parameters and score
print("Best Parameters:", ridge_grid_search.best_params_)
print(f"Best Score: {ridge_grid_search.best_score_:.4f}")

Best Parameters: {'ridge__C': 0.1, 'ridge__class_weight': {0: 1, 1: 5}}
Best Score: 0.4337
log_reg_l2 = LogisticRegression(penalty="l2", solver="saga", random_state=8,
                                 max_iter=400, tol=1e-3, class_weight= {0: 1, 1: 5}, C=0.1)

Comparsion of 1st, 2nd and 3rd models

In this section, I compare the performance of the three logistic models.

The performance metrics are virtually identical across all three models (differences in the third decimal place), which confirms that regularisation isn’t providing meaningful benefit here. This suggests the first model is already well-calibrated without regularisation. Since regularisation isn’t having any effect, the models are not overfitting. The logistic regression coefficients for each model have also been printed and compared.

Comparison Summary

Model Recall (yes) Precision (yes) F1 Score (yes) ROC_AUC score (yes)
No Regularization 0.4991 0.4564 0.4204 0.7735
L1 Regularization 0.4981 0.4565 0.4213 0.7738
L2 Regularization 0.4991 0.4565 0.4204 0.7736

The coefficients are remarkably similar across all three models (no regularisation, L1, and L2), which suggests:

  • Model Stability - The fact that regularisation barely changes the coefficients suggests the first model wasn’t significantly overfitting. If there were overfitting, I’d expect to see much larger differences when regularisation is applied.
  • Feature importance is consistent between the non-regularised and L2 models. The first two largest coefficients are consistent across all models:
    • poutcome_sucess: ~0.36 (strongest positive predictor)
    • contact_unknown: ~-0.33 (strongest negative predictor)
  • L1 vs L2 Differences - L1 regularisation typically drives some coefficients to exactly zero (feature selection), while L2 shrinks them toward zero. The L1 model is driving four coefficients to zero, which reinforces that most features are contributing meaningful information.
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Define confusion matrices based on the provided metrics

# L1 regularization
log_reg_l1.fit(X_train_scald, y_train)
y_pred_l1 = log_reg_l1.predict(X_test_scald)
cm_l1 = confusion_matrix(y_test, y_pred_l1, labels=[0,1])
print("L1 regularisation results:")
print(classification_report(y_test, y_pred_l1, target_names=["no", "yes"], digits=4))
y_probs_l1 = log_reg_l1.predict_proba(X_test_scald)[:, 1]  # Probabilities for class 1
roc_auc_l1 = roc_auc_score(y_test, y_probs_l1)
print(f"ROC AUC Score: {roc_auc_l1:.4f}")

# L2 regularization
start_fit = time.perf_counter()
log_reg_l2.fit(X_train_scald, y_train)
end_fit = time.perf_counter()

start_pred = time.perf_counter()
y_pred_l2 = log_reg_l2.predict(X_test_scald)
end_pred =time.perf_counter()

cm_l2 = confusion_matrix(y_test, y_pred_l2, labels=[0,1])
print("L2 regularisation results:")
print(classification_report(y_test, y_pred_l2, target_names=["no", "yes"], digits=4))
y_probs_l2 = log_reg_l2.predict_proba(X_test_scald)[:, 1]  # Probabilities for class 1
roc_auc_l2 = roc_auc_score(y_test, y_probs_l2)
print(f"ROC AUC Score: {roc_auc_l2:.4f}")

# Report timings
log_reg_l2_fit_time = end_fit - start_fit
log_reg_l2_pred_time = end_pred - start_pred

print(f"Log Regression (L2) fitting time: {log_reg_l2_fit_time:.4f} seconds")
print(f"Log Regression (L2) prediction time: {log_reg_l2_pred_time:.4e} seconds")

# Plotting function
def plot_confusion_matrix(cm, title, ax):
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False, ax=ax)
    ax.set_title(title)
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')
    ax.set_xticklabels(['no', 'yes'])
    ax.set_yticklabels(['no', 'yes'])

# Create subplots
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

plot_confusion_matrix(cm_nr, 'No Regularization', axes[0])
plot_confusion_matrix(cm_l1, 'L1 Regularization', axes[1])
plot_confusion_matrix(cm_l2, 'L2 Regularization', axes[2])

plt.tight_layout()
plt.show()
L1 regularisation results:
              precision    recall  f1-score   support

          no     0.9319    0.9093    0.9205      7985
         yes     0.4213    0.4981    0.4565      1058

    accuracy                         0.8612      9043
   macro avg     0.6766    0.7037    0.6885      9043
weighted avg     0.8721    0.8612    0.8662      9043

ROC AUC Score: 0.7738
L2 regularisation results:
              precision    recall  f1-score   support

          no     0.9319    0.9088    0.9202      7985
         yes     0.4204    0.4991    0.4564      1058

    accuracy                         0.8609      9043
   macro avg     0.6762    0.7039    0.6883      9043
weighted avg     0.8721    0.8609    0.8660      9043

ROC AUC Score: 0.7736
Log Regression (L2) fitting time: 0.5195 seconds
Log Regression (L2) prediction time: 1.3113e-03 seconds

feature_names = X_train.columns
coefs_nr = log_reg_nr.coef_.flatten()
coefs_l1 = log_reg_l1.coef_.flatten()
coefs_l2 = log_reg_l2.coef_.flatten()

logistic_nr = pd.DataFrame({'feature_name': feature_names, 'coefficients': coefs_nr})
logistic_l1 = pd.DataFrame({'feature_name': feature_names, 'coefficients': coefs_l1})
logistic_l2 = pd.DataFrame({'feature_name': feature_names, 'coefficients': coefs_l2})
results = [logistic_nr, logistic_l1, logistic_l2]

fig, axes = plt.subplots(3, 1, figsize=(8, 21))
subplot_titles=["No regularisation", "L1 regularisation", "L2 regularisation"]

for i, ax in enumerate(axes):
    # Sort the features by the absolute value of their coefficient
    results[i]["abs_value"] = results[i]["coefficients"].apply(lambda x: abs(x))
    #results[i]["colors"] = results[i]["coefficients"].apply(lambda x: "green" if x > 0 else "red")
    results[i] = results[i].sort_values("abs_value", ascending=False)
    
    sns.barplot(x="feature_name", y="coefficients", data=results[i].head(20),ax=ax)

    # Colour bars based on coefficient sign
    for bar, coeff in zip(ax.patches, results[i]['coefficients']):
        bar.set_color('green' if coeff >= 0 else 'red')

    axes[i].set_xlabel("Feature Name", fontsize=12)
    
    tick_positions = range(20)
    ax.set_xticks(tick_positions)
    ax.set_xticklabels(results[i]["feature_name"].head(20), rotation=45, ha='right')
    axes[i].set_ylabel("Coef", fontsize=12)    
    axes[i].tick_params(axis='x', labelrotation=45)
    ax.set_title(subplot_titles[i])

# Rotate x-axis labels for better readability
plt.xticks(rotation=45, ha='right')
fig.suptitle("Top 20 Features Logistic Regression", fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.99])
plt.show()

# Compare feature selection between L1 and L2
print(f"L1 non-zero features: {np.sum(log_reg_l1.coef_[0] != 0)}")
print(f"L2 non-zero features: {np.sum(log_reg_l2.coef_[0] != 0)}")
print(f"Total features: {len(log_reg_l1.coef_[0])}")
L1 non-zero features: 43
L2 non-zero features: 47
Total features: 47
print("Features removed from L1 model:")
logistic_l1[logistic_l1["coefficients"] == 0]
Features removed from L1 model:
feature_name coefficients abs_value
16 job_technician 0.0 0.0
19 marital_divorced 0.0 0.0
39 month_may 0.0 0.0
46 poutcome_unknown 0.0 0.0
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import roc_curve

yRawScores_nr = log_reg_nr.decision_function(X_train_scald)
yRawScores_l1 = log_reg_l1.decision_function(X_train_scald)
yRawScores_l2 = log_reg_l2.decision_function(X_train_scald)

fpr_nr, tpr_nr, thresholds_nr = roc_curve(y_train, yRawScores_nr)
fpr_l1, tpr_l1, thresholds_l1 = roc_curve(y_train, yRawScores_l1)
fpr_l2, tpr_l2, thresholds_l2 = roc_curve(y_train, yRawScores_l2)

plt.figure(figsize=(8,8))
plt.plot(fpr_nr,tpr_nr, label="no regularisation")
plt.plot(fpr_l1,tpr_l1, label="L1 regularisation")
plt.plot(fpr_l2,tpr_l2, label="L2 regularisation")
plt.xlabel("FPR",fontsize=15)
plt.ylabel("Recall (TPR)",fontsize=15)
plt.title("ROC",fontsize=20)

randomAssignment = np.random.normal(size=len(y_train))
fprRand, tprRand, thresholdsRand = roc_curve(y_train, randomAssignment)
plt.plot(fprRand, tprRand, ls="--", label="random classifier")
plt.legend()
plt.show()

K-Nearest-Neighbour (KNN)

In this section, I fit a KNN model to the data. First, I use a grid search to find the optimal settings for K, the number of neighbours and the weights. The scoring of the model uses the F1 score to be consistent with the Logistic regression model produced earlier in this work.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
import numpy as np

# Create pipeline
pipeline = Pipeline([
    ('knn', KNeighborsClassifier())
])

# Define K range to search
n_samples = len(X_train_scald)
sqrt_n = int(np.sqrt(n_samples))

# Test around this value
k_range = list(range(1, min(sqrt_n * 2, 50))) 

param_grid = {
    'knn__n_neighbors': k_range,
    'knn__weights': ['uniform', 'distance'],
}

# Grid search with CV
knn_grid_search = GridSearchCV(
    pipeline, 
    param_grid, 
    cv=5,
    scoring='f1',
    n_jobs=-1
)

knn_grid_search.fit(X_train_scald, y_train)
print(f"Best K: {knn_grid_search.best_params_}")
print(f"Best Score: {knn_grid_search.best_score_:.4f}")
Best K: {'knn__n_neighbors': 11, 'knn__weights': 'distance'}
Best Score: 0.3377
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score

# Extract results
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k, weights="distance")
    scores = cross_val_score(knn, X_train_scald, y_train, cv=5, scoring="f1")
    k_scores.append(scores.mean())

plt.plot(k_range, k_scores)
plt.xlabel("K")
plt.ylabel("F1 Score")
plt.title("KNN Performance vs K")
plt.show()

knn_model = KNeighborsClassifier(n_neighbors=11, weights="distance")

start_fit = time.perf_counter()
knn_model.fit(X_train_scald, y_train)
end_fit = time.perf_counter()

start_pred = time.perf_counter()
y_pred_knn = knn_model.predict(X_test_scald)
end_pred = time.perf_counter()

print(classification_report(y_test, y_pred_knn, target_names=["no", "yes"], digits=4))
y_probs_knn = knn_model.predict_proba(X_test_scald)[:, 1]  # Probabilities for class 1
roc_auc_knn = roc_auc_score(y_test, y_probs_knn)
print(f"ROC AUC Score: {roc_auc_knn:.4f}")
# Report timings
knn_fit_time = end_fit - start_fit
knn_pred_time = end_pred - start_pred

print(f"KNN fitting time: {knn_fit_time:.4f} seconds")
print(f"KNN prediction time: {knn_pred_time:.4f} seconds")
              precision    recall  f1-score   support

          no     0.9077    0.9752    0.9402      7985
         yes     0.5733    0.2514    0.3495      1058

    accuracy                         0.8905      9043
   macro avg     0.7405    0.6133    0.6449      9043
weighted avg     0.8686    0.8905    0.8711      9043

ROC AUC Score: 0.7391
KNN fitting time: 0.0062 seconds
KNN prediction time: 0.6241 seconds

Comparsion between Logistic regression and KNN models

# Calculate confusion matrix for KNN
knn_cm = confusion_matrix(y_test, y_pred_knn, labels=[0,1])

# Plot confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

labels = ['no', 'yes']

# KNN Confusion Matrix
sns.heatmap(knn_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0])
axes[0].set_title('KNN Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_xticklabels(labels)
axes[0].set_yticklabels(labels)

# Logistic Regression Confusion Matrix
sns.heatmap(cm_l2, annot=True, fmt='d', cmap='Greens', ax=axes[1])
axes[1].set_title('Logistic Regression (L2) Confusion Matrix')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')
axes[1].set_xticklabels(labels)
axes[1].set_yticklabels(labels)

plt.tight_layout()
plt.show()

Performance Comparison: KNN vs Logistic Regression

Minority class (“yes”) metrics

Metric Logistic Regression (L2) KNN
Precision 0.4204 0.5733
Recall 0.4991 0.2514
F1 Score 0.4564 0.3495

Observation: KNN has low recall for the “yes” class, meaning it misses most of the actual positive cases. This is critical in marketing, where identifying potential responders is key. It does have higher precision, which means less time would be wasted on potential “no” responders.

Overall metrics

Metric Logistic Regression (L2) KNN
Accuracy 0.8609 0.8905
Macro F1 0.6883 0.6449
Weighted F1 0.8663 0.8711
ROC_AUC 0.7736 0.7391

Observation: KNN performs well on the majority class (“no”) but poorly on the minority class (“yes”), dragging down macro and weighted averages.

Training time

Model Training Time (s) Prediction Time (s)
Logistic Regression (L2) 0.1785 1.154200e-03
KNN 0.01 0.3590

KNN is faster at training because it simply stores the numbers and doesn’t do any calculations. However, it is much slower at prediction, because it does the calculations at that point. Based on the sum of training and prediction, Logistic Regression is faster overall.

Number of trainable parameters

Model Trainable Parameters
Logistic Regression Weights for each feature (plus bias). For binary classification with n features, it learns n + 1 parameters.
KNN Stores the training data and makes decisions at prediction time, and has no trainable parameters.

Logistic Regression is more compact and interpretable.

Conclusion

For this application, KNN was worse than Logistic Regression because:

  • KNN was worse at identifying “yes” cases, which is crucial in marketing.
  • KNN had a faster training time but slower prediction time. This was noticeable when testing SMOTE (not shown in this notebook).
Back to top