Generalized Linear Models

[1]:
%load_ext autoreload
%autoreload 2

import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import pandas as pd
from sklearn import datasets
import statsmodels.api as sm
import statsmodels.formula.api as smf

Background: Ordinary Linear Regression

A generalized linear model is a generalization of ordinary linear regression. In ordinary linear regression, we assume that the response variable, y, can be described by the equation:

\[y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \ldots + \beta_n x_n^{(i)}\]

where:

  • \(i\) is the index of the sample

  • \(n\) is the number of features

  • \(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients of the model

  • \(x_1^{(i)}, x_2^{(i)}, \ldots, x_n^{(i)}\) are the features of the \(i\)-th sample

  • \(y^{(i)}\) is the response variable for the \(i\)-th sample

For example, let’s suppose we look at the diabetes dataset:

[20]:
# Load diabetes dataset
dataset = datasets.load_diabetes()
x_data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y_data = pd.DataFrame(dataset.target, columns=["disease_progression"])
data = pd.concat([x_data, y_data], axis=1)
data.sort_values(["age", "sex"], inplace=True)

The diabetes dataset has been transformed into numbers that are somewhat meaningless; for example:

[21]:
print(f"Sex is one of: {data['sex'].unique()}")
print(f"Age is one of:\n{data['age'].unique()}")
Sex is one of: [-0.04464164  0.05068012]
Age is one of:
[-0.10722563 -0.10359309 -0.09996055 -0.09632802 -0.09269548 -0.08906294
 -0.0854304  -0.08179786 -0.07816532 -0.07453279 -0.07090025 -0.06726771
 -0.06363517 -0.06000263 -0.05637009 -0.05273755 -0.04910502 -0.04547248
 -0.04183994 -0.0382074  -0.03457486 -0.03094232 -0.02730979 -0.02367725
 -0.02004471 -0.01641217 -0.01277963 -0.00914709 -0.00551455 -0.00188202
  0.00175052  0.00538306  0.0090156   0.01264814  0.01628068  0.01991321
  0.02354575  0.02717829  0.03081083  0.03444337  0.03807591  0.04170844
  0.04534098  0.04897352  0.05260606  0.0562386   0.05987114  0.06350368
  0.06713621  0.07076875  0.07440129  0.07803383  0.08166637  0.08529891
  0.08893144  0.09256398  0.09619652  0.11072668]

Before using this dataset, we would like sex and age to having meaningful values. To do so, we transform the data using information from Table 1 in the original study:

[16]:
# Original ages
ages_orig = data["age"].unique()

# Correct ages
MIN_AGE = 19
MAX_AGE = MIN_AGE + len(ages_orig)
ages = list(range(MIN_AGE, MAX_AGE))

# Transform the ages in the dataset
age_map = {age_orig: age for age_orig, age in zip(ages_orig, ages)}
data["age"] = data.apply(lambda x: age_map[x["age"]], axis=1)
[17]:
# Original sexes
sexes_orig = data["sex"].unique()

# Correct sexes
sexes = [1, 2]

# Transform the sexes in the dataset
sex_map = {sex_orig: sex for sex_orig, sex, in zip(sexes_orig, sexes)}
data["sex"] = data.apply(lambda x: sex_map[x["sex"]], axis=1)
[18]:
data.head()
[18]:
age sex bmi bp s1 s2 s3 s4 s5 s6 disease_progression
26 19 1 -0.077342 -0.026328 -0.089630 -0.096198 0.026550 -0.076395 -0.042571 -0.005220 137.0
344 19 1 -0.011595 -0.040099 0.049341 0.064447 -0.013948 0.034309 0.007027 -0.030072 200.0
374 19 1 -0.034229 -0.067642 -0.063487 -0.070520 0.008142 -0.039493 -0.000612 -0.079778 140.0
79 20 1 -0.037463 -0.026328 0.002559 0.019980 0.011824 -0.002592 -0.068332 -0.025930 113.0
226 20 2 -0.046085 -0.026328 -0.024960 -0.024800 0.030232 -0.039493 -0.039809 -0.054925 77.0

Here we have 10 features:

[4]:
print(dataset.feature_names)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

and a response variable y = disease_progression, indicating the disease progression one year after baseline.

Our equation for linear regression would be:

\[y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \ldots + \beta_{10} x_{10}^{(i)}\]

For the first data point (sample), this would be:

\[137.0 = \beta_0 + \beta_1 * 19 + \beta_2 * 1 + \beta_3 * -0.077342 + \ldots \beta_{10} * -0.005220\]

In ordinary linear regression, we would try to optimize a cost function, typically the least squares cost function, to find values of \(\beta\) which optimize the correct prediction of the \(y\)-values.

GLM: Linear Predictor

For a generalized linear model (GLM), we assume that the relationship between different terms is linear, but the terms themselves aren’t necessarily linear.

Examples

For example, we may want to do a polynomial fit (I’m dropping the \((i)\) superscript here for readability):

\[\eta = \beta_0 + \beta_{11} x_1+ \beta_{12} x_1^2 + \beta_{21} x_2 + \beta_{22} x_2^2 + \beta_{31} x_3 + \beta_{32} x_3^2 + \beta_{41} x_4 + \beta_{42} x_4^2\]

or, perhaps some of the variables may interact:

\[\eta = \beta_0 + \beta_{12} x_1 x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

or, maybe there’s an exponential function:

\[\eta = \beta_0 + \beta_1 e^{x_1} + \beta_{2} x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

Invalid Examples

An example of what is not a valid function for a GLM:

\[\eta = \beta_0 + \beta_1 x_1^{\beta_5} + \beta_{2} x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

GLM: Exponential Distribution

The second component of a GLM is called the random component. We assume that the observed data \(Y\) comes from an exponential family of distributions, such as: Gaussian, Poisson, binomial, gamma, etc.

\[p(y^{(i)}; \theta^{(i)}) = a(\theta^{(i)})b(y^{(i)})e^{y^{(i)}Q(\theta^{(i)})}\]

where:

  • \(p\) is the probability density function

  • \(y^{(i)}\) is the observed target value, e.g. the disease_progression in the diabetes example

  • \(\theta^{(i)}\) is a function of the explanatory variables \(x_n^{(i)}\)

  • \(Q(\theta)\) is the natural parameter

The third component of a GLM is called the link function:

\[g(\mu^{(i)}) = \eta^{(i)}\]

We differentiate here between \(\eta^{(i)}\), the predicted value of the \(i^{th}\) sample, and \(y^{(i)} \in Y\), the observed value of the \(i^{th}\) sample. We can think of \(\mu^{(i)}\) as the expectation value (mean) of an exponential distribution:

\[\mu^{(i)} = E(Y | X = x^{(i)})\]

Our goal is to estimate \(\mu\), not \(\eta\).

The link function that transforms the mean \(\mu\) to the natural parameter \(Q(\theta)\) is called the canonical link function.

Let’s start with the example of ordinary linear regression. In that case, the predicted output is simply the expectation value; in other words:

\[g(\mu^{(i)}) = \mu^{(i)} = E(Y | X = x^{(i)}) = \eta^{(i)}\]

So, the link function \(g(\mu)\) is simply the identity function.

The random component, the exponential function, is given by the Gaussian distribution, where \(\theta = \mu\):

\[p(y^{(i)}; \mu^{(i)}) = \dfrac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y^{(i)}-\mu^{(i)})^2}{2\sigma^2}}\]

Applying this to the diabetes example, we first define a formula (see Fitting models using R-style formulas for details on how to write formulas in Python):

[7]:
formula = "disease_progression ~ age + bmi + sex"
[8]:
model = smf.glm(formula=formula, data=data, family=sm.families.Gaussian())
results = model.fit(maxiter=100)

print(results.summary())
                  Generalized Linear Model Regression Results
===============================================================================
Dep. Variable:     disease_progression   No. Observations:                  442
Model:                             GLM   Df Residuals:                      438
Model Family:                 Gaussian   Df Model:                            3
Link Function:                Identity   Scale:                          3885.1
Method:                           IRLS   Log-Likelihood:                -2451.7
Date:                 Mon, 17 Mar 2025   Deviance:                   1.7017e+06
Time:                         13:55:12   Pearson chi2:                 1.70e+06
No. Iterations:                      3   Pseudo R-squ. (CS):             0.4146
Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    132.8957     13.567      9.795      0.000     106.304     159.487
age            0.4998      0.234      2.136      0.033       0.041       0.958
bmi          927.0815     63.541     14.590      0.000     802.543    1051.620
sex           -3.4098      6.043     -0.564      0.573     -15.253       8.433
==============================================================================

Now that we have trained our GLM, which is technically just a linear regression, let’s see how well it predicts on our data:

[9]:
disease_progression_pred = results.predict(data, which="mean", transform=True)
disease_progression_pred
[9]:
26      67.280918
344    128.233317
374    107.249704
79     104.751886
226     93.348294
          ...
211    200.689430
311    204.186929
321    215.178345
204    170.301126
402    133.329999
Length: 442, dtype: float64
[10]:
df = data.copy()
df["disease_progression_pred"] = disease_progression_pred
[15]:
fig = px.line(
    df.sort_values("bmi"),
    x="bmi",
    y="disease_progression",
    facet_col="sex",
    title=""
)
px_data = px.line(
    df.sort_values("bmi"),
    x="bmi",
    y="disease_progression_pred",
    facet_col="sex",
    title=""
).data
px_data[0]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[1]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[0]['legendgroup']= "predicted"
px_data[1]['legendgroup']= "predicted"
px_data[1]["showlegend"] = True
px_data[1]["name"] = "predicted"
fig.add_traces(
    px_data
)
fig.data[0].legendgroup = "observed"
fig.data[1].legendgroup = "observed"
fig.data[1].name = "observed"
fig.data[1].showlegend = True
fig.layout.yaxis.title = {'text': 'Disease Progression'}
fig.update_xaxes(title_text="BMI")
fig.update_layout(title="Diabetes Disease Progression Linear Regression")
fig.show()
[ ]: