Generalized Linear Models¶
[1]:
%load_ext autoreload
%autoreload 2
import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import pandas as pd
from sklearn import datasets
import statsmodels.api as sm
import statsmodels.formula.api as smf
Background: Ordinary Linear Regression¶
A generalized linear model is a generalization of ordinary linear regression. In ordinary linear regression, we assume that the response variable
, y
, can be described by the equation:
where:
\(i\) is the index of the sample
\(n\) is the number of features
\(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients of the model
\(x_1^{(i)}, x_2^{(i)}, \ldots, x_n^{(i)}\) are the features of the \(i\)-th sample
\(y^{(i)}\) is the response variable for the \(i\)-th sample
For example, let’s suppose we look at the diabetes
dataset:
[20]:
# Load diabetes dataset
dataset = datasets.load_diabetes()
x_data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y_data = pd.DataFrame(dataset.target, columns=["disease_progression"])
data = pd.concat([x_data, y_data], axis=1)
data.sort_values(["age", "sex"], inplace=True)
The diabetes dataset has been transformed into numbers that are somewhat meaningless; for example:
[21]:
print(f"Sex is one of: {data['sex'].unique()}")
print(f"Age is one of:\n{data['age'].unique()}")
Sex is one of: [-0.04464164 0.05068012]
Age is one of:
[-0.10722563 -0.10359309 -0.09996055 -0.09632802 -0.09269548 -0.08906294
-0.0854304 -0.08179786 -0.07816532 -0.07453279 -0.07090025 -0.06726771
-0.06363517 -0.06000263 -0.05637009 -0.05273755 -0.04910502 -0.04547248
-0.04183994 -0.0382074 -0.03457486 -0.03094232 -0.02730979 -0.02367725
-0.02004471 -0.01641217 -0.01277963 -0.00914709 -0.00551455 -0.00188202
0.00175052 0.00538306 0.0090156 0.01264814 0.01628068 0.01991321
0.02354575 0.02717829 0.03081083 0.03444337 0.03807591 0.04170844
0.04534098 0.04897352 0.05260606 0.0562386 0.05987114 0.06350368
0.06713621 0.07076875 0.07440129 0.07803383 0.08166637 0.08529891
0.08893144 0.09256398 0.09619652 0.11072668]
Before using this dataset, we would like sex
and age
to having meaningful values. To do so, we transform the data using information from Table 1
in the original study:
[16]:
# Original ages
ages_orig = data["age"].unique()
# Correct ages
MIN_AGE = 19
MAX_AGE = MIN_AGE + len(ages_orig)
ages = list(range(MIN_AGE, MAX_AGE))
# Transform the ages in the dataset
age_map = {age_orig: age for age_orig, age in zip(ages_orig, ages)}
data["age"] = data.apply(lambda x: age_map[x["age"]], axis=1)
[17]:
# Original sexes
sexes_orig = data["sex"].unique()
# Correct sexes
sexes = [1, 2]
# Transform the sexes in the dataset
sex_map = {sex_orig: sex for sex_orig, sex, in zip(sexes_orig, sexes)}
data["sex"] = data.apply(lambda x: sex_map[x["sex"]], axis=1)
[18]:
data.head()
[18]:
age | sex | bmi | bp | s1 | s2 | s3 | s4 | s5 | s6 | disease_progression | |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | 19 | 1 | -0.077342 | -0.026328 | -0.089630 | -0.096198 | 0.026550 | -0.076395 | -0.042571 | -0.005220 | 137.0 |
344 | 19 | 1 | -0.011595 | -0.040099 | 0.049341 | 0.064447 | -0.013948 | 0.034309 | 0.007027 | -0.030072 | 200.0 |
374 | 19 | 1 | -0.034229 | -0.067642 | -0.063487 | -0.070520 | 0.008142 | -0.039493 | -0.000612 | -0.079778 | 140.0 |
79 | 20 | 1 | -0.037463 | -0.026328 | 0.002559 | 0.019980 | 0.011824 | -0.002592 | -0.068332 | -0.025930 | 113.0 |
226 | 20 | 2 | -0.046085 | -0.026328 | -0.024960 | -0.024800 | 0.030232 | -0.039493 | -0.039809 | -0.054925 | 77.0 |
Here we have 10
features:
[4]:
print(dataset.feature_names)
['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']
and a response variable y = disease_progression
, indicating the disease progression one year after baseline.
Our equation for linear regression would be:
For the first data point (sample), this would be:
In ordinary linear regression, we would try to optimize a cost function, typically the least squares cost function, to find values of \(\beta\) which optimize the correct prediction of the \(y\)-values.
GLM: Linear Predictor¶
For a generalized linear model (GLM), we assume that the relationship between different terms is linear, but the terms themselves aren’t necessarily linear.
Examples¶
For example, we may want to do a polynomial fit (I’m dropping the \((i)\) superscript here for readability):
or, perhaps some of the variables may interact:
or, maybe there’s an exponential function:
Invalid Examples¶
An example of what is not a valid function for a GLM:
GLM: Exponential Distribution¶
The second component of a GLM is called the random component
. We assume that the observed data \(Y\) comes from an exponential family of distributions, such as: Gaussian
, Poisson
, binomial
, gamma
, etc.
where:
\(p\) is the probability density function
\(y^{(i)}\) is the observed target value, e.g. the
disease_progression
in thediabetes
example\(\theta^{(i)}\) is a function of the explanatory variables \(x_n^{(i)}\)
\(Q(\theta)\) is the natural parameter
GLM: Link Function¶
The third component of a GLM is called the link function
:
We differentiate here between \(\eta^{(i)}\), the predicted value of the \(i^{th}\) sample, and \(y^{(i)} \in Y\), the observed value of the \(i^{th}\) sample. We can think of \(\mu^{(i)}\) as the expectation value (mean) of an exponential distribution:
Our goal is to estimate \(\mu\), not \(\eta\).
The link function that transforms the mean \(\mu\) to the natural parameter \(Q(\theta)\) is called the canonical link function.
Example 1: Identity Link Function, Ordinary Linear Regression¶
Let’s start with the example of ordinary linear regression. In that case, the predicted output is simply the expectation value; in other words:
So, the link function \(g(\mu)\) is simply the identity function.
The random component
, the exponential function, is given by the Gaussian distribution
, where \(\theta = \mu\):
Applying this to the diabetes
example, we first define a formula (see Fitting models using R-style formulas for details on how to write formulas in Python):
[7]:
formula = "disease_progression ~ age + bmi + sex"
[8]:
model = smf.glm(formula=formula, data=data, family=sm.families.Gaussian())
results = model.fit(maxiter=100)
print(results.summary())
Generalized Linear Model Regression Results
===============================================================================
Dep. Variable: disease_progression No. Observations: 442
Model: GLM Df Residuals: 438
Model Family: Gaussian Df Model: 3
Link Function: Identity Scale: 3885.1
Method: IRLS Log-Likelihood: -2451.7
Date: Mon, 17 Mar 2025 Deviance: 1.7017e+06
Time: 13:55:12 Pearson chi2: 1.70e+06
No. Iterations: 3 Pseudo R-squ. (CS): 0.4146
Covariance Type: nonrobust
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 132.8957 13.567 9.795 0.000 106.304 159.487
age 0.4998 0.234 2.136 0.033 0.041 0.958
bmi 927.0815 63.541 14.590 0.000 802.543 1051.620
sex -3.4098 6.043 -0.564 0.573 -15.253 8.433
==============================================================================
Now that we have trained our GLM, which is technically just a linear regression, let’s see how well it predicts on our data:
[9]:
disease_progression_pred = results.predict(data, which="mean", transform=True)
disease_progression_pred
[9]:
26 67.280918
344 128.233317
374 107.249704
79 104.751886
226 93.348294
...
211 200.689430
311 204.186929
321 215.178345
204 170.301126
402 133.329999
Length: 442, dtype: float64
[10]:
df = data.copy()
df["disease_progression_pred"] = disease_progression_pred
[15]:
fig = px.line(
df.sort_values("bmi"),
x="bmi",
y="disease_progression",
facet_col="sex",
title=""
)
px_data = px.line(
df.sort_values("bmi"),
x="bmi",
y="disease_progression_pred",
facet_col="sex",
title=""
).data
px_data[0]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[1]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[0]['legendgroup']= "predicted"
px_data[1]['legendgroup']= "predicted"
px_data[1]["showlegend"] = True
px_data[1]["name"] = "predicted"
fig.add_traces(
px_data
)
fig.data[0].legendgroup = "observed"
fig.data[1].legendgroup = "observed"
fig.data[1].name = "observed"
fig.data[1].showlegend = True
fig.layout.yaxis.title = {'text': 'Disease Progression'}
fig.update_xaxes(title_text="BMI")
fig.update_layout(title="Diabetes Disease Progression Linear Regression")
fig.show()
[ ]: