Generalized Linear Models¶

[1]:

%load_ext autoreload
%autoreload 2

import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import pandas as pd
from sklearn import datasets
import statsmodels.api as sm
import statsmodels.formula.api as smf

Background: Ordinary Linear Regression¶

A generalized linear model is a generalization of ordinary linear regression. In ordinary linear regression, we assume that the response variable, y, can be described by the equation:

\[y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \beta_2 x_2^{(i)} + \ldots + \beta_n x_n^{(i)}\]

where:

\(i\) is the index of the sample
\(n\) is the number of features
\(\beta_0, \beta_1, \ldots, \beta_n\) are the coefficients of the model
\(x_1^{(i)}, x_2^{(i)}, \ldots, x_n^{(i)}\) are the features of the \(i\)-th sample
\(y^{(i)}\) is the response variable for the \(i\)-th sample

For example, let’s suppose we look at the diabetes dataset:

[20]:

# Load diabetes dataset
dataset = datasets.load_diabetes()
x_data = pd.DataFrame(dataset.data, columns=dataset.feature_names)
y_data = pd.DataFrame(dataset.target, columns=["disease_progression"])
data = pd.concat([x_data, y_data], axis=1)
data.sort_values(["age", "sex"], inplace=True)

The diabetes dataset has been transformed into numbers that are somewhat meaningless; for example:

[21]:

print(f"Sex is one of: {data['sex'].unique()}")
print(f"Age is one of:\n{data['age'].unique()}")

Sex is one of: [-0.04464164  0.05068012]
Age is one of:
[-0.10722563 -0.10359309 -0.09996055 -0.09632802 -0.09269548 -0.08906294
 -0.0854304  -0.08179786 -0.07816532 -0.07453279 -0.07090025 -0.06726771
 -0.06363517 -0.06000263 -0.05637009 -0.05273755 -0.04910502 -0.04547248
 -0.04183994 -0.0382074  -0.03457486 -0.03094232 -0.02730979 -0.02367725
 -0.02004471 -0.01641217 -0.01277963 -0.00914709 -0.00551455 -0.00188202
  0.00175052  0.00538306  0.0090156   0.01264814  0.01628068  0.01991321
  0.02354575  0.02717829  0.03081083  0.03444337  0.03807591  0.04170844
  0.04534098  0.04897352  0.05260606  0.0562386   0.05987114  0.06350368
  0.06713621  0.07076875  0.07440129  0.07803383  0.08166637  0.08529891
  0.08893144  0.09256398  0.09619652  0.11072668]

Before using this dataset, we would like sex and age to having meaningful values. To do so, we transform the data using information from Table 1 in the original study:

[16]:

# Original ages
ages_orig = data["age"].unique()

# Correct ages
MIN_AGE = 19
MAX_AGE = MIN_AGE + len(ages_orig)
ages = list(range(MIN_AGE, MAX_AGE))

# Transform the ages in the dataset
age_map = {age_orig: age for age_orig, age in zip(ages_orig, ages)}
data["age"] = data.apply(lambda x: age_map[x["age"]], axis=1)

[17]:

# Original sexes
sexes_orig = data["sex"].unique()

# Correct sexes
sexes = [1, 2]

# Transform the sexes in the dataset
sex_map = {sex_orig: sex for sex_orig, sex, in zip(sexes_orig, sexes)}
data["sex"] = data.apply(lambda x: sex_map[x["sex"]], axis=1)

[18]:

data.head()

[18]:

	age	sex	bmi	bp	s1	s2	s3	s4	s5	s6	disease_progression
26	19	1	-0.077342	-0.026328	-0.089630	-0.096198	0.026550	-0.076395	-0.042571	-0.005220	137.0
344	19	1	-0.011595	-0.040099	0.049341	0.064447	-0.013948	0.034309	0.007027	-0.030072	200.0
374	19	1	-0.034229	-0.067642	-0.063487	-0.070520	0.008142	-0.039493	-0.000612	-0.079778	140.0
79	20	1	-0.037463	-0.026328	0.002559	0.019980	0.011824	-0.002592	-0.068332	-0.025930	113.0
226	20	2	-0.046085	-0.026328	-0.024960	-0.024800	0.030232	-0.039493	-0.039809	-0.054925	77.0

Here we have 10 features:

[4]:

print(dataset.feature_names)

['age', 'sex', 'bmi', 'bp', 's1', 's2', 's3', 's4', 's5', 's6']

and a response variable y = disease_progression, indicating the disease progression one year after baseline.

Our equation for linear regression would be:

\[y^{(i)} = \beta_0 + \beta_1 x_1^{(i)} + \ldots + \beta_{10} x_{10}^{(i)}\]

For the first data point (sample), this would be:

\[137.0 = \beta_0 + \beta_1 * 19 + \beta_2 * 1 + \beta_3 * -0.077342 + \ldots \beta_{10} * -0.005220\]

In ordinary linear regression, we would try to optimize a cost function, typically the least squares cost function, to find values of \(\beta\) which optimize the correct prediction of the \(y\)-values.

GLM: Linear Predictor¶

For a generalized linear model (GLM), we assume that the relationship between different terms is linear, but the terms themselves aren’t necessarily linear.

Examples¶

For example, we may want to do a polynomial fit (I’m dropping the \((i)\) superscript here for readability):

\[\eta = \beta_0 + \beta_{11} x_1+ \beta_{12} x_1^2 + \beta_{21} x_2 + \beta_{22} x_2^2 + \beta_{31} x_3 + \beta_{32} x_3^2 + \beta_{41} x_4 + \beta_{42} x_4^2\]

or, perhaps some of the variables may interact:

\[\eta = \beta_0 + \beta_{12} x_1 x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

or, maybe there’s an exponential function:

\[\eta = \beta_0 + \beta_1 e^{x_1} + \beta_{2} x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

Invalid Examples¶

An example of what is not a valid function for a GLM:

\[\eta = \beta_0 + \beta_1 x_1^{\beta_5} + \beta_{2} x_2 + \beta_{3} x_3 + \beta_{4} x_4\]

GLM: Exponential Distribution¶

The second component of a GLM is called the random component. We assume that the observed data \(Y\) comes from an exponential family of distributions, such as: Gaussian, Poisson, binomial, gamma, etc.

\[p(y^{(i)}; \theta^{(i)}) = a(\theta^{(i)})b(y^{(i)})e^{y^{(i)}Q(\theta^{(i)})}\]

where:

\(p\) is the probability density function
\(y^{(i)}\) is the observed target value, e.g. the disease_progression in the diabetes example
\(\theta^{(i)}\) is a function of the explanatory variables \(x_n^{(i)}\)
\(Q(\theta)\) is the natural parameter

GLM: Link Function¶

The third component of a GLM is called the link function:

\[g(\mu^{(i)}) = \eta^{(i)}\]

We differentiate here between \(\eta^{(i)}\), the predicted value of the \(i^{th}\) sample, and \(y^{(i)} \in Y\), the observed value of the \(i^{th}\) sample. We can think of \(\mu^{(i)}\) as the expectation value (mean) of an exponential distribution:

\[\mu^{(i)} = E(Y | X = x^{(i)})\]

Our goal is to estimate \(\mu\), not \(\eta\).

The link function that transforms the mean \(\mu\) to the natural parameter \(Q(\theta)\) is called the canonical link function.

Example 1: Identity Link Function, Ordinary Linear Regression¶

Let’s start with the example of ordinary linear regression. In that case, the predicted output is simply the expectation value; in other words:

\[g(\mu^{(i)}) = \mu^{(i)} = E(Y | X = x^{(i)}) = \eta^{(i)}\]

So, the link function \(g(\mu)\) is simply the identity function.

The random component, the exponential function, is given by the Gaussian distribution, where \(\theta = \mu\):

\[p(y^{(i)}; \mu^{(i)}) = \dfrac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(y^{(i)}-\mu^{(i)})^2}{2\sigma^2}}\]

Applying this to the diabetes example, we first define a formula (see Fitting models using R-style formulas for details on how to write formulas in Python):

[7]:

formula = "disease_progression ~ age + bmi + sex"

[8]:

model = smf.glm(formula=formula, data=data, family=sm.families.Gaussian())
results = model.fit(maxiter=100)

print(results.summary())

                  Generalized Linear Model Regression Results
===============================================================================
Dep. Variable:     disease_progression   No. Observations:                  442
Model:                             GLM   Df Residuals:                      438
Model Family:                 Gaussian   Df Model:                            3
Link Function:                Identity   Scale:                          3885.1
Method:                           IRLS   Log-Likelihood:                -2451.7
Date:                 Mon, 17 Mar 2025   Deviance:                   1.7017e+06
Time:                         13:55:12   Pearson chi2:                 1.70e+06
No. Iterations:                      3   Pseudo R-squ. (CS):             0.4146
Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    132.8957     13.567      9.795      0.000     106.304     159.487
age            0.4998      0.234      2.136      0.033       0.041       0.958
bmi          927.0815     63.541     14.590      0.000     802.543    1051.620
sex           -3.4098      6.043     -0.564      0.573     -15.253       8.433
==============================================================================

Now that we have trained our GLM, which is technically just a linear regression, let’s see how well it predicts on our data:

[9]:

disease_progression_pred = results.predict(data, which="mean", transform=True)
disease_progression_pred

[9]:

26      67.280918
344    128.233317
374    107.249704
79     104.751886
226     93.348294
          ...
211    200.689430
311    204.186929
321    215.178345
204    170.301126
402    133.329999
Length: 442, dtype: float64

[10]:

df = data.copy()
df["disease_progression_pred"] = disease_progression_pred

[15]:

fig = px.line(
    df.sort_values("bmi"),
    x="bmi",
    y="disease_progression",
    facet_col="sex",
    title=""
)
px_data = px.line(
    df.sort_values("bmi"),
    x="bmi",
    y="disease_progression_pred",
    facet_col="sex",
    title=""
).data
px_data[0]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[1]['line']= {'color': 'orange', 'dash': 'solid'}
px_data[0]['legendgroup']= "predicted"
px_data[1]['legendgroup']= "predicted"
px_data[1]["showlegend"] = True
px_data[1]["name"] = "predicted"
fig.add_traces(
    px_data
)
fig.data[0].legendgroup = "observed"
fig.data[1].legendgroup = "observed"
fig.data[1].name = "observed"
fig.data[1].showlegend = True
fig.layout.yaxis.title = {'text': 'Disease Progression'}
fig.update_xaxes(title_text="BMI")
fig.update_layout(title="Diabetes Disease Progression Linear Regression")
fig.show()

[ ]: