Antibiotic Exposure Model¶
In this section, we will describe the model used to predict the number of antibiotics prescribed to infants in their first year of life. This model will be incorporated into the risk factors for developing asthma later in life.
Datasets¶
We obtained data from the BC Ministry of Health on the number of antibiotics prescribed to
infants in their first year of life. The data is available for the years 2000 to 2018. The
data is formatted as follows:
| Column | Type | Description |
|---|---|---|
year |
int
|
format XXXX, e.g 2000, range [2000, 2018]
|
sex |
int
|
1 = Female, 2 = Male |
n_abx |
int
|
The total number of antibiotic courses prescribed to infants in BC in a given year. Note that this is not per infant, it is the total for all infants. |
Since the n_abx column gives us the total number of antibiotics prescribed, we need to use
population data to convert this to a per infant value. We obtained population data from the
StatCan census data (the same that is used in the Birth module):
| Column | Type | Description |
|---|---|---|
year |
int
|
format XXXX, e.g 2000, range [1999, 2021]
|
sex |
int
|
"M" = male, "F" = female
|
n_birth |
int
|
The total number of births in BC in a given year. |
Model: Generalized Linear Model - Negative Binomial¶
Since our model projects into the future, we would like to be able to extend this data beyond
2018. To obtain these projections, we use a Generalized Linear Model (GLM). A GLM is a
type of regression analysis which is a generalized form of linear regression.
See Generalized Linear Models for more information on GLMs.
Probability Distribution¶
When fitting a GLM, first you must choose a distribution for the response variable. In our
case, the response variable is the number of antibiotics prescribed during the first year of life.
The number of antibiotics prescribed is a count variable, in a given time interval
(a year, in our case). Since it is count data, we need a discrete probability distribution.
The Poisson distribution is a good choice for our data, but it has some limitations. The
Poisson distribution assumes that the mean and variance are equal, i.e:
However, in our data, the variance is greater than the mean. This is a common problem in count data,
and it is called overdispersion. The Negative Binomial distribution is a generalization of
the Poisson distribution that allows for overdispersion. The Negative Binomial distribution
has an extra parameter, \(\theta\), which controls the amount of overdispersion. Typically,
the distribution is written as:
where:
\(k\) is the number of failures before \(r\) successes occur
\(p\) is the probability of a success
\(r\) is the number of successes
We can reparametrize this with \(\mu\) and \(\theta\) using the following equations:
where \(\mu\) is the mean, \(\sigma^2\) is the variance, and \(\theta\) is the overdispersion parameter.
Doing some algebra, we have:
Letting \(y = k\), we have:
We added an upper bound on the mean parameter to prevent unrealistic extrapolation:
In other words, we are saying that the mean number of antibiotics prescribed to an infant in their
first year of life is less than or equal to 0.05.
So we have:
Link Function¶
We also need to choose a link function. Recall that the link function \(g(\mu^{(i)})\)
is used to relate the mean to the predicted value \(\eta^{(i)}\):
How do we choose a link function? Well, we are free to choose any link function we like, but there
are some constraints. For example, in the Negative Binomial distribution, the mean is always
>= 0. However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function
that maps real numbers to non-negative numbers. The log link function is a good choice for this:
Formula¶
Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.
For our dataset, we want a formula using sex and year. Since prescribing practices change
over time, and since infections requiring antibiotic prescriptions also change over time,
we should include year in our formula. We also want to include sex, since there are sex differences
in antibiotic prescriptions.
There is an additional factor specific to BC regulations. In 2005, the BC government introduced
an antibiotic conservation program, which reduced the number of antibiotics prescribed
[Mamun, 2019]. It stands to reason that the formula may change before and after 2005. To
account for this, we will introduce a Heaviside step function, which returns 0 for values
below a given threshold, and 1 for values above the threshold. In our case, the threshold
is 2005.
where:
Variable |
Description |
|---|---|
\(\eta^{(i)}\) |
the predicted number of antibiotics per capita prescribed for an infant. |
\(\beta_0\) |
an intercept constant |
\(s^{(i)}\) |
sex |
\(\beta_{s}\) |
sex constant |
\(t^{(i)}\) |
year of birth of the infant |
\(\beta_t\) |
year constant |
\(H(t^{(i)} - 2005)\) |
Heaviside step function, which is |
\(\beta_h\) |
Heaviside step function constant |
\(\beta_{th}\) |
year and Heaviside step function constant |
Example¶
Setup¶
%load_ext autoreload
%autoreload 2
import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm
import statsmodels.formula.api as smf
from leap.data_generation.utils import heaviside
from leap.data_generation.antibiotic_data import load_birth_data
# Plot Settings
pio.templates.default = "plotly_white"
config = {
'toImageButtonOptions': {
'format': 'png',
'scale': 2
}
}
Data¶
Since the antibiotic dosing data is private, we will create here a mock dataset. In this dataset, we will have the total number of courses of antibiotics prescribed to all infants in BC, stratified by sex and year.
# Data was collected for the years 2000-2018
years = np.arange(2000, 2018)
# Mock data for antibiotic courses prescribed to infants using an approximate linear trend
n_abx_female = [-640 * i + 16000 for i, _ in enumerate(years)]
n_abx_male = [-880 * i + 22000 for i, _ in enumerate(years)]
n_abx = n_abx_female + n_abx_male
# Create a DataFrame for antibiotic courses prescribed to infants
df_abx = pd.DataFrame(
data=list(itertools.product(["F", "M"], years)),
columns=["sex", "year"]
)
df_abx["n_abx"] = n_abx
df_abx.head()
| sex | year | n_abx | |
|---|---|---|---|
| 0 | F | 2000 | 16000 |
| 1 | F | 2001 | 15360 |
| 2 | F | 2002 | 14720 |
| 3 | F | 2003 | 14080 |
| 4 | F | 2004 | 13440 |
Next, since the antibiotic data measures the number of antibiotics prescribed at a population level,
we need to use population data from Statistics Canada to determine the number of antibiotics
prescribed per capita:
df_birth = load_birth_data()
df_birth.head()
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3811 try:
-> 3812 return self._engine.get_loc(casted_key)
3813 except KeyError as err:
File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: 'year'
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[4], line 1
----> 1 df_birth = load_birth_data()
2 df_birth.head()
File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/leap/data_generation/antibiotic_data.py:88, in load_birth_data(province, min_year, max_year)
81 df.rename(
82 columns={"REF_DATE": "year", "GEO": "province", "SEX": "sex", "VALUE": "n_birth"},
83 inplace=True
84 )
86 # select only the age = 0 age group and the years where min_year <= year <= max_year
87 df = df.loc[
---> 88 (df["year"] >= min_year) &
89 (df["year"] <= max_year) &
90 (df["AGE_GROUP"] == "0 years")
91 ]
93 # select only the columns we need
94 df = df[["year", "province", "sex", "n_birth"]]
File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
4111 if self.columns.nlevels > 1:
4112 return self._getitem_multilevel(key)
-> 4113 indexer = self.columns.get_loc(key)
4114 if is_integer(indexer):
4115 indexer = [indexer]
File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
3814 if isinstance(casted_key, slice) or (
3815 isinstance(casted_key, abc.Iterable)
3816 and any(isinstance(x, slice) for x in casted_key)
3817 ):
3818 raise InvalidIndexError(key)
-> 3819 raise KeyError(key) from err
3820 except TypeError:
3821 # If we have a listlike key, _check_indexing_error will raise
3822 # InvalidIndexError. Otherwise we fall through and re-raise
3823 # the TypeError.
3824 self._check_indexing_error(key)
KeyError: 'year'
We can now merge the df_abx and df_birth dataframes:
df = pd.merge(df_abx, df_birth, on=["year", "sex"], how="left")
df.head()
color_map={
"M": "#09bfc4",
"F": "#f39c12",
}
fig = px.line(
df,
x="year",
y="n_abx",
facet_col="sex",
color="sex",
labels={"n_abx": "Total Number of Antibiotic Courses Prescribed to Infant Population"},
color_discrete_map=color_map
)
fig.update_xaxes(title_text="Birth Year", nticks=10)
fig.update_yaxes(title_font=dict(size=12))
fig.show()
Model¶
df["sex"] = df.apply(lambda x: 1 if x["sex"] == "F" else 2, axis=1)
formula = "n_abx ~ year + sex + heaviside(year, 2005) * year"
θ = 800
alpha = 1 / θ
# Fit the GLM model
model = smf.glm(
formula=formula,
data=df,
family=sm.families.NegativeBinomial(alpha=alpha),
offset=np.log(df["n_birth"])
)
results = model.fit(maxiter=1000)
print(results.summary())
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: n_abx No. Observations: 36
Model: GLM Df Residuals: 31
Model Family: NegativeBinomial Df Model: 4
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -260.35
Date: Tue, 12 Aug 2025 Deviance: 18.039
Time: 13:53:48 Pearson chi2: 18.0
No. Iterations: 5 Pseudo R-squ. (CS): 1.000
Covariance Type: nonrobust
==============================================================================================
coef std err z P>|z| [0.025 0.975]
----------------------------------------------------------------------------------------------
Intercept 74.9737 16.193 4.630 0.000 43.236 106.711
year -0.0377 0.008 -4.664 0.000 -0.054 -0.022
sex 0.2595 0.012 21.258 0.000 0.236 0.283
heaviside(year, 2005) 87.6702 16.652 5.265 0.000 55.033 120.308
heaviside(year, 2005):year -0.0437 0.008 -5.259 0.000 -0.060 -0.027
==============================================================================================
Now that the model has been trained, we can obtain the $\beta$ parameters:
β_0 = results.params["Intercept"]
β_t = results.params["year"]
β_s = results.params["sex"]
β_h = results.params["heaviside(year, 2005)"]
β_th = results.params["heaviside(year, 2005):year"]
Example: Sex = Male, Year = 2008¶
sex = 2
year = 2008
η = β_0 + β_s * sex + β_t * year + β_h * heaviside(year, 2005) + β_th * heaviside(year, 2005) * year
Now, $\eta$ is the number of antibiotics prescribed per number of infants; what we want to know, is
given an age, birth year, and sex, how many antibiotics were prescribed to that person during
infancy? To do so, we need to use the Negative Binomial distribution.
First, we compute $\mu$, using the link function.
µ = np.exp(η)
Next, we make sure that $\mu > 0.05$:
µ = max(µ, 0.05)
print(f"µ = {µ:.6f}")
µ = 0.674037
The standard parametrization of the Negative Binomial function uses $p$ and $r$, not $\mu$. So, we need to convert $\mu$ to $p$:
p = θ / (θ + µ)
r = θ
print(f"p = {p:.6f}, r = {r:.4f}")
p = 0.999158, r = 800.0000
The Negative Binomial distribution gives us the probability of an infant being prescribed $k$
courses of antibiotics during infancy:
$$ \begin{aligned} P(Y = k; r, p) := \binom{k+r-1}{k}(1-p)^k p^r \end{aligned} $$
n_abx = np.random.negative_binomial(r, p)
sex = "female" if sex == 1 else "male"
print(f"Number of courses of antibiotics prescribed during infancy to a {sex} person born in {year}: {n_abx}")
Number of courses of antibiotics prescribed during infancy to a male person born in 2008: 1
The number of antibiotics obtained is just a sampling from the distribution; if we repeated the selection multiple times, we could get different values. To visualize this, we can look at the distribution:
n_abx = np.random.negative_binomial(r, p, size=100000)
fig = px.histogram(
n_abx,
nbins=10,
title=f"Distribution of Antibiotic Courses Prescribed in Infancy<br>Sex = {sex}, Birth Year = {year}",
color_discrete_sequence=["#09bfc4"],
width=600
)
fig.update_xaxes(title_text="Number of Courses of Antibiotics", nticks=10)
fig.update_yaxes(title_text="Count")
fig.update_layout(
legend_title_text="",
title_x=0.5,
title_y=0.95,
margin=dict(t=50), # Adjust top margin for title
showlegend=False
)
fig.show(config=config)
Example: Visualize Multiple Years / Sexes¶
The example above was just the distribution for a single year and sex; we can visualize multiple years / sexes together:
def compute_abx_distribution_parameters(sex, year, β_0, β_s, β_t, β_h, β_th, θ):
η = (
β_0 + β_s * sex + β_t * year +
β_h * heaviside(year, 2005) +
β_th * heaviside(year, 2005) * year
)
µ = np.exp(η)
µ = max(µ, 0.05) # Ensure µ is greater than 0.05
p = θ / (θ + µ)
return p
Let’s create a dataframe with even years from 2000 to 2018:
sexes = [1, 2]
years = np.arange(2000, 2018, step=2)
df_pred = pd.DataFrame(
data=list(itertools.product(sexes, years)),
columns=["sex", "year"],
dtype='Int64'
)
df_pred.head()
| sex | year | |
|---|---|---|
| 0 | 1 | 2000 |
| 1 | 1 | 2002 |
| 2 | 1 | 2004 |
| 3 | 1 | 2006 |
| 4 | 1 | 2008 |
Next, let’s find $p$ and $r$ for each year / sex combination:
df_pred["p"] = df_pred.apply(
lambda x: compute_abx_distribution_parameters(x["sex"], x["year"], β_0, β_s, β_t, β_h, β_th, θ),
axis=1
)
df_pred["r"] = df_pred.apply(
lambda x: r,
axis=1
)
df_pred.head()
| sex | year | p | r | |
|---|---|---|---|---|
| 0 | 1 | 2000 | 0.998991 | 800 |
| 1 | 1 | 2002 | 0.999064 | 800 |
| 2 | 1 | 2004 | 0.999132 | 800 |
| 3 | 1 | 2006 | 0.999236 | 800 |
| 4 | 1 | 2008 | 0.999350 | 800 |
We now create a second dataframe, which we will use to sample the Negative Binomial
distribution.
df_dist = pd.DataFrame(
data=list(itertools.product(sexes, years, np.zeros(1000))),
columns=["sex", "year", "nabx"],
dtype=int
)
df_dist = pd.merge(df_dist, df_pred, on=["year", "sex"])
df_dist.head()
| sex | year | nabx | p | r | |
|---|---|---|---|---|---|
| 0 | 1 | 2000 | 0 | 0.998991 | 800 |
| 1 | 1 | 2000 | 0 | 0.998991 | 800 |
| 2 | 1 | 2000 | 0 | 0.998991 | 800 |
| 3 | 1 | 2000 | 0 | 0.998991 | 800 |
| 4 | 1 | 2000 | 0 | 0.998991 | 800 |
grouped_df = df_dist.groupby(["sex", "year"])
df_dist["nabx"] = grouped_df.apply(
lambda x: np.random.negative_binomial(x["r"], x["p"], size=len(x)), include_groups=False
).explode().astype(int).values
df_dist.head()
| sex | year | nabx | p | r | |
|---|---|---|---|---|---|
| 0 | 1 | 2000 | 1 | 0.998991 | 800 |
| 1 | 1 | 2000 | 2 | 0.998991 | 800 |
| 2 | 1 | 2000 | 1 | 0.998991 | 800 |
| 3 | 1 | 2000 | 3 | 0.998991 | 800 |
| 4 | 1 | 2000 | 1 | 0.998991 | 800 |
Now we can visualize our results:
color_map={
2: "#09bfc4",
1: "#f39c12",
}
fig = px.histogram(
df_dist,
x="nabx",
facet_col="year",
facet_col_wrap=3,
color="sex",
nbins=10,
title=f"Distribution of Antibiotic Courses Prescribed in Infancy",
height=1000,
width=850,
facet_row_spacing=0.1,
color_discrete_map=color_map,
barmode="group"
)
fig.for_each_trace(lambda t: t.update(name="female" if t.name == "1" else "male"))
fig.for_each_annotation(lambda a: a.update(text=a.text.replace("year=", "Year: ")))
fig.update_xaxes(
title_text="Number of Courses of Antibiotics",
nticks=10,
row=1,
title=dict(font=dict(size=12))
)
fig.update_xaxes(showticklabels=True, nticks=10)
fig.update_yaxes(title_text="Count", col=1)
fig.update_layout(
legend_title_text="",
title_x=0.5,
title_y=0.95,
margin=dict(t=150), # Adjust top margin for title
showlegend=True
)
fig.show(config=config)