Antibiotic Exposure Model

In this section, we will describe the model used to predict the number of antibiotics prescribed to infants in their first year of life. This model will be incorporated into the risk factors for developing asthma later in life.

Datasets

We obtained data from the BC Ministry of Health on the number of antibiotics prescribed to infants in their first year of life. The data is available for the years 2000 to 2018. The data is formatted as follows:

Column Type Description
year int format XXXX, e.g 2000, range [2000, 2018]
sex int 1 = Female, 2 = Male
n_abx int The total number of antibiotic courses prescribed to infants in BC in a given year. Note that this is not per infant, it is the total for all infants.

Since the n_abx column gives us the total number of antibiotics prescribed, we need to use population data to convert this to a per infant value. We obtained population data from the StatCan census data (the same that is used in the Birth module):

Column Type Description
year int format XXXX, e.g 2000, range [1999, 2021]
sex int "M" = male, "F" = female
n_birth int The total number of births in BC in a given year.

Model: Generalized Linear Model - Negative Binomial

Since our model projects into the future, we would like to be able to extend this data beyond 2018. To obtain these projections, we use a Generalized Linear Model (GLM). A GLM is a type of regression analysis which is a generalized form of linear regression. See Generalized Linear Models for more information on GLMs.

Probability Distribution

When fitting a GLM, first you must choose a distribution for the response variable. In our case, the response variable is the number of antibiotics prescribed during the first year of life. The number of antibiotics prescribed is a count variable, in a given time interval (a year, in our case). Since it is count data, we need a discrete probability distribution. The Poisson distribution is a good choice for our data, but it has some limitations. The Poisson distribution assumes that the mean and variance are equal, i.e:

\[\mu = \sigma^2\]

However, in our data, the variance is greater than the mean. This is a common problem in count data, and it is called overdispersion. The Negative Binomial distribution is a generalization of the Poisson distribution that allows for overdispersion. The Negative Binomial distribution has an extra parameter, \(\theta\), which controls the amount of overdispersion. Typically, the distribution is written as:

\[P(Y = k; r, p) := \binom{k+r-1}{k}(1-p)^k p^r\]

where:

  • \(k\) is the number of failures before \(r\) successes occur

  • \(p\) is the probability of a success

  • \(r\) is the number of successes

We can reparametrize this with \(\mu\) and \(\theta\) using the following equations:

\[\begin{split}p &= \dfrac{\mu}{\sigma^2} \\ r &= \dfrac{\mu^2}{\sigma^2 - \mu} \\ \sigma^2 &= \mu + \dfrac{\mu^2}{\theta}\end{split}\]

where \(\mu\) is the mean, \(\sigma^2\) is the variance, and \(\theta\) is the overdispersion parameter.

Doing some algebra, we have:

\[\begin{split}p &= \dfrac{\mu}{\sigma^2} = \dfrac{\mu}{\mu + \dfrac{\mu^2}{\theta}} = \dfrac{\theta}{\theta + \mu} \\ r &= \dfrac{\mu^2}{\sigma^2 - \mu} = \dfrac{\mu^2}{\mu + \dfrac{\mu^2}{\theta} - \mu} = \dfrac{\mu^2}{\dfrac{\mu^2}{\theta}} = \theta\end{split}\]

Letting \(y = k\), we have:

\[\begin{split}P(Y = y; \mu, \theta) &= \binom{y + \theta - 1}{y} \left(1-\dfrac{\theta}{\theta + \mu}\right)^y \left(\dfrac{\theta}{\theta + \mu}\right)^{\theta} \\ &= \binom{y + \theta - 1}{y} \left(\dfrac{\theta + \mu - \theta}{\theta + \mu}\right)^y \left(\dfrac{\theta}{\theta + \mu}\right)^{\theta} \\ &= \binom{y + \theta - 1}{y} \dfrac{\mu^y \theta^{\theta}}{(\theta + \mu)^{y+\theta}}\end{split}\]

We added an upper bound on the mean parameter to prevent unrealistic extrapolation:

\[\mu^{(i)} = E(Y | X = x^{(i)}) \leq 0.05\]

In other words, we are saying that the mean number of antibiotics prescribed to an infant in their first year of life is less than or equal to 0.05.

So we have:

\[\begin{split}P(Y = y; \mu, \theta) &= p(y; \mu^{(i)}, \theta^{(i)}) = \binom{y + \theta^{(i)} - 1}{y} \dfrac{(\mu^{(i)})^y (\theta^{(i)})^{\theta^{(i)}}}{(\theta^{(i)} + \mu^{(i)})^{y + \theta^{(i)}}} \\ \mu^{(i)} &= \text{max}(0.05, \mu^{(i)})\end{split}\]

We also need to choose a link function. Recall that the link function \(g(\mu^{(i)})\) is used to relate the mean to the predicted value \(\eta^{(i)}\):

\[\begin{split}g(\mu^{(i)}) &= \eta^{(i)} \\ \mu^{(i)} &= E(Y | X = x^{(i)})\end{split}\]

How do we choose a link function? Well, we are free to choose any link function we like, but there are some constraints. For example, in the Negative Binomial distribution, the mean is always >= 0. However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function that maps real numbers to non-negative numbers. The log link function is a good choice for this:

\[g(\mu^{(i)}) = \log(\mu^{(i)}) = \eta^{(i)}\]

Formula

Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.

For our dataset, we want a formula using sex and year. Since prescribing practices change over time, and since infections requiring antibiotic prescriptions also change over time, we should include year in our formula. We also want to include sex, since there are sex differences in antibiotic prescriptions.

There is an additional factor specific to BC regulations. In 2005, the BC government introduced an antibiotic conservation program, which reduced the number of antibiotics prescribed [Mamun, 2019]. It stands to reason that the formula may change before and after 2005. To account for this, we will introduce a Heaviside step function, which returns 0 for values below a given threshold, and 1 for values above the threshold. In our case, the threshold is 2005.

\[\eta^{(i)} = \beta_0 + \beta_s \cdot s^{(i)} + \beta_t \cdot t^{(i)} + \beta_h \cdot H(t^{(i)} - 2005) + \beta_{th} \cdot t \cdot H(t^{(i)} - 2005)\]

where:

Variable

Description

\(\eta^{(i)}\)

the predicted number of antibiotics per capita prescribed for an infant.

\(\beta_0\)

an intercept constant

\(s^{(i)}\)

sex

\(\beta_{s}\)

sex constant

\(t^{(i)}\)

year of birth of the infant

\(\beta_t\)

year constant

\(H(t^{(i)} - 2005)\)

Heaviside step function, which is 0 for years before 2005 and 1 for years after 2005

\(\beta_h\)

Heaviside step function constant

\(\beta_{th}\)

year and Heaviside step function constant

Example

Setup

%load_ext autoreload
%autoreload 2

import plotly.io as pio
pio.renderers.default = "sphinx_gallery"
import plotly.express as px
import pandas as pd
import numpy as np
import itertools
import statsmodels.api as sm
import statsmodels.formula.api as smf
from leap.data_generation.utils import heaviside
from leap.data_generation.antibiotic_data import load_birth_data
# Plot Settings
pio.templates.default = "plotly_white"
config = {
    'toImageButtonOptions': {
        'format': 'png',
        'scale': 2
    }
}

Data

Since the antibiotic dosing data is private, we will create here a mock dataset. In this dataset, we will have the total number of courses of antibiotics prescribed to all infants in BC, stratified by sex and year.

# Data was collected for the years 2000-2018
years = np.arange(2000, 2018)

# Mock data for antibiotic courses prescribed to infants using an approximate linear trend
n_abx_female = [-640 * i + 16000 for i, _ in enumerate(years)]
n_abx_male = [-880 * i + 22000 for i, _ in enumerate(years)]
n_abx = n_abx_female + n_abx_male

# Create a DataFrame for antibiotic courses prescribed to infants
df_abx = pd.DataFrame(
    data=list(itertools.product(["F", "M"], years)),
    columns=["sex", "year"]
)
df_abx["n_abx"] = n_abx
df_abx.head()
sex year n_abx
0 F 2000 16000
1 F 2001 15360
2 F 2002 14720
3 F 2003 14080
4 F 2004 13440

Next, since the antibiotic data measures the number of antibiotics prescribed at a population level, we need to use population data from Statistics Canada to determine the number of antibiotics prescribed per capita:

df_birth = load_birth_data()
df_birth.head()
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3811 try:
-> 3812     return self._engine.get_loc(casted_key)
   3813 except KeyError as err:

File pandas/_libs/index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7088, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7096, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'year'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[4], line 1
----> 1 df_birth = load_birth_data()
      2 df_birth.head()

File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/leap/data_generation/antibiotic_data.py:88, in load_birth_data(province, min_year, max_year)
     81 df.rename(
     82     columns={"REF_DATE": "year", "GEO": "province", "SEX": "sex", "VALUE": "n_birth"},
     83     inplace=True
     84 )
     86 # select only the age = 0 age group and the years where min_year <= year <= max_year
     87 df = df.loc[
---> 88     (df["year"] >= min_year) & 
     89     (df["year"] <= max_year) & 
     90     (df["AGE_GROUP"] == "0 years")
     91 ]
     93 # select only the columns we need
     94 df = df[["year", "province", "sex", "n_birth"]]

File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/frame.py:4113, in DataFrame.__getitem__(self, key)
   4111 if self.columns.nlevels > 1:
   4112     return self._getitem_multilevel(key)
-> 4113 indexer = self.columns.get_loc(key)
   4114 if is_integer(indexer):
   4115     indexer = [indexer]

File /opt/hostedtoolcache/Python/3.10.19/x64/lib/python3.10/site-packages/pandas/core/indexes/base.py:3819, in Index.get_loc(self, key)
   3814     if isinstance(casted_key, slice) or (
   3815         isinstance(casted_key, abc.Iterable)
   3816         and any(isinstance(x, slice) for x in casted_key)
   3817     ):
   3818         raise InvalidIndexError(key)
-> 3819     raise KeyError(key) from err
   3820 except TypeError:
   3821     # If we have a listlike key, _check_indexing_error will raise
   3822     #  InvalidIndexError. Otherwise we fall through and re-raise
   3823     #  the TypeError.
   3824     self._check_indexing_error(key)

KeyError: 'year'

We can now merge the df_abx and df_birth dataframes:

df = pd.merge(df_abx, df_birth, on=["year", "sex"], how="left")
df.head()
color_map={
    "M": "#09bfc4",
    "F": "#f39c12",
}

fig = px.line(
    df,
    x="year",
    y="n_abx",
    facet_col="sex",
    color="sex",
    labels={"n_abx": "Total Number of Antibiotic Courses Prescribed to Infant Population"},
    color_discrete_map=color_map
)
fig.update_xaxes(title_text="Birth Year", nticks=10)
fig.update_yaxes(title_font=dict(size=12))

fig.show()

Model

df["sex"] = df.apply(lambda x: 1 if x["sex"] == "F" else 2, axis=1)
formula = "n_abx ~ year + sex + heaviside(year, 2005) * year"
θ = 800
alpha = 1 / θ

# Fit the GLM model
model = smf.glm(
    formula=formula,
    data=df,
    family=sm.families.NegativeBinomial(alpha=alpha),
    offset=np.log(df["n_birth"])
)
results = model.fit(maxiter=1000)

print(results.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                  n_abx   No. Observations:                   36
Model:                            GLM   Df Residuals:                       31
Model Family:        NegativeBinomial   Df Model:                            4
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -260.35
Date:                Tue, 12 Aug 2025   Deviance:                       18.039
Time:                        13:53:48   Pearson chi2:                     18.0
No. Iterations:                     5   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          z      P>|z|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                     74.9737     16.193      4.630      0.000      43.236     106.711
year                          -0.0377      0.008     -4.664      0.000      -0.054      -0.022
sex                            0.2595      0.012     21.258      0.000       0.236       0.283
heaviside(year, 2005)         87.6702     16.652      5.265      0.000      55.033     120.308
heaviside(year, 2005):year    -0.0437      0.008     -5.259      0.000      -0.060      -0.027
==============================================================================================

Now that the model has been trained, we can obtain the $\beta$ parameters:

β_0 = results.params["Intercept"]
β_t = results.params["year"]
β_s = results.params["sex"]
β_h = results.params["heaviside(year, 2005)"]
β_th = results.params["heaviside(year, 2005):year"]

Example: Sex = Male, Year = 2008

sex = 2
year = 2008
η = β_0 + β_s * sex + β_t * year + β_h * heaviside(year, 2005) + β_th * heaviside(year, 2005) * year

Now, $\eta$ is the number of antibiotics prescribed per number of infants; what we want to know, is given an age, birth year, and sex, how many antibiotics were prescribed to that person during infancy? To do so, we need to use the Negative Binomial distribution. First, we compute $\mu$, using the link function.

µ = np.exp(η)

Next, we make sure that $\mu > 0.05$:

µ = max(µ, 0.05)
print(f"µ = {µ:.6f}")
µ = 0.674037

The standard parametrization of the Negative Binomial function uses $p$ and $r$, not $\mu$. So, we need to convert $\mu$ to $p$:

p = θ / (θ + µ)
r = θ
print(f"p = {p:.6f}, r = {r:.4f}")
p = 0.999158, r = 800.0000

The Negative Binomial distribution gives us the probability of an infant being prescribed $k$ courses of antibiotics during infancy:

$$ \begin{aligned} P(Y = k; r, p) := \binom{k+r-1}{k}(1-p)^k p^r \end{aligned} $$

n_abx = np.random.negative_binomial(r, p)
sex = "female" if sex == 1 else "male"
print(f"Number of courses of antibiotics prescribed during infancy to a {sex} person born in {year}: {n_abx}")
Number of courses of antibiotics prescribed during infancy to a male person born in 2008: 1

The number of antibiotics obtained is just a sampling from the distribution; if we repeated the selection multiple times, we could get different values. To visualize this, we can look at the distribution:

n_abx = np.random.negative_binomial(r, p, size=100000)
fig = px.histogram(
    n_abx,
    nbins=10,
    title=f"Distribution of Antibiotic Courses Prescribed in Infancy<br>Sex = {sex}, Birth Year = {year}",
    color_discrete_sequence=["#09bfc4"],
    width=600
)
fig.update_xaxes(title_text="Number of Courses of Antibiotics", nticks=10)
fig.update_yaxes(title_text="Count")
fig.update_layout(
    legend_title_text="",
    title_x=0.5,
    title_y=0.95,
    margin=dict(t=50), # Adjust top margin for title
    showlegend=False
)
fig.show(config=config)

Example: Visualize Multiple Years / Sexes

The example above was just the distribution for a single year and sex; we can visualize multiple years / sexes together:

def compute_abx_distribution_parameters(sex, year, β_0, β_s, β_t, β_h, β_th, θ):
    η = (
        β_0 + β_s * sex + β_t * year +
        β_h * heaviside(year, 2005) +
        β_th * heaviside(year, 2005) * year
    )
    µ = np.exp(η)
    µ = max(µ, 0.05)  # Ensure µ is greater than 0.05
    p = θ / (θ + µ)
    return p
    

Let’s create a dataframe with even years from 2000 to 2018:

sexes = [1, 2]
years = np.arange(2000, 2018, step=2)
df_pred = pd.DataFrame(
    data=list(itertools.product(sexes, years)),
    columns=["sex", "year"],
    dtype='Int64'
)
df_pred.head()
sex year
0 1 2000
1 1 2002
2 1 2004
3 1 2006
4 1 2008

Next, let’s find $p$ and $r$ for each year / sex combination:

df_pred["p"] = df_pred.apply(
    lambda x: compute_abx_distribution_parameters(x["sex"], x["year"], β_0, β_s, β_t, β_h, β_th, θ),
    axis=1
)
df_pred["r"] = df_pred.apply(
    lambda x: r,
    axis=1
)
df_pred.head()
sex year p r
0 1 2000 0.998991 800
1 1 2002 0.999064 800
2 1 2004 0.999132 800
3 1 2006 0.999236 800
4 1 2008 0.999350 800

We now create a second dataframe, which we will use to sample the Negative Binomial distribution.

df_dist = pd.DataFrame(
    data=list(itertools.product(sexes, years, np.zeros(1000))),
    columns=["sex", "year", "nabx"],
    dtype=int
)
df_dist = pd.merge(df_dist, df_pred, on=["year", "sex"])
df_dist.head()
sex year nabx p r
0 1 2000 0 0.998991 800
1 1 2000 0 0.998991 800
2 1 2000 0 0.998991 800
3 1 2000 0 0.998991 800
4 1 2000 0 0.998991 800
grouped_df = df_dist.groupby(["sex", "year"])
df_dist["nabx"] = grouped_df.apply(
    lambda x: np.random.negative_binomial(x["r"], x["p"], size=len(x)), include_groups=False
).explode().astype(int).values
df_dist.head()
sex year nabx p r
0 1 2000 1 0.998991 800
1 1 2000 2 0.998991 800
2 1 2000 1 0.998991 800
3 1 2000 3 0.998991 800
4 1 2000 1 0.998991 800

Now we can visualize our results:

color_map={
    2: "#09bfc4",
    1: "#f39c12",
}

fig = px.histogram(
    df_dist,
    x="nabx",
    facet_col="year",
    facet_col_wrap=3,
    color="sex",
    nbins=10,
    title=f"Distribution of Antibiotic Courses Prescribed in Infancy",
    height=1000,
    width=850,
    facet_row_spacing=0.1,
    color_discrete_map=color_map,
    barmode="group"
)
fig.for_each_trace(lambda t: t.update(name="female" if t.name == "1" else "male"))
fig.for_each_annotation(lambda a: a.update(text=a.text.replace("year=", "Year: ")))
fig.update_xaxes(
    title_text="Number of Courses of Antibiotics",
    nticks=10,
    row=1,
    title=dict(font=dict(size=12))
)
fig.update_xaxes(showticklabels=True, nticks=10)
fig.update_yaxes(title_text="Count", col=1)
fig.update_layout(
    legend_title_text="",
    title_x=0.5,
    title_y=0.95,
    margin=dict(t=150), # Adjust top margin for title
    showlegend=True
)
fig.show(config=config)