Asthma Occurrence Model¶
This document describes the asthma occurrence model, which is used to predict the incidence and prevalence of asthma in British Columbia. The model is divided into two parts:
Occurrence Model 1: Crude Occurrence: A
Generalized Linear Model (GLM)
that predicts asthma incidence and prevalence based on age, sex, and year.Occurrence Model 2: Risk Factors: A model that incorporates risk factors such as family history and antibiotic use during infancy to predict asthma incidence and prevalence, along with the results from the first model.
Occurrence Model 1: Crude Occurrence¶
In the first model, we will use data collected from the BC Ministry of Health
on the
incidence and prevalence of asthma in British Columbia. We will use this data to fit a
Generalized Linear Model (GLM)
to predict the incidence and prevalence of asthma
based on the age, sex, and year. However, asthma occurrence doesn’t just depend on someone’s age or
sex, but it also depends on risk factors such as family history and antibiotic use during
infancy. We will address these in the second model: Occurrence Model 2: Risk Factors.
Datasets¶
The BC Ministry of Health Administrative Dataset contains asthma incidence and prevalence data
for the years 2000-2019
, in 5-year age intervals.
The data is formatted as follows:
Column | Type | Description |
---|---|---|
year |
int
|
format XXXX , e.g 2000 , range [2000, 2019]
|
age |
int
|
The midpoint of the age group, e.g. 3 for the age group 1-5 years |
sex |
int
|
1 = Female, 2 = Male |
incidence |
float
|
The incidence of asthma in BC for a given year, age group, and sex, per 100 people |
prevalence |
float
|
The prevalence of asthma in BC for a given year, age group, and sex, per 100 people |
Model: Generalized Linear Model - Poisson¶
Since our model projects into the future, we would like to be able to extend this data beyond
2019
. Our model also makes predictions at 1-year age intervals, not 5-year age intervals.
To obtain these projections, we use a Generalized Linear Model (GLM)
. A GLM
is a type of
regression analysis which is a generalized form of linear regression. See Generalized Linear Models for more
information on GLMs
.
Probability Distribution¶
When fitting a GLM
, first you must choose a distribution for the response variable
. In our
case, the response variable is the asthma prevalence or incidence. Incidence and prevalence are
counts of the number of people diagnosed with asthma and the number of people with asthma,
respectively, in a given time interval (a year, in our case). Since these are counts, we need a
discrete probability distribution. The Poisson distribution
is a good choice for our data.
Link Function¶
We also need to choose a link function
. Recall that the link function \(g(\mu^{(i)})\)
is used to relate the mean to the predicted value \(\eta^{(i)}\):
How do we choose a link function? Well, we are free to choose any link function we like, but there
are some constraints. For example, in the Poisson distribution, the mean is always positive.
However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function that maps
real numbers to positive numbers. The log link function
is a good choice for this:
Formula¶
Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.
Let’s start with incidence
. We want a formula using age
, sex
, and year
.
Since asthma depends on factors such as pollution and antibiotic use, and these factors change
from year to year, it follows that asthma incidence should depend on the year. Antibiotic use
also depends on age, so we should include age in our formula. Finally, there is a sex difference
in asthma incidence, so we should include sex in our formula.
where:
\(\beta_{k\ell m}\) is the coefficient for the feature \((a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\)
\(a^{(i)}\) is the age
\(t^{(i)}\) is the year
\(s^{(i)}\) is the sex
There are \(2 + 6 * 2 = 14\) coefficients in the incidence model.
Next we have the prevalence
. We again want a formula using age
, sex
, and year
.
Since asthma prevalence depends on the number of people who have asthma, and this number changes
from year to year, we should include year in our formula. Asthma prevalence also depends on age,
so we should include age in our formula. Finally, there is a sex difference
in asthma incidence and hence prevalence, so we should include sex in our formula.
There are \(6 * 3 * 2 = 36\) coefficients in the prevalence model.
Occurrence Model 2: Risk Factors¶
In the second model, we will use the predicted asthma incidence and prevalence from the first model, \(\eta\), as our target asthma prevalence / incidence in this model. We would now like to incorporate the risk factors of family history and antibiotic use on asthma incidence and prevalence.
Datasets¶
We use the predicted asthma incidence and prevalence from the first model, \(\eta\), as our target asthma prevalence / incidence in this model. The data is formatted as follows:
Column | Type | Description |
---|---|---|
year |
int
|
format XXXX , e.g 2000 , range [2000, 2019]
|
age |
int
|
The age in years, a value in [3, 100]
|
sex |
str
|
"F" = Female "M" = Male
|
incidence |
float
|
The predicted incidence of asthma in BC for a given year, age, and sex, per 100 people |
prevalence |
float
|
The predicted prevalence of asthma in BC for a given year, age, and sex, per 100 people |
Model: Risk Factors¶
We want to incorporate the effects of family history and antibiotic use on asthma incidence and prevalence.
Risk Factor | Values | Description |
---|---|---|
Family History |
A value in {0, 1}
|
1 : at least one parent has asthma0 : neither parent has asthma
|
Antibiotic Dose |
An integer in [0, 3]
|
This variable represents the number of courses of antibiotics taken during the first year of life. The maximum value is 3, since the likelihood of taking more than 3 courses of antibiotics in the first year of life is very low. The upper value of 3 indicates 3 or more courses of antibiotics taken during the first year of life. |
Formula¶
Before we begin, let us define some terms. We have two risk factors we are interested in: family history and antibiotic use. There are \(2 * 4 = 8\) possible combinations of these two risk factors:
λ | Family History | Antibiotic Dose |
---|---|---|
0 |
0 |
0 |
1 |
1 |
0 |
2 |
0 |
1 |
3 |
1 |
1 |
4 |
0 |
2 |
5 |
1 |
2 |
6 |
0 |
3 |
7 |
1 |
3 |
We can represent each combination as a vector of the form:
where \(f_{\lambda}\) is the family history and \(d_{\lambda}\) is the antibiotic dose.
We next define the odds ratio for a given risk factor as:
where \(A\) is the asthma incidence or prevalence and \(r\) is the risk factor. To combine odds ratios, we have:
Since these are multiplicative, the log of the odds ratios is additive:
We can now define our formula for the calibration model:
where:
Variable |
Description |
---|---|
\(\beta_{\eta} = \sigma^{-1}(\eta^{(i)})\) |
determined by the output of the first model |
\(\eta^{(i)}\) |
the predicted incidence or prevalence from the first model |
\(\sigma(x)\) |
the logistic function |
\(\alpha = \sum_{\lambda=1}^{n} p(\lambda) \cdot \beta_{\lambda}\) |
the correction / calibration term for either the incidence or prevalence |
\(\zeta^{(i)} = \sum_{\lambda=0}^{n} p(\lambda) \zeta_{\lambda}^{(i)}\) |
predicted asthma prevalence / incidence for the model. We want this to be as close as possible to \(\eta^{(i)}\). |
\(\zeta_{\lambda}^{(i)}\) |
the predicted asthma incidence or prevalence from the model for the risk factor combination indexed by \(\lambda\) |
\(p(\lambda)\) |
the probability of the risk factor combination indexed by \(\lambda\) |
Let’s break this formula down:
Antibiotic Risk Factors¶
The antibiotic terms were fit by Lee et al. [4], using a random effects meta-regression model:
where:
\(\beta_{\text{abx_xxx}}\) is a constant coefficient
\(a^{(i)}\) is the age
\(d^{(i)}\) is the number of courses of antibiotics taken during the first year of life
The beta coefficients were found to be:
\(\beta_{\text{abx_0}} = 1.82581\)
\(\beta_{\text{abx_age}} = 0.2253\)
\(\beta_{\text{abx_dose}} = 0.0531475\)
Family History Risk Factors¶
The family history terms were fit using the CHILD Study
data, in the paper by Patrick et al.
[6], using logistic regression:
where:
\(\beta_{\text{hx_xxx}}\) is a constant coefficient
\(a^{(i)}\) is the age
\(f^{(i)}\) is the family history of asthma; 1 = at least one parent has asthma, 0 = neither parent has asthma
The beta coefficients were found to be:
\(\beta_{\text{hx_0}} = \log(1.13) = 0.122\)
\(\beta_{\text{hx_age}} = \dfrac{\log(2.4) - \log(1.13)}{2} = 0.377\)
So, the only unknown term in our formula is the correction term \(\alpha\). To solve this, we separate the formulae for incidence and prevalence. We will begin with prevalence.
Solving for the Correction Term: Prevalence¶
We want to find a correction term \(\alpha\) such that the predicted asthma prevalence
\(\zeta\) is as close as possible to the predicted asthma prevalence \(\eta\). To do this,
we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS)
algorithm to minimize the absolute
difference between \(\zeta\) and \(\eta\).
Solving for the Correction Term: Incidence¶
In our model, asthma incidence is defined as the number of new diagnoses between the previous year and the current year, divided by the total population. To calibrate the incidence, we first find the calibrated prevalence for the previous year:
Now, what we want to find is the joint probability of each risk factor combination, \(p(\lambda, A = 0 \mid t-1)\), for the population without asthma.
Now, we must have:
So, we can rewrite the joint probability as:
Next, we find the calibrated asthma incidence for the current year:
where we recall that:
Variable |
Description |
---|---|
\(\beta_{\eta} = \sigma^{-1}(\eta^{(i)}(t))\) |
determined by the output of the first model |
\(\eta^{(i)}(t)\) |
defined above; the predicted incidence from the first model |
\(\alpha = \sum_{\lambda=1}^{n} p(\lambda, A = 0 \mid t-1) \cdot \beta_{\lambda}\) |
the correction / calibration term for the incidence |
\(\zeta^{(i)} = \sum_{\lambda=0}^{n} p(\lambda, A = 0 \mid t-1) \zeta_{\lambda}^{(i)}\) |
predicted asthma incidence for the model. We want this to be as close as possible to \(\eta^{(i)}\). |
\(\zeta_{\lambda}^{(i)}\) |
the predicted asthma incidence from the model for the risk factor combination indexed by \(\lambda\) |
\(p(\lambda, A = 0 \mid t-1)\) |
the joint probability of the risk factor combination indexed by \(\lambda\), for a person who did not have asthma at time \(t-1\) |
We again want to find a correction term \(\alpha\) such that the predicted asthma incidence
\(\zeta\) is as close as possible to the asthma incidence from the first model, \(\eta\).
To do this, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS)
algorithm to minimize the
absolute difference between \(\zeta\) and \(\eta\).
Optimizing the Initial Beta Parameters for the Incidence Equation¶
For the incidence equation, we need to optimize two of the initial beta parameters:
\(\beta_{\text{fhx}_\text{age}}\)
\(\beta_{\text{abx}_\text{age}}\)
Example¶
Before we begin, let us first define what we mean by a contingency table
. A contingency
table is a table that displays two categorical variables and their joint frequency distribution.
For example, suppose we have two categorical variables: smoking
and lung cancer
, with
n = 300
patients in total:
lung cancer | no lung cancer | ||
---|---|---|---|
smoker | a0 = 100 |
b0 = 50 |
n1 = 150 |
non-smoker | c0 = 25 |
d0 = 125 |
|
n2 = 125 |
n = 300 |
The variables n1
and n2
are called the marginal totals:
The variable n
is the total number of patients:
In our model, we want to compute the contingency table for the risk factor combinations \(\lambda\) and the asthma diagnosis.
Past Contingency Table¶
variable 2, outcome + | variable 2, outcome - | ||
---|---|---|---|
variable 1, outcome + | a0 |
b0 |
n1 |
variable 1, outcome - | c0 |
d0 |
|
n2 |
n |
We want to calculate \(a_0\), \(b_0\), \(c_0\), and \(d_0\) using \(n_1\), \(n_2\), \(n\), and \(\omega_{\lambda}\). Now, we have the probabilities of each of the risk factor combinations, \(p(\lambda)\), but for the contingency table, we only want to consider one risk factor combination at a time. To do this, we compute the conditional probability:
To obtain \(n_1\), the number of people with risk factor combination \(\lambda\) with or without an asthma diagnosis, we multiply the conditional probability by the total population \(n\):
To obtain \(n_2\), the number of people diagnosed with asthma with or without risk factor combination \(\lambda\):
From this, we can calculate the values for the contingency table:
To obtain \(a_0\), we follow the methods described in the paper [2]. See conv_2x2 for the Python implementation of this method.
Current Contingency Table: Reassessment¶
According to our model, an asthma diagnosis is not static; a patient may be diagnosed with asthma and then later be reassessed as not having asthma. We would like to compute the updated contingency table:
asthma, outcome + | asthma, outcome - | ||
---|---|---|---|
risk factor λ , outcome + |
a1_ra |
b1_ra |
n1 |
risk factor λ , outcome - |
c1_ra |
d1_ra |
|
n2 |
n |
To calculate the updated contingency table, we have:
where:
Risk Factors | t=0 | t=1 | |
---|---|---|---|
a1_ra |
λ |
has asthma diagnosis | reassessed: has asthma diagnosis |
b1_ra |
λ |
has asthma diagnosis | reassessed: no asthma diagnosis |
c1_ra |
None | has asthma diagnosis | reassessed: has asthma diagnosis |
d1_ra |
None | has asthma diagnosis | reassessed: no asthma diagnosis |
\(a_{1, \text{ra}}\) is the proportion of the population with risk factor combination \(\lambda\) who had an asthma diagnosis at \(t=0\) and still have it at \(t=1\)
\(b_{1, \text{ra}}\) is the proportion of the population with risk factor combination \(\lambda\) who had an asthma diagnosis at \(t=0\) but no longer have it at \(t=1\)
\(c_{1, \text{ra}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who had an asthma diagnosis at \(t=0\) and still have it at \(t=1\)
\(d_{1, \text{ra}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who had an asthma diagnosis at \(t=0\) but no longer have it at \(t=1\)
\(\rho\) is the probability that a person would be reassessed as having an asthma diagnosis at \(t=1\) given that they had an asthma diagnosis at \(t=0\)
Current Contingency Table: New Diagnosis¶
For the reassessment table, we considered only the patients who were diagnosed with asthma. We will now consider those who were not diagnosed with asthma:
asthma, outcome + | asthma, outcome - | ||
---|---|---|---|
risk factor λ , outcome + |
a1_dx |
b1_dx |
n1 |
risk factor λ , outcome - |
c1_dx |
d1_dx |
|
n2 |
n |
To calculate the updated contingency table, we have:
where:
Risk Factors | t=0 | t=1 | |||
---|---|---|---|---|---|
incidence | net | ||||
a1_dx |
λ |
no asthma diagnosis | new asthma diagnosis | asthma | |
b1_dx |
λ |
no asthma diagnosis | no new asthma diagnosis | no asthma | |
c1_dx |
None | no asthma diagnosis | new asthma diagnosis | asthma | |
d1_dx |
None | no asthma diagnosis | no new asthma diagnosis | no asthma |
\(a_{1, \text{dx}}\) is the proportion of the population with risk factor combination \(\lambda\) who didn’t have an asthma diagnosis at \(t=0\) and were diagnosed at \(t=1\) \(\rightarrow\) have asthma at \(t=1\)
\(b_{1, \text{dx}}\) is the proportion of the population with risk factor combination \(\lambda\) who didn’t have an asthma diagnosis at \(t=0\) and were not diagnosed with asthma at \(t=1\), \(\rightarrow\) don’t have asthma at \(t=1\)
\(c_{1, \text{dx}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who didn’t have an asthma diagnosis at \(t=0\) and were diagnosed at \(t=1\) \(\rightarrow\) have asthma at \(t=1\)
\(d_{1, \text{dx}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who didn’t have an asthma diagnosis at \(t=0\) and were not diagnosed with asthma at \(t=1\), \(\rightarrow\) don’t have asthma at \(t=1\)
Current Contingency Table¶
Finally, we can compute the contingency table for the current year, \(t=1\):
asthma, outcome + | asthma, outcome - | ||
---|---|---|---|
risk factor λ , outcome + |
a1 |
b1 |
n1 |
risk factor λ , outcome - |
c1 |
d1 |
|
n2 |
n |
where:
From these values, we can compute the odds ratio:
Optimization¶
We want to find the beta parameters that minimize the difference between the predicted odds ratio \(\Omega\) and the observed odds ratio \(\omega_{\lambda}\).