Asthma Occurrence Model

This document describes the asthma occurrence model, which is used to predict the incidence and prevalence of asthma in British Columbia. The model is divided into two parts:

  1. Occurrence Model 1: Crude Occurrence: A Generalized Linear Model (GLM) that predicts asthma incidence and prevalence based on age, sex, and year.

  2. Occurrence Model 2: Risk Factors: A model that incorporates risk factors such as family history and antibiotic use during infancy to predict asthma incidence and prevalence, along with the results from the first model.

Occurrence Model 1: Crude Occurrence

In the first model, we will use data collected from the BC Ministry of Health on the incidence and prevalence of asthma in British Columbia. We will use this data to fit a Generalized Linear Model (GLM) to predict the incidence and prevalence of asthma based on the age, sex, and year. However, asthma occurrence doesn’t just depend on someone’s age or sex, but it also depends on risk factors such as family history and antibiotic use during infancy. We will address these in the second model: Occurrence Model 2: Risk Factors.

Datasets

The BC Ministry of Health Administrative Dataset contains asthma incidence and prevalence data for the years 2000-2019, in 5-year age intervals.

The data is formatted as follows:

Column Type Description
year int format XXXX, e.g 2000, range [2000, 2019]
age int The midpoint of the age group, e.g. 3 for the age group 1-5 years
sex int 1 = Female, 2 = Male
incidence float The incidence of asthma in BC for a given year, age group, and sex, per 100 people
prevalence float The prevalence of asthma in BC for a given year, age group, and sex, per 100 people

Model: Generalized Linear Model - Poisson

Since our model projects into the future, we would like to be able to extend this data beyond 2019. Our model also makes predictions at 1-year age intervals, not 5-year age intervals. To obtain these projections, we use a Generalized Linear Model (GLM). A GLM is a type of regression analysis which is a generalized form of linear regression. See Generalized Linear Models for more information on GLMs.

Probability Distribution

When fitting a GLM, first you must choose a distribution for the response variable. In our case, the response variable is the asthma prevalence or incidence. Incidence and prevalence are counts of the number of people diagnosed with asthma and the number of people with asthma, respectively, in a given time interval (a year, in our case). Since these are counts, we need a discrete probability distribution. The Poisson distribution is a good choice for our data.

\[P(Y = y) = p(y; \mu^{(i)}) = \dfrac{(\mu^{(i)})^{y} ~ e^{-\mu^{(i)}}}{y!}\]

We also need to choose a link function. Recall that the link function \(g(\mu^{(i)})\) is used to relate the mean to the predicted value \(\eta^{(i)}\):

\[\begin{split}g(\mu^{(i)}) &= \eta^{(i)} \\ \mu^{(i)} &= E(Y \mid X = x^{(i)})\end{split}\]

How do we choose a link function? Well, we are free to choose any link function we like, but there are some constraints. For example, in the Poisson distribution, the mean is always positive. However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function that maps real numbers to positive numbers. The log link function is a good choice for this:

\[g(\mu^{(i)}) = \log(\mu^{(i)}) = \eta^{(i)}\]

Formula

Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.

Let’s start with incidence. We want a formula using age, sex, and year. Since asthma depends on factors such as pollution and antibiotic use, and these factors change from year to year, it follows that asthma incidence should depend on the year. Antibiotic use also depends on age, so we should include age in our formula. Finally, there is a sex difference in asthma incidence, so we should include sex in our formula.

\[\eta^{(i)} = \sum_{m=0}^1 \beta_{01m} t^{(i)} \cdot (s^{(i)})^m + \sum_{k=0}^{5} \sum_{m=0}^{1} \beta_{k0m} \cdot (a^{(i)})^k \cdot (s^{(i)})^m\]

where:

  • \(\beta_{k\ell m}\) is the coefficient for the feature \((a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\)

  • \(a^{(i)}\) is the age

  • \(t^{(i)}\) is the year

  • \(s^{(i)}\) is the sex

There are \(2 + 6 * 2 = 14\) coefficients in the incidence model.

Next we have the prevalence. We again want a formula using age, sex, and year. Since asthma prevalence depends on the number of people who have asthma, and this number changes from year to year, we should include year in our formula. Asthma prevalence also depends on age, so we should include age in our formula. Finally, there is a sex difference in asthma incidence and hence prevalence, so we should include sex in our formula.

\[\eta^{(i)} = \sum_{k=0}^{5} \sum_{\ell=0}^2 \sum_{m=0}^1 \beta_{k \ell m} \cdot (a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\]

There are \(6 * 3 * 2 = 36\) coefficients in the prevalence model.

Occurrence Model 2: Risk Factors

In the second model, we will use the predicted asthma incidence and prevalence from the first model, \(\eta\), as our target asthma prevalence / incidence in this model. We would now like to incorporate the risk factors of family history and antibiotic use on asthma incidence and prevalence.

Datasets

We use the predicted asthma incidence and prevalence from the first model, \(\eta\), as our target asthma prevalence / incidence in this model. The data is formatted as follows:

Column Type Description
year int format XXXX, e.g 2000, range [2000, 2019]
age int The age in years, a value in [3, 100]
sex str "F" = Female
"M" = Male
incidence float The predicted incidence of asthma in BC for a given year, age, and sex, per 100 people
prevalence float The predicted prevalence of asthma in BC for a given year, age, and sex, per 100 people

Model: Risk Factors

We want to incorporate the effects of family history and antibiotic use on asthma incidence and prevalence.

Risk Factor Values Description
Family History A value in {0, 1} 1: at least one parent has asthma
0: neither parent has asthma
Antibiotic Dose An integer in [0, 3] This variable represents the number of courses of antibiotics taken during the first year of life. The maximum value is 3, since the likelihood of taking more than 3 courses of antibiotics in the first year of life is very low. The upper value of 3 indicates 3 or more courses of antibiotics taken during the first year of life.

Formula

Before we begin, let us define some terms. We have two risk factors we are interested in: family history and antibiotic use. There are \(2 * 4 = 8\) possible combinations of these two risk factors:

λ Family History Antibiotic Dose
0 0 0
1 1 0
2 0 1
3 1 1
4 0 2
5 1 2
6 0 3
7 1 3

We can represent each combination as a vector of the form:

\[\begin{split}\begin{bmatrix} f_{\lambda} \\ d_{\lambda} \end{bmatrix}\end{split}\]

where \(f_{\lambda}\) is the family history and \(d_{\lambda}\) is the antibiotic dose.

We next define the odds ratio for a given risk factor as:

\[\omega(r=k) = \dfrac{P(A = 1 \mid r = k)}{P(A = 1 \mid r = 0)}\]

where \(A\) is the asthma incidence or prevalence and \(r\) is the risk factor. To combine odds ratios, we have:

\[\begin{split}\omega_{\lambda} &= \omega(f = f_{\lambda}, d = d_{\lambda}) \\ &= \dfrac{P(A = 1 \mid f = f_{\lambda}, d = d_{\lambda})}{P(A = 1 \mid f = 0, d = 0)} \\ &= \dfrac{P(A = 1 \mid f = f_{\lambda})}{P(A = 1 \mid f = 0)} \cdot \dfrac{P(A = 1 \mid d = d_{\lambda})}{P(A = 1 \mid d = 0)} \\ &= \omega(f = f_{\lambda}) \cdot \omega(d = d_{\lambda})\end{split}\]

Since these are multiplicative, the log of the odds ratios is additive:

\[\log(\omega_{\lambda}) = \log(\omega(f = f_{\lambda})) + \log(\omega(d = d_{\lambda}))\]

We can now define our formula for the calibration model:

\[\zeta_{\lambda}^{(i)} = \sigma\left(\beta_{\eta} + \log(\omega_{\lambda}^{(i)}) + \alpha\right)\]

where:

Variable

Description

\(\beta_{\eta} = \sigma^{-1}(\eta^{(i)})\)

determined by the output of the first model

\(\eta^{(i)}\)

the predicted incidence or prevalence from the first model

\(\sigma(x)\)

the logistic function

\(\alpha = \sum_{\lambda=1}^{n} p(\lambda) \cdot \beta_{\lambda}\)

the correction / calibration term for either the incidence or prevalence

\(\zeta^{(i)} = \sum_{\lambda=0}^{n} p(\lambda) \zeta_{\lambda}^{(i)}\)

predicted asthma prevalence / incidence for the model. We want this to be as close as possible to \(\eta^{(i)}\).

\(\zeta_{\lambda}^{(i)}\)

the predicted asthma incidence or prevalence from the model for the risk factor combination indexed by \(\lambda\)

\(p(\lambda)\)

the probability of the risk factor combination indexed by \(\lambda\)

Let’s break this formula down:

Antibiotic Risk Factors

The antibiotic terms were fit by Lee et al. [4], using a random effects meta-regression model:

\[\begin{split}\log(\omega(d_{\lambda})) = \begin{cases} \beta_{\text{abx_0}} + \beta_{\text{abx_age}} \cdot \text{min}(a^{(i)}, 7) + \beta_{\text{abx_dose}} \cdot \text{min}(d^{(i)}, 3) && d^{(i)} > 0 \text{ and } a^{(i)} \leq 7 \\ \\ 0 && \text{otherwise} \end{cases}\end{split}\]

where:

  • \(\beta_{\text{abx_xxx}}\) is a constant coefficient

  • \(a^{(i)}\) is the age

  • \(d^{(i)}\) is the number of courses of antibiotics taken during the first year of life

The beta coefficients were found to be:

  • \(\beta_{\text{abx_0}} = 1.82581\)

  • \(\beta_{\text{abx_age}} = 0.2253\)

  • \(\beta_{\text{abx_dose}} = 0.0531475\)

Family History Risk Factors

The family history terms were fit using the CHILD Study data, in the paper by Patrick et al. [6], using logistic regression:

\[\log(\omega(f_{\lambda})) = \beta_{\text{hx_0}} \cdot f^{(i)} + \beta_{\text{hx_age}} \cdot (\text{min}(a^{(i)}, 5) - 3) \cdot f^{(i)}\]

where:

  • \(\beta_{\text{hx_xxx}}\) is a constant coefficient

  • \(a^{(i)}\) is the age

  • \(f^{(i)}\) is the family history of asthma; 1 = at least one parent has asthma, 0 = neither parent has asthma

The beta coefficients were found to be:

  • \(\beta_{\text{hx_0}} = \log(1.13) = 0.122\)

  • \(\beta_{\text{hx_age}} = \dfrac{\log(2.4) - \log(1.13)}{2} = 0.377\)

So, the only unknown term in our formula is the correction term \(\alpha\). To solve this, we separate the formulae for incidence and prevalence. We will begin with prevalence.

Solving for the Correction Term: Prevalence

\[\begin{split}\zeta_{\text{prev}} &= \sum_{\lambda=0}^{n} p(\lambda) \zeta_{\lambda} \\ &= \sum_{\lambda=0}^{n} p(\lambda) \sigma(\beta_{\eta} + \log(\omega_{\lambda}) - \alpha)\end{split}\]

We want to find a correction term \(\alpha\) such that the predicted asthma prevalence \(\zeta\) is as close as possible to the predicted asthma prevalence \(\eta\). To do this, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm to minimize the absolute difference between \(\zeta\) and \(\eta\).

Solving for the Correction Term: Incidence

In our model, asthma incidence is defined as the number of new diagnoses between the previous year and the current year, divided by the total population. To calibrate the incidence, we first find the calibrated prevalence for the previous year:

\[\begin{split}\zeta_{\text{prev}}(t-1) &= \sum_{\lambda=0}^{n} p(\lambda, t-1) \zeta_{\text{prev}, \lambda}(t-1) \\ &= \sum_{\lambda=0}^{n} p(\lambda, t-1) \sigma(\beta_{\eta} + \log(\omega_{\lambda}) - \alpha)\end{split}\]

Now, what we want to find is the joint probability of each risk factor combination, \(p(\lambda, A = 0 \mid t-1)\), for the population without asthma.

\[P(\lambda, A = 0) = P(A = 0 \mid \lambda) \cdot P(\lambda)\]

Now, we must have:

\[P(A = 0 \mid \lambda) = 1 - P(A = 1 \mid \lambda) = 1 - \zeta_{\text{prev}, \lambda}(t-1)\]

So, we can rewrite the joint probability as:

\[p(\lambda, A = 0 \mid t-1) = (1 - \zeta_{\text{prev}, \lambda}(t-1)) \cdot p(\lambda, t-1)\]

Next, we find the calibrated asthma incidence for the current year:

\[\begin{split}\zeta_{\text{inc}}(t) &= \sum_{\lambda=0}^{n} p(\lambda, A = 0 \mid t-1) \zeta_{\text{inc}, \lambda}(t) \\ &= \sum_{\lambda=0}^{n} p(\lambda, A = 0 \mid t-1) \sigma(\beta_{\eta} + \log(\omega_{\lambda}) - \alpha)\end{split}\]

where we recall that:

Variable

Description

\(\beta_{\eta} = \sigma^{-1}(\eta^{(i)}(t))\)

determined by the output of the first model

\(\eta^{(i)}(t)\)

defined above; the predicted incidence from the first model

\(\alpha = \sum_{\lambda=1}^{n} p(\lambda, A = 0 \mid t-1) \cdot \beta_{\lambda}\)

the correction / calibration term for the incidence

\(\zeta^{(i)} = \sum_{\lambda=0}^{n} p(\lambda, A = 0 \mid t-1) \zeta_{\lambda}^{(i)}\)

predicted asthma incidence for the model. We want this to be as close as possible to \(\eta^{(i)}\).

\(\zeta_{\lambda}^{(i)}\)

the predicted asthma incidence from the model for the risk factor combination indexed by \(\lambda\)

\(p(\lambda, A = 0 \mid t-1)\)

the joint probability of the risk factor combination indexed by \(\lambda\), for a person who did not have asthma at time \(t-1\)

We again want to find a correction term \(\alpha\) such that the predicted asthma incidence \(\zeta\) is as close as possible to the asthma incidence from the first model, \(\eta\). To do this, we use the Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm to minimize the absolute difference between \(\zeta\) and \(\eta\).

Optimizing the Initial Beta Parameters for the Incidence Equation

For the incidence equation, we need to optimize two of the initial beta parameters:

  • \(\beta_{\text{fhx}_\text{age}}\)

  • \(\beta_{\text{abx}_\text{age}}\)

Example

Before we begin, let us first define what we mean by a contingency table. A contingency table is a table that displays two categorical variables and their joint frequency distribution. For example, suppose we have two categorical variables: smoking and lung cancer, with n = 300 patients in total:

lung cancer no lung cancer
smoker a0 = 100 b0 = 50 n1 = 150
non-smoker c0 = 25 d0 = 125
n2 = 125 n = 300

The variables n1 and n2 are called the marginal totals:

\[\begin{split}n_1 &= a_0 + b_0 = \text{total number of patients with lung cancer} \\ n_2 &= c_0 + d_0 = \text{total number of patients who smoke}\end{split}\]

The variable n is the total number of patients:

\[n = a_0 + b_0 + c_0 + d_0 = \text{total number of patients}\]

In our model, we want to compute the contingency table for the risk factor combinations \(\lambda\) and the asthma diagnosis.

Past Contingency Table
variable 2, outcome + variable 2, outcome -
variable 1, outcome + a0 b0 n1
variable 1, outcome - c0 d0
n2 n

We want to calculate \(a_0\), \(b_0\), \(c_0\), and \(d_0\) using \(n_1\), \(n_2\), \(n\), and \(\omega_{\lambda}\). Now, we have the probabilities of each of the risk factor combinations, \(p(\lambda)\), but for the contingency table, we only want to consider one risk factor combination at a time. To do this, we compute the conditional probability:

\[p(\Lambda = \lambda \mid \Lambda \in \{0, \lambda\}) = \dfrac{p(\Lambda = \lambda)}{p(\Lambda = \lambda) + p(\Lambda = 0)}\]

To obtain \(n_1\), the number of people with risk factor combination \(\lambda\) with or without an asthma diagnosis, we multiply the conditional probability by the total population \(n\):

\[n_1 = p(\Lambda = \lambda \mid \Lambda \in \{0, \lambda\}) \cdot n\]

To obtain \(n_2\), the number of people diagnosed with asthma with or without risk factor combination \(\lambda\):

\[n_2 = (1 - p(\Lambda = \lambda \mid \Lambda \in \{0, \lambda\})) \cdot \zeta_{\text{prev}, 0}(t=0) \cdot n + p(\Lambda = \lambda \mid \Lambda \in \{0, \lambda\}) \cdot \zeta_{\text{prev}, \lambda}(t=0) \cdot n\]

From this, we can calculate the values for the contingency table:

\[\begin{split}b_0 &= n_1 - a_0 \\ c_0 &= n_2 - a_0 \\ d_0 &= n - n_1 - n_2 - a_0\end{split}\]

To obtain \(a_0\), we follow the methods described in the paper [2]. See conv_2x2 for the Python implementation of this method.

Current Contingency Table: Reassessment

According to our model, an asthma diagnosis is not static; a patient may be diagnosed with asthma and then later be reassessed as not having asthma. We would like to compute the updated contingency table:

asthma, outcome + asthma, outcome -
risk factor λ, outcome + a1_ra b1_ra n1
risk factor λ, outcome - c1_ra d1_ra
n2 n

To calculate the updated contingency table, we have:

\[\begin{split}a_{1, \text{ra}} &= a_0 \cdot \rho \\ b_{1, \text{ra}} &= a_0 \cdot (1 - \rho) \\ c_{1, \text{ra}} &= c_0 \cdot \rho \\ d_{1, \text{ra}} &= c_0 \cdot (1 - \rho)\end{split}\]

where:

Risk Factors t=0 t=1
a1_ra λ has asthma diagnosis reassessed: has asthma diagnosis
b1_ra λ has asthma diagnosis reassessed: no asthma diagnosis
c1_ra None has asthma diagnosis reassessed: has asthma diagnosis
d1_ra None has asthma diagnosis reassessed: no asthma diagnosis
  • \(a_{1, \text{ra}}\) is the proportion of the population with risk factor combination \(\lambda\) who had an asthma diagnosis at \(t=0\) and still have it at \(t=1\)

  • \(b_{1, \text{ra}}\) is the proportion of the population with risk factor combination \(\lambda\) who had an asthma diagnosis at \(t=0\) but no longer have it at \(t=1\)

  • \(c_{1, \text{ra}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who had an asthma diagnosis at \(t=0\) and still have it at \(t=1\)

  • \(d_{1, \text{ra}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who had an asthma diagnosis at \(t=0\) but no longer have it at \(t=1\)

  • \(\rho\) is the probability that a person would be reassessed as having an asthma diagnosis at \(t=1\) given that they had an asthma diagnosis at \(t=0\)

Current Contingency Table: New Diagnosis

For the reassessment table, we considered only the patients who were diagnosed with asthma. We will now consider those who were not diagnosed with asthma:

asthma, outcome + asthma, outcome -
risk factor λ, outcome + a1_dx b1_dx n1
risk factor λ, outcome - c1_dx d1_dx
n2 n

To calculate the updated contingency table, we have:

\[\begin{split}a_{1, \text{dx}} &= b_0 \cdot \zeta_{\text{inc}, \lambda}(t=1) \\ b_{1, \text{dx}} &= b_0 \cdot (1 - \zeta_{\text{inc}, \lambda}(t=1)) \\ c_{1, \text{dx}} &= d_0 \cdot \zeta_{\text{inc}, 0}(t=1) \\ d_{1, \text{dx}} &= d_0 \cdot (1 - \zeta_{\text{inc}, 0}(t=1))\end{split}\]

where:

Risk Factors t=0 t=1
incidence net
a1_dx λ no asthma diagnosis new asthma diagnosis asthma
b1_dx λ no asthma diagnosis no new asthma diagnosis no asthma
c1_dx None no asthma diagnosis new asthma diagnosis asthma
d1_dx None no asthma diagnosis no new asthma diagnosis no asthma
  • \(a_{1, \text{dx}}\) is the proportion of the population with risk factor combination \(\lambda\) who didn’t have an asthma diagnosis at \(t=0\) and were diagnosed at \(t=1\) \(\rightarrow\) have asthma at \(t=1\)

  • \(b_{1, \text{dx}}\) is the proportion of the population with risk factor combination \(\lambda\) who didn’t have an asthma diagnosis at \(t=0\) and were not diagnosed with asthma at \(t=1\), \(\rightarrow\) don’t have asthma at \(t=1\)

  • \(c_{1, \text{dx}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who didn’t have an asthma diagnosis at \(t=0\) and were diagnosed at \(t=1\) \(\rightarrow\) have asthma at \(t=1\)

  • \(d_{1, \text{dx}}\) is the proportion of the population with no risk factors (\(\lambda = 0\)) who didn’t have an asthma diagnosis at \(t=0\) and were not diagnosed with asthma at \(t=1\), \(\rightarrow\) don’t have asthma at \(t=1\)

Current Contingency Table

Finally, we can compute the contingency table for the current year, \(t=1\):

asthma, outcome + asthma, outcome -
risk factor λ, outcome + a1 b1 n1
risk factor λ, outcome - c1 d1
n2 n

where:

\[\begin{split}a_1 &= a_{1, \text{ra}} + a_{1, \text{dx}} \\ b_1 &= b_{1, \text{ra}} + b_{1, \text{dx}} \\ c_1 &= c_{1, \text{ra}} + c_{1, \text{dx}} \\ d_1 &= d_{1, \text{ra}} + d_{1, \text{dx}}\end{split}\]

From these values, we can compute the odds ratio:

\[\Omega = \dfrac{a_1 \cdot d_1}{b_1 \cdot c_1}\]
Optimization

We want to find the beta parameters that minimize the difference between the predicted odds ratio \(\Omega\) and the observed odds ratio \(\omega_{\lambda}\).

\[\sum_{i=1}^{N}\sum_{\lambda=1}^{n} \dfrac{\left| \log(\Omega^{(i)}) - \log(\omega_{\lambda}^{(i)}) \right|}{N}\]