Occurrence Model

The BC Ministry of Health Administrative Dataset contains asthma incidence and prevalence data for the years 2000-2019, in 5-year age intervals.

The data is formatted as follows:

Column Type Description
year int format XXXX, e.g 2000, range [2000, 2019]
age int The midpoint of the age group, e.g. 3 for the age group 1-5 years
sex int 1 = Female, 2 = Male
incidence float The incidence of asthma in BC for a given year, age group, and sex, per 100 people
prevalence float The prevalence of asthma in BC for a given year, age group, and sex, per 100 people

Since our model projects into the future, we would like to be able to extend this data beyond 2019. Our model also makes predictions at 1-year age intervals, not 5-year age intervals. To obtain these projections, we use a Generalized Linear Model (GLM). A GLM is a type of regression analysis which is a generalized form of linear regression. See Generalized Linear Models for more information on GLMs.

When fitting a GLM, first you must choose a distribution for the response variable. In our case, the response variable is the asthma prevalence or incidence. Incidence and prevalence are counts of the number of people diagnosed with asthma and the number of people with asthma, respectively, in a given time interval (a year, in our case). Since these are counts, we need a discrete probability distribution. The Poisson distribution is a good choice for our data.

\[P(Y = y) = p(y; \mu^{(i)}) = \dfrac{(\mu^{(i)})^{y} ~ e^{-\mu^{(i)}}}{y!}\]

We also need to choose a link function. Recall that the link function \(g(\mu^{(i)})\) is used to relate the mean to the predicted value \(\eta^{(i)}\):

\[\begin{split}g(\mu^{(i)}) &= \eta^{(i)} \\ \mu^{(i)} &= E(Y | X = x^{(i)})\end{split}\]

How do we choose a link function? Well, we are free to choose any link function we like, but there are some constraints. For example, in the Poisson distribution, the mean is always positive. However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function that maps real numbers to positive numbers. The log link function is a good choice for this:

\[g(\mu^{(i)}) = \log(\mu^{(i)}) = \eta^{(i)}\]

Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.

Let’s start with incidence. We want a formula using age, sex, and year. Since asthma depends on factors such as pollution and antibiotic use, and these factors change from year to year, it follows that asthma incidence should depend on the year. Antibiotic use also depends on age, so we should include age in our formula. Finally, there is a sex difference in asthma incidence, so we should include sex in our formula.

TODO: Why was this formula chosen?

\[\eta^{(i)} = \sum_{m=0}^1 \beta_{01m} t^{(i)} \cdot (s^{(i)})^m + \sum_{k=0}^{5} \sum_{m=0}^{1} \beta_{k0m} \cdot (a^{(i)})^k \cdot (s^{(i)})^m\]

where:

  • \(\beta_{k\ell m}\) is the coefficient for the feature \((a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\)

  • \(a^{(i)}\) is the age

  • \(t^{(i)}\) is the year

  • \(s^{(i)}\) is the sex

There are \(2 + 6 * 2 = 14\) coefficients in the incidence model.

Next we have the prevalence. We again want a formula using age, sex, and year. Since asthma prevalence depends on the number of people who have asthma, and this number changes from year to year, we should include year in our formula. Asthma prevalence also depends on age, so we should include age in our formula. Finally, there is a sex difference in asthma incidence and hence prevalence, so we should include sex in our formula.

\[\eta^{(i)} = \sum_{k=0}^{5} \sum_{\ell=0}^2 \sum_{m=0}^1 \beta_{k \ell m} \cdot (a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\]

There are \(6 * 3 * 2 = 36\) coefficients in the prevalence model.