Occurrence Model¶
The BC Ministry of Health Administrative Dataset contains asthma incidence and prevalence data
for the years 2000-2019
, in 5-year age intervals.
The data is formatted as follows:
Column | Type | Description |
---|---|---|
year |
int
|
format XXXX , e.g 2000 , range [2000, 2019]
|
age |
int
|
The midpoint of the age group, e.g. 3 for the age group 1-5 years |
sex |
int
|
1 = Female, 2 = Male |
incidence |
float
|
The incidence of asthma in BC for a given year, age group, and sex, per 100 people |
prevalence |
float
|
The prevalence of asthma in BC for a given year, age group, and sex, per 100 people |
Since our model projects into the future, we
would like to be able to extend this data beyond 2019
. Our model also makes predictions at
1-year age intervals, not 5-year age intervals. To obtain these projections, we use a
Generalized Linear Model (GLM)
. A GLM
is a type of regression analysis which is a
generalized form of linear regression. See Generalized Linear Models for more information on GLMs
.
When fitting a GLM
, first you must choose a distribution for the response variable
. In our
case, the response variable is the asthma prevalence or incidence. Incidence and prevalence are
counts of the number of people diagnosed with asthma and the number of people with asthma,
respectively, in a given time interval (a year, in our case). Since these are counts, we need a
discrete probability distribution. The Poisson distribution
is a good choice for our data.
We also need to choose a link function
. Recall that the link function \(g(\mu^{(i)})\)
is used to relate the mean to the predicted value \(\eta^{(i)}\):
How do we choose a link function? Well, we are free to choose any link function we like, but there
are some constraints. For example, in the Poisson distribution, the mean is always positive.
However, \(\eta^{(i)}\) can be any real number. Therefore, we need a link function that maps
real numbers to positive numbers. The log link function
is a good choice for this:
Now that we have our distribution and link function, we need to decide on a formula for \(\eta^{(i)}\). We are permitted to use linear combinations of functions of the features in our dataset.
Let’s start with incidence
. We want a formula using age
, sex
, and year
.
Since asthma depends on factors such as pollution and antibiotic use, and these factors change
from year to year, it follows that asthma incidence should depend on the year. Antibiotic use
also depends on age, so we should include age in our formula. Finally, there is a sex difference
in asthma incidence, so we should include sex in our formula.
TODO: Why was this formula chosen?
where:
\(\beta_{k\ell m}\) is the coefficient for the feature \((a^{(i)})^k \cdot (t^{(i)})^{\ell} \cdot (s^{(i)})^m\)
\(a^{(i)}\) is the age
\(t^{(i)}\) is the year
\(s^{(i)}\) is the sex
There are \(2 + 6 * 2 = 14\) coefficients in the incidence model.
Next we have the prevalence
. We again want a formula using age
, sex
, and year
.
Since asthma prevalence depends on the number of people who have asthma, and this number changes
from year to year, we should include year in our formula. Asthma prevalence also depends on age,
so we should include age in our formula. Finally, there is a sex difference
in asthma incidence and hence prevalence, so we should include sex in our formula.
There are \(6 * 3 * 2 = 36\) coefficients in the prevalence model.