Occurrence Calibration Data¶

To run the data generation for the incidence/prevalence calibration:

cd LEAP
python3 leap/data_generation/occurrence_calibration_data.py

This will update the file leap/processed_data/asthma_occurrence_correction.csv. This file contains the incidence / prevalence correction parameters and is formatted as follows:

Column	Type	Description
`year`	`int`	format `XXXX`, e.g `2000`, range `[2000, 2026]`
`age`	`int`	The age in years, a value in `[3, 63]`
`sex`	`str`	`"F"` = Female, `"M"` = Male
`correction`	`float`	The correction term for the asthma incidence / prevalence equation
`type`	`str`	One of `"incidence"` `"prevalence"`

If you want to rerun the optimization for the \(\beta\) parameters, add the flag --retrain-beta:

cd LEAP
python3 leap/data_generation/occurrence_calibration_data.py --retrain-beta

This will update the file leap/processed_data/asthma_occurrence_correction.csv as described above, and will also update leap/processed_data/occurrence_calibration_parameters.json:

{
   "β_fhx_age": 0.6445257,
   "β_abx_age": -0.2968535
}

Warning

Rerunning the beta parameters optimization is slow - could take up to 24 hours.

leap.data_generation.occurrence_calibration_data module¶

leap.data_generation.occurrence_calibration_data.get_asthma_occurrence_prediction(age: int, sex: str, year: int, occurrence_type: str, max_asthma_age: int = 62, stabilization_year: int = 2025) → float[source]¶

Predicts the asthma prevalence or incidence based on the given parameters.

Parameters:¶

age: int¶: Age of the individual in years.
sex: str¶: One of "M" or "F".
year: int¶: Year of the prediction.
occurrence_type: str¶: One of "prevalence" or "incidence".
max_asthma_age: int = 62¶: The maximum age for asthma prediction (default is 62).
stabilization_year: int = 2025¶: The year when asthma stabilization occurs (default is 2025).

Returns:¶

A float representing the predicted asthma prevalence or incidence.

leap.data_generation.occurrence_calibration_data.load_occurrence_data(province: str = 'CA', min_year: int = 2000, max_year: int = 2065) → tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶

Load the asthma incidence and prevalence data for the given province and year range.

Parameters:¶

province: str = 'CA'¶: The province to load data for.
min_year: int = 2000¶: The minimum year to load data for.
max_year: int = 2065¶: The maximum year to load data for.

Returns:¶

A tuple of two DataFrames. Details:

Incidence dataframe with columns:
- year (int): The year of the prediction.
- age (int): The age of the individual in years, ranging from 3 to 110.
- sex (str): One of "M" or "F".
- incidence (float): The predicted asthma incidence for the given year, age, and sex.
Prevalence dataframe with columns:
- year (int): The year of the prediction.
- age (int): The age of the individual in years, ranging from 3 to 110.
- sex (str): One of "M" or "F".
- prevalence (float): The predicted asthma prevalence for the given year, age, and sex.

leap.data_generation.occurrence_calibration_data.load_reassessment_data(province: str = 'CA') → pandas.core.frame.DataFrame[source]¶

Load the asthma reassessment data for the given province.

Parameters:¶

province: str = 'CA'¶: The province to load data for.

Returns:¶

A DataFrame containing the reassessment data. Columns:

year (int): The year of the reassessment.
age (int): The age of the individual in years, a value in [4, 110].
sex (str): One of "M" or "F".
province (str): The two-letter code for the province, e.g. "CA".
prob (float): The probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year; a value in [0, 1].

leap.data_generation.occurrence_calibration_data.load_family_history_data(β_fam_history: dict[str, float] | None = None) → pandas.core.frame.DataFrame[source]¶

Load the family history data for the given province.

Parameters:¶

β_fam_history: dict[str, float] | None = None¶

A dictionary of two beta parameters for the odds ratio calculation:

β_fhx_0: The beta parameter for the constant term in the equation
β_fhx_age: The beta parameter for the age term in the equation.

Returns:¶

A dataframe containing the family history odds ratios. It contains the following columns:

age (int): The age of the individual. Ranges from 3 to 5.
fam_history (int): Whether or not there is a family history of asthma:
- 0: one or more parents has asthma,
- 1: neither parent has asthma.
odds_ratio (float): The odds ratio for asthma prevalence based on family history and age. The odds ratio is calculated based on the CHILD study data.

leap.data_generation.occurrence_calibration_data.load_abx_exposure_data(β_abx: dict[str, float] | None = None) → pandas.core.frame.DataFrame[source]¶

Load the antibiotic exposure data.

Parameters:¶

β_abx: dict[str, float] | None = None¶

A dictionary of 3 beta parameters for the odds ratio calculation:

β_abx_0: The beta parameter for the constant term in the equation
β_abx_age: The beta parameter for the age term in the equation
β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:¶

A dataframe with the odds ratios of asthma prevalence given the number of courses of antibiotics taken during the first year of life. It contains the following columns:

age (int): The age of the individual. An integer in [3, 8].
abx_dose (int): The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.
odds_ratio (float): The odds ratio for asthma prevalence based on antibiotic exposure during the first year of life and age. The odds ratio is calculated based on the CHILD study data.

leap.data_generation.occurrence_calibration_data.compute_antibiotic_dose_prob(year: int, sex: str, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper) → pandas.core.frame.DataFrame[source]¶

Compute the probability of number of courses of antibiotics during infancy.

Parameters:¶

year: int¶: The birth year of the person.
sex: str¶: The sex of the infant, one of "M" or "F".
model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶: The fitted Negative Binomial model for the number of courses of antibiotics. This model was fitted using BC Ministry of Health data on antibiotic prescriptions.

Returns:¶

A dataframe with the probability of the number of courses of antibiotics, ranging from 0 - 5+. Columns:

n_abx (int): The number of courses of antibiotics, an integer in [0, 5], where 5 indicates 5 or more courses.
prob (float): The probability that a person of the given sex and birth year was given n_abx courses of antibiotics during the first year of their life.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_abx(age: int, dose: int, β_abx: dict[str, float] | None = None) → float[source]¶

Calculate the odds ratio for asthma prevalence based on antibiotic exposure.

\[\begin{split}\log(\omega(d_{\lambda})) = \begin{cases} \beta_{\text{abx_0}} + \beta_{\text{abx_age}} \cdot \text{min}(a^{(i)}, 7) + \beta_{\text{abx_dose}} \cdot \text{min}(d^{(i)}, 3) && d^{(i)} > 0 \text{ and } a^{(i)} \leq 7 \\ \\ 0 && \text{otherwise} \end{cases}\end{split}\]

where:

\(\beta_{\text{abx_xxx}}\) is a constant coefficient
\(a^{(i)}\) is the age
\(d^{(i)}\) is the number of courses of antibiotics taken during the first year of life

Parameters:¶

age: int¶

The age of the individual in years.

dose: int¶

The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.

β_abx: dict[str, float] | None = None¶

The parameters for the odds ratio calculation. If None, the default parameters are used. Dictionary keys:

β_abx_0: The beta parameter for the constant term in the equation
β_abx_age: The beta parameter for the age term in the equation
β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:¶

A float representing the odds ratio for asthma prevalence based on antibiotic exposure and age.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_fam_history(age: int, fam_hist: int, β_fam_hist: dict[str, float] | None = None) → float[source]¶

Calculate the odds ratio for asthma prevalence based on family history.

Parameters:¶

age: int¶

The age of the individual in years.

fam_hist: int¶

Whether or not there is a family history of asthma:

0: one or more parents has asthma
1: neither parent has asthma

β_fam_hist: dict[str, float] | None = None¶

The beta parameters for the odds ratio calculation:

β_fhx_0: The beta parameter for the constant term in the odds ratio equation
β_fhx_age: The beta parameter for the age term in the odds ratio equation.

Returns:¶

A float representing the odds ratio for asthma prevalence based on family history and age.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_risk_factors(fam_hist: int, age: int, dose: int, β_risk_factors: dict[str, dict[str, float]] | None = None) → float[source]¶

Calculate the odds ratio for asthma prevalence based on family history and antibiotic exposure.

Parameters:¶

fam_hist: int¶

Whether or not there is a family history of asthma:

0: one or more parents has asthma
1: neither parent has asthma

age: int¶

The age of the individual in years.

dose: int¶

The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.

β_risk_factors: dict[str, dict[str, float]] | None = None¶

A dictionary of beta parameters for the risk factor equations. Must contain the following keys:

fam_history: A dictionary with the beta parameters for the family history odds ratio calculation. Must contain the keys β_fhx_0 and β_fhx_age.
abx: A dictionary with the beta parameters for the antibiotic exposure odds ratio calculation. Must contain the keys β_abx_0, β_abx_age, and β_abx_dose.

Returns:¶

The odds ratio for asthma prevalence based on family history and antibiotic exposure.

leap.data_generation.occurrence_calibration_data.risk_factor_generator(year: int, age: int, sex: str, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, β_fam_history: dict[str, float] | None = None, β_abx: dict[str, float] | None = None) → pandas.core.frame.DataFrame[source]¶

Compute the combined antibiotic exposure and family history odds ratio.

Parameters:¶

year: int¶

The current year.

age: int¶

The age of the person in years.

sex: str¶

One of "M" or "F".

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶

The fitted Negative Binomial model for the number of courses of antibiotics.

β_fam_history: dict[str, float] | None = None¶

A dictionary of 2 beta parameters for the calculation of the odds ratio of having asthma given family history:

β_fhx_0: The beta parameter for the constant term in the equation.
β_fhx_age: The beta parameter for the age term in the equation.

β_abx: dict[str, float] | None = None¶

A dictionary of 3 beta parameters for the calculation of the odds ratio of having asthma given antibiotic exposure during infancy:

β_abx_0: The beta parameter for the constant term in the equation.
β_abx_age: The beta parameter for the age term in the equation.
β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:¶

A dataframe with the combined probabilities and odds ratios for the antibiotic exposure and family history risk factors. Columns:

fam_history (int): Whether or not there is a family history of asthma.
- 0: one or more parents has asthma
- 1: neither parent has asthma
abx_exposure (int): The number of antibiotic courses taken in the first year of life; an integer in [0, 5], where 5 indicates 5 or more courses.
year (int): The given year.
sex (str): One of M or F.
age (int): The age of the person in years.
prob (float): The probability of antibiotic exposure * probability of one or more parents having asthma given that the person has asthma.

odds_ratio (float): The odds combined odds ratio:

odds_ratio = odds_ratio_abx * odds_ratio_fam_history

class leap.data_generation.occurrence_calibration_data.ResultsPrevalence¶

Bases: dict

α : float¶

β : list[float] | numpy.ndarray¶

ζ_λ : list[float] | numpy.ndarray¶

ζ : float¶

leap.data_generation.occurrence_calibration_data.calibrate_asthma_prevalence(year: int, sex: str, age: int, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_prevalence: pandas.core.frame.DataFrame) → leap.data_generation.occurrence_calibration_data.ResultsPrevalence[source]¶

Calibrate the asthma prevalence for the given year, age, and sex.

Our goal is to find the correction term for the asthma prevalence, \(\alpha\), which is given by:

\[\alpha = \sum_{\lambda=1}^{n} p(\lambda) \cdot \beta_{\lambda}\]

where:

\(p(\lambda)\) is the probability of risk factor level \(\lambda\), risk_factor_prob[λ]
\(\beta_{\lambda}\) is the parameter for risk factor level \(\lambda\), asthma_prev_risk_factor_params[λ]

To do so, we need to determine \(\beta_{\lambda}\) for each risk factor level. In order to find the \(\beta_{\lambda}\) parameters, we minimize the following equation:

\[\text{min}(\Delta) = \text{min}(| \zeta - \eta |)\]

where:

\(\zeta\) is the predicted / calibrated asthma prevalence
\(\eta\) is the target asthma prevalence, asthma_prev_target, from the model of the BC Ministry of Health data.

We have \(\eta\) from the occurrence model, so we only need to find \(\zeta\). We can write \(\zeta\) in terms of \(\zeta_{\lambda}\), the predicted asthma prevalence at risk factor level \(\lambda\):

\[\zeta = \sum_{\lambda=0}^{n} p(\lambda) \cdot \zeta_{\lambda}\]

We also already know \(p(\lambda)\), the probability of risk factor level \(\lambda\). We can express \(\zeta_{\lambda}\) finally as:

\[\begin{split}\zeta_{\lambda} &= \sigma(\beta_0 + \log(\omega_{\lambda}) - \alpha) \\ \beta_0 &= \sigma^{-1}(\eta)\end{split}\]

where:

\(\omega_{\lambda}\) is the odds ratio for risk factor level \(\lambda\), odds_ratio_target[λ]
\(\alpha\) is the correction term for the asthma prevalence

Putting everything together, the calibrated asthma prevalence is given by:

\[\zeta = \sum_{\lambda=0}^{n} p(\lambda) \cdot \sigma\left(\sigma^{-1}(\eta) + \log(\omega_{\lambda}) - \sum_{\gamma=1}^{n} p(\gamma) \cdot \beta_{\gamma}\right)\]

Parameters:¶

year: int¶

The integer year.

sex: str¶

One of M = male, F = female.

age: int¶

The age in years.

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶

The fitted Negative Binomial model for the number of courses of antibiotics.

df_prevalence: pandas.core.frame.DataFrame¶

A dataframe with the prevalence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prevalence (float): the prevalence of asthma for the given year, age, and sex

Returns:¶

A dictionary containing the calibrated asthma prevalence for the given year, age, and sex. The dictionary has the following keys:

α (float): The prevalence correction factor.
β (list[float]): A vector of the beta parameters for the risk factors.
ζ_λ (list[float]): A vector of the calibrated asthma prevalence for each risk factor combination λ for the current year.
ζ (float): The calibrated asthma prevalence for the current year.

class leap.data_generation.occurrence_calibration_data.ResultsIncidence¶

Bases: dict

α : float¶

ζ_λ : list[float] | numpy.ndarray¶

ζ_prev_λ : list[float] | numpy.ndarray¶

risk_sets : Dict[Literal['inc', 'past_prev'], pandas.core.frame.DataFrame]¶

leap.data_generation.occurrence_calibration_data.calibrate_asthma_incidence(year: int, sex: str, age: int, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, β_risk_factors: dict[str, dict[str, float]] = {'abx': {'β_abx_0': 1.82581, 'β_abx_age': -0.225, 'β_abx_dose': 0.053}, 'fam_history': {'β_fhx_0': np.float64(0.12221763272424911), 'β_fhx_age': np.float64(0.37662555231482536)}}, min_year: int = 2000) → leap.data_generation.occurrence_calibration_data.ResultsIncidence[source]¶

Calibrate the asthma incidence for the given year, age, and sex.

Parameters:¶

year: int¶

The integer year.

age: int¶

The age in years.

sex: str¶

One of M = male, F = female.

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame¶

A dataframe with the incidence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame¶

A dataframe with the prevalence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prevalence (float): the prevalence of asthma for the given year, age, and sex.

β_risk_factors: dict[str, dict[str, float]] = {'abx': {'β_abx_0': 1.82581, 'β_abx_age': -0.225, 'β_abx_dose': 0.053}, 'fam_history': {'β_fhx_0': np.float64(0.12221763272424911), 'β_fhx_age': np.float64(0.37662555231482536)}}¶

A dictionary of beta parameters for the risk factor equations:

fam_history: A dictionary with the beta parameters for the family history odds ratio calculation. Must contain the keys β_fhx_0 and β_fhx_age.
abx: A dictionary with the beta parameters for the antibiotic exposure odds ratio calculation. Must contain the keys β_abx_0, β_abx_age, and β_abx_dose.

min_year: int = 2000¶

The minimum year to consider for the calibration.

Returns:¶

A dictionary containing the calibrated asthma incidence for the given year, age, and sex. The dictionary has the following keys:

α: The incidence correction factor.
ζ_λ: A vector of the calibrated asthma incidence for each risk factor combination λ for the current year.
ζ_prev_λ: A vector of the calibrated asthma prevalence for each risk factor combination λ for the previous year.
risk_sets: A dictionary with two keys:
- inc: A dataframe with the incidence risk factors and their probabilities.
- past_prev: A dataframe with the risk factors and their probabilities and odds ratios for the previous year.

leap.data_generation.occurrence_calibration_data.compute_mean_diff_log_OR(β_risk_factors_age: list[float], df: pandas.core.frame.DataFrame, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, df_reassessment: pandas.core.frame.DataFrame, min_year: int = 2000) → float[source]¶

Compute the mean difference in log odds ratio for the given model and data.

Parameters:¶

β_risk_factors_age: list[float]¶

A list of two beta parameters, β_fhx_age and β_abx_age.

df: pandas.core.frame.DataFrame¶

A dataframe with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
index (int): the index of the row in the dataframe

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame¶

A dataframe with the incidence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame¶

A dataframe with the prevalence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prevalence (float): the prevalence of asthma for the given year, age, and sex.

df_reassessment: pandas.core.frame.DataFrame¶

A dataframe with the reassessment of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prob (float): the probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year.

min_year: int = 2000¶

The minimum year to consider for the calibration.

Returns:¶

The mean difference in log odds ratio for the given model and data.

leap.data_generation.occurrence_calibration_data.beta_params_age_optimizer(model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, df_reassessment: pandas.core.frame.DataFrame, baseline_year: int = 2001, stabilization_year: int = 2025, max_age: int = 9, min_year: int = 2000, β_risk_factors_age: list[float] = [np.float64(0.37662555231482536), -0.225]) → None[source]¶

Optimize the risk factor beta parameters for the age terms.

Parameters:¶

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame¶

A dataframe with the incidence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame¶

A dataframe with the prevalence of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prevalence (float): the prevalence of asthma for the given year, age, and sex.

df_reassessment: pandas.core.frame.DataFrame¶

A dataframe with the reassessment of asthma, with the following columns:

year (int): the year
age (int): the age in years
sex (str): M = male, F = female
prob (float): the probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year.

baseline_year: int = 2001¶

The baseline year for the calibration.

stabilization_year: int = 2025¶

The stabilization year for the calibration.

max_age: int = 9¶

The maximum age to consider for the calibration.

min_year: int = 2000¶

The minimum year to consider for the calibration.

β_risk_factors_age: list[float] = [np.float64(0.37662555231482536), -0.225]¶

A list of two beta parameters, β_fhx_age and β_abx_age, to be used as the initial values in the optimization.

leap.data_generation.occurrence_calibration_data.generate_occurrence_calibration_data(province: str = 'CA', min_year: int = 2000, max_year: int = 2065, baseline_year: int = 2001, stabilization_year: int = 2025, max_age: int = 9, retrain_beta: bool = False)[source]¶

Generate the occurrence calibration data for the given province and year range.

Parameters:¶

province: str = 'CA'¶: The province to load data for.
min_year: int = 2000¶: The minimum year to load data for.
max_year: int = 2065¶: The maximum year to load data for.
baseline_year: int = 2001¶: The baseline year for the calibration.
stabilization_year: int = 2025¶: The stabilization year for the calibration.
max_age: int = 9¶: The maximum age to consider for the calibration.
retrain_beta: bool = False¶: If True, re-run the fit for the β_risk_factors. Otherwise, load the saved parameters from a json file.