Occurrence Calibration Data

To run the data generation for the incidence/prevalence calibration:

cd LEAP
python3 leap/data_generation/occurrence_calibration_data.py

This will update the file leap/processed_data/asthma_occurrence_correction.csv. This file contains the incidence / prevalence correction parameters and is formatted as follows:

Column Type Description
year int format XXXX, e.g 2000, range [2000, 2026]
age int The age in years, a value in [3, 63]
sex str "F" = Female, "M" = Male
correction float The correction term for the asthma incidence / prevalence equation
type str One of "incidence" "prevalence"

If you want to rerun the optimization for the \(\beta\) parameters, add the flag --retrain-beta:

cd LEAP
python3 leap/data_generation/occurrence_calibration_data.py --retrain-beta

This will update the file leap/processed_data/asthma_occurrence_correction.csv as described above, and will also update leap/processed_data/occurrence_calibration_parameters.json:

{
   "β_fhx_age": 0.6445257,
   "β_abx_age": -0.2968535
}

Warning

Rerunning the beta parameters optimization is slow - could take up to 24 hours.

leap.data_generation.occurrence_calibration_data module

leap.data_generation.occurrence_calibration_data.get_asthma_occurrence_prediction(age: int, sex: str, year: int, occurrence_type: str, max_asthma_age: int = 62, stabilization_year: int = 2025) float[source]

Predicts the asthma prevalence or incidence based on the given parameters.

Parameters:
age: int

Age of the individual in years.

sex: str

One of "M" or "F".

year: int

Year of the prediction.

occurrence_type: str

One of "prevalence" or "incidence".

max_asthma_age: int = 62

The maximum age for asthma prediction (default is 62).

stabilization_year: int = 2025

The year when asthma stabilization occurs (default is 2025).

Returns:

A float representing the predicted asthma prevalence or incidence.

leap.data_generation.occurrence_calibration_data.load_occurrence_data(province: str = 'CA', min_year: int = 2000, max_year: int = 2065) tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

Load the asthma incidence and prevalence data for the given province and year range.

Parameters:
province: str = 'CA'

The province to load data for.

min_year: int = 2000

The minimum year to load data for.

max_year: int = 2065

The maximum year to load data for.

Returns:

A tuple of two DataFrames. Details:

  1. Incidence dataframe with columns:

    • year (int): The year of the prediction.

    • age (int): The age of the individual in years, ranging from 3 to 110.

    • sex (str): One of "M" or "F".

    • incidence (float): The predicted asthma incidence for the given year, age, and sex.

  2. Prevalence dataframe with columns:

    • year (int): The year of the prediction.

    • age (int): The age of the individual in years, ranging from 3 to 110.

    • sex (str): One of "M" or "F".

    • prevalence (float): The predicted asthma prevalence for the given year, age, and sex.

leap.data_generation.occurrence_calibration_data.load_reassessment_data(province: str = 'CA') pandas.core.frame.DataFrame[source]

Load the asthma reassessment data for the given province.

Parameters:
province: str = 'CA'

The province to load data for.

Returns:

A DataFrame containing the reassessment data. Columns:

  • year (int): The year of the reassessment.

  • age (int): The age of the individual in years, a value in [4, 110].

  • sex (str): One of "M" or "F".

  • province (str): The two-letter code for the province, e.g. "CA".

  • prob (float): The probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year; a value in [0, 1].

leap.data_generation.occurrence_calibration_data.load_family_history_data(β_fam_history: dict[str, float] | None = None) pandas.core.frame.DataFrame[source]

Load the family history data for the given province.

Parameters:
β_fam_history: dict[str, float] | None = None

A dictionary of two beta parameters for the odds ratio calculation:

  • β_fhx_0: The beta parameter for the constant term in the equation

  • β_fhx_age: The beta parameter for the age term in the equation.

Returns:

A dataframe containing the family history odds ratios. It contains the following columns:

  • age (int): The age of the individual. Ranges from 3 to 5.

  • fam_history (int): Whether or not there is a family history of asthma:

    • 0: one or more parents has asthma,

    • 1: neither parent has asthma.

  • odds_ratio (float): The odds ratio for asthma prevalence based on family history and age. The odds ratio is calculated based on the CHILD study data.

leap.data_generation.occurrence_calibration_data.load_abx_exposure_data(β_abx: dict[str, float] | None = None) pandas.core.frame.DataFrame[source]

Load the antibiotic exposure data.

Parameters:
β_abx: dict[str, float] | None = None

A dictionary of 3 beta parameters for the odds ratio calculation:

  • β_abx_0: The beta parameter for the constant term in the equation

  • β_abx_age: The beta parameter for the age term in the equation

  • β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:

A dataframe with the odds ratios of asthma prevalence given the number of courses of antibiotics taken during the first year of life. It contains the following columns:

  • age (int): The age of the individual. An integer in [3, 8].

  • abx_dose (int): The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.

  • odds_ratio (float): The odds ratio for asthma prevalence based on antibiotic exposure during the first year of life and age. The odds ratio is calculated based on the CHILD study data.

leap.data_generation.occurrence_calibration_data.compute_antibiotic_dose_prob(year: int, sex: str, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper) pandas.core.frame.DataFrame[source]

Compute the probability of number of courses of antibiotics during infancy.

Parameters:
year: int

The birth year of the person.

sex: str

The sex of the infant, one of "M" or "F".

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics. This model was fitted using BC Ministry of Health data on antibiotic prescriptions.

Returns:

A dataframe with the probability of the number of courses of antibiotics, ranging from 0 - 5+. Columns:

  • n_abx (int): The number of courses of antibiotics, an integer in [0, 5], where 5 indicates 5 or more courses.

  • prob (float): The probability that a person of the given sex and birth year was given n_abx courses of antibiotics during the first year of their life.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_abx(age: int, dose: int, β_abx: dict[str, float] | None = None) float[source]

Calculate the odds ratio for asthma prevalence based on antibiotic exposure.

\[\begin{split}\log(\omega(d_{\lambda})) = \begin{cases} \beta_{\text{abx_0}} + \beta_{\text{abx_age}} \cdot \text{min}(a^{(i)}, 7) + \beta_{\text{abx_dose}} \cdot \text{min}(d^{(i)}, 3) && d^{(i)} > 0 \text{ and } a^{(i)} \leq 7 \\ \\ 0 && \text{otherwise} \end{cases}\end{split}\]

where:

  • \(\beta_{\text{abx_xxx}}\) is a constant coefficient

  • \(a^{(i)}\) is the age

  • \(d^{(i)}\) is the number of courses of antibiotics taken during the first year of life

Parameters:
age: int

The age of the individual in years.

dose: int

The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.

β_abx: dict[str, float] | None = None

The parameters for the odds ratio calculation. If None, the default parameters are used. Dictionary keys:

  • β_abx_0: The beta parameter for the constant term in the equation

  • β_abx_age: The beta parameter for the age term in the equation

  • β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:

A float representing the odds ratio for asthma prevalence based on antibiotic exposure and age.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_fam_history(age: int, fam_hist: int, β_fam_hist: dict[str, float] | None = None) float[source]

Calculate the odds ratio for asthma prevalence based on family history.

Parameters:
age: int

The age of the individual in years.

fam_hist: int

Whether or not there is a family history of asthma:

  • 0: one or more parents has asthma

  • 1: neither parent has asthma

β_fam_hist: dict[str, float] | None = None

The beta parameters for the odds ratio calculation:

  • β_fhx_0: The beta parameter for the constant term in the odds ratio equation

  • β_fhx_age: The beta parameter for the age term in the odds ratio equation.

Returns:

A float representing the odds ratio for asthma prevalence based on family history and age.

leap.data_generation.occurrence_calibration_data.calculate_odds_ratio_risk_factors(fam_hist: int, age: int, dose: int, β_risk_factors: dict[str, dict[str, float]] | None = None) float[source]

Calculate the odds ratio for asthma prevalence based on family history and antibiotic exposure.

Parameters:
fam_hist: int

Whether or not there is a family history of asthma:

  • 0: one or more parents has asthma

  • 1: neither parent has asthma

age: int

The age of the individual in years.

dose: int

The number of antibiotic courses taken in the first year of life, an integer in [0, 5], where 5 indicates 5 or more courses.

β_risk_factors: dict[str, dict[str, float]] | None = None

A dictionary of beta parameters for the risk factor equations. Must contain the following keys:

  • fam_history: A dictionary with the beta parameters for the family history odds ratio calculation. Must contain the keys β_fhx_0 and β_fhx_age.

  • abx: A dictionary with the beta parameters for the antibiotic exposure odds ratio calculation. Must contain the keys β_abx_0, β_abx_age, and β_abx_dose.

Returns:

The odds ratio for asthma prevalence based on family history and antibiotic exposure.

leap.data_generation.occurrence_calibration_data.risk_factor_generator(year: int, age: int, sex: str, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, β_fam_history: dict[str, float] | None = None, β_abx: dict[str, float] | None = None) pandas.core.frame.DataFrame[source]

Compute the combined antibiotic exposure and family history odds ratio.

Parameters:
year: int

The current year.

age: int

The age of the person in years.

sex: str

One of "M" or "F".

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics.

β_fam_history: dict[str, float] | None = None

A dictionary of 2 beta parameters for the calculation of the odds ratio of having asthma given family history:

  • β_fhx_0: The beta parameter for the constant term in the equation.

  • β_fhx_age: The beta parameter for the age term in the equation.

β_abx: dict[str, float] | None = None

A dictionary of 3 beta parameters for the calculation of the odds ratio of having asthma given antibiotic exposure during infancy:

  • β_abx_0: The beta parameter for the constant term in the equation.

  • β_abx_age: The beta parameter for the age term in the equation.

  • β_abx_dose: The beta parameter for the antibiotic dose term in the equation.

Returns:

A dataframe with the combined probabilities and odds ratios for the antibiotic exposure and family history risk factors. Columns:

  • fam_history (int): Whether or not there is a family history of asthma.

    • 0: one or more parents has asthma

    • 1: neither parent has asthma

  • abx_exposure (int): The number of antibiotic courses taken in the first year of life; an integer in [0, 5], where 5 indicates 5 or more courses.

  • year (int): The given year.

  • sex (str): One of M or F.

  • age (int): The age of the person in years.

  • prob (float): The probability of antibiotic exposure * probability of one or more parents having asthma given that the person has asthma.

  • odds_ratio (float): The odds combined odds ratio:

    odds_ratio = odds_ratio_abx * odds_ratio_fam_history
    

class leap.data_generation.occurrence_calibration_data.ResultsPrevalence

Bases: dict

α : float
β : list[float] | numpy.ndarray
ζ_λ : list[float] | numpy.ndarray
ζ : float
leap.data_generation.occurrence_calibration_data.calibrate_asthma_prevalence(year: int, sex: str, age: int, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_prevalence: pandas.core.frame.DataFrame) leap.data_generation.occurrence_calibration_data.ResultsPrevalence[source]

Calibrate the asthma prevalence for the given year, age, and sex.

Our goal is to find the correction term for the asthma prevalence, \(\alpha\), which is given by:

\[\alpha = \sum_{\lambda=1}^{n} p(\lambda) \cdot \beta_{\lambda}\]

where:

  • \(p(\lambda)\) is the probability of risk factor level \(\lambda\), risk_factor_prob[λ]

  • \(\beta_{\lambda}\) is the parameter for risk factor level \(\lambda\), asthma_prev_risk_factor_params[λ]

To do so, we need to determine \(\beta_{\lambda}\) for each risk factor level. In order to find the \(\beta_{\lambda}\) parameters, we minimize the following equation:

\[\text{min}(\Delta) = \text{min}(| \zeta - \eta |)\]

where:

  • \(\zeta\) is the predicted / calibrated asthma prevalence

  • \(\eta\) is the target asthma prevalence, asthma_prev_target, from the model of the BC Ministry of Health data.

We have \(\eta\) from the occurrence model, so we only need to find \(\zeta\). We can write \(\zeta\) in terms of \(\zeta_{\lambda}\), the predicted asthma prevalence at risk factor level \(\lambda\):

\[\zeta = \sum_{\lambda=0}^{n} p(\lambda) \cdot \zeta_{\lambda}\]

We also already know \(p(\lambda)\), the probability of risk factor level \(\lambda\). We can express \(\zeta_{\lambda}\) finally as:

\[\begin{split}\zeta_{\lambda} &= \sigma(\beta_0 + \log(\omega_{\lambda}) - \alpha) \\ \beta_0 &= \sigma^{-1}(\eta)\end{split}\]

where:

  • \(\omega_{\lambda}\) is the odds ratio for risk factor level \(\lambda\), odds_ratio_target[λ]

  • \(\alpha\) is the correction term for the asthma prevalence

Putting everything together, the calibrated asthma prevalence is given by:

\[\zeta = \sum_{\lambda=0}^{n} p(\lambda) \cdot \sigma\left(\sigma^{-1}(\eta) + \log(\omega_{\lambda}) - \sum_{\gamma=1}^{n} p(\gamma) \cdot \beta_{\gamma}\right)\]
Parameters:
year: int

The integer year.

sex: str

One of M = male, F = female.

age: int

The age in years.

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics.

df_prevalence: pandas.core.frame.DataFrame

A dataframe with the prevalence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prevalence (float): the prevalence of asthma for the given year, age, and sex

Returns:

A dictionary containing the calibrated asthma prevalence for the given year, age, and sex. The dictionary has the following keys:

  • α (float): The prevalence correction factor.

  • β (list[float]): A vector of the beta parameters for the risk factors.

  • ζ_λ (list[float]): A vector of the calibrated asthma prevalence for each risk factor combination λ for the current year.

  • ζ (float): The calibrated asthma prevalence for the current year.

class leap.data_generation.occurrence_calibration_data.ResultsIncidence

Bases: dict

α : float
ζ_λ : list[float] | numpy.ndarray
ζ_prev_λ : list[float] | numpy.ndarray
risk_sets : Dict[Literal['inc', 'past_prev'], pandas.core.frame.DataFrame]
leap.data_generation.occurrence_calibration_data.calibrate_asthma_incidence(year: int, sex: str, age: int, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, β_risk_factors: dict[str, dict[str, float]] = {'abx': {'β_abx_0': 1.82581, 'β_abx_age': -0.225, 'β_abx_dose': 0.053}, 'fam_history': {'β_fhx_0': np.float64(0.12221763272424911), 'β_fhx_age': np.float64(0.37662555231482536)}}, min_year: int = 2000) leap.data_generation.occurrence_calibration_data.ResultsIncidence[source]

Calibrate the asthma incidence for the given year, age, and sex.

Parameters:
year: int

The integer year.

age: int

The age in years.

sex: str

One of M = male, F = female.

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame

A dataframe with the incidence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame

A dataframe with the prevalence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prevalence (float): the prevalence of asthma for the given year, age, and sex.

β_risk_factors: dict[str, dict[str, float]] = {'abx': {'β_abx_0': 1.82581, 'β_abx_age': -0.225, 'β_abx_dose': 0.053}, 'fam_history': {'β_fhx_0': np.float64(0.12221763272424911), 'β_fhx_age': np.float64(0.37662555231482536)}}

A dictionary of beta parameters for the risk factor equations:

  • fam_history: A dictionary with the beta parameters for the family history odds ratio calculation. Must contain the keys β_fhx_0 and β_fhx_age.

  • abx: A dictionary with the beta parameters for the antibiotic exposure odds ratio calculation. Must contain the keys β_abx_0, β_abx_age, and β_abx_dose.

min_year: int = 2000

The minimum year to consider for the calibration.

Returns:

A dictionary containing the calibrated asthma incidence for the given year, age, and sex. The dictionary has the following keys:

  • α: The incidence correction factor.

  • ζ_λ: A vector of the calibrated asthma incidence for each risk factor combination λ for the current year.

  • ζ_prev_λ: A vector of the calibrated asthma prevalence for each risk factor combination λ for the previous year.

  • risk_sets: A dictionary with two keys:

    • inc: A dataframe with the incidence risk factors and their probabilities.

    • past_prev: A dataframe with the risk factors and their probabilities and odds ratios for the previous year.

leap.data_generation.occurrence_calibration_data.compute_mean_diff_log_OR(β_risk_factors_age: list[float], df: pandas.core.frame.DataFrame, model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, df_reassessment: pandas.core.frame.DataFrame, min_year: int = 2000) float[source]

Compute the mean difference in log odds ratio for the given model and data.

Parameters:
β_risk_factors_age: list[float]

A list of two beta parameters, β_fhx_age and β_abx_age.

df: pandas.core.frame.DataFrame

A dataframe with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • index (int): the index of the row in the dataframe

model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame

A dataframe with the incidence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame

A dataframe with the prevalence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prevalence (float): the prevalence of asthma for the given year, age, and sex.

df_reassessment: pandas.core.frame.DataFrame

A dataframe with the reassessment of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prob (float): the probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year.

min_year: int = 2000

The minimum year to consider for the calibration.

Returns:

The mean difference in log odds ratio for the given model and data.

leap.data_generation.occurrence_calibration_data.beta_params_age_optimizer(model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, df_incidence: pandas.core.frame.DataFrame, df_prevalence: pandas.core.frame.DataFrame, df_reassessment: pandas.core.frame.DataFrame, baseline_year: int = 2001, stabilization_year: int = 2025, max_age: int = 9, min_year: int = 2000, β_risk_factors_age: list[float] = [np.float64(0.37662555231482536), -0.225]) None[source]

Optimize the risk factor beta parameters for the age terms.

Parameters:
model_abx: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted Negative Binomial model for the number of courses of antibiotics.

df_incidence: pandas.core.frame.DataFrame

A dataframe with the incidence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • incidence (float): the incidence of asthma for the given year, age, and sex.

df_prevalence: pandas.core.frame.DataFrame

A dataframe with the prevalence of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prevalence (float): the prevalence of asthma for the given year, age, and sex.

df_reassessment: pandas.core.frame.DataFrame

A dataframe with the reassessment of asthma, with the following columns:

  • year (int): the year

  • age (int): the age in years

  • sex (str): M = male, F = female

  • prob (float): the probability that someone diagnosed with asthma previously will maintain their asthma diagnosis in the given year.

baseline_year: int = 2001

The baseline year for the calibration.

stabilization_year: int = 2025

The stabilization year for the calibration.

max_age: int = 9

The maximum age to consider for the calibration.

min_year: int = 2000

The minimum year to consider for the calibration.

β_risk_factors_age: list[float] = [np.float64(0.37662555231482536), -0.225]

A list of two beta parameters, β_fhx_age and β_abx_age, to be used as the initial values in the optimization.

leap.data_generation.occurrence_calibration_data.generate_occurrence_calibration_data(province: str = 'CA', min_year: int = 2000, max_year: int = 2065, baseline_year: int = 2001, stabilization_year: int = 2025, max_age: int = 9, retrain_beta: bool = False)[source]

Generate the occurrence calibration data for the given province and year range.

Parameters:
province: str = 'CA'

The province to load data for.

min_year: int = 2000

The minimum year to load data for.

max_year: int = 2065

The maximum year to load data for.

baseline_year: int = 2001

The baseline year for the calibration.

stabilization_year: int = 2025

The stabilization year for the calibration.

max_age: int = 9

The maximum age to consider for the calibration.

retrain_beta: bool = False

If True, re-run the fit for the β_risk_factors. Otherwise, load the saved parameters from a json file.