Occurrence Data

To run the data processing for the occurrence data:

cd LEAP
python3 leap/data_generation/occurrence_data.py

leap.data_generation.occurrence_data module

leap.data_generation.occurrence_data.poly(x: list[float] | numpy.ndarray, degree: int = 1, alpha: list[float] | numpy.ndarray | None = None, norm2: list[float] | numpy.ndarray | None = None, orthogonal: bool = False) numpy.ndarray | tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

Generate a polynomial basis for a vector.

See Orthogonal polynomial regression in Python for more information on this function.

Parameters:
x: list[float] | numpy.ndarray

The vector to generate the polynomial basis for.

degree: int = 1

The degree of the polynomial.

orthogonal: bool = False

Whether to generate an orthogonal polynomial basis.

Returns:

The polynomial basis, as a 2D Numpy array. If orthogonal is True, the function will return a tuple of three Numpy arrays: the orthogonal polynomial basis, the alpha values, and the norm2 values.

Examples

>>> poly([1, 2, 3], degree=2) 
array([[1, 1],
       [2, 4],
       [3, 9]])
>>> poly([1, 2, 3], degree=2, orthogonal=True) 
(array([[-7.07106781e-01,  4.08248290e-01],
       [-5.55111512e-17, -8.16496581e-01],
       [ 7.07106781e-01,  4.08248290e-01]]), array([2., 2.]), array([3.        , 2.        , 0.66666667]))
leap.data_generation.occurrence_data.parse_age_group(x: str, max_age: int = 65) tuple[int, int][source]

Parse an age group string into a tuple of integers.

Parameters:
x: str

The age group string. Must be in the format “X-Y”, “X+”, “X-Y years”, “<1 year”.

Returns:

A tuple of integers representing the lower and upper age of the age group.

Examples

>>> parse_age_group("0-4")
(0, 4)
>>> parse_age_group("5-9 years")
(5, 9)
>>> parse_age_group("10+")
(10, 65)
>>> parse_age_group("<1 year")
(0, 1)
leap.data_generation.occurrence_data.load_asthma_df(starting_year: int = 2000) pandas.core.frame.DataFrame[source]

Load the asthma incidence and prevalence data.

Parameters:
starting_year: int = 2000

The starting year for the data. Data before this year will be excluded from the analysis.

Returns:

The asthma incidence and prevalence data. Columns:

  • year (int): The calendar year.

  • age_group (str): The age group.

  • age (int): The average age of the age group.

  • sex (str): One of F = female, M = male.

  • incidence (float): The incidence of asthma.

  • prevalence (float): The prevalence of asthma.

leap.data_generation.occurrence_data.generate_occurrence_model(df_asthma: pandas.core.frame.DataFrame, formula: str, occ_type: str, maxiter: int = 1000) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]

Generate a GLM model for asthma incidence or prevalence.

Parameters:
df_asthma: pandas.core.frame.DataFrame

The asthma dataframe. Must have columns:

  • year (int): The calendar year.

  • sex (str): One of M = male, F = female.

  • age (int): The age in years.

  • incidence (float): The incidence of asthma.

  • prevalence (float): The prevalence of asthma.

formula: str

The formula for the GLM model. See the statsmodels documentation for more information.

occ_type: str

The type of occurrence data to model. Must be one of "incidence" or "prevalence".

maxiter: int = 1000

The maximum number of iterations to perform while fitting the model.

Returns:

The fitted GLM model.

leap.data_generation.occurrence_data.generate_incidence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int = 1000) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]

Generate a GLM model for asthma incidence.

Parameters:
df_asthma: pandas.core.frame.DataFrame

The asthma dataframe. Must have columns:

  • year (int): The calendar year.

  • sex (str): One of M = male, F = female.

  • age (int): The age in years.

  • incidence (float): The incidence of asthma.

  • prevalence (float): The prevalence of asthma.

maxiter: int = 1000

The maximum number of iterations to perform while fitting the model.

Returns:

The fitted GLM model.

leap.data_generation.occurrence_data.generate_prevalence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int = 1000) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]

Generate a GLM model for asthma prevalence.

Parameters:
df_asthma: pandas.core.frame.DataFrame

The asthma dataframe. Must have columns:

  • year (int): The calendar year.

  • sex (str): One of M = male, F = female.

  • age (int): The age in years.

  • incidence (float): The incidence of asthma.

  • prevalence (float): The prevalence of asthma.

maxiter: int = 1000

The maximum number of iterations to perform while fitting the model.

Returns:

The fitted GLM model.

leap.data_generation.occurrence_data.get_predicted_data(model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, pred_col: str, min_age: int = 3, max_age: int = 100, min_year: int = 2000, max_year: int = 2019) pandas.core.frame.DataFrame[source]

Get predicted data from a GLM model.

The GLM model must be fitted on the following columns:

  • year (int): The calendar year.

  • sex (int): One of 0 = female, 1 = male.

  • age (int): The age in years.

Parameters:
model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper

The fitted GLM model.

pred_col: str

The name of the column to store the predicted data.

min_age: int = 3

The minimum age to predict.

max_age: int = 100

The maximum age to predict.

min_year: int = 2000

The minimum year to predict.

max_year: int = 2019

The maximum year to predict.

Returns:

A dataframe containing the predicted data. Columns:

  • year (int): The calendar year.

  • sex (str): One of M = male, F = female.

  • age (int): The age in years.

  • pred_col (float): The predicted data.

leap.data_generation.occurrence_data.plot_occurrence(df: pandas.core.frame.DataFrame, y: str, title: str = '', file_path: pathlib.Path | None = None, min_year: int = 2000, max_year: int = 2019, year_interval: int = 2, max_age: int = 110, width: int = 1000, height: int = 800)[source]

Plot the incidence or prevalence of asthma.

Parameters:
df: pandas.core.frame.DataFrame

A dataframe containing either incidence or prevalence data. Must have columns:

  • year (int): The calendar year.

  • sex (str): One of F = female, M = male.

  • age (int): The age in years.

  • y (float): Specified by the y argument, this will be the y data.

  • y_pred (float): Optional, the predicted y data. If this column is present, it will be plotted alongside the actual data. The column name must be the same as y with _pred appended. For example, if y is incidence, then the predicted data must be incidence_pred.

y: str

The name of the column in the dataframe which will be plotted as the y data.

title: str = ''

The title of the plot.

file_path: pathlib.Path | None = None

The path to save the plot to. If None, the plot will be displayed.

min_year: int = 2000

The minimum year to plot.

max_year: int = 2019

The maximum year to plot.

year_interval: int = 2

The interval between years. This is used if you don’t want to plot every year.

max_age: int = 110

The maximum age to plot.

width: int = 1000

The width of the plot.

height: int = 800

The height of the plot.

Returns:

If file_path is None, the plot will be displayed. Otherwise, the plot will be saved to the specified path.

leap.data_generation.occurrence_data.generate_occurrence_data()[source]

Generate the asthma incidence and prevalence data.

Saves the data to a CSV file: processed_data/asthma_occurrence_predictions.csv.

The data is also plotted and saved to the following files:

  • data_generation/figures/asthma_incidence_predicted.png: The predicted asthma incidence per 100 in BC.

  • data_generation/figures/asthma_prevalence_predicted.png: The predicted asthma prevalence per 100 in BC.