Occurrence Data¶

To run the data processing for the occurrence data:

cd LEAP
python3 leap/data_generation/occurrence_data.py

leap.data_generation.occurrence_data module¶

leap.data_generation.occurrence_data.poly(x: list[float] | numpy.ndarray, degree: int = 1, alpha: list[float] | numpy.ndarray | None = None, norm2: list[float] | numpy.ndarray | None = None, orthogonal: bool = False) → numpy.ndarray | tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray][source]¶

Generate a polynomial basis for a vector.

See Orthogonal polynomial regression in Python for more information on this function.

Parameters:¶

x: list[float] | numpy.ndarray¶: The vector to generate the polynomial basis for.
degree: int = 1¶: The degree of the polynomial.
orthogonal: bool = False¶: Whether to generate an orthogonal polynomial basis.

Returns:¶

The polynomial basis, as a 2D Numpy array. If orthogonal is True, the function will return a tuple of three Numpy arrays: the orthogonal polynomial basis, the alpha values, and the norm2 values.

Examples

>>> poly([1, 2, 3], degree=2) 
array([[1, 1],
       [2, 4],
       [3, 9]])

>>> poly([1, 2, 3], degree=2, orthogonal=True) 
(array([[-7.07106781e-01,  4.08248290e-01],
       [-5.55111512e-17, -8.16496581e-01],
       [ 7.07106781e-01,  4.08248290e-01]]), array([2., 2.]), array([3.        , 2.        , 0.66666667]))

leap.data_generation.occurrence_data.parse_age_group(x: str, max_age: int = 65) → tuple[int, int][source]¶

Parse an age group string into a tuple of integers.

Parameters:¶

x: str¶: The age group string. Must be in the format “X-Y”, “X+”, “X-Y years”, “<1 year”.

Returns:¶

A tuple of integers representing the lower and upper age of the age group.

Examples

>>> parse_age_group("0-4")
(0, 4)
>>> parse_age_group("5-9 years")
(5, 9)
>>> parse_age_group("10+")
(10, 65)
>>> parse_age_group("<1 year")
(0, 1)

leap.data_generation.occurrence_data.load_asthma_df(starting_year: int = 2000) → pandas.core.frame.DataFrame[source]¶

Load the asthma incidence and prevalence data.

Parameters:¶

starting_year: int = 2000¶: The starting year for the data. Data before this year will be excluded from the analysis.

Returns:¶

The asthma incidence and prevalence data. Columns:

year (int): The calendar year.
age_group (str): The age group.
age (int): The average age of the age group.
sex (str): One of F = female, M = male.
incidence (float): The incidence of asthma.
prevalence (float): The prevalence of asthma.

leap.data_generation.occurrence_data.generate_occurrence_model(df_asthma: pandas.core.frame.DataFrame, formula: str, occ_type: str, maxiter: int = 1000) → statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]¶

Generate a GLM model for asthma incidence or prevalence.

Parameters:¶

df_asthma: pandas.core.frame.DataFrame¶

The asthma dataframe. Must have columns:

year (int): The calendar year.
sex (str): One of M = male, F = female.
age (int): The age in years.
incidence (float): The incidence of asthma.
prevalence (float): The prevalence of asthma.

formula: str¶

The formula for the GLM model. See the statsmodels documentation for more information.

occ_type: str¶

The type of occurrence data to model. Must be one of "incidence" or "prevalence".

maxiter: int = 1000¶

The maximum number of iterations to perform while fitting the model.

Returns:¶

The fitted GLM model.

leap.data_generation.occurrence_data.generate_incidence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int = 1000) → statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]¶

Generate a GLM model for asthma incidence.

Parameters:¶

df_asthma: pandas.core.frame.DataFrame¶

The asthma dataframe. Must have columns:

year (int): The calendar year.
sex (str): One of M = male, F = female.
age (int): The age in years.
incidence (float): The incidence of asthma.
prevalence (float): The prevalence of asthma.

maxiter: int = 1000¶

The maximum number of iterations to perform while fitting the model.

Returns:¶

The fitted GLM model.

leap.data_generation.occurrence_data.generate_prevalence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int = 1000) → statsmodels.genmod.generalized_linear_model.GLMResultsWrapper[source]¶

Generate a GLM model for asthma prevalence.

Parameters:¶

df_asthma: pandas.core.frame.DataFrame¶

The asthma dataframe. Must have columns:

year (int): The calendar year.
sex (str): One of M = male, F = female.
age (int): The age in years.
incidence (float): The incidence of asthma.
prevalence (float): The prevalence of asthma.

maxiter: int = 1000¶

The maximum number of iterations to perform while fitting the model.

Returns:¶

The fitted GLM model.

leap.data_generation.occurrence_data.get_predicted_data(model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, pred_col: str, min_age: int = 3, max_age: int = 100, min_year: int = 2000, max_year: int = 2019) → pandas.core.frame.DataFrame[source]¶

Get predicted data from a GLM model.

The GLM model must be fitted on the following columns:

year (int): The calendar year.
sex (int): One of 0 = female, 1 = male.
age (int): The age in years.

Parameters:¶

model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶: The fitted GLM model.
pred_col: str¶: The name of the column to store the predicted data.
min_age: int = 3¶: The minimum age to predict.
max_age: int = 100¶: The maximum age to predict.
min_year: int = 2000¶: The minimum year to predict.
max_year: int = 2019¶: The maximum year to predict.

Returns:¶

A dataframe containing the predicted data. Columns:

year (int): The calendar year.
sex (str): One of M = male, F = female.
age (int): The age in years.
pred_col (float): The predicted data.

leap.data_generation.occurrence_data.plot_occurrence(df: pandas.core.frame.DataFrame, y: str, title: str = '', file_path: pathlib.Path | None = None, min_year: int = 2000, max_year: int = 2019, year_interval: int = 2, max_age: int = 110, width: int = 1000, height: int = 800)[source]¶

Plot the incidence or prevalence of asthma.

Parameters:¶

df: pandas.core.frame.DataFrame¶

A dataframe containing either incidence or prevalence data. Must have columns:

year (int): The calendar year.
sex (str): One of F = female, M = male.
age (int): The age in years.
y (float): Specified by the y argument, this will be the y data.
y_pred (float): Optional, the predicted y data. If this column is present, it will be plotted alongside the actual data. The column name must be the same as y with _pred appended. For example, if y is incidence, then the predicted data must be incidence_pred.

y: str¶

The name of the column in the dataframe which will be plotted as the y data.

title: str = ''¶

The title of the plot.

file_path: pathlib.Path | None = None¶

The path to save the plot to. If None, the plot will be displayed.

min_year: int = 2000¶

The minimum year to plot.

max_year: int = 2019¶

The maximum year to plot.

year_interval: int = 2¶

The interval between years. This is used if you don’t want to plot every year.

max_age: int = 110¶

The maximum age to plot.

width: int = 1000¶

The width of the plot.

height: int = 800¶

The height of the plot.

Returns:¶

If file_path is None, the plot will be displayed. Otherwise, the plot will be saved to the specified path.

leap.data_generation.occurrence_data.generate_occurrence_data()[source]¶

Generate the asthma incidence and prevalence data.

Saves the data to a CSV file: processed_data/asthma_occurrence_predictions.csv.

The data is also plotted and saved to the following files:

data_generation/figures/asthma_incidence_predicted.png: The predicted asthma incidence per 100 in BC.
data_generation/figures/asthma_prevalence_predicted.png: The predicted asthma prevalence per 100 in BC.