Occurrence Data¶
To run the data processing for the occurrence data:
cd LEAP
python3 leap/data_generation/occurrence_data.py
leap.data_generation.occurrence_data module¶
-
leap.data_generation.occurrence_data.poly(x: list[float] | numpy.ndarray, degree: int =
1
, alpha: list[float] | numpy.ndarray | None =None
, norm2: list[float] | numpy.ndarray | None =None
, orthogonal: bool =False
) numpy.ndarray | tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray] [source]¶ Generate a polynomial basis for a vector.
See Orthogonal polynomial regression in Python for more information on this function.
- Parameters:¶
- Returns:¶
The polynomial basis, as a 2D Numpy array. If
orthogonal
isTrue
, the function will return a tuple of three Numpy arrays: the orthogonal polynomial basis, thealpha
values, and thenorm2
values.
Examples
>>> poly([1, 2, 3], degree=2) array([[1, 1], [2, 4], [3, 9]])
>>> poly([1, 2, 3], degree=2, orthogonal=True) (array([[-7.07106781e-01, 4.08248290e-01], [-5.55111512e-17, -8.16496581e-01], [ 7.07106781e-01, 4.08248290e-01]]), array([2., 2.]), array([3. , 2. , 0.66666667]))
-
leap.data_generation.occurrence_data.parse_age_group(x: str, max_age: int =
65
) tuple[int, int] [source]¶ Parse an age group string into a tuple of integers.
- Parameters:¶
- x: str¶
The age group string. Must be in the format “X-Y”, “X+”, “X-Y years”, “<1 year”.
- Returns:¶
A tuple of integers representing the lower and upper age of the age group.
Examples
>>> parse_age_group("0-4") (0, 4) >>> parse_age_group("5-9 years") (5, 9) >>> parse_age_group("10+") (10, 65) >>> parse_age_group("<1 year") (0, 1)
-
leap.data_generation.occurrence_data.load_asthma_df(starting_year: int =
2000
) pandas.core.frame.DataFrame [source]¶ Load the asthma incidence and prevalence data.
- Parameters:¶
- starting_year: int =
2000
¶ The starting year for the data. Data before this year will be excluded from the analysis.
- starting_year: int =
- Returns:¶
The asthma incidence and prevalence data. Columns:
year (int)
: The calendar year.age_group (str)
: The age group.age (int)
: The average age of the age group.sex (str)
: One ofF
= female,M
= male.incidence (float)
: The incidence of asthma.prevalence (float)
: The prevalence of asthma.
-
leap.data_generation.occurrence_data.generate_occurrence_model(df_asthma: pandas.core.frame.DataFrame, formula: str, occ_type: str, maxiter: int =
1000
) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper [source]¶ Generate a
GLM
model for asthma incidence or prevalence.- Parameters:¶
- df_asthma: pandas.core.frame.DataFrame¶
The asthma dataframe. Must have columns:
year (int)
: The calendar year.sex (str)
: One ofM
= male,F
= female.age (int)
: The age in years.incidence (float)
: The incidence of asthma.prevalence (float)
: The prevalence of asthma.
- formula: str¶
The formula for the GLM model. See the statsmodels documentation for more information.
- occ_type: str¶
The type of occurrence data to model. Must be one of
"incidence"
or"prevalence"
.- maxiter: int =
1000
¶ The maximum number of iterations to perform while fitting the model.
- Returns:¶
The fitted
GLM
model.
-
leap.data_generation.occurrence_data.generate_incidence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int =
1000
) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper [source]¶ Generate a
GLM
model for asthma incidence.- Parameters:¶
- df_asthma: pandas.core.frame.DataFrame¶
The asthma dataframe. Must have columns:
year (int)
: The calendar year.sex (str)
: One ofM
= male,F
= female.age (int)
: The age in years.incidence (float)
: The incidence of asthma.prevalence (float)
: The prevalence of asthma.
- maxiter: int =
1000
¶ The maximum number of iterations to perform while fitting the model.
- Returns:¶
The fitted
GLM
model.
-
leap.data_generation.occurrence_data.generate_prevalence_model(df_asthma: pandas.core.frame.DataFrame, maxiter: int =
1000
) statsmodels.genmod.generalized_linear_model.GLMResultsWrapper [source]¶ Generate a
GLM
model for asthma prevalence.- Parameters:¶
- df_asthma: pandas.core.frame.DataFrame¶
The asthma dataframe. Must have columns:
year (int)
: The calendar year.sex (str)
: One ofM
= male,F
= female.age (int)
: The age in years.incidence (float)
: The incidence of asthma.prevalence (float)
: The prevalence of asthma.
- maxiter: int =
1000
¶ The maximum number of iterations to perform while fitting the model.
- Returns:¶
The fitted
GLM
model.
-
leap.data_generation.occurrence_data.get_predicted_data(model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper, pred_col: str, min_age: int =
3
, max_age: int =100
, min_year: int =2000
, max_year: int =2019
) pandas.core.frame.DataFrame [source]¶ Get predicted data from a GLM model.
The GLM model must be fitted on the following columns:
year (int)
: The calendar year.sex (int)
: One of0
= female,1
= male.age (int)
: The age in years.
- Parameters:¶
- model: statsmodels.genmod.generalized_linear_model.GLMResultsWrapper¶
The fitted GLM model.
- pred_col: str¶
The name of the column to store the predicted data.
- min_age: int =
3
¶ The minimum age to predict.
- max_age: int =
100
¶ The maximum age to predict.
- min_year: int =
2000
¶ The minimum year to predict.
- max_year: int =
2019
¶ The maximum year to predict.
- Returns:¶
A dataframe containing the predicted data. Columns:
year (int)
: The calendar year.sex (str)
: One ofM
= male,F
= female.age (int)
: The age in years.pred_col (float)
: The predicted data.
-
leap.data_generation.occurrence_data.plot_occurrence(df: pandas.core.frame.DataFrame, y: str, title: str =
''
, file_path: pathlib.Path | None =None
, min_year: int =2000
, max_year: int =2019
, year_interval: int =2
, max_age: int =110
, width: int =1000
, height: int =800
)[source]¶ Plot the incidence or prevalence of asthma.
- Parameters:¶
- df: pandas.core.frame.DataFrame¶
A dataframe containing either incidence or prevalence data. Must have columns:
year (int)
: The calendar year.sex (str)
: One ofF
= female,M
= male.age (int)
: The age in years.y (float)
: Specified by they
argument, this will be the y data.y_pred (float)
: Optional, the predicted y data. If this column is present, it will be plotted alongside the actual data. The column name must be the same asy
with_pred
appended. For example, ify
isincidence
, then the predicted data must beincidence_pred
.
- y: str¶
The name of the column in the dataframe which will be plotted as the
y
data.- title: str =
''
¶ The title of the plot.
- file_path: pathlib.Path | None =
None
¶ The path to save the plot to. If
None
, the plot will be displayed.- min_year: int =
2000
¶ The minimum year to plot.
- max_year: int =
2019
¶ The maximum year to plot.
- year_interval: int =
2
¶ The interval between years. This is used if you don’t want to plot every year.
- max_age: int =
110
¶ The maximum age to plot.
- width: int =
1000
¶ The width of the plot.
- height: int =
800
¶ The height of the plot.
- Returns:¶
If
file_path
isNone
, the plot will be displayed. Otherwise, the plot will be saved to the specified path.
- leap.data_generation.occurrence_data.generate_occurrence_data()[source]¶
Generate the asthma incidence and prevalence data.
Saves the data to a CSV file:
processed_data/asthma_occurrence_predictions.csv
.The data is also plotted and saved to the following files:
data_generation/figures/asthma_incidence_predicted.png
: The predicted asthma incidence per 100 in BC.data_generation/figures/asthma_prevalence_predicted.png
: The predicted asthma prevalence per 100 in BC.