opendvp.tl.stats_bootstrap#
- opendvp.tl.stats_bootstrap(dataframe, n_bootstrap=100, subset_sizes=None, summary_func=<function mean>, replace=True, return_raw=False, return_summary=True, plot=True, random_seed=42, nan_policy='omit', cv_threshold=None)#
Evaluate the variability of feature-level coefficient of variation (CV) via bootstrapping.
This function samples subsets from the input DataFrame and computes the CV (standard deviation divided by mean) of each feature (column) for each bootstrap replication. For each subset size, the function aggregates the CVs across bootstraps and then summarizes them with a user-specified statistic (e.g., mean, median). Optionally, the function can generate a violin plot of the summarized CVs across different subset sizes, and it returns the bootstrapped raw CVs and/or the summarized results.
Parameters#
- dataframepandas.DataFrame
The input DataFrame containing the data (features in columns, samples in rows).
- n_bootstrapint, optional (default=100)
Number of bootstrap replicates to perform for each subset size.
- subset_sizeslist of int, optional (default=[10, 50, 100])
List of subset sizes (number of rows to sample) to use during the bootstrapping.
- summary_funccallable or ‘count_above_threshold’, optional (default=np.mean)
Function to aggregate the per-feature CVs across bootstraps. For example, np.mean, np.median, etc. If set to “count_above_threshold”, counts the number of CVs above
cv_threshold
for each feature.- replacebool, optional (default=True)
Whether to sample with replacement (True, standard bootstrapping) or without (False, subsampling).
- cv_thresholdfloat, optional (default=None)
Threshold for counting CVs above this value when summary_func is “count_above_threshold”.
- return_rawbool, optional (default=True)
If True, returns the raw bootstrapped CVs in long format.
- return_summarybool, optional (default=True)
If True, returns a summary DataFrame where the per-feature bootstrapped CVs have been aggregated using
summary_func
for each subset size.- plotbool, optional (default=True)
If True, displays a violin plot of the summarized CVs (one aggregated value per feature) across subset sizes.
- random_seedint or None, optional (default=42)
Seed for the random number generator, ensuring reproducibility.
- nan_policy{‘omit’, ‘raise’, ‘propagate’}, optional (default=”omit”)
- How to handle NaN values. Options are:
“omit”: ignore NaNs during calculations,
“raise”: raise an error if NaNs are encountered,
“propagate”: allow NaNs to propagate in the output.
Returns:#
- pandas.DataFrame or tuple of pandas.DataFrame
- Depending on the flags
return_raw
andreturn_summary
, the function returns: If both are True: a tuple (raw_df, summary_df) * raw_df: DataFrame in long format with columns “feature”, “cv”, “subset_size”, and “bootstrap_id”. * summary_df: DataFrame with the aggregated CV (using
summary_func
) per feature and subset size,with columns “subset_size”, “feature”, and “cv_summary”.
If only one of the flags is True, only that DataFrame is returned.
If neither is True, returns None.
- Depending on the flags
Raises:#
- ValueError
If any of the specified subset sizes is larger than the number of rows in
dataframe
.
Examples:#
>>> import pandas as pd >>> import numpy as np >>> df = pd.DataFrame(np.random.randn(100, 5)) # 100 samples, 5 features >>> raw_results, summary_results = bootstrap_variability(df, subset_sizes=[10, 20, 50]) >>> summary_results.head() subset_size feature cv_summary 0 10 A 0.123456 1 10 B 0.098765 2 20 A 0.110987 3 20 B 0.102345 4 50 A 0.095432