opendvp.tl.stats_bootstrap

opendvp.tl.stats_bootstrap#

opendvp.tl.stats_bootstrap(dataframe, n_bootstrap=100, subset_sizes=None, summary_func=<function mean>, replace=True, return_raw=False, return_summary=True, plot=True, random_seed=42, nan_policy='omit', cv_threshold=None)#

Evaluate the variability of feature-level coefficient of variation (CV) via bootstrapping.

This function samples subsets from the input DataFrame and computes the CV (standard deviation divided by mean) of each feature (column) for each bootstrap replication. For each subset size, the function aggregates the CVs across bootstraps and then summarizes them with a user-specified statistic (e.g., mean, median). Optionally, the function can generate a violin plot of the summarized CVs across different subset sizes, and it returns the bootstrapped raw CVs and/or the summarized results.

Parameters#

dataframepandas.DataFrame

The input DataFrame containing the data (features in columns, samples in rows).

n_bootstrapint, optional (default=100)

Number of bootstrap replicates to perform for each subset size.

subset_sizeslist of int, optional (default=[10, 50, 100])

List of subset sizes (number of rows to sample) to use during the bootstrapping.

summary_funccallable or ‘count_above_threshold’, optional (default=np.mean)

Function to aggregate the per-feature CVs across bootstraps. For example, np.mean, np.median, etc. If set to “count_above_threshold”, counts the number of CVs above cv_threshold for each feature.

replacebool, optional (default=True)

Whether to sample with replacement (True, standard bootstrapping) or without (False, subsampling).

cv_thresholdfloat, optional (default=None)

Threshold for counting CVs above this value when summary_func is “count_above_threshold”.

return_rawbool, optional (default=True)

If True, returns the raw bootstrapped CVs in long format.

return_summarybool, optional (default=True)

If True, returns a summary DataFrame where the per-feature bootstrapped CVs have been aggregated using summary_func for each subset size.

plotbool, optional (default=True)

If True, displays a violin plot of the summarized CVs (one aggregated value per feature) across subset sizes.

random_seedint or None, optional (default=42)

Seed for the random number generator, ensuring reproducibility.

nan_policy{‘omit’, ‘raise’, ‘propagate’}, optional (default=”omit”)
How to handle NaN values. Options are:
  • “omit”: ignore NaNs during calculations,

  • “raise”: raise an error if NaNs are encountered,

  • “propagate”: allow NaNs to propagate in the output.

Returns:#

pandas.DataFrame or tuple of pandas.DataFrame
Depending on the flags return_raw and return_summary, the function returns:
  • If both are True: a tuple (raw_df, summary_df) * raw_df: DataFrame in long format with columns “feature”, “cv”, “subset_size”, and “bootstrap_id”. * summary_df: DataFrame with the aggregated CV (using summary_func) per feature and subset size,

    with columns “subset_size”, “feature”, and “cv_summary”.

  • If only one of the flags is True, only that DataFrame is returned.

  • If neither is True, returns None.

Raises:#

ValueError

If any of the specified subset sizes is larger than the number of rows in dataframe.

Examples:#

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(100, 5))  # 100 samples, 5 features
>>> raw_results, summary_results = bootstrap_variability(df, subset_sizes=[10, 20, 50])
>>> summary_results.head()
     subset_size feature  cv_summary
0           10       A    0.123456
1           10       B    0.098765
2           20       A    0.110987
3           20       B    0.102345
4           50       A    0.095432