Proccessing of proteomic data#

Package import#

import opendvp as dvp
import geopandas as gpd
import pandas as pd
import anndata as ad
import numpy as np
import scanpy as sc

import seaborn as sns
import matplotlib.pyplot as plt

Load adata object#

adata = ad.read_h5ad("data/checkpoints/1_loaded/20250701_1252_1_loaded_adata.h5ad")

adata

AnnData object with n_obs × n_vars = 10 × 6763
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description'

First step, describe the dataset#

dvp.plotting.dual_axis_boxplots(adata_obs=adata.obs, feature_key="RCN")

../_images/54d1f9ff5bb72b63ffd825e23321cfefd924071b0bc075a32b8eceaf1469f62e.png

Let’s plot some data#

# Log2 transform the data
adata.X = np.log2(adata.X)

dvp.plotting.density(adata=adata, color_by="RCN")
dvp.plotting.density(adata=adata, color_by="QuPath_class")

../_images/555da664bf2db55b10a74255a30fcc65f02046abcf2f3fafe670d95d264edd7e.png

../_images/8a0a8bed359401d2887767f49a3a8b8863d0e6d36dfdbf5848c11c02c3dad4f3.png

Filter dataset by NaNs#

We expect a lot of proteins to be removed because this is a subsampled dataset.
Many protein hits were present in other groups not present here.

adata_filtered = dvp.tl.filter_features_byNaNs(adata=adata, threshold=0.7, grouping="RCN")

17:07.39 | INFO | Filtering protein with at least 70.0% valid values in ANY group
17:07.39 | INFO | Calculating overall QC metrics for all features.
17:07.40 | INFO | Filtering by groups, RCN: ['RCN1', 'RCN3']
17:07.40 | INFO |  RCN1 has 5 samples
17:07.40 | INFO |  RCN3 has 5 samples
17:07.41 | INFO | Keeping proteins that pass 'ANY' group criteria.
17:07.41 | INFO | Complete QC metrics for all initial features stored in `adata.uns['filter_features_byNaNs_qc_metrics']`.
17:07.41 | INFO | 4637 proteins were kept.
17:07.41 | INFO | 2126 proteins were removed.
17:07.41 | SUCCESS | filter_features_byNaNs complete.

adata_filtered.uns['filter_features_byNaNs_qc_metrics'].head()

	Protein.Group	Protein.Names	Genes	First.Protein.Description	overall_mean	overall_nan_count	overall_valid_count	overall_nan_proportions	overall_valid	RCN1_mean	...	RCN1_nan_proportions	RCN1_valid	RCN3_mean	RCN3_nan_count	RCN3_valid_count	RCN3_nan_proportions	RCN3_valid	valid_in_all_groups	valid_in_any_group	not_valid_in_any_group
Gene
TMA7	A0A024R1R8;Q9Y2S6	TMA7B_HUMAN;TMA7_HUMAN	TMA7;TMA7B	Translation machinery-associated protein 7B	13.391	5	5	0.5	False	13.029	...	0.4	False	13.933	3	2	0.6	False	False	False	True
IGLV8-61	A0A075B6I0	LV861_HUMAN	IGLV8-61	Immunoglobulin lambda variable 8-61	13.666	6	4	0.6	False	12.395	...	0.8	False	14.090	2	3	0.4	False	False	False	True
IGLV3-10	A0A075B6K4	LV310_HUMAN	IGLV3-10	Immunoglobulin lambda variable 3-10	14.216	6	4	0.6	False	13.335	...	0.6	False	15.097	3	2	0.6	False	False	False	True
IGLV3-9	A0A075B6K5	LV39_HUMAN	IGLV3-9	Immunoglobulin lambda variable 3-9	15.293	0	10	0.0	True	13.680	...	0.0	True	16.906	0	5	0.0	True	True	True	False
IGKV2-28	A0A075B6P5;P01615	KV228_HUMAN;KVD28_HUMAN	IGKV2-28;IGKV2D-28	Immunoglobulin kappa variable 2-28	15.343	1	9	0.1	True	14.014	...	0.2	True	16.407	0	5	0.0	True	True	True	False

5 rows × 22 columns

adata_filtered.uns['filter_features_byNaNs_qc_metrics'].columns

Index(['Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description',
       'overall_mean', 'overall_nan_count', 'overall_valid_count',
       'overall_nan_proportions', 'overall_valid', 'RCN1_mean',
       'RCN1_nan_count', 'RCN1_valid_count', 'RCN1_nan_proportions',
       'RCN1_valid', 'RCN3_mean', 'RCN3_nan_count', 'RCN3_valid_count',
       'RCN3_nan_proportions', 'RCN3_valid', 'valid_in_all_groups',
       'valid_in_any_group', 'not_valid_in_any_group'],
      dtype='object')

In this dataframe, stored away in adata.uns, you can see all th qc metrics of the filtering

# Store filtered adata
dvp.io.export_adata(adata=adata_filtered, path_to_dir="data/checkpoints", checkpoint_name="2_filtered")

15:17:12.78 | INFO | Writing h5ad
15:17:12.84 | SUCCESS | Wrote h5ad file

Imputation#

adata_imputed = dvp.tl.impute_gaussian(adata=adata_filtered, mean_shift=-1.8, std_dev_shift=0.3)

17:31.45 | INFO | Storing original data in `adata.layers['unimputed']`.
17:31.45 | INFO | Imputation with Gaussian distribution PER PROTEIN
17:31.46 | INFO | Mean number of missing values per sample: 572.6 out of 4637 proteins
17:31.46 | INFO | Mean number of missing values per protein: 1.23 out of 10 samples
17:33.35 | INFO | Imputation complete. QC metrics stored in `adata.uns['impute_gaussian_qc_metrics']`.

adata_imputed

AnnData object with n_obs × n_vars = 10 × 4637
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description', 'mean', 'nan_proportions'
    uns: 'filter_features_byNaNs_qc_metrics', 'impute_gaussian_qc_metrics'
    layers: 'unimputed'

Like the previous process, the imputation stores two quality control datasets.
First, the impute_gaussian_qc_metrics Showing you per protein:

how many values were imputed
the distribution used
the values used to impute with

adata_imputed.uns['impute_gaussian_qc_metrics']

	n_imputed	imputation_mean	imputation_stddev	imputed_values
Gene
IGLV3-9	0	15.292778	1.840310	NAN
IGKV2-28	1	15.343103	1.346909	[12.4703]
IGHV3-64	6	13.710045	1.247562	[11.1396, 12.2713, 12.2389, 10.8992, 11.9149, ...
IGKV2D-29	0	16.213394	1.481762	NAN
IGKV1-27	0	13.452261	1.394043	NAN
...	...	...	...	...
WASF2	0	15.135411	0.255652	NAN
MAU2	1	12.475482	0.455856	[11.636]
ENPP4	0	12.008373	0.630813	NAN
MORC2	1	12.330734	0.783775	[10.7335]
SEC23IP	0	14.050191	0.227216	NAN

4637 rows × 4 columns

Second, the unimputed values are stored inside the layers compartment of the adata object.
This is a backup in case imputation has done something wrong.
You can always call those values by adata_imputed.layers['unimputed']

dvp.io.export_adata(adata=adata_imputed, path_to_dir="data/checkpoints", checkpoint_name="3_imputed")

15:18:23.45 | INFO | Writing h5ad
15:18:23.51 | SUCCESS | Wrote h5ad file

Scanpy’s PCA#

sc.pp.pca(adata_imputed)

Scanpy is a very powerful data analysis package created for single-cell RNA sequencing.
We use it here because it is very convenient, and it already expects the AnnData format we have.
Beware of using Scanpy for proteomics datasets, assumptions will vary.

adata_imputed

AnnData object with n_obs × n_vars = 10 × 4637
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description', 'mean', 'nan_proportions'
    uns: 'filter_features_byNaNs_qc_metrics', 'impute_gaussian_qc_metrics', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'unimputed'

# let's plot it
sc.pl.pca(adata_imputed, color="RCN", annotate_var_explained=True, size=300)

../_images/f36b58ef4cefbdd026f0c199598e55e93a3e0cd53d92fea6403a6a29bd5882e7.png

# let's plot the direction of each feature
dvp.plotting.pca_loadings(adata_imputed)

4 [-0.38130396 -0.40484619]
52 [-0.39167196  0.581337  ]

../_images/bc3f8ca6d91d58768a6a55d8f1565ebc8183131fb7859763db989ad91e6762ad.png

dvp.io.export_adata(adata=adata_imputed, path_to_dir="data/checkpoints", checkpoint_name="4_pca")

15:19:19.65 | INFO | Writing h5ad
15:19:19.71 | SUCCESS | Wrote h5ad file

heatmap#

dataframe = pd.DataFrame(data=adata_imputed.X, columns=adata_imputed.var_names, index=adata_imputed.obs.RCN)
sns.clustermap(data=dataframe.T, z_score=0, cmap="bwr", vmin=-2, vmax=2)

<seaborn.matrix.ClusterGrid at 0x156e4c790>

../_images/617d5afd9887da34f2ee59511b4de45ae8c5ad49eb9a06bf3390118a616d23ca.png

Differential analysis#

# ttest
adata_DAP = dvp.tl.stats_ttest(adata_imputed, grouping="RCN", group1="RCN1", group2="RCN3", FDR_threshold=0.05)

20:10.81 | INFO | Using pingouin.ttest to perform unpaired two-sided t-test between RCN1 and RCN3
20:10.81 | INFO | Using Benjamini-Hochberg for FDR correction, with a threshold of 0.05
20:10.81 | INFO | The test found 1997 proteins to be significantly

adata_DAP

AnnData object with n_obs × n_vars = 10 × 4637
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description', 'mean', 'nan_proportions', 't_val', 'p_val', 'mean_diff', 'sig', 'p_corr', '-log10_p_corr'
    uns: 'filter_features_byNaNs_qc_metrics', 'impute_gaussian_qc_metrics', 'pca', 'RCN_colors'
    obsm: 'X_pca'
    varm: 'PCs'
    layers: 'unimputed'

dvp.io.export_adata(adata=adata_DAP, path_to_dir="data/checkpoints", checkpoint_name="5_DAP")

15:21:26.28 | INFO | Writing h5ad
15:21:26.37 | SUCCESS | Wrote h5ad file

plotting with volcano plot#

dvp.plotting.volcano(adata_DAP, x="mean_diff", y="-log10_p_corr", FDR=0.05, significant=True, tag_top=30, group1="Tumor samples", group2="Immune samples")

../_images/aa9a055aa2f2514d1dd6e2c82b4365b8b96d3b868da87f186a770b57e3ea4738.png

Proccessing of proteomic data

Contents