Integrating proteomics and imaging data#

Import necessary packages#

import opendvp as dvp
import geopandas as gpd
import ast
import spatialdata
import napari_spatialdata
from dask_image import imread

Load DIANN data into an adata object#

adata = dvp.io.DIANN_to_adata(
    DIANN_path="data/proteomics/DIANN_subset_demo.pg_matrix.csv",
    DIANN_sep="\t",
    metadata_path="data/proteomics/DIANN_metadata_subset_demo.csv",
    metadata_sep=";",
    n_of_protein_metadata_cols=4
)
adata

50:10.86 | INFO | Starting DIANN matrix shape (7030, 14)
50:10.87 | INFO | Removing 264 contaminants
50:10.87 | INFO | Filtering 3 genes that are NaN
50:10.87 | INFO | ['A0A0G2JRQ6_HUMAN', 'A0A0J9YY99_HUMAN', 'YJ005_HUMAN']
50:10.87 | INFO | 10 samples, and 6763 proteins
50:10.88 | INFO | 52 gene lists (eg 'TMA7;TMA7B') were simplified to ('TMA7').
50:10.88 | SUCCESS | Anndata object has been created :)

AnnData object with n_obs × n_vars = 10 × 6763
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description'

Let’s see the loaded metadata

adata.obs

	Precursors.Identified	Proteins.Identified	Average.Missed.Tryptic.Cleavages	LCMS_run_id	RCN	RCN_long	QuPath_class
Sample_filepath
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241204_JoN_Evo2_80SPD_Rplate_P26E17_K20_S6-G4_1_8674.d	26358	4760	0.139	8674	RCN1	Tumor enriched	P12_Tumor_1
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_H21_S3-F9_1_8452.d	24705	4553	0.134	8452	RCN1	Tumor enriched	P12_Tumor_2
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_E4_S2-C2_1_8414.d	26750	4835	0.134	8414	RCN1	Tumor enriched	P12_Tumor_3
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241204_JoN_Evo2_80SPD_Rplate_P26E17_L10_S4-H8_1_8551.d	24268	4502	0.116	8551	RCN1	Tumor enriched	P12_Tumor_4
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_F4_S2-D2_1_8424.d	27883	4858	0.136	8424	RCN1	Tumor enriched	P12_Tumor_5
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C22_S3-A10_1_8521.d	20460	4116	0.129	8521	RCN3	Immune enriched	P12_Immune_1
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_D4_S2-B2_1_8404.d	20446	3955	0.129	8404	RCN3	Immune enriched	P12_Immune_2
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_D3_S2-B1_1_8403.d	21077	4212	0.127	8403	RCN3	Immune enriched	P12_Immune_3
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C10_S2-A8_1_8390.d	18013	3749	0.117	8390	RCN3	Immune enriched	P12_Immune_4
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C23_S3-A11_1_8520.d	19740	3799	0.123	8520	RCN3	Immune enriched	P12_Immune_5

Export adata for transparency#

dvp.io.export_adata(adata=adata, path_to_dir="data/checkpoints", checkpoint_name="1_loaded")

12:52:16.33 | INFO | Writing h5ad
12:52:16.37 | SUCCESS | Wrote h5ad file

Load shapes of cut samples#

gdf = gpd.read_file("data/proteomics/collection_shapes.geojson")
gdf.head()

	id	objectType	classification	geometry
0	78422794-3ed4-4e09-a9aa-6486c467149d	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
1	70a57ffd-50a0-4449-b873-2dd2a38bf190	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
2	edfda1ca-8de8-44f5-af4e-f16e15bcd4ca	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
3	a0a868c6-c7e3-4fb0-8f4f-071454565080	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
4	ce97d787-321e-4b7d-97e7-55b203fb0373	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Create spatialdata object#

sdata = spatialdata.SpatialData()

Load multiplex immunofluorescence into spatialdata object#

# first load as array using dask-image
image_array = imread.imread("data/image/mIF.ome.tif")
image_array

	Array	Chunk
Bytes	357.63 MiB	23.84 MiB
Shape	(15, 5000, 5000)	(1, 5000, 5000)
Dask graph	15 chunks in 4 graph layers
Data type	uint8 numpy.ndarray

# load image to spatialdata object
sdata['mIF'] = spatialdata.models.Image2DModel.parse(image_array)

INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')

sdata

SpatialData object
└── Images
      └── 'mIF': DataArray[cyx] (15, 5000, 5000)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images)

Load proteomics matrix (adata object) to spatialdata object#

First we must label the matrix, to let spatialdata know which coordinate system to use.
In this case, this means labelling which slide it was.

adata.obs["Slide_id"] = "Slide_P12"

adata

AnnData object with n_obs × n_vars = 10 × 6763
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class', 'Slide_id'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description'

Now we will pass the matrix to spatialdata.
We need to tell spatialdata what parts of the adata object to use for visualization.
region is which set of shapes to project these data into, in this case Slide_P12.
region_key is which column in adata.obs to use for finding the region parameter.
instance_key refers to the column that links this adata object to the shapes.
We use QuPath_class because that is the column that matches between the adata and the geodataframe.

sdata['proteomics'] = spatialdata.models.TableModel.parse(
                            adata=adata, 
                            region="Slide_P12",
                            region_key="Slide_id",
                            instance_key="QuPath_class"
                            )

sdata

SpatialData object
├── Images
│     └── 'mIF': DataArray[cyx] (15, 5000, 5000)
└── Tables
      └── 'proteomics': AnnData (10, 6763)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images)

Load the geodataframe with shapes that match proteomic samples#

gdf.head()

	id	objectType	classification	geometry
0	78422794-3ed4-4e09-a9aa-6486c467149d	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
1	70a57ffd-50a0-4449-b873-2dd2a38bf190	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
2	edfda1ca-8de8-44f5-af4e-f16e15bcd4ca	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
3	a0a868c6-c7e3-4fb0-8f4f-071454565080	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
4	ce97d787-321e-4b7d-97e7-55b203fb0373	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Here we see that the QuPath class name is inside that classification column.
Let’s get it out

gdf['QuPath_class'] = gdf['classification'].apply(
    lambda row: ast.literal_eval(row).get('name') if isinstance(row,str) else row.get('name'))

# this line of code says, if classification cells are strings, 
# convert to a dictionary and get name attribute, 
# otherwise we assume it is already a dictionary and we get the name attribute.

gdf.head()

	id	objectType	classification	geometry	QuPath_class
0	78422794-3ed4-4e09-a9aa-6486c467149d	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...	P12_Immune_3
1	70a57ffd-50a0-4449-b873-2dd2a38bf190	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...	P12_Immune_3
2	edfda1ca-8de8-44f5-af4e-f16e15bcd4ca	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...	P12_Immune_3
3	a0a868c6-c7e3-4fb0-8f4f-071454565080	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...	P12_Immune_4
4	ce97d787-321e-4b7d-97e7-55b203fb0373	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...	P12_Immune_4

Now we see that the QuPath_class column has our matching names.
One last thing we must do, is set these names as the index of the geodataframe.

gdf = gdf.set_index("QuPath_class")
gdf.head()

	id	objectType	classification	geometry
QuPath_class
P12_Immune_3	78422794-3ed4-4e09-a9aa-6486c467149d	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
P12_Immune_3	70a57ffd-50a0-4449-b873-2dd2a38bf190	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
P12_Immune_3	edfda1ca-8de8-44f5-af4e-f16e15bcd4ca	annotation	{ "name": "P12_Immune_3", "color": [ 0, 255, 2...	POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
P12_Immune_4	a0a868c6-c7e3-4fb0-8f4f-071454565080	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
P12_Immune_4	ce97d787-321e-4b7d-97e7-55b203fb0373	annotation	{ "name": "P12_Immune_4", "color": [ 0, 255, 2...	POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Now we see that QuPath_class is the index.
Spatialdata needs this to match the proteomics adata object to these shapes.
Let’s add these prepared shapes to the spatialdata object.

sdata['Slide_P12'] = spatialdata.models.ShapesModel.parse(gdf)

sdata

SpatialData object
├── Images
│     └── 'mIF': DataArray[cyx] (15, 5000, 5000)
├── Shapes
│     └── 'Slide_P12': GeoDataFrame shape: (20, 4) (2D shapes)
└── Tables
      └── 'proteomics': AnnData (10, 6763)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images), Slide_P12 (Shapes)

Visualize data interactively#

napari_spatialdata.Interactive(sdata)

<napari_spatialdata._interactive.Interactive at 0x2ad487610>

Write spatialdata object for future use#

sdata.write("outputs/spatialdata.zarr")

INFO     The Zarr backing store has been changed from None the new file path: outputs/spatialdata.zarr

Integrating proteomics and imaging data

Contents