Integrating proteomics and imaging data#

Import necessary packages#

import opendvp as dvp
import geopandas as gpd
import ast
import spatialdata
import napari_spatialdata
from dask_image import imread

Load DIANN data into an adata object#

adata = dvp.io.DIANN_to_adata(
    DIANN_path="data/proteomics/DIANN_subset_demo.pg_matrix.csv",
    DIANN_sep="\t",
    metadata_path="data/proteomics/DIANN_metadata_subset_demo.csv",
    metadata_sep=";",
    n_of_protein_metadata_cols=4
)
adata
13:50:10.86 | INFO | Starting DIANN matrix shape (7030, 14)
13:50:10.87 | INFO | Removing 264 contaminants
13:50:10.87 | INFO | Filtering 3 genes that are NaN
13:50:10.87 | INFO | ['A0A0G2JRQ6_HUMAN', 'A0A0J9YY99_HUMAN', 'YJ005_HUMAN']
13:50:10.87 | INFO | 10 samples, and 6763 proteins
13:50:10.88 | INFO | 52 gene lists (eg 'TMA7;TMA7B') were simplified to ('TMA7').
13:50:10.88 | SUCCESS | Anndata object has been created :)
AnnData object with n_obs × n_vars = 10 × 6763
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description'

Let’s see the loaded metadata

adata.obs
Precursors.Identified Proteins.Identified Average.Missed.Tryptic.Cleavages LCMS_run_id RCN RCN_long QuPath_class
Sample_filepath
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241204_JoN_Evo2_80SPD_Rplate_P26E17_K20_S6-G4_1_8674.d 26358 4760 0.139 8674 RCN1 Tumor enriched P12_Tumor_1
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_H21_S3-F9_1_8452.d 24705 4553 0.134 8452 RCN1 Tumor enriched P12_Tumor_2
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_E4_S2-C2_1_8414.d 26750 4835 0.134 8414 RCN1 Tumor enriched P12_Tumor_3
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241204_JoN_Evo2_80SPD_Rplate_P26E17_L10_S4-H8_1_8551.d 24268 4502 0.116 8551 RCN1 Tumor enriched P12_Tumor_4
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_F4_S2-D2_1_8424.d 27883 4858 0.136 8424 RCN1 Tumor enriched P12_Tumor_5
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C22_S3-A10_1_8521.d 20460 4116 0.129 8521 RCN3 Immune enriched P12_Immune_1
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_D4_S2-B2_1_8404.d 20446 3955 0.129 8404 RCN3 Immune enriched P12_Immune_2
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_D3_S2-B1_1_8403.d 21077 4212 0.127 8403 RCN3 Immune enriched P12_Immune_3
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C10_S2-A8_1_8390.d 18013 3749 0.117 8390 RCN3 Immune enriched P12_Immune_4
E:\Proteomics\Jose\P26E17\rawfiles\Olive_20241129_JoN_Evo2_80SPD_P26E17_991_C23_S3-A11_1_8520.d 19740 3799 0.123 8520 RCN3 Immune enriched P12_Immune_5

Export adata for transparency#

dvp.io.export_adata(adata=adata, path_to_dir="data/checkpoints", checkpoint_name="1_loaded")
12:52:16.33 | INFO | Writing h5ad
12:52:16.37 | SUCCESS | Wrote h5ad file

Load shapes of cut samples#

gdf = gpd.read_file("data/proteomics/collection_shapes.geojson")
gdf.head()
id objectType classification geometry
0 78422794-3ed4-4e09-a9aa-6486c467149d annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
1 70a57ffd-50a0-4449-b873-2dd2a38bf190 annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
2 edfda1ca-8de8-44f5-af4e-f16e15bcd4ca annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
3 a0a868c6-c7e3-4fb0-8f4f-071454565080 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
4 ce97d787-321e-4b7d-97e7-55b203fb0373 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Create spatialdata object#

sdata = spatialdata.SpatialData()

Load multiplex immunofluorescence into spatialdata object#

# first load as array using dask-image
image_array = imread.imread("data/image/mIF.ome.tif")
image_array
Array Chunk
Bytes 357.63 MiB 23.84 MiB
Shape (15, 5000, 5000) (1, 5000, 5000)
Dask graph 15 chunks in 4 graph layers
Data type uint8 numpy.ndarray
5000 5000 15
# load image to spatialdata object
sdata['mIF'] = spatialdata.models.Image2DModel.parse(image_array)
INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')
sdata
SpatialData object
└── Images
      └── 'mIF': DataArray[cyx] (15, 5000, 5000)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images)

Load proteomics matrix (adata object) to spatialdata object#

First we must label the matrix, to let spatialdata know which coordinate system to use.
In this case, this means labelling which slide it was.

adata.obs["Slide_id"] = "Slide_P12"
adata
AnnData object with n_obs × n_vars = 10 × 6763
    obs: 'Precursors.Identified', 'Proteins.Identified', 'Average.Missed.Tryptic.Cleavages', 'LCMS_run_id', 'RCN', 'RCN_long', 'QuPath_class', 'Slide_id'
    var: 'Protein.Group', 'Protein.Names', 'Genes', 'First.Protein.Description'

Now we will pass the matrix to spatialdata.
We need to tell spatialdata what parts of the adata object to use for visualization.
region is which set of shapes to project these data into, in this case Slide_P12.
region_key is which column in adata.obs to use for finding the region parameter.
instance_key refers to the column that links this adata object to the shapes.
We use QuPath_class because that is the column that matches between the adata and the geodataframe.

sdata['proteomics'] = spatialdata.models.TableModel.parse(
                            adata=adata, 
                            region="Slide_P12",
                            region_key="Slide_id",
                            instance_key="QuPath_class"
                            )
sdata
SpatialData object
├── Images
│     └── 'mIF': DataArray[cyx] (15, 5000, 5000)
└── Tables
      └── 'proteomics': AnnData (10, 6763)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images)

Load the geodataframe with shapes that match proteomic samples#

gdf.head()
id objectType classification geometry
0 78422794-3ed4-4e09-a9aa-6486c467149d annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
1 70a57ffd-50a0-4449-b873-2dd2a38bf190 annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
2 edfda1ca-8de8-44f5-af4e-f16e15bcd4ca annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
3 a0a868c6-c7e3-4fb0-8f4f-071454565080 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
4 ce97d787-321e-4b7d-97e7-55b203fb0373 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Here we see that the QuPath class name is inside that classification column.
Let’s get it out

gdf['QuPath_class'] = gdf['classification'].apply(
    lambda row: ast.literal_eval(row).get('name') if isinstance(row,str) else row.get('name'))

# this line of code says, if classification cells are strings, 
# convert to a dictionary and get name attribute, 
# otherwise we assume it is already a dictionary and we get the name attribute.
gdf.head()
id objectType classification geometry QuPath_class
0 78422794-3ed4-4e09-a9aa-6486c467149d annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ... P12_Immune_3
1 70a57ffd-50a0-4449-b873-2dd2a38bf190 annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ... P12_Immune_3
2 edfda1ca-8de8-44f5-af4e-f16e15bcd4ca annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ... P12_Immune_3
3 a0a868c6-c7e3-4fb0-8f4f-071454565080 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8... P12_Immune_4
4 ce97d787-321e-4b7d-97e7-55b203fb0373 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2... P12_Immune_4

Now we see that the QuPath_class column has our matching names.
One last thing we must do, is set these names as the index of the geodataframe.

gdf = gdf.set_index("QuPath_class")
gdf.head()
id objectType classification geometry
QuPath_class
P12_Immune_3 78422794-3ed4-4e09-a9aa-6486c467149d annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3417 955.5, 3417.23 962.94, 3417.93 ...
P12_Immune_3 70a57ffd-50a0-4449-b873-2dd2a38bf190 annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3547 715.5, 3547.23 722.94, 3547.93 ...
P12_Immune_3 edfda1ca-8de8-44f5-af4e-f16e15bcd4ca annotation { "name": "P12_Immune_3", "color": [ 0, 255, 2... POLYGON ((3324 568.5, 3324.23 575.94, 3324.93 ...
P12_Immune_4 a0a868c6-c7e3-4fb0-8f4f-071454565080 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4398 2191.5, 4398.21 2198.06, 4398.8...
P12_Immune_4 ce97d787-321e-4b7d-97e7-55b203fb0373 annotation { "name": "P12_Immune_4", "color": [ 0, 255, 2... POLYGON ((4173 2349, 4173.22 2356.16, 4173.9 2...

Now we see that QuPath_class is the index.
Spatialdata needs this to match the proteomics adata object to these shapes.
Let’s add these prepared shapes to the spatialdata object.

sdata['Slide_P12'] = spatialdata.models.ShapesModel.parse(gdf)
sdata
SpatialData object
├── Images
│     └── 'mIF': DataArray[cyx] (15, 5000, 5000)
├── Shapes
│     └── 'Slide_P12': GeoDataFrame shape: (20, 4) (2D shapes)
└── Tables
      └── 'proteomics': AnnData (10, 6763)
with coordinate systems:
    ▸ 'global', with elements:
        mIF (Images), Slide_P12 (Shapes)

Visualize data interactively#

napari_spatialdata.Interactive(sdata)
<napari_spatialdata._interactive.Interactive at 0x2ad487610>

Write spatialdata object for future use#

sdata.write("outputs/spatialdata.zarr")
INFO     The Zarr backing store has been changed from None the new file path: outputs/spatialdata.zarr