Querying a SOMA experiment
Source:vignettes/soma-experiment-queries.Rmd
soma-experiment-queries.Rmd
Overview
In this notebook, we’ll take a quick look at how to query a
SOMAExperiment
using the
SOMAExperimentAxisQuery
class. This allows for easy
selection of data from a SOMAMeasurement
by filtering on
annotations stored in each axis data frame (i.e., obs
and
var
).
Example data
Load the bundled SOMAExperiment
containing a subsetted
version of the 10X genomics PBMC
dataset provided by SeuratObject. This will return a
SOMAExperiment
object.
experiment <- load_dataset("soma-exp-pbmc-small")
experiment
#> <SOMAExperiment>
#> uri: /tmp/RtmpXjeQmR/soma-exp-pbmc-small
Querying basics
To perform a query we’ll need to initialize a new
SOMAExperimentAxisQuery
object, specifying the
SOMAExperiment
and the SOMAMeasurement
within
the experiment we want to query.
We can see that our current experiment contains only a single
measurement: "RNA"
.
experiment$ms
#> <SOMACollection>
#> uri: file:///tmp/RtmpXjeQmR/soma-exp-pbmc-small/ms
To use larger (or smaller) buffer sizes:
ctx <- SOMATileDBContext$new(c(soma.init_buffer_bytes = as.character(2 * 1024**3)))
experiment <- SOMAExperimentOpen(experiment$uri, tiledbsoma_ctx = ctx)
Alternatively, you can have in your environment
export TILEDB_SOMA_INIT_BUFFER_BYTES=2147483648
before
loading the data.
Now we can construct our query object.
query <- SOMAExperimentAxisQuery$new(
experiment = experiment,
measurement_name = "RNA"
)
Once it’s created, we can use the query
object to
inspect, select, and extract filtered data from the experiment.
For example, we can use n_obs
and n_vars
to
determine the number of observations and variables that passed our
filtering criteria. Since we didn’t specify any filtering criteria,
these numbers will match the full size of the experiment.
Number of observations:
query$n_obs
#> [1] 80
Number of variables:
query$n_vars
#> [1] 230
We can also extract any data component from the experiment. Here
we’ll read in the obs
data frame from the query using
obs()
which returns an iterator of
arrow::Table
. The iterator is useful when the data is too
large to load in memory allowing to stream the data in chunks. This
applies to var()
as well.
To load the data in memory you can concatenate all chunks of the iterator as shown below.
iterator <- query$obs()
obs <- iterator$concat()
obs
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>
As a reminder arrow:Table
can be easily cast into a
tibble
obs$to_data_frame()
#> soma_joinid orig.ident nCount_RNA nFeature_RNA RNA_snn_res.0.8
#> 1 0 SeuratProject 70 47 0
#> 2 1 SeuratProject 85 52 0
#> 3 2 SeuratProject 87 50 1
#> 4 3 SeuratProject 127 56 0
#> 5 4 SeuratProject 173 53 0
#> 6 5 SeuratProject 70 48 0
#> 7 6 SeuratProject 64 36 0
#> 8 7 SeuratProject 72 45 0
#> 9 8 SeuratProject 52 36 0
#> 10 9 SeuratProject 100 41 0
#> 11 10 SeuratProject 62 31 0
#> 12 11 SeuratProject 191 61 0
#> 13 12 SeuratProject 101 41 0
#> 14 13 SeuratProject 51 26 0
#> 15 14 SeuratProject 99 45 0
#> 16 15 SeuratProject 168 44 0
#> 17 16 SeuratProject 67 33 0
#> 18 17 SeuratProject 135 45 0
#> 19 18 SeuratProject 79 43 0
#> 20 19 SeuratProject 109 41 0
#> 21 20 SeuratProject 298 65 1
#> 22 21 SeuratProject 406 74 1
#> 23 22 SeuratProject 213 48 1
#> 24 23 SeuratProject 231 49 1
#> 25 24 SeuratProject 463 77 1
#> 26 25 SeuratProject 87 42 1
#> 27 26 SeuratProject 327 62 1
#> 28 27 SeuratProject 224 50 1
#> 29 28 SeuratProject 361 76 1
#> 30 29 SeuratProject 353 80 1
#> 31 30 SeuratProject 246 59 0
#> 32 31 SeuratProject 115 51 0
#> 33 32 SeuratProject 189 53 0
#> 34 33 SeuratProject 187 61 0
#> 35 34 SeuratProject 156 48 0
#> 36 35 SeuratProject 164 47 0
#> 37 36 SeuratProject 221 67 0
#> 38 37 SeuratProject 151 59 0
#> 39 38 SeuratProject 126 53 0
#> 40 39 SeuratProject 316 65 0
#> 41 40 SeuratProject 156 60 0
#> 42 41 SeuratProject 139 61 0
#> 43 42 SeuratProject 108 44 0
#> 44 43 SeuratProject 41 32 0
#> 45 44 SeuratProject 146 47 0
#> 46 45 SeuratProject 104 40 0
#> 47 46 SeuratProject 126 48 0
#> 48 47 SeuratProject 94 55 0
#> 49 48 SeuratProject 204 52 0
#> 50 49 SeuratProject 99 45 0
#> 51 50 SeuratProject 371 75 1
#> 52 51 SeuratProject 387 83 1
#> 53 52 SeuratProject 139 50 1
#> 54 53 SeuratProject 99 42 1
#> 55 54 SeuratProject 443 77 1
#> 56 55 SeuratProject 417 75 0
#> 57 56 SeuratProject 502 81 1
#> 58 57 SeuratProject 324 76 1
#> 59 58 SeuratProject 292 71 1
#> 60 59 SeuratProject 443 81 0
#> 61 60 SeuratProject 787 88 0
#> 62 61 SeuratProject 612 69 1
#> 63 62 SeuratProject 286 68 0
#> 64 63 SeuratProject 462 86 1
#> 65 64 SeuratProject 872 96 1
#> 66 65 SeuratProject 709 94 1
#> 67 66 SeuratProject 745 84 1
#> 68 67 SeuratProject 328 72 1
#> 69 68 SeuratProject 389 73 1
#> 70 69 SeuratProject 754 83 0
#> 71 70 SeuratProject 212 38 0
#> 72 71 SeuratProject 172 29 0
#> 73 72 SeuratProject 168 37 0
#> 74 73 SeuratProject 210 33 0
#> 75 74 SeuratProject 228 39 0
#> 76 75 SeuratProject 527 47 0
#> 77 76 SeuratProject 202 30 0
#> 78 77 SeuratProject 157 29 0
#> 79 78 SeuratProject 150 30 0
#> 80 79 SeuratProject 233 76 1
#> letter.idents groups RNA_snn_res.1 obs_id
#> 1 A g2 0 ATGCCAGAACGACT
#> 2 A g1 0 CATGGCCTGTGCAT
#> 3 B g2 0 GAACCTGATGAACC
#> 4 A g2 0 TGACTGGATTCTCA
#> 5 A g2 0 AGTCAGACTGCACA
#> 6 A g1 0 TCTGATACACGTGT
#> 7 A g1 0 TGGTATCTAAACAG
#> 8 A g1 0 GCAGCTCTGTTTCT
#> 9 A g1 0 GATATAACACGCAT
#> 10 A g1 0 AATGTTGACAGTCA
#> 11 A g2 2 AGGTCATGAGTGTC
#> 12 A g1 2 AGAGATGATCTCGC
#> 13 A g2 2 GGGTAACTCTAGTG
#> 14 A g2 2 CATGAGACACGGGA
#> 15 A g2 2 TACGCCACTCCGAA
#> 16 A g1 2 CTAAACCTGTGCAT
#> 17 A g2 2 GTAAGCACTCATTC
#> 18 A g1 2 TTGGTACTGAATCC
#> 19 A g1 2 CATCATACGGAGCA
#> 20 A g2 2 TACATCACGCTAAC
#> 21 B g1 1 TTACCATGAATCGC
#> 22 B g1 1 ATAGGAGAAACAGA
#> 23 B g2 1 GCGCACGACTTTAC
#> 24 B g2 1 ACTCGCACGAAAGT
#> 25 B g1 1 ATTACCTGCCTTAT
#> 26 B g2 1 CCCAACTGCAATCG
#> 27 B g2 1 AAATTCGAATCACG
#> 28 B g2 1 CCATCCGATTCGCC
#> 29 B g2 1 TCCACTCTGAGCTT
#> 30 B g1 1 CATCAGGATGCACA
#> 31 A g1 0 CTAAACCTCTGACA
#> 32 A g1 2 GATAGAGAAGGGTG
#> 33 A g1 0 CTAACGGAACCGAT
#> 34 A g2 0 AGATATACCCGTAA
#> 35 A g1 0 TACTCTGAATCGAC
#> 36 A g1 0 GCGCATCTTGCTCC
#> 37 A g2 0 GTTGACGATATCGG
#> 38 A g1 0 ACAGGTACTGGTGT
#> 39 A g1 0 GGCATATGCTTATC
#> 40 A g2 0 CATTACACCAACTG
#> 41 A g1 0 TAGGGACTGAACTC
#> 42 A g2 2 GCTCCATGAGAAGT
#> 43 A g2 0 TACAATGATGCTAG
#> 44 A g2 0 CTTCATGACCGAAT
#> 45 A g1 2 CTGCCAACAGGAGC
#> 46 A g2 2 TTGCATTGAGCTAC
#> 47 A g1 0 AAGCAAGAGCTTAG
#> 48 A g2 0 CGGCACGAACTCAG
#> 49 A g1 0 GGTGGAGATTACTC
#> 50 A g2 0 GGCCGATGTACTCT
#> 51 B g1 1 CGTAGCCTGTATGC
#> 52 B g2 1 TGAGCTGAATGCTG
#> 53 B g2 2 CCTATAACGAGACG
#> 54 B g2 1 ATAAGTTGGTACGT
#> 55 B g1 1 AAGCGACTTTGACG
#> 56 A g1 1 ACCAGTGAATACCG
#> 57 B g1 1 ATTGCACTTGCTTT
#> 58 B g1 1 CTAGGTGATGGTTG
#> 59 B g2 1 GCACTAGACCTTTA
#> 60 A g1 0 CATGCGCTAGTCAC
#> 61 A g1 2 TTGAGGACTACGCA
#> 62 B g1 1 ATACCACTCTAAGC
#> 63 A g1 2 CATATAGACTAAGC
#> 64 B g1 1 TTTAGCTGTACTCT
#> 65 B g1 2 GACATTCTCCACCT
#> 66 B g2 1 ACGTGATGCCATGA
#> 67 B g2 1 ATTGTAGATTCCCG
#> 68 B g1 1 GATAGAGATCACGA
#> 69 B g1 1 AATGCGTGGACGGA
#> 70 A g1 2 GCGTAAACACGGTT
#> 71 A g2 0 ATTCAGCTCATTGG
#> 72 A g1 0 GGCATATGGGGAGT
#> 73 A g2 0 ATCATCTGACACCA
#> 74 A g2 0 GTCATACTTCGCCT
#> 75 A g1 0 TTACGTACGTTCAG
#> 76 A g1 0 GAGTTGTGGTAGCT
#> 77 A g2 0 GACGCTCTCTCTCG
#> 78 A g1 0 AGTCTTACTTCGGA
#> 79 A g2 0 GGAACACTTCAGAC
#> 80 B g1 1 CTTGATTGATCTTC
Alternatively, you can use the iterator, which retrieves data in
chunks that are smaller than the soma.init_buffer_bytes
context field. You can use the iterator’s method
$read_next()
to load a chunk in memory.
iterator <- query$obs()
iterator$read_next()
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>
In this example the full obs
table is relatively small
and fits all in one chunk.
For a bigger SOMADataFrame
you can check if the
iteration has finished by checking the logical
$read_complete()
.
Here we demonstrate by creating a new iterator.
iterator <- experiment$obs$read()
iterator$read_complete()
#> [1] FALSE
iterator$read_next()
#> Table
#> 80 rows x 9 columns
#> $soma_joinid <int64 not null>
#> $orig.ident <dictionary<values=string, indices=int8>>
#> $nCount_RNA <double>
#> $nFeature_RNA <int32>
#> $RNA_snn_res.0.8 <dictionary<values=string, indices=int8>>
#> $letter.idents <dictionary<values=string, indices=int8>>
#> $groups <large_string>
#> $RNA_snn_res.1 <dictionary<values=string, indices=int8>>
#> $obs_id <large_string>
iterator$read_complete()
#> [1] TRUE
We can also access the expression via X()
.
Similarly to obs()
and var()
,
X()
is intended for iteration, but in this case we have
access to two different iterators, and thus X()
returns a
reader that gives you access to an iterator for
arrow::Table
and one for
Matrix::sparse_matrix
.
Let’s take a look at the Arrow Table iterator:
reader <- query$X(layer_name = "counts")
table_irerator <- reader$tables()
table_irerator$read_next()
#> Table
#> 4456 rows x 3 columns
#> $soma_dim_0 <int64 not null>
#> $soma_dim_1 <int64 not null>
#> $soma_data <double not null>
As in the obs
example the data is small enough to fit in
one chunk. For bigger data you can user
iterator$read_complete()
to check the status of iteration
and iterator$concat()
to concatenate the rest of the
chunks.
The iterator for Matrix::sparse_matrix
works in the same
way. Keep in mind that the matrix format is dgTMatrix
as it
is the most memory-efficient and the only format type that can be easily
iterated. And most importantly, the resulting object is a “view” of the
full matrix with the original shape and indexes but only with data
corresponding to the query coordinates or filters (see section
below).
reader <- query$X(layer_name = "counts")
iterator <- reader$sparse_matrix()
str(iterator$read_next())
#> Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#> ..@ i : int [1:4456] 0 0 0 0 0 0 0 0 0 0 ...
#> ..@ j : int [1:4456] 1 5 8 11 22 30 33 34 36 38 ...
#> ..@ Dim : int [1:2] 80 230
#> ..@ Dimnames:List of 2
#> .. ..$ : NULL
#> .. ..$ : NULL
#> ..@ x : num [1:4456] 1 1 3 1 1 4 1 5 1 1 ...
#> ..@ factors : list()
Adding filters
Adding filters requires creating a SOMAAxisQuery
object
that allows you to define coordinates, value filters, or both for an
axis.
Here we’ll create a query for obs
that slices the first
40 rows, and then filters that subset based on the
nCount_RNA
column.
obs_query <- SOMAAxisQuery$new(
coords = list(soma_joinid = 0:39),
value_filter = "nCount_RNA > 100"
)
To apply this filter we’ll pass it to a new
SOMAExperimentAxisQuery
object.
query <- SOMAExperimentAxisQuery$new(
experiment = experiment,
measurement_name = "RNA",
obs_query = obs_query
)
Let’s see how many observations this query identified.
query$n_obs
#> [1] 26
As before, we can load the obs
data frame into memory
but now it only includes the filtered observations.
obs <- query$obs(column_names = c("obs_id", "nCount_RNA"))$concat()
obs$to_data_frame()
#> obs_id nCount_RNA
#> 1 TGACTGGATTCTCA 127
#> 2 AGTCAGACTGCACA 173
#> 3 AGAGATGATCTCGC 191
#> 4 GGGTAACTCTAGTG 101
#> 5 CTAAACCTGTGCAT 168
#> 6 TTGGTACTGAATCC 135
#> 7 TACATCACGCTAAC 109
#> 8 TTACCATGAATCGC 298
#> 9 ATAGGAGAAACAGA 406
#> 10 GCGCACGACTTTAC 213
#> 11 ACTCGCACGAAAGT 231
#> 12 ATTACCTGCCTTAT 463
#> 13 AAATTCGAATCACG 327
#> 14 CCATCCGATTCGCC 224
#> 15 TCCACTCTGAGCTT 361
#> 16 CATCAGGATGCACA 353
#> 17 CTAAACCTCTGACA 246
#> 18 GATAGAGAAGGGTG 115
#> 19 CTAACGGAACCGAT 189
#> 20 AGATATACCCGTAA 187
#> 21 TACTCTGAATCGAC 156
#> 22 GCGCATCTTGCTCC 164
#> 23 GTTGACGATATCGG 221
#> 24 ACAGGTACTGGTGT 151
#> 25 GGCATATGCTTATC 126
#> 26 CATTACACCAACTG 316
As well as the X matrix in two different formats:
query$X("counts")$tables()$concat()
#> Table
#> 1485 rows x 3 columns
#> $soma_dim_0 <int64 not null>
#> $soma_dim_1 <int64 not null>
#> $soma_data <double not null>
Matrix::sparse_matrix
in dgTMatrix
format.
str(query$X("counts")$sparse_matrix()$concat())
#> Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
#> ..@ i : int [1:1485] 3 3 3 3 3 3 3 3 3 3 ...
#> ..@ j : int [1:1485] 8 11 22 30 31 32 33 34 37 38 ...
#> ..@ Dim : int [1:2] 80 230
#> ..@ Dimnames:List of 2
#> .. ..$ : NULL
#> .. ..$ : NULL
#> ..@ x : num [1:1485] 13 1 6 5 2 1 2 2 1 1 ...
#> ..@ factors : list()
For a re-indexed and re-shaped sparse matrix see the section below.
Export to an annotated re-indexed sparse matrix
Any component of the queried SOMAExperiment
can be
exported to a [sparse matrix][Matrix::sparseMatrix-class] using the
to_sparse_matrix()
method.
For example, let’s create a sparse matrix of the filtered expression data. We’ll create a new query that returns a smaller subset of the data to make the output easier to read.
query <- SOMAExperimentAxisQuery$new(
experiment = experiment,
measurement_name = "RNA",
obs_query = SOMAAxisQuery$new(coords = 0:9),
var_query = SOMAAxisQuery$new(coords = 0:9)
)
Then we indicate that we want to access the "counts"
layer of the "X"
collection.
query$to_sparse_matrix(
collection = "X",
layer = "counts"
)
#> 10 x 10 sparse Matrix of class "dgTMatrix"
#> [[ suppressing 10 column names '0', '1', '2' ... ]]
#>
#> 0 . 1 . . . 1 . . 3 .
#> 1 . . . 1 . . . . 7 .
#> 2 . . . . . . . . 11 .
#> 3 . . . . . . . . 13 .
#> 4 . . . 1 . . . . 3 .
#> 5 . . . 1 . . . . 4 .
#> 6 . . . . . . . . 6 .
#> 7 . . . 1 . . . . 4 .
#> 8 . . . . . . . . 2 .
#> 9 . 1 . . . . . . 21 .
By default, the dimensions are named using soma_joinid
values which are unique to each observation and variable. However,
dimension names can come from any column in the obs
and
var
arrays that uniquely identifies each record. For an
expression matrix it makes sense to name the dimensions using cell
barcodes and gene names, which are stored in the obs_id
and
var_id
columns, respectively.
query$to_sparse_matrix(
collection = "X",
layer = "counts",
obs_index = "obs_id",
var_index = "var_id"
)
#> 10 x 10 sparse Matrix of class "dgTMatrix"
#> [[ suppressing 10 column names 'MS4A1', 'CD79B', 'CD79A' ... ]]
#>
#> ATGCCAGAACGACT . 1 . . . 1 . . 3 .
#> CATGGCCTGTGCAT . . . 1 . . . . 7 .
#> GAACCTGATGAACC . . . . . . . . 11 .
#> TGACTGGATTCTCA . . . . . . . . 13 .
#> AGTCAGACTGCACA . . . 1 . . . . 3 .
#> TCTGATACACGTGT . . . 1 . . . . 4 .
#> TGGTATCTAAACAG . . . . . . . . 6 .
#> GCAGCTCTGTTTCT . . . 1 . . . . 4 .
#> GATATAACACGCAT . . . . . . . . 2 .
#> AATGTTGACAGTCA . 1 . . . . . . 21 .
We can use this method for any of the SOMAExperiment
’s
collections. Let’s access the t-SNE coordinates stored in the
obsm
collection’s X_tsne
layer, populating the
row names with cell barcodes.
query$to_sparse_matrix(
collection = "obsm",
layer = "X_tsne",
obs_index = "obs_id"
)
#> 10 x 2 sparse Matrix of class "dgTMatrix"
#> 0 1
#> ATGCCAGAACGACT 0.8675977 -8.1007483
#> CATGGCCTGTGCAT -7.3925306 -8.7717451
#> GAACCTGATGAACC -28.2064258 0.2410102
#> TGACTGGATTCTCA 16.3480689 -11.1633255
#> AGTCAGACTGCACA 1.9113998 -11.1929311
#> TCTGATACACGTGT 3.1475998 -9.9369312
#> TGGTATCTAAACAG 17.8526863 -9.8978901
#> GCAGCTCTGTTTCT -6.4912187 -8.3874434
#> GATATAACACGCAT 1.3305297 -9.6807936
#> AATGTTGACAGTCA 16.9642732 -9.4295980
Export to Seurat
The query
object also contains methods for loading in
results as a Seurat object (or any of Seurat’s component classes). As
with the to_sparse_matrix()
method, we can specify the
obs_index
and var_index
to use for naming the
dimensions of the resulting object.
query <- SOMAExperimentAxisQuery$new(
experiment = experiment,
measurement_name = "RNA"
)
query$to_seurat(
X_layers = c(counts = "counts", data = "data"),
obs_index = "obs_id",
var_index = "var_id"
)
#> Warning: Adding a command log without an assay associated with it
#> An object of class Seurat
#> 230 features across 80 samples within 1 assay
#> Active assay: RNA (230 features, 0 variable features)
#> 2 layers present: counts, data
#> 2 dimensional reductions calculated: pca, tsne