Vignette - How to use GammaGateR
library(GammaGateR)
library(gridExtra)
library(ggpubr)
#> Loading required package: ggplot2
library(hrbrthemes)
#> NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
#> Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
#> if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
1 Introduction
GammaGateR is a semi-automated soft marker-gating procedure designed specifically for the marker intensity distributions observed in multiplexed imaging. The algorithm uses a zero-inflated two-component Gamma Mixture Model to model marker expression for each slide. The Gamma mixture model naturally identifies marker-positive and marker-negative cell distributions and returns the probability of belonging to the marker-positive cell distribution for each cell. The returned probabilities can either be used directly in subsequent analysis or combined and dichotomized to define cell phenotypes. GammaGateR incorporates user-specified constraints to provide consistent model fit across a large number of slides. The model evaluation methods included in GammaGateR allow the user to evaluate the constraints and quality check results. The power source of GammaGateR is the closed-form Gamma mixture model, which is a novel approach to phenotyping for mIF data that makes it more computationally efficient than traditional GMMs.
2 Loading mIF data from
VectraPolarisData
The VectraPolarisData
package on Bioconductor contains
multiplex immunohistochemistry (mIHC) in a sample of 153 patients with
small cell lung cancer(Johnson et al. 2021;
Wrobel and Ghosh 2022). To download the dataset, follow the
instruction at https://bioconductor.org/packages/release/data/experiment/html/VectraPolarisData.html.
After installation, run the following to store data as the
cell
object. The data is are cell-level, meaning that each
row of the data represents a single cell.
3 Prepare the data for GammaGateR
3.1 Normalization and Visualization
Before fitting the model, the expression levels should be normalized to reduce batch effects and variation among slides. We use a log normalization, based on prior work (Harris et al. 2022).
allMarkers = grep('^entire_cell_.*total$', names(cell), value=TRUE)
# normalize the data
nzNormedMarkers = paste0('nzNorm_',allMarkers)
cell = do.call(rbind, lapply(split(cell, cell$slide_id), function(df){df[,nzNormedMarkers] = log10(1+sweep(df[,allMarkers], 2, colMeans(replace(df[,allMarkers], df[,allMarkers]==0, NA), na.rm = TRUE ), FUN = '/' ) ); df }) )
Because GammaGateR can use user specified constraints based on quantile and/or fixed value boundaries, it is helpful to visualize histograms of each channel by slide to get a coarse understanding of the overall distribution of normalized expression data, which serves as a good visual aid for a first guess of the constraints.
# plot histogram
for(i in 1:length(allMarkers)){
cat('\n### ', allMarkers[i], " \n")
col.name <- nzNormedMarkers[i]
slidetest <- cell[, c("slide_id", col.name)]
slidetest <- slidetest[slidetest[,col.name]!=0,]
colnames(slidetest)[2] <- "mkr.val"
p1 <- ggplot() + ggtitle(col.name)+
geom_freqpoly(data=slidetest, aes(mkr.val,y=after_stat(density), color=slide_id,group=slide_id), binwidth = 0.05, alpha=0.1, linewidth=1)+theme_ipsum()+theme(legend.position = "none")
print(p1)
cat('\n\n')
}
3.1.1 entire_cell_cd19_opal_650_total
3.1.2 entire_cell_cd3_opal_520_total
3.1.3 entire_cell_cd14_opal_540_total
3.1.4 entire_cell_cd8_opal_620_total
3.1.5 entire_cell_hladr_opal_690_total
3.1.6 entire_cell_ck_opal_570_total
3.1.7 entire_cell_dapi_dapi_total
3.1.8 entire_cell_autofluorescence_total
3.2 Picking Constraints
Different constraints can be selected for each marker but should be the same for all slides in the same channel. There are two ways to identify the constraints: by specifying a quantile or a particular value on the x-axis. For example, when setting the upp. Both can be passed as arguments to the GammaGateR function, and when applied to the slide, the threshold with the wider range will be chosen by the function automatically.
The quantile boundaries can be determined if there are known ranges for the proportions of positive cells, though these values don’t map specifically to constraints on the estimated cell proportions. The constraints are passed to the groupGammaGateR function as a list. Each constraint is specified with a 2x2 matrix, where the rows correspond to the unexpressed and expressed cell populations, and the columns are the lower and upper constraints, respectively. For example, the first constraint in quantileBoundaries restricts the mode of the unexpressed cells to be between the 0th and 80th quantiles of the expression distribution and the expressed cell mode between the 90th and 100th quantiles. The order of the markers is as shown in the histograms above. If no reasonable constraint can be determined for a marker, it can be left blank.
Constraints can also be specified according to values on the x-axis of the histograms. For example, the mode for the unexpressed cells for the HLADR marker in the is constrained between 0 and 0.3 and the mode for the expressed cells is constrained between 0.45 and infinity. This ensures that the modes of the two estimated distributions are separated by a gap of 0.15.
The last line of the code assigns the names of the constraints to keep track of which were used for which markers. groupGammagater does not use the names of the constraints in fitting, but assumes the order of the constraints matches the order of the columns of the input data.
# set up boundary for all markers
quantileBoundaries = list(matrix(c(0,.8, .9, 1), byrow = TRUE, nrow=2),
matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
NULL,
NULL,
NULL)
boundaries = list(matrix(c(0,.3, .5, 1), byrow = TRUE, nrow=2),
NULL,
NULL,
NULL,
matrix(c(0,.3, .45, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.3, .5, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.4, .45, Inf), byrow = TRUE, nrow=2),
NULL)
names(quantileBoundaries) = names(boundaries) = nzNormedMarkers
4 Running GammaGateR
The groupGammaGateR function takes a data frame with the cell level
data for each slide, a vector identifying slide membership, and list(s)
of the boundaries on the expression or quantile scales. The
n.cores
argument can be used to run model fitting in
parallel across slides.The last line saves the fitted model object as an
.RData
file.
# fit cfGMM with groupGammaGateR
groupPolarisFit <- groupGammaGateR(cell[,nzNormedMarkers],
slide = cell$slide_id,
boundaryMarkers = boundaries,
qboundaryMarkers=quantileBoundaries,
n.cores = 8)
save(groupPolarisFit, file="groupPolarisFit.RData")
At this stage, checking if the model fits well through visualization has higher priority than extracting information in the output. The output structure will be explained in the later section.
5 Evaluating Model fit
After fitting, convCheck()
can be used to see if the
model failed to converge in any slides or channels. The reason for
non-convergence could be due to the constraints, or that too few
observations are included in each slide. If the model fails to converge
estimated parameters and probabilities are returned as NA.
load("groupPolarisFit.RData")
convCheck(groupPolarisFit)
#> [1] "Convergence NA:"
#> marker slide
#> nzNorm_entire_cell_dapi_dapi_total #24 5-261-485
#> nzNorm_entire_cell_dapi_dapi_total #55 6-231-821
#> nzNorm_entire_cell_cd19_opal_650_total #83 6-404-198
#> [1] "All converged for other slides."
The reason for non-convergence could be bad threshold, or that too few observations are included in each slide. This can be checked by examining the density distribution of the expression value of the slide:
par(mfrow=c(2,2))
hist(cell$nzNorm_entire_cell_dapi_dapi_total[cell$slide_id=="#24 5-261-485"],
xlab="Expression Values", main="dapi_#24 5-261-485", freq = FALSE)
hist(cell$nzNorm_entire_cell_dapi_dapi_total[cell$slide_id=="#55 6-231-821"],
xlab="Expression Values", main="dapi_#55 6-231-821", freq = FALSE)
hist(cell$nzNorm_entire_cell_cd19_opal_650_total[cell$slide_id=="#83 6-404-198"],
xlab="Expression Values", main="cd19_#83 6-404-198", freq = FALSE)
By comparing the histogram layout, it appears that the boundaries might be too stringent for these slides. We can keep this in mind, and modify the boundaries in the next round of model fitting.
5.1 Diagnostic Plots
For datasets with large numbers of slides, it may be infeasible to visually check each slide for “goodness of fit”. Under such circumstance, the GammaGateR diagnostic plot will be an easier way to spot potential problematic models. The diagnostic plot below illustrates two key parameters for each slide that are indicator of goodness of fit: the mode and proportion (\(\lambda\)) of expressed component. Often, if the estimated proportion of expressed cells is outside a reasonable range for the marker channel or if the slide is an outlier on the mode or lambda, it is likely that the slide did not fit well with the model set up. The plots are interactive, so we can see the name of the slide by hovering mouse over the point and scroll below to look at the fitted gamma distributions and the expression histogram for the slide. If not, or if the two components are too close then this can be fixed by refitting with different constraints.
For example, the diagnostic plot of CD19 shows that slide
#43 1-786-092
has extremely small second component
proportion. Therefore, in the next step, user can pay attention to this
slide and see if the fitted model for that particular slide looks
reasonable.
nzNormedMarkers <- names(groupPolarisFit[[1]][[1]])
m1 <- strsplit(nzNormedMarkers[1], split="_")[[1]][4]
cat('\n### ', m1, " \n")
5.1.1 cd19
cat('\n\n') for (i in (nzNormedMarkers[-1])){ mkr.name <- strsplit(i, split="_")[[1]][4] cat('\n### ', mkr.name, " \n") print(htmltools::tagList(plot(groupPolarisFit, marker=i, diagnostic=TRUE,interactive=TRUE, histogram=FALSE, print=FALSE)[[1]])) cat('\n\n') }5.1.2 cd3
5.1.3 cd14
5.1.4 cd8
5.1.5 hladr
5.1.6 ck
5.1.7 dapi
#> Warning: Ignoring 2 observations
#> Warning: Ignoring 2 observations
5.1.8 autofluorescence
5.2 Individual slide histograms
The individual density curves of fitted models for each slide can be used to further evaluate model fit. There are two parts for each model in the individual slide:
- The plot: visualization of model “goodness-of-fit”.
- The fitted gamma density curve for the two components.
- The histogram of actual normalized pixel value. Plotted in the back for comparison.
- The table: shows the estimated parameters for the model.
- Components 0-2 correspond to the estimates for the zeros, unexpressed, and expressed cells. - Lambda represents the estimated proportion of cells in that group
- Alpha and beta parameters are the estimates for that control the shape of the distribution for that component.
- Also included model convergence on the boundary or not
- Mode of each component. This can to some extent show whether adjusting the boundary can help improve the fitted models.
The fitted curves of the model should approximately align with the
histogram. In addition, we expect that the marker-positive cell
distribution should sit to the right of the unexpressed cells. This is
not always possible if the amount of noise in the dataset is too much to
distinguish expressed and unexpressed cells. For the purpose of display,
only histograms of 4 slides are displyed for each marker. Users can
print out the rest of the histograms, if they wish to, by modifying the
slide
argument of the plot function.
for (i in (nzNormedMarkers)){
mkr.name <- strsplit(i, split="_")[[1]][4]
cat('\n### ', mkr.name, " \n")
temp <- groupPolarisFit[1:4]
class(temp) <- "groupGammaGateR"
plotss <- plot(temp, marker=i, diagnostic=FALSE, histogram=TRUE, print=FALSE, tabl=TRUE)
print(do.call(ggarrange,plotss))
cat('\n\n')
}
5.2.1 cd19
5.2.2 cd3
5.2.3 cd14
5.2.4 cd8
5.2.5 hladr
5.2.6 ck
5.2.7 dapi
5.2.8 autofluorescence
6 Refitting the models
Here are a few observation based on the model diagnostic plots and slide histograms:
- CD14 and CD8 could use a lower boundary constraint of 0.5 for the expressed cells, e.g. 0.4.
- Lower bound of CK marker should be set to 0.25, because it looks like it’s missing some expressed cell distributions and most slides are at the boundary.
- CD19 and DAPI should use lower boundary constraint.
The following constraints reflects the changes.
# set up boundary for all markers
quantileBoundaries2 = list(matrix(c(0,.8, .9, 1), byrow = TRUE, nrow=2),
matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
NULL,
matrix(c(0,.7, .9, 1), byrow = TRUE, nrow=2),
NULL)
boundaries2 = list(matrix(c(0,.45, .45, 1), byrow = TRUE, nrow=2),
matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.4, .6, 1), byrow = TRUE, nrow=2),
matrix(c(0,.4, .5, 1), byrow = TRUE, nrow=2),
matrix(c(0,.3, .5, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.25, .25, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2),
matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2))
names(quantileBoundaries2) = names(boundaries2) = nzNormedMarkers
# refit cfGMM with groupGammaGateR
groupPolarisFit2 <- groupGammaGateR(cell[,nzNormedMarkers],
slide = cell$slide_id,
boundaryMarkers = boundaries2,
qboundaryMarkers=quantileBoundaries2,
n.cores = 8)
save(groupPolarisFit2, file="groupPolarisFit2.RData")
Again, we need to check if all models converged.
7 Evaluating new model fit
In order to decide if the new boundary actually improved model fit,
the model fitted with old and new boundaries should be visualized
together to compare. GammaGateR
can compare two model fits
in both diagnostic plots and the histograms.
7.1 Compare two fittings: Diagnostic plots
diag_contrast
in GammaGateR
returns a
scatter plot similar to the previous diagnostic plot, color coded by
fittings. Points that correspond to the same slide be different models
are connected by a dashed line, which makes it easy to see how model
fitting outcome of expressed cells moves under different boundaries.
for (i in nzNormedMarkers){
mkr.name <- strsplit(i, split="_")[[1]][4]
cat('\n### ', mkr.name, " \n")
print(diag_contrast(groupPolarisFit, groupPolarisFit2, marker = i, title = paste0("contrastive: ", mkr.name)))
cat('\n\n')
}
7.1.1 cd19
7.1.2 cd3
7.1.3 cd14
7.1.4 cd8
7.1.5 hladr
7.1.6 ck
7.1.7 dapi
7.1.8 autofluorescence
7.2 Compare two fittings: Histograms
In hist_sr_contrast
, the histogram is plotted very
similar to the plot
function previously. In order to
compare the results, the density curve of both models are plot in the
same histogram, color coded the same way as in the contrastive
diagnostic plot: green is the first fit object and red is the second fit
object. This provides a closer look to the individual slide
fittings.
for (i in nzNormedMarkers){
mkr.name <- strsplit(i, split="_")[[1]][4]
cat('\n### ', mkr.name, " \n")
temp1 <- groupPolarisFit[1:4]
class(temp1) <- "groupGammaGateR"
temp2 <- groupPolarisFit2[1:4]
class(temp2) <- "groupGammaGateR"
plotss <- list()
for(j in 1:4){
plotss[[j]] <- hist_sr_constrast(temp1 ,temp2, marker=i, slide=j ,diagnostic=FALSE, histogram=TRUE, print=FALSE, tabl=TRUE)
}
cat('\n\n')
}
7.2.1 cd19
7.2.2 cd3
7.2.3 cd14
7.2.4 cd8
7.2.5 hladr
7.2.6 ck
7.2.7 dapi
7.2.8 autofluorescence
8 Model results output
The output structure is as follows:
Specifically, the output contains lists of slides output, where the output includes a variety of model and input information:
expressionZ
is the posterior probabilitiesexpressionX
is the input expression valuesexpressionW
is the marginal probabilitiesparams
is the zero-inflated prarameteresfit
is thecfGMM
package model outputsubBatch
is the subBatch model fit parameter, if fitted with futher breakdown parameters within a slide. Each output facet includes a list of the marker-specific outputs.
At this stage, the model output is ready for next-step analysis. The
major part of the output are the expression probability of each cell.
GammaGateR
provides posterior and marginal probability in
expressionZ
and expressionW
separately. Since
GammaGateR
model fitting output belong to class
list
, the output can simply be extracted by going to the
specific list. Suppose we need posterior probability from slide
#01 0-889-121
.
Of course, we can also easily extract posterior probability of all cells:
In the examples we demonstrated in paper, posterior probability always outperforms marginal probability. However, marginal probability might be a more reasonable choice when the distribution is unimodal. In this scenario, the two components in the mixture model will appear to be overlapping a lot and the posterior probability will be zero for all cells.
ploti <- plot(groupPolarisFit2, slide="#84 6-409-858", marker="nzNorm_entire_cell_autofluorescence_total", diagnostic=FALSE, histogram=TRUE)