Vignette - How to use GammaGateR

library(GammaGateR)
library(gridExtra)
library(ggpubr)
#> Loading required package: ggplot2
library(hrbrthemes)
#> NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
#>       Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
#>       if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow

1 Introduction

GammaGateR is a semi-automated soft marker-gating procedure designed specifically for the marker intensity distributions observed in multiplexed imaging. The algorithm uses a zero-inflated two-component Gamma Mixture Model to model marker expression for each slide. The Gamma mixture model naturally identifies marker-positive and marker-negative cell distributions and returns the probability of belonging to the marker-positive cell distribution for each cell. The returned probabilities can either be used directly in subsequent analysis or combined and dichotomized to define cell phenotypes. GammaGateR incorporates user-specified constraints to provide consistent model fit across a large number of slides. The model evaluation methods included in GammaGateR allow the user to evaluate the constraints and quality check results. The power source of GammaGateR is the closed-form Gamma mixture model, which is a novel approach to phenotyping for mIF data that makes it more computationally efficient than traditional GMMs.

2 Loading mIF data from VectraPolarisData

The VectraPolarisData package on Bioconductor contains multiplex immunohistochemistry (mIHC) in a sample of 153 patients with small cell lung cancer(Johnson et al. 2021; Wrobel and Ghosh 2022). To download the dataset, follow the instruction at https://bioconductor.org/packages/release/data/experiment/html/VectraPolarisData.html. After installation, run the following to store data as the cell object. The data is are cell-level, meaning that each row of the data represents a single cell.

library(VectraPolarisData)
spe_lung <- HumanLungCancerV3()
# get the cell level imaging data with spatial coordinates
cell = cbind(as.data.frame(colData(spe_lung)), as.data.frame(spatialCoords(spe_lung)) )

3 Prepare the data for GammaGateR

3.1 Normalization and Visualization

Before fitting the model, the expression levels should be normalized to reduce batch effects and variation among slides. We use a log normalization, based on prior work (Harris et al. 2022).

allMarkers = grep('^entire_cell_.*total$', names(cell), value=TRUE)
# normalize the data 
nzNormedMarkers = paste0('nzNorm_',allMarkers)
cell = do.call(rbind, lapply(split(cell, cell$slide_id), function(df){df[,nzNormedMarkers] = log10(1+sweep(df[,allMarkers], 2, colMeans(replace(df[,allMarkers], df[,allMarkers]==0, NA), na.rm = TRUE ), FUN = '/' ) ); df }) )

Because GammaGateR can use user specified constraints based on quantile and/or fixed value boundaries, it is helpful to visualize histograms of each channel by slide to get a coarse understanding of the overall distribution of normalized expression data, which serves as a good visual aid for a first guess of the constraints.

# plot histogram
for(i in 1:length(allMarkers)){
  cat('\n### ', allMarkers[i], " \n")
  col.name <- nzNormedMarkers[i]
  slidetest <- cell[, c("slide_id", col.name)]
  slidetest <- slidetest[slidetest[,col.name]!=0,]
  colnames(slidetest)[2] <- "mkr.val"
  p1 <- ggplot() + ggtitle(col.name)+
  geom_freqpoly(data=slidetest, aes(mkr.val,y=after_stat(density), color=slide_id,group=slide_id), binwidth = 0.05, alpha=0.1, linewidth=1)+theme_ipsum()+theme(legend.position = "none")
  print(p1)
  cat('\n\n')
}

3.1.1 entire_cell_cd19_opal_650_total

3.1.2 entire_cell_cd3_opal_520_total

3.1.3 entire_cell_cd14_opal_540_total

3.1.4 entire_cell_cd8_opal_620_total

3.1.5 entire_cell_hladr_opal_690_total

3.1.6 entire_cell_ck_opal_570_total

3.1.7 entire_cell_dapi_dapi_total

3.1.8 entire_cell_autofluorescence_total

3.2 Picking Constraints

Different constraints can be selected for each marker but should be the same for all slides in the same channel. There are two ways to identify the constraints: by specifying a quantile or a particular value on the x-axis. For example, when setting the upp. Both can be passed as arguments to the GammaGateR function, and when applied to the slide, the threshold with the wider range will be chosen by the function automatically.

The quantile boundaries can be determined if there are known ranges for the proportions of positive cells, though these values don’t map specifically to constraints on the estimated cell proportions. The constraints are passed to the groupGammaGateR function as a list. Each constraint is specified with a 2x2 matrix, where the rows correspond to the unexpressed and expressed cell populations, and the columns are the lower and upper constraints, respectively. For example, the first constraint in quantileBoundaries restricts the mode of the unexpressed cells to be between the 0th and 80th quantiles of the expression distribution and the expressed cell mode between the 90th and 100th quantiles. The order of the markers is as shown in the histograms above. If no reasonable constraint can be determined for a marker, it can be left blank.

Constraints can also be specified according to values on the x-axis of the histograms. For example, the mode for the unexpressed cells for the HLADR marker in the is constrained between 0 and 0.3 and the mode for the expressed cells is constrained between 0.45 and infinity. This ensures that the modes of the two estimated distributions are separated by a gap of 0.15.

The last line of the code assigns the names of the constraints to keep track of which were used for which markers. groupGammagater does not use the names of the constraints in fitting, but assumes the order of the constraints matches the order of the columns of the input data.

# set up boundary for all markers
quantileBoundaries = list(matrix(c(0,.8, .9, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
                          NULL,
                          NULL,
                          NULL)
boundaries = list(matrix(c(0,.3, .5, 1), byrow = TRUE, nrow=2),
                  NULL,
                  NULL,
                  NULL,
                  matrix(c(0,.3, .45, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.3, .5, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .45, Inf), byrow = TRUE, nrow=2),
                  NULL)
names(quantileBoundaries) = names(boundaries) = nzNormedMarkers

4 Running GammaGateR

The groupGammaGateR function takes a data frame with the cell level data for each slide, a vector identifying slide membership, and list(s) of the boundaries on the expression or quantile scales. The n.cores argument can be used to run model fitting in parallel across slides.The last line saves the fitted model object as an .RData file.

# fit cfGMM with groupGammaGateR
groupPolarisFit <- groupGammaGateR(cell[,nzNormedMarkers], 
                                slide = cell$slide_id,
                                 boundaryMarkers = boundaries,
                                qboundaryMarkers=quantileBoundaries,
                                n.cores = 8)
save(groupPolarisFit, file="groupPolarisFit.RData")

At this stage, checking if the model fits well through visualization has higher priority than extracting information in the output. The output structure will be explained in the later section.

5 Evaluating Model fit

After fitting, convCheck() can be used to see if the model failed to converge in any slides or channels. The reason for non-convergence could be due to the constraints, or that too few observations are included in each slide. If the model fails to converge estimated parameters and probabilities are returned as NA.

load("groupPolarisFit.RData")
convCheck(groupPolarisFit)
#> [1] "Convergence NA:"
#>                                  marker         slide
#>      nzNorm_entire_cell_dapi_dapi_total #24 5-261-485
#>      nzNorm_entire_cell_dapi_dapi_total #55 6-231-821
#>  nzNorm_entire_cell_cd19_opal_650_total #83 6-404-198
#> [1] "All converged for other slides."

The reason for non-convergence could be bad threshold, or that too few observations are included in each slide. This can be checked by examining the density distribution of the expression value of the slide:

par(mfrow=c(2,2))
hist(cell$nzNorm_entire_cell_dapi_dapi_total[cell$slide_id=="#24 5-261-485"],
     xlab="Expression Values", main="dapi_#24 5-261-485", freq = FALSE)
hist(cell$nzNorm_entire_cell_dapi_dapi_total[cell$slide_id=="#55 6-231-821"],
     xlab="Expression Values", main="dapi_#55 6-231-821", freq = FALSE)
hist(cell$nzNorm_entire_cell_cd19_opal_650_total[cell$slide_id=="#83 6-404-198"],
     xlab="Expression Values", main="cd19_#83 6-404-198", freq = FALSE)

By comparing the histogram layout, it appears that the boundaries might be too stringent for these slides. We can keep this in mind, and modify the boundaries in the next round of model fitting.

5.1 Diagnostic Plots

For datasets with large numbers of slides, it may be infeasible to visually check each slide for “goodness of fit”. Under such circumstance, the GammaGateR diagnostic plot will be an easier way to spot potential problematic models. The diagnostic plot below illustrates two key parameters for each slide that are indicator of goodness of fit: the mode and proportion (\(\lambda\)) of expressed component. Often, if the estimated proportion of expressed cells is outside a reasonable range for the marker channel or if the slide is an outlier on the mode or lambda, it is likely that the slide did not fit well with the model set up. The plots are interactive, so we can see the name of the slide by hovering mouse over the point and scroll below to look at the fitted gamma distributions and the expression histogram for the slide. If not, or if the two components are too close then this can be fixed by refitting with different constraints.

For example, the diagnostic plot of CD19 shows that slide #43 1-786-092 has extremely small second component proportion. Therefore, in the next step, user can pay attention to this slide and see if the fitted model for that particular slide looks reasonable.

nzNormedMarkers <- names(groupPolarisFit[[1]][[1]])
m1 <- strsplit(nzNormedMarkers[1], split="_")[[1]][4]
cat('\n### ', m1, " \n")

5.1.1 cd19

htmltools::tagList(plot(groupPolarisFit, marker=nzNormedMarkers[1], diagnostic=TRUE,interactive=TRUE, histogram=FALSE, print=FALSE)[[1]])
#> Warning: Ignoring 1 observations

#> Warning: Ignoring 1 observations
cat('\n\n') for (i in (nzNormedMarkers[-1])){ mkr.name <- strsplit(i, split="_")[[1]][4] cat('\n### ', mkr.name, " \n") print(htmltools::tagList(plot(groupPolarisFit, marker=i, diagnostic=TRUE,interactive=TRUE, histogram=FALSE, print=FALSE)[[1]])) cat('\n\n') }

5.1.2 cd3

5.1.3 cd14

5.1.4 cd8

5.1.5 hladr

5.1.6 ck

5.1.7 dapi

#> Warning: Ignoring 2 observations
#> Warning: Ignoring 2 observations

5.1.8 autofluorescence

5.2 Individual slide histograms

The individual density curves of fitted models for each slide can be used to further evaluate model fit. There are two parts for each model in the individual slide:

  1. The plot: visualization of model “goodness-of-fit”.
    • The fitted gamma density curve for the two components.
    • The histogram of actual normalized pixel value. Plotted in the back for comparison.
  2. The table: shows the estimated parameters for the model.
    • Components 0-2 correspond to the estimates for the zeros, unexpressed, and expressed cells. - Lambda represents the estimated proportion of cells in that group
    • Alpha and beta parameters are the estimates for that control the shape of the distribution for that component.
    • Also included model convergence on the boundary or not
    • Mode of each component. This can to some extent show whether adjusting the boundary can help improve the fitted models.

The fitted curves of the model should approximately align with the histogram. In addition, we expect that the marker-positive cell distribution should sit to the right of the unexpressed cells. This is not always possible if the amount of noise in the dataset is too much to distinguish expressed and unexpressed cells. For the purpose of display, only histograms of 4 slides are displyed for each marker. Users can print out the rest of the histograms, if they wish to, by modifying the slide argument of the plot function.

for (i in (nzNormedMarkers)){
  mkr.name <- strsplit(i, split="_")[[1]][4]
  cat('\n### ', mkr.name, " \n")
  temp <- groupPolarisFit[1:4]
  class(temp) <- "groupGammaGateR"
  plotss <- plot(temp, marker=i, diagnostic=FALSE, histogram=TRUE, print=FALSE, tabl=TRUE)
  print(do.call(ggarrange,plotss))

  cat('\n\n')
}

5.2.1 cd19

5.2.2 cd3

5.2.3 cd14

5.2.4 cd8

5.2.5 hladr

5.2.6 ck

5.2.7 dapi

5.2.8 autofluorescence

6 Refitting the models

Here are a few observation based on the model diagnostic plots and slide histograms:

  • CD14 and CD8 could use a lower boundary constraint of 0.5 for the expressed cells, e.g. 0.4.
  • Lower bound of CK marker should be set to 0.25, because it looks like it’s missing some expressed cell distributions and most slides are at the boundary.
  • CD19 and DAPI should use lower boundary constraint.

The following constraints reflects the changes.

# set up boundary for all markers
quantileBoundaries2 = list(matrix(c(0,.8, .9, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.6, .8, 1), byrow = TRUE, nrow=2),
                          matrix(c(0,.7, .8, 1), byrow = TRUE, nrow=2),
                          NULL,
                          matrix(c(0,.7, .9, 1), byrow = TRUE, nrow=2),
                          NULL)
boundaries2 = list(matrix(c(0,.45, .45, 1), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .6, 1), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .5, 1), byrow = TRUE, nrow=2),
                  matrix(c(0,.3, .5, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.25, .25, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2),
                  matrix(c(0,.4, .4, Inf), byrow = TRUE, nrow=2))
names(quantileBoundaries2) = names(boundaries2) = nzNormedMarkers

# refit cfGMM with groupGammaGateR
groupPolarisFit2 <- groupGammaGateR(cell[,nzNormedMarkers], 
                                slide = cell$slide_id,
                                 boundaryMarkers = boundaries2,
                                qboundaryMarkers=quantileBoundaries2,
                                n.cores = 8)
save(groupPolarisFit2, file="groupPolarisFit2.RData")

Again, we need to check if all models converged.

load("groupPolarisFit2.RData")
convCheck(groupPolarisFit2)
#> [1] "All converged."

7 Evaluating new model fit

In order to decide if the new boundary actually improved model fit, the model fitted with old and new boundaries should be visualized together to compare. GammaGateR can compare two model fits in both diagnostic plots and the histograms.

7.1 Compare two fittings: Diagnostic plots

diag_contrast in GammaGateR returns a scatter plot similar to the previous diagnostic plot, color coded by fittings. Points that correspond to the same slide be different models are connected by a dashed line, which makes it easy to see how model fitting outcome of expressed cells moves under different boundaries.

for (i in nzNormedMarkers){
  mkr.name <- strsplit(i, split="_")[[1]][4]
  cat('\n### ', mkr.name, " \n")
  print(diag_contrast(groupPolarisFit, groupPolarisFit2, marker = i, title = paste0("contrastive: ", mkr.name)))
  cat('\n\n')
}

7.1.1 cd19

7.1.2 cd3

7.1.3 cd14

7.1.4 cd8

7.1.5 hladr

7.1.6 ck

7.1.7 dapi

7.1.8 autofluorescence

7.2 Compare two fittings: Histograms

In hist_sr_contrast, the histogram is plotted very similar to the plot function previously. In order to compare the results, the density curve of both models are plot in the same histogram, color coded the same way as in the contrastive diagnostic plot: green is the first fit object and red is the second fit object. This provides a closer look to the individual slide fittings.

for (i in nzNormedMarkers){
  mkr.name <- strsplit(i, split="_")[[1]][4]
  cat('\n### ', mkr.name, " \n")
  temp1 <- groupPolarisFit[1:4]
  class(temp1) <- "groupGammaGateR"
  temp2 <- groupPolarisFit2[1:4]
  class(temp2) <- "groupGammaGateR"
  plotss <- list()
  for(j in 1:4){
    plotss[[j]] <- hist_sr_constrast(temp1 ,temp2, marker=i, slide=j ,diagnostic=FALSE, histogram=TRUE, print=FALSE, tabl=TRUE)
  }
  cat('\n\n')
}

7.2.1 cd19

7.2.2 cd3

7.2.3 cd14

7.2.4 cd8

7.2.5 hladr

7.2.6 ck

7.2.7 dapi

7.2.8 autofluorescence

8 Model results output

The output structure is as follows:

Figure: Output Structure
Figure: Output Structure

Specifically, the output contains lists of slides output, where the output includes a variety of model and input information:

  • expressionZ is the posterior probabilities
  • expressionX is the input expression values
  • expressionW is the marginal probabilities
  • params is the zero-inflated prarameteres
  • fit is the cfGMM package model output
  • subBatch is the subBatch model fit parameter, if fitted with futher breakdown parameters within a slide. Each output facet includes a list of the marker-specific outputs.

At this stage, the model output is ready for next-step analysis. The major part of the output are the expression probability of each cell. GammaGateR provides posterior and marginal probability in expressionZ and expressionW separately. Since GammaGateR model fitting output belong to class list, the output can simply be extracted by going to the specific list. Suppose we need posterior probability from slide #01 0-889-121.

post_01_0_889_121 <- groupPolarisFit2[["#01 0-889-121"]][["expressionZ"]]

Of course, we can also easily extract posterior probability of all cells:

post_all <- do.call(rbind, (lapply(groupPolarisFit2, "[[", "expressionZ")))

In the examples we demonstrated in paper, posterior probability always outperforms marginal probability. However, marginal probability might be a more reasonable choice when the distribution is unimodal. In this scenario, the two components in the mixture model will appear to be overlapping a lot and the posterior probability will be zero for all cells.

ploti <- plot(groupPolarisFit2, slide="#84 6-409-858", marker="nzNorm_entire_cell_autofluorescence_total", diagnostic=FALSE, histogram=TRUE)

Harris, Coleman R, Eliot T McKinley, Joseph T Roland, Qi Liu, Martha J Shrubsole, Ken S Lau, Robert J Coffey, Julia Wrobel, and Simon N Vandekar. 2022. “Quantifying and Correcting Slide-to-Slide Variation in Multiplexed Immunofluorescence Images.” Bioinformatics 38 (6): 1700–1707.
Johnson, AM, JM Boland, J Wrobel, EK Klezcko, M Weiser-Evans, K Hopp, L Heasley, et al. 2021. “Cancer Cell-Specific MHCII Expression as a Determinant of the Immune Infiltrate Organization and Function in the Non-Small Cell Lung Cancer Tumor Microenvironment.” Journal of Thoracic Oncology: Official Publication of the International Association for the Study of Lung Cancer.
Wrobel, J, and T Ghosh. 2022. VectraPolarisData: Vectra Polaris and Vectra 3 Multiplex Single-Cell Imaging Data.