1 Overview

NeuroVault, is an open repository for sharing statistical maps, parcellations, and atlases of the human brain. Scientists can upload the brain images that they use for their studies to the online repository, to allow for interactive visualization of images while also contributing data entries for meta-analysis.

The images in Neurovault are highly heterogeneous and susceptible to human reporting errors in manually enter image metadata. This document automatically curates the Neurovault repository to produce a final dataset that is amenable to voxel-level meta-analysis. Here, we develop and run a preprocessing pipeline to ensure the Neurovault repository entries correspond to group-level statistical images and that images are registered to a common template (MNI 2mm). Details of preprocessing are described below. As a final step, we convert the images to effect sizes (the robust effect size index; RESI) and perform statistical analysis to remove outliers.

NeuroVault is living growing database of images. This preprocessing pipeline is automatically run on the first of each month to build a comprehensive preprocessed dataset that is freely available for download. The curated dataset is available on Zenodo. In our preprint, we demonstrate the utility of the curated dataset through unsupervised and supervised meta-analyses. This document and the dataset were last updated on 2026-05-19.

2 Download the Neurovault data

2.1 Metadata download and exclusionary criteria

The Neurovault metadata include manually entered information at the study- and image-level describing the type of data that were uploaded, including:

analysis_level: indicates whether the image is a group-level or individual-level statistical map.
not_mni: indicates whether the image is in MNI space.
is_thresholded: indicates whether the image is thresholded or unthresholded.
map_type: indicates the statistical type of the image (e.g., T, Z, F).
modality: indicates the imaging modality (e.g., structural, functional).
perc_bad_voxels: Proportion of voxels with 0 intensity or NaN values.
perc_voxels_outside: Proportion of voxels outside the brain.
number_of_subjects: Sample size of the study
DOI: The DOI for the publication associated with the repository entry.
paper_url: The URL for the publication associated with the repository entry.

We download the Neurovault study-level and image-level metadata, then apply the following quality exclusionary criteria to the metadata to ensure uniformity of the group-level statistical images.

Retain only group-level images, which represent group-level statistical maps rather than individual-level contrasts (analysis_level == "group").
Exclude images that are not in MNI space (not_mni == FALSE).
Keep only unthresholded maps (is_thresholded == FALSE), which ensures all statistical values computed in the image are included. This avoids biased caused by only reporting significant results
Restrict to statistical image types T and Z images, which are most commonly used in group-level inference. This step removes less frequently reported F images, and non-statistical images.
Exclude those that have sample size of 0 or NaN.

See Table 1 and Figure 1 for a summary of the intersection of these exclusionary criteria.

## This chunk would take some time, as it creates the file neurovaultMetadata, containing all the study-level images

library(callr)

# Check if data was already downloaded this month
rds_file_path <- get_monthly_rds_path("neurovaultMetadata")
file_exists_this_month <- file.exists(rds_file_path)

if (!file_exists_this_month) {
  # Run in background using callr::r - function saves results internally
  download_and_save_metadata_fn <- function(rds_dir) {
    library(httr)
    library(jsonlite)
    
    # Helper function to get date suffix
    get_date_suffix <- function() {
      format(Sys.Date(), "%Y-%m")
    }
    
    # Helper function to get monthly RDS path
    get_monthly_rds_path <- function(base_name, dir = rds_dir) {
      date_suffix <- get_date_suffix()
      current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
      
      pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
      all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
      
      old_files <- all_files[all_files != current_file]
      if (length(old_files) > 0) {
        invisible(lapply(old_files, file.remove))
      }
      
      return(current_file)
    }
    
    Sys.setenv(CURL_CA_BUNDLE = "/etc/ssl/certs/ca-certificates.crt")
    httr::GET("https://api.github.com")
    
    url <- 'https://neurovault.org/api/collections/?format=json'
    
    nv = list()
    i = 0
    while (!is.null(url)) {
      i = i + 1
      cat(sprintf("[%s] Downloading collections batch %d\n", Sys.time(), i))
      response <- GET(url, config(cainfo="/etc/ssl/certs/ca-certificates.crt"))
      data <- content(response, "text")
      json_data <- fromJSON(data, flatten = TRUE)
      nv[[i]] = json_data$results
      
      url <- json_data[['next']]
    }
    nv = do.call(rbind, nv)
    # subset to studies that have more than 0 images
    nv = nv[nv$number_of_images>0,]
    
    # Save directly to versioned RDS file
    if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
    rds_path <- get_monthly_rds_path("neurovaultMetadata")
    saveRDS(nv, file = rds_path)
    
    cat(sprintf("[%s] Neurovault metadata saved to %s\n", Sys.time(), rds_path))
  }
  
  # Start background process - DON'T WAIT for completion
  logfiles = c('logfiles/neurovault_collection_download.log', 'logfiles/neurovault_collection_download.err')
  unlink(logfiles)
  bg_process <- callr::r(download_and_save_metadata_fn, args = list(rds_dir), stdout = logfiles[1], stderr=logfiles[2])
  
  message(sprintf("[%s] Download complete. Output saved to %s", 
          Sys.time(), rds_file_path))
} else {
  # Load already downloaded metadata
  nv <- load_monthly_rds("neurovaultMetadata")
  message("Loaded previously downloaded metadata from ", rds_file_path)
}

nv$DOIexists = factor(ifelse(is.na(nv$DOI), "No", "Yes"), levels=c('Yes', 'No'))
nv$paperURLexists = factor(ifelse(is.na(nv$paper_url), "No", "Yes"), levels=c('Yes', 'No'))
label(nv$paperURLexists) = "Paper URL exists"
label(nv$number_of_images) = "Number of images"
label(nv$DOIexists) = "DOI exists"
table1(~number_of_images + DOIexists + paperURLexists, data = nv, caption = "Study-level metadata.")

## The following two chunks would take some time, as it extracts and downloads all the image-level data and creates the first image tabular dataset that we work with. More than 2 hours

## May 2026: 10743 collections

library(callr)
api_cores = 4
# Check if image list was already downloaded this month
rds_file_path <- get_monthly_rds_path("imageList")
file_exists_this_month <- file.exists(rds_file_path)

if (!file_exists_this_month) {
  # Define the download function to run in background - function saves results internally
  download_and_save_image_list_fn <- function(collection_ids, n_cores, rds_dir) {
    library(httr)
    library(jsonlite)
    library(parallel)
    
    # Ensure collection_ids is a proper character vector (fix for list coercion error)
    if (is.list(collection_ids)) {
      collection_ids <- unlist(collection_ids)
    }
    collection_ids <- as.character(collection_ids)
    
    cat(sprintf("[%s] [BG] Starting download with %d collection IDs\n", Sys.time(), length(collection_ids)))
    
    # Helper function to get monthly RDS path
    get_monthly_rds_path <- function(base_name, dir = rds_dir) {
      get_date_suffix <- function() {
        format(Sys.Date(), "%Y-%m")
      }
      date_suffix <- get_date_suffix()
      current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
      
      pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
      all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
      
      old_files <- all_files[all_files != current_file]
      if (length(old_files) > 0) {
        invisible(lapply(old_files, file.remove))
      }
      
      return(current_file)
    }
    
    # Download function with pagination support and error handling
    download_collection_safe <- function(collection_id) {
      tryCatch({
        collection_id <- as.character(collection_id)
        cat(sprintf("[%s] [BG] Downloading collection %s\n", Sys.time(), collection_id))
        
        url <- paste0("https://neurovault.org/api/collections/", collection_id, "/images/?format=json")
        all_results <- list()
        page <- 0
        
        # Paginate through all results
        while (!is.null(url)) {
          page <- page + 1
          response <- GET(url, timeout(30))
          
          if (status_code(response) != 200) {
            cat(sprintf("[%s] [BG] Collection %s page %s: HTTP %s\n", Sys.time(), collection_id, page, status_code(response)))
            break
          }
          
          data <- content(response, "text")
          json_data <- fromJSON(data, flatten = TRUE)
          
          # Extract results from this page
          if (!is.null(json_data$results) && is.data.frame(json_data$results) && nrow(json_data$results) > 0) {
            all_results[[page]] <- json_data$results
          }
          
          # Check for next page
          url <- json_data[['next']]
        }
        
        # Combine all pages
        if (length(all_results) == 0) {
          cat(sprintf("[%s] [BG] Collection %s: No results\n", Sys.time(), collection_id))
          return(data.frame())
        }
        
        result <- do.call(rbind, all_results)
        
        # Validate result is a dataframe
        if (!is.data.frame(result)) {
          cat(sprintf("[%s] [BG] Collection %s: invalid result type (%s)\n", Sys.time(), collection_id, class(result)[1]))
          return(data.frame())
        }
        
        cat(sprintf("[%s] [BG] Collection %s: OK (%s pages, %s total rows, %s cols)\n", Sys.time(), collection_id, page, nrow(result), ncol(result)))
        return(result)
        
      }, error = function(e) {
        cat(sprintf("[%s] [BG] Collection %s: ERROR - %s\n", Sys.time(), collection_id, e$message))
        return(data.frame())
      })
    }
    
    # Download all collections
    cat(sprintf("[%s] [BG] Starting mclapply", Sys.time(), n_cores))
    image_list <- mclapply(collection_ids, download_collection_safe, mc.cores = n_cores)
    
    # Validate we got a list
    if (!is.list(image_list)) {
      stop(sprintf("ERROR: mclapply did not return a list, got %s", class(image_list)))
    }
    cat(sprintf("[%s] [BG] mclapply returned %d results\n", Sys.time(), length(image_list)))
    
    # Keep only data frames with >0 rows
    cat(sprintf("[%s] [BG] Filtering dataframes with >0 rows\n", Sys.time()))
    is_valid <- sapply(image_list, function(x) is.data.frame(x) && nrow(x) > 0)
    valid_count <- sum(is_valid)
    cat(sprintf("[%s] [BG] Valid dataframes: %d of %d\n", Sys.time(), valid_count, length(image_list)))
    
    if (valid_count == 0) {
      stop("ERROR: No valid dataframes downloaded. Check collection_ids and API.")
    }
    
    image_list <- image_list[is_valid]
    
    # Save list directly to versioned RDS file
    if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
    rds_path <- get_monthly_rds_path("imageList")
    
    cat(sprintf("[%s] [BG] Saving image list with %d dataframes to %s\n", Sys.time(), length(image_list), rds_path))
    saveRDS(image_list, file = rds_path, compress = "gzip")
    
    cat(sprintf("[%s] [BG] ✓ Image list saved successfully\n", Sys.time()))
  }
  
  
  if (is.null(nv) || nrow(nv) == 0) {
    stop("ERROR: neurovaultMetadata is empty. Run getStudyLevelData chunk first.")
  }
  
  # Extract and validate collection IDs
  collection_ids <- nv$id
  if (is.list(collection_ids)) {
    collection_ids <- unlist(collection_ids)
  }
  collection_ids <- as.character(collection_ids)
  
  cat(sprintf("[%s] Extracted %d collection IDs\n", Sys.time(), length(collection_ids)))
  
  # Create logfiles directory if needed
  logdir <- "logfiles"
  if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
  logfiles = c('logfiles/neurovault_download.log', 'logfiles/neurovault_download.err')
  unlink(logfiles)
  
  # Start background process - DON'T WAIT for completion
  bg_process <- callr::r(download_and_save_image_list_fn, 
                            args = list(collection_ids, api_cores, rds_dir), 
                            stdout = logfiles[1], 
                            stderr = logfiles[2])
  
  message(sprintf("[%s] Background process started (PID: %s). Downloading %d collections. Output: %s", 
            Sys.time(), length(collection_ids), logfiles[1]))
} else {
  # Load already downloaded list
  image_list <- load_monthly_rds("imageList")
  message("Loaded previously downloaded image list from ", rds_file_path)
}

# Check if merged image-level data already exists this month
rds_file_path <- get_monthly_rds_path("fullimageMetadata")
file_exists_this_month <- file.exists(rds_file_path)

#res = unlist(by(images$collection_id, images$collection_id, length)); res = data.frame(collection_id = names(res), numberinImages=res) 
#test = merge(res, nv[, c('id', 'number_of_images')], by.x='collection_id', by.y='id')

if (!file_exists_this_month) {
  
  if (is.null(image_list) || length(image_list) == 0) {
    stop("ERROR: imageList is empty. Run getImageLevelData_section1.1_download chunk first and wait for it to complete.")
  }
  
  cat(sprintf("[%s] Loaded image list with %d dataframes\n", Sys.time(), length(image_list)))
  
  # Find common column names across all dataframes (columns in ALL collections)
  cat(sprintf("[%s] Analyzing column overlap across %d collections\n", Sys.time(), length(image_list)))
  all_colnames <- lapply(image_list, colnames)
  
  # Columns that appear in ALL collections
  # SNV: This excludes important columns
  # common_cols <- Reduce(intersect, all_colnames)
  # cat(sprintf("[%s] Found %d columns in ALL collections\n", Sys.time(), length(common_cols)))
  
  # Find columns that appear in at least 80% of collections
  col_freq <- table(unlist(all_colnames))
  common_cols <- names(col_freq)[col_freq >= 0.8 * length(image_list)]
  cat(sprintf("[%s] Fallback: %d columns in ≥80%% of collections\n", Sys.time(), length(common_cols)))
  
  # Subset each dataframe to common columns only
 cat(sprintf("[%s] Subsetting to %d common columns\n", Sys.time(), length(common_cols)))
  image_list <- lapply(image_list, function(df) {
    # Get columns that exist in this dataframe
    cols_exist <- intersect(common_cols, colnames(df))
    
    # Get missing columns
    cols_missing <- setdiff(common_cols, colnames(df))
    
    # Start with existing columns
    result <- df[, cols_exist, drop = FALSE]
    
    # Add missing columns filled with NA
    if (length(cols_missing) > 0) {
      result[, cols_missing] <- NA
    }
    
    # Reorder to match common_cols order
    result[, common_cols, drop = FALSE]
  })
  
  # Combine dataframes using rbind (much faster than merge)
  cat(sprintf("[%s] Combining %d dataframes via rbind...\n", Sys.time(), length(image_list)))
  start_time <- Sys.time()
  images <- do.call(rbind, image_list)
  elapsed <- difftime(Sys.time(), start_time, units = "secs")
  cat(sprintf("[%s] rbind complete in %.1f sec: %d rows × %d columns\n", 
              Sys.time(), elapsed, nrow(images), ncol(images)))
  
  # Sort all columns by missingness
  cat(sprintf("[%s] Computing column completeness for %d columns\n", Sys.time(), ncol(images)))
  col_miss <- colMeans(is.na(images))
  col_miss = sort(col_miss)
  col_keep <- names(col_miss)[col_miss < 0.8]
  images <- images[, col_keep]
  
  
  # Save to versioned RDS file
  if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
  rds_file_path <- get_monthly_rds_path("fullimageMetadata")
  
  cat(sprintf("[%s] Saving merged data to %s\n", Sys.time(), rds_file_path))
  saveRDS(images, file = rds_file_path, compress = "gzip")
  
  cat(sprintf("[%s] ✓ Full image metadata saved successfully\n", Sys.time()))
} else {
  # Load already merged metadata
  images <- load_monthly_rds("fullimageMetadata")
  message("Loaded previously merged image metadata from ", rds_file_path)
}

images <- images %>% mutate(
  image_type      = factor(image_type,      levels = names(sort(table(image_type),      decreasing = TRUE))),
  map_type        = factor(map_type,        levels = names(sort(table(map_type),        decreasing = TRUE))),
  analysis_level  = factor(analysis_level,  levels = names(sort(table(analysis_level),  decreasing = TRUE))),
  is_thresholded  = factor(is_thresholded,  levels = names(sort(table(is_thresholded),  decreasing = TRUE))),
  not_mni         = factor(not_mni,         levels = names(sort(table(not_mni),         decreasing = TRUE)))
)

table1Form = ~ number_of_subjects + modality + image_type + map_type + analysis_level + is_thresholded + not_mni + perc_bad_voxels + perc_voxels_outside

table1(table1Form,
       data = images,
       caption = "Table 1: Neurovault image-level metadata summary.")

Table 1: Neurovault image-level metadata summary.
	Overall (N=618582)
number_of_subjects
Mean (SD)	244 (62800)
Median [Min, Max]	30.0 [-275, 32200000]
Missing	355365 (57.4%)
modality
	41801 (6.8%)
Diffusion MRI	1026 (0.2%)
EEG	43 (0.0%)
fMRI-BOLD	442445 (71.5%)
fMRI-CBF	275 (0.0%)
fMRI-CBV	26 (0.0%)
MEG	84 (0.0%)
Other	126908 (20.5%)
PET FDG	179 (0.0%)
PET other	274 (0.0%)
Structural MRI	4322 (0.7%)
Missing	1199 (0.2%)
image_type
statistic_map	617383 (99.8%)
NIDM results statistic map	1181 (0.2%)
atlas	18 (0.0%)
map_type
univariate-beta map	147640 (23.9%)
variance	137908 (22.3%)
ROI/mask	111407 (18.0%)
T map	75813 (12.3%)
other	69067 (11.2%)
Z map	66055 (10.7%)
P map (given null hypothesis)	6812 (1.1%)
parcellation	1245 (0.2%)
F map	1052 (0.2%)
1-P map ("inverted" probability)	541 (0.1%)
anatomical	490 (0.1%)
multivariate-beta map	478 (0.1%)
Chi squared map	56 (0.0%)
Missing	18 (0.0%)
analysis_level
single-subject	365261 (59.0%)
meta-analysis	117877 (19.1%)
group	54178 (8.8%)
other	2332 (0.4%)
	27 (0.0%)
Missing	78907 (12.8%)
is_thresholded
FALSE	451736 (73.0%)
TRUE	159548 (25.8%)
Missing	7298 (1.2%)
not_mni
FALSE	615842 (99.6%)
TRUE	2722 (0.4%)
Missing	18 (0.0%)
perc_bad_voxels
Mean (SD)	79.4 (16.7)
Median [Min, Max]	78.1 [0, 100]
Missing	18 (0.0%)
perc_voxels_outside
Mean (SD)	30.4 (38.9)
Median [Min, Max]	10.1 [0, 100]
Missing	909 (0.1%)

Table 1 summarizes the full NeuroVault repository prior to any exclusions (N = 618582). Several patterns are worth noting. The repository is dominated by single-subject images and meta-analytic maps; group-level images — the target of this pipeline — comprise only 10.0% of the total. The most common map types are non-statistical maps such as univariate-beta map and variance; T and Z maps together account for a minority of all images. A large proportion of images have no recorded sample size (57.4% missing), reflecting the manually curated nature of the repository. The high median perc_bad_voxels reflects expected zero-valued voxels outside the brain mask rather than data corruption.

# SNV: Warning mapping missing to FALSE in upset plot is suppressed.
library(ComplexUpset)
## UpSet table
images$binarized_group_level <- ifelse(images$analysis_level == "group", TRUE, FALSE)
images$binarized_map_type <- ifelse(images$map_type == "T map" | images$map_type == "Z map", TRUE, FALSE)
images$unthresholded <- ifelse(images$is_thresholded == FALSE, TRUE, FALSE)
images$mni <- ifelse(images$not_mni == FALSE, TRUE, FALSE)
images$nonzero_sample <- ifelse(is.na(images$number_of_subjects) | images$number_of_subjects<=0, FALSE, TRUE)

images_clean <- images %>%
  transmute(
    "Group-level analysis" = as.logical(binarized_group_level),
    "T/Z map"  = as.logical(binarized_map_type),
    "Unthresholded" = as.logical(unthresholded),
    "MNI space" = as.logical(mni),
    "Nonzero sample" = as.logical(nonzero_sample)
  )

upset(
  images_clean,
  intersect = c(
    "Group-level analysis",
    "T/Z map",
    "Unthresholded",
    "MNI space",
    'Nonzero sample'
  ),
  name='Group',
  min_size = 50,
  base_annotations = list(
    "Intersection size" = intersection_size(
      text = list(
        angle = 90,
        vjust = 0.5,
        hjust = -0.02
      )
    )
  )
)

Figure 1: Intersection of NeuroVault image-level exclusionary criteria.

# SNV: clean up larger unused objects
rm(images_clean, image_list)

Figure 1 shows the an UpSet plot showing the intersection of the five exclusionary criteria applied to the raw NeuroVault image metadata. Each column in the upper bar chart represents a unique combination of criteria, with the bar height indicating how many images fall into that combination. The dot matrix below each bar shows which criteria are active — filled dots connected by a vertical line indicate criteria that are simultaneously satisfied. The horizontal bars on the left show the total number of images satisfying each criterion individually (set size), regardless of the others. The final retained images satisfy all five criteria simultaneously. See Table 2 below for summary statistics after exclusion.

Note:

The analysis_level variable has other types of data, and group-level images make up less than 10% of data.
Including only T and Z maps removes approximately 70% of the data.
0 images have a reported sample size greater than 100,000 (maximum: 32,000). Our approach here is by spot checking the entry with 32,222,222, and it turns out to be a sample size of 32. We have manually fixed it in our dataset. However, for everything else and future additional studies to our pipeline, a study with more than 0 images can reflect metadata entry errors. Thus, implausibly large values will be flagged and removed here (except for the one that we spot checked.

# SNV: I deleted reloading the dataset, here
images_before <- nrow(images)
# subset to group-level analysis
images = images[ which(images$analysis_level=='group'), ]
# subset to unthresholded images
images = images[which(images$is_thresholded==FALSE),]
# Including only T and Z map types
images = images[images$map_type %in% c("T map", "Z map"), ]
# Exclude all not mni
images = images[which(images$not_mni == FALSE), ]
# Exclude sample size 0
images <- images[which(!is.na(images$number_of_subjects) & images$number_of_subjects>0),]

table1(table1Form, data = images, caption = "Table 2: Summary statistics after metadata exclusions.")

Table 2: Summary statistics after metadata exclusions.
	Overall (N=25706)
number_of_subjects
Mean (SD)	79.7 (226)
Median [Min, Max]	45.0 [1.00, 30000]
modality
Diffusion MRI	39 (0.2%)
fMRI-BOLD	17647 (68.6%)
fMRI-CBF	14 (0.1%)
fMRI-CBV	2 (0.0%)
MEG	39 (0.2%)
Other	7361 (28.6%)
PET FDG	50 (0.2%)
PET other	43 (0.2%)
Structural MRI	511 (2.0%)
image_type
statistic_map	25706 (100%)
NIDM results statistic map	0 (0%)
atlas	0 (0%)
map_type
univariate-beta map	0 (0%)
variance	0 (0%)
ROI/mask	0 (0%)
T map	23143 (90.0%)
other	0 (0%)
Z map	2563 (10.0%)
P map (given null hypothesis)	0 (0%)
parcellation	0 (0%)
F map	0 (0%)
1-P map ("inverted" probability)	0 (0%)
anatomical	0 (0%)
multivariate-beta map	0 (0%)
Chi squared map	0 (0%)
analysis_level
single-subject	0 (0%)
meta-analysis	0 (0%)
group	25706 (100%)
other	0 (0%)
	0 (0%)
is_thresholded
FALSE	25706 (100%)
TRUE	0 (0%)
not_mni
FALSE	25706 (100%)
TRUE	0 (0%)
perc_bad_voxels
Mean (SD)	74.0 (12.2)
Median [Min, Max]	75.9 [0, 86.1]
perc_voxels_outside
Mean (SD)	13.2 (21.3)
Median [Min, Max]	6.41 [0, 100]

After applying these exclusions, the dataset is reduced from 618,582 to 25,706 images (Table 2). The retained images are exclusively group-level, unthresholded T and Z maps in MNI space with a valid nonzero sample size. As expected, variables used as exclusionary criteria — analysis_level, is_thresholded, not_mni, and map_type — show no remaining variation, confirming the filters were applied correctly. The dataset is dominated by fMRI-BOLD images (68.6%), with a median sample size of 45 subjects. The mean sample size (80) is notably larger than the median, suggesting a small number of studies with disproportionately large sample sizes that will be addressed in the next section.

2.2 Imaging data download

Here, we download Nifti images from each NeuroVault collection after applying the metadata exclusions. After preparing local file paths, it checks for missing files and downloads them in parallel using 36 cores.

## This chunk would take some time, as it downloads the images to local directories, to allow further work on the images.


library(callr)
# download the images
outdir = '/media/big/neurovault'
images$localfile = gsub('http://neurovault.org/media/images', outdir, images$file)

# creates the directories
invisible(capture.output(lapply(unique(dirname(images$localfile)), dir.create, showWarnings=FALSE, recursive = TRUE)))

# Actually download the images, here, if they aren't already
# downloaded. I am not sure if there are usage limits on
# neurovault. This parallel call would probably be a problem
# if there is. There were issues the first time.
images$download = !file.exists(images$localfile)

# Define the download function
download_images_fn <- function(url_vec, destfile_vec, n_cores) {
  library(parallel)
  
  downloadFile = function(url, destfile){ 
    tryCatch(download.file(url=url, destfile=destfile), error=function(e){NA})
  }
  
  invisible(mcmapply(downloadFile, url=url_vec, destfile=destfile_vec, mc.cores = n_cores))
}

# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/image_download.log', 'logfiles/image_download.err')
unlink(logfiles)

# Run and wait for completion
cat(sprintf("[%s] Starting image download (%d files)...\n", Sys.time(), sum(images$download==TRUE)))

## [2026-05-19 12:39:46.195016] Starting image download (46 files)...

result <- callr::r(download_images_fn, 
                   args = list(images$file[images$download==TRUE], 
                              images$localfile[images$download==TRUE], 
                              n_cores),
                   stdout = logfiles[1],
                   stderr = logfiles[2])

cat(sprintf("[%s] ✓ Download completed. Output: %s\n", Sys.time(), logfiles[1]))

## [2026-05-19 12:39:48.345042] ✓ Download completed. Output: logfiles/image_download.log

# recheck what files exist
images$download = !file.exists(images$localfile)
# SNV: Haozheng, I added this, but it is not documented. Excludes failed downloads
images = images[ !images$download,]

3 Image derived exclusionary criteria

After downloading the images, we compute image dimensions (imgdim), voxel dimensions (voxdim), number of nonzero voxels (nvox), image intensity range (range), and orientation from the Nifti images. The image and voxel dimensions are used to compute the physical dimensions of each image in millimeters. We screen for duplicate images based on both the qualitative and quantitative aspects, including the file’s basename imgdim, voxeldim, nvox, range, and bounding_box. It would be rare for two different images to have exact identical values across all these metrics, especially bounding_box and range. After identifying duplicates, we check whether the image dimensions in mm are within 2.5% of the MNI template image. Finally, we check the image range; if it is less than 0.01, we exclude the image (Figure 2). This threshold was chosen because for images where the maximum and minimum intensity values differ by less than 0.01, they contain effectively no signal variation, and are therefore unlikely to represent valid statistical maps. We do not exclude based on large ranges because the test statistic under the alternative converges to infinity as the sample size gets large. Further effect size based exclusionary criteria are applied below. Tables 4 and 5 show the dataset before and after these exclusionary criteria, respectively.

3.1 Duplicates handling

library(callr)

# Check if image dimensions/voxels metadata was already computed this month
rds_file_path <- get_monthly_rds_path("imageMetadataWithDimensions")
file_exists_this_month <- file.exists(rds_file_path)

if (!file_exists_this_month) {
  # Define the function to compute image dimensions - runs in background
  compute_and_save_image_dimensions_fn <- function(rds_dir, images, n_cores, c3d_bin) {
    library(parallel)
    
    # Helper function to get monthly RDS path
    get_monthly_rds_path <- function(base_name, dir = rds_dir) {
      get_date_suffix <- function() {
        format(Sys.Date(), "%Y-%m")
      }
      date_suffix <- get_date_suffix()
      current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
      
      pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
      all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
      
      old_files <- all_files[all_files != current_file]
      if (length(old_files) > 0) {
        invisible(lapply(old_files, file.remove))
      }
      
      return(current_file)
    }
    
    # Helper function to extract value by keyword from a string
    extract_info_field <- function(lines, keyword) {
      val <- sapply(lines, function(x) {
        first_line <- x[1]
        splt <- strsplit(first_line, split = ";")[[1]]
        match <- grep(keyword, splt, value = TRUE)
        if (length(match) > 0) {
          return(trimws(sub(paste0("^.*", keyword, "\\s*=\\s*"), "", match[1])))
        } else {
          return(NA)
        }
      })
      return(val)
    }
    
    # Get image dimensions and voxel info
    cat(sprintf("[%s] [BG] Computing image dimensions with c3d -info...\n", Sys.time()))
    cmd = paste(c3d_bin, images$localfile, '-info')
    voxdim = mclapply(cmd, system, intern=TRUE, mc.cores=n_cores)
    images$voxdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('vox', splt)] }))
    images$imgdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('dim', splt)] }))
    images$voxdim <- trimws(images$voxdim)
    images$imgdim <- trimws(images$imgdim)
    cat(sprintf("[%s] [BG] Image dimensions computed\n", Sys.time()))
    
    # Count non-zero voxels
    cat(sprintf("[%s] [BG] Counting non-zero voxels with c3d -threshold...\n", Sys.time()))
    cmd_vox <- paste(c3d_bin, images$localfile, '-threshold 0 0 0 1 -voxel-sum')
    nvox <- unlist(mclapply(cmd_vox, system, intern = TRUE, mc.cores = n_cores))
    images$nvox <- as.numeric(gsub("[^0-9.]", "", nvox))
    cat(sprintf("[%s] [BG] Non-zero voxel count complete\n", Sys.time()))
    
    # Get orientation, range, and bounding box from `-info` output
    cat(sprintf("[%s] [BG] Extracting orientation, range, and bounding box...\n", Sys.time()))
    cmd_info <- paste(c3d_bin, images$localfile, '-info')
    info <- mclapply(cmd_info, system, intern = TRUE, mc.cores = n_cores)
    
    images$orient <- extract_info_field(info, "orient")
    images$range <- extract_info_field(info, "range")
    images$bounding_box <- extract_info_field(info, "bb")
    images$voxdim <- trimws(gsub("vox = ", "", images$voxdim))
    images$imgdim <- trimws(gsub("Image #1: dim = ", "", images$imgdim))
    
    # Save to versioned RDS file
    if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
    rds_path <- get_monthly_rds_path("imageMetadataWithDimensions")
    
    cat(sprintf("[%s] [BG] Saving enhanced metadata to %s\n", Sys.time(), rds_path))
    saveRDS(images, file = rds_path, compress = "gzip")
    
    cat(sprintf("[%s] [BG] ✓ Image dimension metadata saved successfully\n", Sys.time()))
  }
  
  # Create logfiles directory if needed
  logdir <- "logfiles"
  if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
  logfiles = c('logfiles/c3d_dimensions.log', 'logfiles/c3d_dimensions.err')
  unlink(logfiles)
  
  # Start background process - DON'T WAIT for completion
  bg_process <- callr::r(compute_and_save_image_dimensions_fn, 
                            args = list(rds_dir, images, n_cores, c3d_bin), 
                            stdout = logfiles[1], 
                            stderr = logfiles[2])
  
  message(sprintf("[%s] Background process started (PID: %s). Computing image dimensions. Output: %s", 
            Sys.time(), logfiles[1]))
} else {
  # Load already computed metadata
  images <- load_monthly_rds("imageMetadataWithDimensions")
  message("Loaded previously computed image dimensions from ", rds_file_path)
}

Table 3: Duplicate images identified and removed based on identical intensity range and filename.

images$localfile_base <- basename(as.character(images$localfile))
cand_keys <- c("range", "localfile_base")

keyDF <- as.data.frame(lapply(images[, cand_keys], function(x) {
  if (is.list(x)) sapply(x, paste, collapse = ",") else as.character(x)
}))

images$duplicated <- duplicated(keyDF)
n_dup <- sum(images$duplicated)

reactable(
  images[images$duplicated == TRUE, c('collection', 'id', 'range', 'localfile_base')],
  striped = TRUE,
  highlight = TRUE,
  bordered = TRUE,
  searchable = TRUE,
  defaultPageSize = 10,
  columns = list(
    collection     = colDef(name = "Collection", minWidth = 150),
    id             = colDef(name = "ID",         minWidth = 80),
    range          = colDef(name = "Range",      minWidth = 150),
    localfile_base = colDef(name = "Filename",   minWidth = 200)
  )
)

images <- images[!images$duplicated, ]
if (exists("keyDF")) rm(keyDF)

We screen for duplicate images based on identical intensity range and filename, as two images sharing both values are almost certainly the same statistical map uploaded to multiple collections. A total of 2066 duplicate images were identified and removed, leaving 23594 images for downstream analysis. The duplicates are displayed in the table above. The table above shows an interactive table of the objects being dropped here. Note: Only the second occurrence of each duplicate pair is shown here; the first occurrence is retained in the dataset and will not appear in this table, so it is normal for no two rows in the table to look identical to each other.

3.2 Proportionality checking

Using a proportionality tolerance of 0.025, we then check whether the physical dimensions of each image are approximately proportional to the MNI template (182x218x182 mm) as shown in Figure 2.

# compute dim_mm by multiplying imgdim with voxdim after converting to numeric values
imgdims = lapply(gsub('\\]', ')', gsub('\\[', 'c(', as.character(images$imgdim))), function(x) eval(parse(text=x)))
voxdims = lapply(gsub('\\]', ')', gsub('\\[', 'c(', as.character(images$voxdim))), function(x) eval(parse(text=x)))
dims_mat = t(mapply(function(x,y){ x*y}, imgdims, voxdims))
colnames(dims_mat) <- c("dim_x", "dim_y", "dim_z")

# Split dim_mm into x, y, z components
# SNV: dim_mm was not defined and only existed because it was loading an old dataset version.
images$dim_mm = apply(dims_mat, 1, function(x) paste(x, collapse = "*"))


# Compute ratios
ratio_xy <- dims_mat[,1] / dims_mat[,2]
ratio_xz <- dims_mat[,1] / dims_mat[,3]
ratio_yz <- dims_mat[,2] / dims_mat[,3]


# Define reference ratios (from MNI: 182x218x182)
ref_xy <- 182 / 218       # ≈ 0.8349
ref_xz <- 1               # since x = z
ref_yz <- 218 / 182       # ≈ 1.1978

images[, paste('log10', c('ratio_xy', 'ratio_xz', 'ratio_yz'), sep = '_')] <- cbind(ratio_xy/ref_xy, ratio_xz/ref_xz, ratio_yz/ref_yz)

images[, c('ratio_x', 'ratio_y', 'ratio_z') ] <- sweep(dims_mat, 2, c(182, 218, 182), FUN = "/")


# Reference thresholds by dimension
ref_thresholds <- list(
  ratio_x = c(0.75, 1.25),
  ratio_y = c(0.75, 1.25),
  ratio_z = c(0.6, 1.2)
)

images$is_proportional = apply(images[, c('ratio_x', 'ratio_y', 'ratio_z')], 1, function(r) {
  all(r >= ref_thresholds$ratio_x[1] & r <= ref_thresholds$ratio_x[2]) &
    all(r >= ref_thresholds$ratio_y[1] & r <= ref_thresholds$ratio_y[2]) &
    all(r >= ref_thresholds$ratio_z[1] & r <= ref_thresholds$ratio_z[2])
})

# Logical mask for proportional dimensions
# Set epsilon
epsilon <- 0.025
# images$is_proportional <- abs(ratio_xy - ref_xy) < epsilon &
#                    abs(ratio_xz - ref_xz) < epsilon &
#                    abs(ratio_yz - ref_yz) < epsilon

# SNV: Changed to not repeat code you had above and use the same (more compact) syntax you used above
images <- images %>% mutate(
  dim_mm      = factor(dim_mm,      levels = names(sort(table(dim_mm),      decreasing = TRUE)))
)

prop_near_mni <- sum(images$is_proportional)
mni_total <- length(images$is_proportional)


# PLOT

library(GGally)
library(ggplot2)
library(scales)
library(rlang)

# Create ratio dataframe
ratio_df <- images[, c('ratio_x', 'ratio_y', 'ratio_z')]
ratio_df <- na.omit(ratio_df)

# Diagonal function (histograms)
diag_fun_ratio <- function(data, mapping, ...) {
  p <- GGally::ggally_barDiag(data, mapping, ...)
  
  var <- as_label(mapping$x)
  thrs <- ref_thresholds[[var]]
  
  if (!is.null(thrs)) {
    for (thr in thrs) {
      p <- p + geom_vline(xintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
    }
  }
  p
}

# Lower function (scatter plots)
lower_fun_ratio <- function(data, mapping, ...) {
  p <- GGally::ggally_points(data, mapping, ...)
  
  xvar <- as_label(mapping$x)
  yvar <- as_label(mapping$y)
  
  x_thrs <- ref_thresholds[[xvar]]
  y_thrs <- ref_thresholds[[yvar]]
  
  if (!is.null(x_thrs)) {
    for (thr in x_thrs) {
      p <- p + geom_vline(xintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
    }
  }
  
  if (!is.null(y_thrs)) {
    for (thr in y_thrs) {
      p <- p + geom_hline(yintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
    }
  }
  
  p
}

# Upper function (correlations)
upper_fun_ratio <- function(data, mapping, ...) {
  GGally::ggally_cor(data, mapping, ...) +
    theme_void()
}

# Create the plot
GGally::ggpairs(
  ratio_df,
  diag = list(continuous = diag_fun_ratio),
  lower = list(continuous = lower_fun_ratio),
  upper = list(continuous = upper_fun_ratio),
  title = "Figure 2: Image dimension ratios (relative to MNI 182×218×182 template)"
)

We find that 23191 of 23594 images are approximately proportional to the MNI template.

3.3 Range Exclusion

We convert the image dimensions to mm by multiplying the number of voxels by the pixel dimension. Using a proportionality tolerance of \(\varepsilon = 0.025\), we then check whether the physical dimensions of each image are approximately proportional to the MNI template (182×218×182 mm). We find that 23191 of 23594 images are approximately proportional to the MNI template.

Images with a low intensity range (\(|\max - \min|\)) are unlikely to be valid statistical images, so we exclude all images that have a range smaller than 0.01. A visual illustration is available in Figure 3 below. Everything below the red dashed line are removed due to small range.

library(stringr)
library(ggplot2)
library(scales)

# SNV: removed loading and saving dataset in this chunk

# parse [lo, hi]
hi <- str_match(
  images[["range"]],
  "\\[\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*,\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*\\]"
)
lo <- as.numeric(hi[, 2])
hi <- as.numeric(hi[, 3])

# attach & compute range
images$range_low  <- lo
images$range_high <- hi
images$range_diff <- images$range_high - images$range_low
rm(lo, hi)

# histogram of log10(range_diff) (positive values only)
xpos <- images$range_diff
# SNV: commented this because it was already satisfied.
# xpos <- xpos[is.finite(xpos) & xpos > 0]

# xpos already defined above as positive, finite range_diff values
df_sc <- data.frame(
  idx = seq_along(xpos),
  l10 = log10(xpos)
)

tick_min <- floor(min(df_sc$l10, na.rm = TRUE))
tick_max <- ceiling(max(df_sc$l10, na.rm = TRUE))
ticks    <- tick_min:tick_max

# SNV: changed hard coding of horizontal line
# SNV: Change this to -2, since it is thresholding on the log scale
thr <- -2
ggplot(df_sc, aes(idx, l10)) +
  geom_point(alpha = 0.5, size = 0.6) +
  scale_y_continuous(breaks = ticks, labels = scales::math_format(10^.x)) +
  geom_hline(yintercept = ticks, linetype = "dotted", linewidth = 0.3, color = "grey70") +
  labs(
    title = "Figure 3: Scatterplot of image range (log10 scale) by row index",
    x = "Image index",
    y = "Image numerical range (log10 scale)",
    subtitle = sprintf("n = %d ", length(xpos))
  ) +
  geom_hline(yintercept = thr,linetype = "dashed", color = "red") + 
  theme_minimal(base_size = 12)

rm(df_sc)

# drop rows with log10(range_diff) < 0.1 (guard against non-positive / NA)
images$range_exclusion = log10(images$range_diff) < thr

Table 4 below summarizes the dataset after image-level exclusions, showing the distribution of our dataset after the three additional criteria derived from the downloaded images: duplicated (identical range and filename across collections), is_proportional (physical dimensions within 2.5% of the MNI 182×218×182 template), and range_exclusion (intensity range below 0.01). The dim_mm column shows the distribution of image dimensions in millimeters across the dataset.

The remaining images are proportionally sized relative to the MNI template, contain meaningful intensity variation, and have no detected duplicates. The dim_mm distribution in Table 4 reflects only the image dimensions present in the retained dataset, with the most common dimension being 182218182.

# SNV: perc_of_voxels_outside was not included in table1Form, so I did not include it here.
table2Form = update.formula(table1Form, ~ duplicated + is_proportional + range_exclusion + dim_mm + .)
images <- images[which(images$is_proportional & !images$duplicated & !images$range_exclusion), ]
dimtab = table(images$dim_mm); dimtab = dimtab[dimtab > 0]
images$dim_mm = factor(images$dim_mm, levels = names(sort(dimtab,      decreasing = TRUE)))

table1(table2Form, data = images, caption = "Table 4: After image-level exclusionary criteria with prior metadata.")

Table 4: After image-level exclusionary criteria with prior metadata.
	Overall (N=22833)
duplicated
Yes	0 (0%)
No	22833 (100%)
is_proportional
Yes	22833 (100%)
No	0 (0%)
range_exclusion
Yes	0 (0%)
No	22833 (100%)
dim_mm
182218182	10756 (47.1%)
195231196	2329 (10.2%)
194230194	2061 (9.0%)
195231195	1174 (5.1%)
192228192	1094 (4.8%)
158190158	825 (3.6%)
195232.5195	648 (2.8%)
159189138	544 (2.4%)
181.5217.5181.5	465 (2.0%)
195231198	457 (2.0%)
196.749230.755194.4	388 (1.7%)
183219183	296 (1.3%)
159189156	291 (1.3%)
194230194.4	288 (1.3%)
150186160	108 (0.5%)
184.25220184	77 (0.3%)
157.5192.5157.5	76 (0.3%)
162192150	76 (0.3%)
180216180	63 (0.3%)
194.4230.4194.4	59 (0.3%)
158190156	56 (0.2%)
158190138	45 (0.2%)
194230195	45 (0.2%)
196.218231.894196.42	33 (0.1%)
170210145	30 (0.1%)
197233189	28 (0.1%)
193.2229.2193.2	26 (0.1%)
181217181	24 (0.1%)
180217.5180	20 (0.1%)
188220184	19 (0.1%)
158190168	18 (0.1%)
182217182	18 (0.1%)
148184156	16 (0.1%)
196232197.6	15 (0.1%)
158190174	14 (0.1%)
162.4191.4150.8	14 (0.1%)
194.247230.346195	14 (0.1%)
195.8231195.8	14 (0.1%)
157.5190137.5	13 (0.1%)
182.5220182.5	13 (0.1%)
159195156	12 (0.1%)
169.5205.5169.5	12 (0.1%)
194230195.8	12 (0.1%)
157.5190.5157.5	11 (0.0%)
192.5229.25192.5	11 (0.0%)
196.875231.25198	11 (0.0%)
153.9189160.38	10 (0.0%)
164192140	10 (0.0%)
196.552231.384195.25	10 (0.0%)
152184152	9 (0.0%)
161190.75150.5	9 (0.0%)
195.816231.636198.75	9 (0.0%)
156186147	8 (0.0%)
157.5190157.5	8 (0.0%)
158190136.5	8 (0.0%)
194.4230.4196.56	8 (0.0%)
153189147	7 (0.0%)
156.6189158.4	6 (0.0%)
195231197.12	6 (0.0%)
157.5189136.5	5 (0.0%)
159195159	5 (0.0%)
196.472232.412194.4	5 (0.0%)
198.75232.5197.6	5 (0.0%)
198231195	5 (0.0%)
154190164	4 (0.0%)
157.5195171	4 (0.0%)
182218183	4 (0.0%)
193.5229.5195	4 (0.0%)
195231196.56	4 (0.0%)
195231198.8	4 (0.0%)
139.01492173.76865151.65264	3 (0.0%)
192.5227.5192.5	3 (0.0%)
196.735231.77197.1	3 (0.0%)
144180146	2 (0.0%)
157189156	2 (0.0%)
159192156	2 (0.0%)
160192140	2 (0.0%)
164192160	2 (0.0%)
169205169	2 (0.0%)
171207171	2 (0.0%)
178.5214.5178.5	2 (0.0%)
182.5217.5182.5	2 (0.0%)
183249183	2 (0.0%)
192.5230192.5	2 (0.0%)
195.468230.373195.5	2 (0.0%)
195.966233.784197.2	2 (0.0%)
195232.5198	2 (0.0%)
200235191.4	2 (0.0%)
202232192	2 (0.0%)
150.5185158	1 (0.0%)
159192138	1 (0.0%)
159192159	1 (0.0%)
161.25191.25136.8	1 (0.0%)
161.25191.25140	1 (0.0%)
161191151	1 (0.0%)
180219184	1 (0.0%)
183219174	1 (0.0%)
183219183.6	1 (0.0%)
184.8220.8182.4	1 (0.0%)
188224184	1 (0.0%)
190.9227.7190.9	1 (0.0%)
192.5230.3125192.5	1 (0.0%)
193.6228.8193.6	1 (0.0%)
193229193	1 (0.0%)
198234190	1 (0.0%)
number_of_subjects
Mean (SD)	1490 (213000)
Median [Min, Max]	53.0 [1.00, 32200000]
modality
Diffusion MRI	39 (0.2%)
fMRI-BOLD	15000 (65.7%)
fMRI-CBF	14 (0.1%)
fMRI-CBV	1 (0.0%)
MEG	39 (0.2%)
Other	7351 (32.2%)
PET FDG	31 (0.1%)
PET other	42 (0.2%)
Structural MRI	316 (1.4%)
image_type
statistic_map	22833 (100%)
NIDM results statistic map	0 (0%)
atlas	0 (0%)
map_type
univariate-beta map	0 (0%)
variance	0 (0%)
ROI/mask	0 (0%)
T map	20356 (89.2%)
other	0 (0%)
Z map	2477 (10.8%)
P map (given null hypothesis)	0 (0%)
parcellation	0 (0%)
F map	0 (0%)
1-P map ("inverted" probability)	0 (0%)
anatomical	0 (0%)
multivariate-beta map	0 (0%)
Chi squared map	0 (0%)
analysis_level
single-subject	0 (0%)
meta-analysis	0 (0%)
group	22833 (100%)
other	0 (0%)
	0 (0%)
is_thresholded
FALSE	22833 (100%)
TRUE	0 (0%)
not_mni
FALSE	22833 (100%)
TRUE	0 (0%)
perc_bad_voxels
Mean (SD)	74.7 (10.6)
Median [Min, Max]	77.0 [0, 86.1]
perc_voxels_outside
Mean (SD)	12.3 (21.3)
Median [Min, Max]	3.47 [0, 100]

3.4 Image value processing

Aside from the above three approaches, there are other attributes of certain images that stops registration.

There could be NaN values in the images that causes FLIRT to fail. We want to remove them as well as some values that are very close to zero due to prior registration artifact.
There are images with active voxels everywhere, or with very extremely low intensity voxels. We set a fixed \(\varepsilon\) (0.001) here, to filter out voxels with values around zero by \(\varepsilon\) in all images.

This value was chosen by eroding the non-brain mask, defined below, and looking at a histogram of the 90th quantile of the values outside the brain.

library(callr)

# Define the image processing function
process_images_c3d_fn <- function(images_data, n_cores, eps_val) {
  library(parallel)
  
  # Create output filenames
  images_data$localfileadj <- sub("\\.nii(\\.gz)?$", "_eps.nii.gz", 
                                   images_data$localfile, ignore.case = TRUE)
  
  # Step 1: Convert NaN into zeros
  cat(sprintf("[%s] [BG] Converting NaN to zeros...\n", Sys.time()))
  cmd1 <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d', 
                images_data$localfile, '-replace NaN 0.0 -o', 
                images_data$localfileadj)
  invisible(mclapply(cmd1, function(c) system(paste(c, ">/dev/null")), 
                     mc.cores = n_cores))
  cat(sprintf("[%s] [BG] NaN conversion complete\n", Sys.time()))
  
  # Step 2: Apply threshold filter
  cat(sprintf("[%s] [BG] Applying threshold filter (eps = %g)...\n", Sys.time(), eps_val))
  cmd2 <- sprintf("%s %s -dup -threshold -%g %g 0 1 -multiply -o %s",
                  shQuote("/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d"),
                  shQuote(images_data$localfileadj), eps_val, eps_val, 
                  shQuote(images_data$localfileadj))
  invisible(mclapply(cmd2, function(c) system(paste(c, ">/dev/null")), 
                     mc.cores = n_cores))
  cat(sprintf("[%s] [BG] Threshold filtering complete\n", Sys.time()))
  
  # Return updated filenames
  return(images_data$localfileadj)
}

# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/image_processing_c3d.log', 'logfiles/image_processing_c3d.err')
unlink(logfiles)

# Run and wait for completion
# SNV: @copilot, should only run this if not run
eps <- 1e-3
cat(sprintf("[%s] Starting image processing (%d images)...\n", Sys.time(), nrow(images)))

## [2026-05-19 12:39:56.208183] Starting image processing (22833 images)...

localfileadj_result <- callr::r(process_images_c3d_fn, 
                                args = list(images, n_cores, eps),
                                stdout = logfiles[1],
                                stderr = logfiles[2])

# Update images dataframe with processed filenames
images$localfileadj <- localfileadj_result
cat(sprintf("[%s] ✓ Image processing completed. Output: %s\n", Sys.time(), logfiles[1]))

## [2026-05-19 12:43:15.195242] ✓ Image processing completed. Output: logfiles/image_processing_c3d.log

4 Image realignment and Registration

Despite the images being reported as MNI space. After visual inspection, we noticed they were not properly aligned with the MNI template. Examples of some misaligned images from the image dimension categories shown in Table 4 are shown in Figure 3.

Figure 4: Examples of image misalignment.

To resolve image misalignment, we evaluated registration of varying complexity:

Reslicing – no registration is performed; the image is simply resampled into the MNI 2mm grid.
Rigid registration (DOF = 6) – allows only rotation and translation, preserving the original shape and scale of the brain.
Affine registration (DOF = 12) – adds scaling and shearing to the rigid transform.

For both rigid and linear registrations, we use mutual information as the cost function (-cost mutualinfo using FLIRT) to accommodate differences in contrast and intensity between the template and statistical images. Results are shown for the seven most common image dimensions covering the majority of the dataset.

4.1 Registration snapshots

4.1.1 158x190x158

4.1.2 182x218x182

4.1.3 192x228x192

4.1.4 194x230x194

4.1.5 195x231x195

4.1.6 195x231x196

4.1.7 195x232.5x195

Based on these results, we chose rigid registration, as a balance between flexibility and accuracy across the image dimension categories. With the registered images, we further extracted information from the data:

voxdim_new: indicates the new voxel dimension of the images, now all images should be 2mm isotropic.
imgdim_new: indicates the new image dimension, now all images should be 91*109*91.
range_new: indicates the new range of voxel intensity.

4.2 Run registration

library(RNifti)
library(callr)

# Add resliced output filenames to dataframe (fast operation, no need for background)
cat(sprintf("[%s] Adding resliced output filenames\n", Sys.time()))

## [2026-05-19 12:43:15.278141] Adding resliced output filenames

images$resliced <- gsub('.nii.gz', '_rigid_cost_step1.nii.gz', images$localfile)

# Define the registration function to run in background
register_images_fn <- function(images_data, template_path, n_cores, outdir) {
  library(parallel)
  
  ## FSL Setup - Robust
  if (!nzchar(Sys.which("flirt"))) {
    if (!nzchar(Sys.getenv("FSLDIR"))) {
      fsl_candidates <- c("/usr/local/fsl", "/usr/share/fsl", "/opt/fsl")
      existing <- file.exists(fsl_candidates)
      if (any(existing)) {
        Sys.setenv(FSLDIR = fsl_candidates[which(existing)[1]])
      } else {
        stop("FSL not found. Please install FSL and set FSLDIR environment variable")
      }
    }
    Sys.setenv(FSLOUTPUTTYPE = "NIFTI_GZ")
    fslbin <- file.path(Sys.getenv("FSLDIR"), "bin")
    Sys.setenv(PATH = paste(fslbin, Sys.getenv("PATH"), sep = ":"))
  }
  
  ## Reference image
  ref <- file.path(Sys.getenv("FSLDIR"), "data", "standard", "MNI152_T1_2mm_brain.nii.gz")
  
  ## Helpers
  validate_flirt <- function() {
    flirt <- Sys.which("flirt")
    if (!nzchar(flirt)) stop("FLIRT not found in PATH")
    if (system2(flirt, args = "-version", stdout = FALSE, stderr = FALSE) != 0) {
      stop("FLIRT exists but failed to run. Please check FSL installation")
    }
    flirt
  }
  
  flirt <- validate_flirt()
  
  reslice_image <- function(input, output, ref) {
    if (!file.exists(output)) {
      message(sprintf("[%s] Reslicing: %s", Sys.time(), basename(input)))
      cmd <- sprintf("%s -in '%s' -ref '%s' -out '%s' -dof 6 -cost mutualinfo -interp trilinear",
                     flirt, input, ref, output)
      status <- system(cmd, ignore.stdout = TRUE, ignore.stderr = TRUE)
      if (status != 0 || !file.exists(output) || file.size(output) == 0) {
        warning("FLIRT failed for: ", input)
        return(NA_character_)
      }
    }
    return(output)
  }
  
  ## Reslice all images
  cat(sprintf("[%s] [BG] Starting reslicing with %d cores...\n", Sys.time(), n_cores))
  results <- mcmapply(
    FUN = reslice_image,
    input = images_data$localfileadj,
    output = images_data$resliced,
    MoreArgs = list(ref = ref),
    mc.cores = n_cores
  )
  
  cat(sprintf("[%s] [BG] Image registration completed\n", Sys.time()))
}

# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/flirt_registration.log', 'logfiles/flirt_registration.err')
unlink(logfiles)


# Start background process - DON'T WAIT for completion
bg_process <- callr::r(register_images_fn, 
                          args = list(images, template, n_cores, outdir),
                          stdout = logfiles[1],
                          stderr = logfiles[2])

message(sprintf("[%s] Background process started. Running image registration. Output: %s", 
                Sys.time(), logfiles[1]))

library(stringr)

# Check if post-registration extraction and coverage data already exists this month
rds_file_path <- get_monthly_rds_path("imageExtractionAndCoverage")
file_exists_this_month <- file.exists(rds_file_path)

if (!file_exists_this_month) {
  # ============================================================================
  # PART 1: FEATURE EXTRACTION FROM REGISTERED IMAGES
  # ============================================================================
  
  # Count non-zero voxels with error handling
  cmd_vox <- paste(c3d_bin, images$resliced, '-threshold 0 0 0 1 -voxel-sum')
  nvox_result <- tryCatch({
    unlist(mclapply(cmd_vox, function(cmd) {
      tryCatch(system(cmd, intern = TRUE), error = function(e) "")
    }, mc.cores = n_cores))
  }, error = function(e) {
    cat(sprintf("[%s] Warning: c3d voxel count failed\n", Sys.time()))
    rep("", length(cmd_vox))
  })
  images$nvox_new <- as.numeric(gsub("[^0-9.]", "", nvox_result))
  images$nvox_new[is.na(images$nvox_new)] <- NA
  
  # Extract new info with c3d -info
  cat(sprintf("[%s] Extracting image info (voxdim, imgdim, orientation, range)...\n", Sys.time()))
  cmd_info <- paste(c3d_bin, images$resliced, "-info")
  info_result <- tryCatch({
    mclapply(cmd_info, function(cmd) {
      tryCatch(system(cmd, intern = TRUE), error = function(e) "")
    }, mc.cores = n_cores)
  }, error = function(e) {
    cat(sprintf("[%s] Warning: c3d -info failed\n", Sys.time()))
    rep(list(""), length(cmd_info))
  })
  
  # Enhanced extractField function with error handling
  extractField <- function(x, field_regex) {
    tryCatch({
      if (length(x) == 0 || all(x == "")) return(NA_character_)
      line <- x[1]
      if (is.na(line) || line == "") return(NA_character_)
      match <- regmatches(line, regexpr(field_regex, line))
      if (length(match) == 0) return(NA_character_)
      return(match)
    }, error = function(e) NA_character_)
  }
  
  # Extract fields with safe mapping
  cat(sprintf("[%s] Parsing extracted fields...\n", Sys.time()))
  images$voxdim_new <- sapply(info_result, extractField, "vox = \\[.*?\\]")
  images$imgdim_new <- sapply(info_result, extractField, "dim = \\[.*?\\]")
  images$orient_new <- sapply(info_result, extractField, "orient = [A-Z]+")
  images$range_new  <- sapply(info_result, extractField, "range = \\[.*?\\]")
  
  # Count how many extractions were successful
  na_count <- sum(is.na(images$voxdim_new))
  cat(sprintf("[%s] Extraction complete: %d successful, %d with missing/failed output\n", 
              Sys.time(), nrow(images) - na_count, na_count))
  
  # ============================================================================
  # PART 2: COMPUTE COVERAGE STATISTICS
  # ============================================================================
  
  # Get files for coverage computation
  files_all <- images$resliced
  keep <- !is.na(files_all) & file.exists(files_all)
  files <- files_all[keep]
  idx   <- which(keep)
  
  # Function to compute mask overlap
  maskOverlap = function(mask_path, files){
    mask = readNifti(mask_path)
    unlist(mclapply(files, function(file){
      out = sum(readNifti(file)[ mask==1]!=0)
    }, mc.cores = n_cores))
  }
  
  # Compute coverage for different masks
  nnz_outside = maskOverlap(nonbrain_mask, files)
  nnz_gm = maskOverlap(gm_mask, files)
  nnz_wm = maskOverlap(wm_mask, files)
  
  # Get minimum nonzero value in the brain complement
  maskcomp = readNifti(nonbrain_eroded_mask)
  q90_outside = unlist(mclapply(files, function(file){
    out = abs(readNifti(file)[ maskcomp==1])
    quantile(out, probs=0.9)
  }, mc.cores = n_cores))
  
  # Write back to images (order-preserving; NAs for missing files)
  images$nvox_outside[idx] <- nnz_outside
  images$nvox_gm[idx]      <- nnz_gm
  images$nvox_wm[idx]      <- nnz_wm
  images$q90_outside[idx] = q90_outside
  
  # Save to versioned RDS file
  if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
  rds_file_path <- get_monthly_rds_path("imageExtractionAndCoverage")
  
  cat(sprintf("[%s] Saving extracted features and coverage data to %s\n", Sys.time(), rds_file_path))
  saveRDS(images, file = rds_file_path, compress = "gzip")
  
  cat(sprintf("[%s] ✓ Extraction and coverage data saved successfully\n", Sys.time()))
} else {
  # Load previously computed data
  images <- load_monthly_rds("imageExtractionAndCoverage")
  message("Loaded previously computed extraction and coverage data from ", rds_file_path)
}

## Loaded previously computed extraction and coverage data from /media/disk2/ree/metadataset/rds/imageExtractionAndCoverage_2026-05.rds

# Plot histogram of q90_outside for reference
#hist(log10(images$q90_outside[images$q90_outside>0]), xlab='log10(90th quantile)', main='90th image quantile')

4.3 Exclusionary criteria based on template overlap

After registration, we compute the proportion of nonzero voxels overlapping with three spatial masks derived from the MNI template: gray matter, white matter, and the region outside both (i.e., non-brain voxels). These coverage metrics serve as a proxy for registration quality. A well-registered brain image should have high overlap with gray and white matter and minimal signal outside the brain mask.

Figure 5 shows the distribution of these three coverage proportions across all registered images. Based on these histograms, we selected thresholds that separate the bulk of the distribution from a low-coverage tail, and visually evaluated excluded images to confirm the thresholds were not discarding well-registered data.

# thresholds
GM_MIN  <- 100000
WM_MIN  <- 20000
OUT_MAX <- 100000

GM_total <- sum(readNifti(gm_mask))
WM_total <- sum(readNifti(wm_mask))

out_total <- 91*109*91 - GM_total - WM_total
images$gm_perc <- images$nvox_gm / GM_total
images$wm_perc <- images$nvox_wm / WM_total
images$outside_perc <- images$nvox_outside / out_total


library(GGally)
library(ggplot2)
library(scales)
library(rlang)

# --- proportions already computed above ---
coverage_df <- images[, c("gm_perc", "wm_perc", "outside_perc")]
coverage_df <- na.omit(coverage_df)

colnames(coverage_df) <- c("Gray Matter", "White Matter", "Outside Mask")

# convert to percent for plotting
coverage_pct <- coverage_df * 100

# convert voxel-count thresholds to percent thresholds
ref_pct <- c(
  "Gray Matter"  = 100 * GM_MIN  / GM_total,
  "White Matter" = 100 * WM_MIN  / WM_total,
  "Outside Mask" = 100 * OUT_MAX / out_total
)

# formatter for axes
pct_lab <- scales::label_percent(scale = 1, accuracy = 1)  # since data already in 0-100

diag_fun <- function(data, mapping, ...) {
  p <- GGally::ggally_barDiag(data, mapping, ...)
  var <- as_label(mapping$x)
  thr <- ref_pct[[var]]
  if (!is.null(thr)) p <- p + geom_vline(xintercept = thr, linetype = "dashed")
  p + scale_x_continuous(labels = pct_lab)
}

lower_fun <- function(data, mapping, ...) {
  p <- GGally::ggally_points(data, mapping, ...)

  xvar <- as_label(mapping$x)
  yvar <- as_label(mapping$y)

  # Reference identity line y = x
  p <- p + geom_abline(
    slope = 1, intercept = 0,
    linetype = "solid", color = "black"
  )

  # Optional reference thresholds
  if (!is.null(ref_pct[[xvar]]))
    p <- p + geom_vline(xintercept = ref_pct[[xvar]], linetype = "dashed")

  if (!is.null(ref_pct[[yvar]]))
    p <- p + geom_hline(yintercept = ref_pct[[yvar]], linetype = "dashed")

  p +
    scale_x_continuous(labels = pct_lab) +
    scale_y_continuous(labels = pct_lab)
}

upper_fun <- function(data, mapping, ...) {
  GGally::ggally_cor(data, mapping, ...) +
    theme_void()
}

GGally::ggpairs(
  coverage_pct,
  title = "Figure 5: Spatial Coverage Percentage",
  upper = list(continuous = "cor"),
  diag  = list(continuous = diag_fun),
  lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)

We apply the following three exclusions:

Gray matter coverage above 55% — images with fewer than this proportion of nonzero voxels inside gray matter are likely missing substantial cortical signal, possibly due to partial brain coverage or failed registration.
White matter coverage above 32% — low white matter coverage similarly suggests the image does not span the full brain volume.
Non-brain signal below 15% — images with a high proportion of nonzero voxels outside the brain mask likely contain registration artifacts or were not properly skull-stripped prior to upload.

Together, these criteria remove images that are unlikely to represent valid whole-brain statistical maps in MNI space, regardless of whether they passed the earlier metadata-based filters.

# keep ONLY rows that pass all three checks (and are finite)
images$gm_exclude = images$nvox_gm <= GM_MIN
images$wm_exclude = images$nvox_wm <= WM_MIN
images$nvox_outside_exclude = (images$nvox_outside >= OUT_MAX)

# SNV: Hoazheng, to be consistent with the prior sections, lets use a table structure to show the exclusionary criteria
table3Form = update.formula(table2Form, ~ gm_exclude + wm_exclude + nvox_outside_exclude + .)
table1(table3Form, data=images,caption = "Table 5: Coverage-based exclusionary criteria prior to removal.")

Table 5: Coverage-based exclusionary criteria prior to removal.
	Overall (N=22954)
gm_exclude
Yes	212 (0.9%)
No	22742 (99.1%)
wm_exclude
Yes	77 (0.3%)
No	22877 (99.7%)
nvox_outside_exclude
Yes	787 (3.4%)
No	22167 (96.6%)
duplicated
Yes	0 (0%)
No	22954 (100%)
is_proportional
Yes	22954 (100%)
No	0 (0%)
range_exclusion
Yes	0 (0%)
No	22954 (100%)
dim_mm
182218182	10757 (46.9%)
195231196	2350 (10.2%)
194230194	2083 (9.1%)
195231195	1213 (5.3%)
192228192	1094 (4.8%)
158190158	825 (3.6%)
195232.5195	648 (2.8%)
159189138	544 (2.4%)
195231198	475 (2.1%)
181.5217.5181.5	465 (2.0%)
196.749230.755194.4	408 (1.8%)
183219183	296 (1.3%)
159189156	291 (1.3%)
194230194.4	288 (1.3%)
150186160	108 (0.5%)
184.25220184	77 (0.3%)
157.5192.5157.5	76 (0.3%)
162192150	76 (0.3%)
180216180	63 (0.3%)
194.4230.4194.4	59 (0.3%)
158190156	56 (0.2%)
158190138	45 (0.2%)
194230195	45 (0.2%)
196.218231.894196.42	33 (0.1%)
170210145	30 (0.1%)
197233189	28 (0.1%)
193.2229.2193.2	26 (0.1%)
181217181	24 (0.1%)
180217.5180	20 (0.1%)
188220184	19 (0.1%)
158190168	18 (0.1%)
182217182	18 (0.1%)
148184156	16 (0.1%)
196232197.6	15 (0.1%)
158190174	14 (0.1%)
162.4191.4150.8	14 (0.1%)
194.247230.346195	14 (0.1%)
195.8231195.8	14 (0.1%)
157.5190137.5	13 (0.1%)
182.5220182.5	13 (0.1%)
169.5205.5169.5	12 (0.1%)
159195156	12 (0.1%)
194230195.8	12 (0.1%)
157.5190.5157.5	11 (0.0%)
192.5229.25192.5	11 (0.0%)
196.875231.25198	11 (0.0%)
153.9189160.38	10 (0.0%)
164192140	10 (0.0%)
196.552231.384195.25	10 (0.0%)
152184152	9 (0.0%)
161190.75150.5	9 (0.0%)
195.816231.636198.75	9 (0.0%)
156186147	8 (0.0%)
157.5190157.5	8 (0.0%)
158190136.5	8 (0.0%)
194.4230.4196.56	8 (0.0%)
153189147	7 (0.0%)
156.6189158.4	6 (0.0%)
195231197.12	6 (0.0%)
157.5189136.5	5 (0.0%)
159195159	5 (0.0%)
196.472232.412194.4	5 (0.0%)
198.75232.5197.6	5 (0.0%)
198231195	5 (0.0%)
154190164	4 (0.0%)
157.5195171	4 (0.0%)
182218183	4 (0.0%)
193.5229.5195	4 (0.0%)
195231196.56	4 (0.0%)
195231198.8	4 (0.0%)
139.01492173.76865151.65264	3 (0.0%)
192.5227.5192.5	3 (0.0%)
196.735231.77197.1	3 (0.0%)
144180146	2 (0.0%)
157189156	2 (0.0%)
159192156	2 (0.0%)
160192140	2 (0.0%)
164192160	2 (0.0%)
169205169	2 (0.0%)
171207171	2 (0.0%)
178.5214.5178.5	2 (0.0%)
182.5217.5182.5	2 (0.0%)
183249183	2 (0.0%)
192.5230192.5	2 (0.0%)
195.468230.373195.5	2 (0.0%)
195.966233.784197.2	2 (0.0%)
195232.5198	2 (0.0%)
200235191.4	2 (0.0%)
202232192	2 (0.0%)
150.5185158	1 (0.0%)
159192138	1 (0.0%)
159192159	1 (0.0%)
161.25191.25136.8	1 (0.0%)
161.25191.25140	1 (0.0%)
161191151	1 (0.0%)
180219184	1 (0.0%)
183219174	1 (0.0%)
183219183.6	1 (0.0%)
184.8220.8182.4	1 (0.0%)
188224184	1 (0.0%)
190.9227.7190.9	1 (0.0%)
192.5230.3125192.5	1 (0.0%)
193.6228.8193.6	1 (0.0%)
193229193	1 (0.0%)
198234190	1 (0.0%)
number_of_subjects
Mean (SD)	1490 (213000)
Median [Min, Max]	52.0 [1.00, 32200000]
modality
Diffusion MRI	39 (0.2%)
fMRI-BOLD	15121 (65.9%)
fMRI-CBF	14 (0.1%)
fMRI-CBV	1 (0.0%)
MEG	39 (0.2%)
Other	7351 (32.0%)
PET FDG	31 (0.1%)
PET other	42 (0.2%)
Structural MRI	316 (1.4%)
image_type
statistic_map	22954 (100%)
NIDM results statistic map	0 (0%)
atlas	0 (0%)
map_type
univariate-beta map	0 (0%)
variance	0 (0%)
ROI/mask	0 (0%)
T map	20477 (89.2%)
other	0 (0%)
Z map	2477 (10.8%)
P map (given null hypothesis)	0 (0%)
parcellation	0 (0%)
F map	0 (0%)
1-P map ("inverted" probability)	0 (0%)
anatomical	0 (0%)
multivariate-beta map	0 (0%)
Chi squared map	0 (0%)
analysis_level
single-subject	0 (0%)
meta-analysis	0 (0%)
group	22954 (100%)
other	0 (0%)
	0 (0%)
is_thresholded
FALSE	22954 (100%)
TRUE	0 (0%)
not_mni
FALSE	22954 (100%)
TRUE	0 (0%)
perc_bad_voxels
Mean (SD)	74.7 (10.6)
Median [Min, Max]	76.8 [0, 86.1]
perc_voxels_outside
Mean (SD)	12.4 (21.3)
Median [Min, Max]	3.65 [0, 100]

## The following chunk finds out the ones that should be excluded, i.e. low coverage, and visualize them

par(mfrow = c(3, 4), mar = c(1, 1, 3.5, 1), pty = "m")

tempBG <- readNifti(template)

low_gm   <- images[which(images$gm_exclude == TRUE), ][seq_len(min(4, sum(images$gm_exclude,             na.rm = TRUE))), ]
low_wm   <- images[which(images$wm_exclude == TRUE), ][seq_len(min(4, sum(images$wm_exclude,             na.rm = TRUE))), ]
high_out <- images[which(images$nvox_outside_exclude == TRUE), ][seq_len(min(4, sum(images$nvox_outside_exclude, na.rm = TRUE))), ]

spot_check <- rbind(low_gm, low_wm, high_out)
spot_check$reason <- c(
  rep("Low gray matter",    nrow(low_gm)),
  rep("Low white matter",   nrow(low_wm)),
  rep("High outside mask",  nrow(high_out))
)

for (i in seq_len(nrow(spot_check))) {
  img <- readNifti(spot_check$resliced[i])
  
  pbj:::image.niftiImage(
    img, BGimg = tempBG, lo = FALSE, index = 35,
    limits = quantile(abs(img), probs = 0.01),
    plane  = 'sagittal'
  )
  
  title(
    main = paste0(
      spot_check$reason[i], "\n",
      "GM: ",      round(100 * spot_check$gm_perc[i],      1), "% | ",
      "WM: ",      round(100 * spot_check$wm_perc[i],      1), "% | ",
      "Outside: ", round(100 * spot_check$outside_perc[i], 1), "%"
    ),
    cex.main = 0.65,
    line     = 0.3
  )
}

Figure 6: Representative sagittal slices of excluded images grouped by exclusion reason. Top row: low gray matter coverage (images with sparse or misaligned cortical signal). Middle row: low white matter coverage (images with minimal subcortical signal, likely due to partial brain coverage or poor registration). Bottom row: excessive non-brain signal (images where signal extends well beyond the brain mask, indicating registration failure or absent skull-stripping).

n_before <- nrow(images)
images   <- images[which(!(images$gm_exclude | images$wm_exclude | images$nvox_outside_exclude)), ]
n_after  <- nrow(images)
cat(sprintf("Excluded %d images. Remaining: %d\n", n_before - n_after, n_after))

## Excluded 999 images. Remaining: 21955

The following shows what the dataset looks like after excluding.

# SNV: Changed to coverage_pct to be consistent
# --- proportions already computed above ---
coverage_df <- images[, c("gm_perc", "wm_perc", "outside_perc")]
coverage_df <- na.omit(coverage_df)

colnames(coverage_df) <- c("Gray Matter", "White Matter", "Outside Mask")

# convert to percent for plotting
coverage_pct <- coverage_df * 100

GGally::ggpairs(
  coverage_pct,
  title = "Figure 7: Spatial Coverage Percentage",
  upper = list(continuous = "cor"),
  diag  = list(continuous = diag_fun),
  lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)

5 Effect size analysis and exclusionary criteria

5.1 Effect size estimation

Here, we convert nonzero values in the T and Z statistical images into the signed version of the robust effect size index (RESI; using RESI::t2S for T maps and RESI::z2S for Z maps), based on each metadata sample size reported with the images. We save the resulting effect size images and summary statistics within gray and white matter (mean and 0, 0.25, 0.5, 0.75, 1 quantiles).

library(RNifti)
library(callr)
library(RESI)

# Check if effect size computation was already done this month
rds_file_path <- get_monthly_rds_path("imageMetadataEffectSizes")
file_exists_this_month <- file.exists(rds_file_path)

if (!file_exists_this_month) {
  # Define the effect size computation function to run in background - function saves results internally
  compute_and_save_effect_sizes_fn <- function(images_data, gm_mask_path, wm_mask_path, 
                                              mask_2mm_path, n_cores, rds_dir) {
    library(RNifti)
    library(parallel)
    library(RESI)
    
    # Helper function to get monthly RDS path
    get_monthly_rds_path <- function(base_name, dir = rds_dir) {
      get_date_suffix <- function() {
        format(Sys.Date(), "%Y-%m")
      }
      date_suffix <- get_date_suffix()
      current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
      
      pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
      all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
      
      old_files <- all_files[all_files != current_file]
      if (length(old_files) > 0) {
        invisible(lapply(old_files, file.remove))
      }
      
      return(current_file)
    }
    
    # Load masks once
    gm_mask_nifti <- RNifti::readNifti(gm_mask_path)
    wm_mask_nifti <- RNifti::readNifti(wm_mask_path)
    brain_mask_nifti <- RNifti::readNifti(mask_2mm_path)
    
    # Validate that required columns exist
    cat(sprintf("[%s] [BG] Starting effect size computation for %d images...\n", Sys.time(), nrow(images_data)))
    
    # Get indices of nonzero voxels once for efficiency
    gm_idx   <- which(gm_mask_nifti != 0)
    wm_idx   <- which(wm_mask_nifti != 0)
    brain_idx <- which(brain_mask_nifti != 0)
    
    # Worker function that takes ALL arguments explicitly (no lexical scoping)
    process_single_image <- function(idx, file_path, n_subj, map_type, gm_idx, wm_idx, brain_idx) {
      tryCatch({
        if (is.na(file_path) || !nzchar(file_path) || !file.exists(file_path)) return(NULL)
        if (is.na(n_subj) || is.na(map_type)) return(NULL)
        
        img <- RNifti::readNifti(file_path)
        
        # Only extract nonzero values
        nonzero_mask <- (img != 0) & !is.na(img)
        valid_idx <- which(nonzero_mask)
        
        if (length(valid_idx) == 0) return(NULL)
        
        valid_vals <- img[valid_idx]
        
        # Convert map_type to character to handle factors
        map_type_char <- as.character(map_type)
        
        # Convert effect sizes
        if (map_type_char == "T map") {
          converted <- RESI::t2S(valid_vals, rdf = n_subj - 1, n = n_subj)
        } else if (map_type_char == "Z map") {
          converted <- RESI::z2S(valid_vals, n = n_subj, unbiased = TRUE)
        } else {
          return(NULL)
        }
        
        # Build effect size image only with converted values
        effect_img <- img
        effect_img[valid_idx] <- converted
        effect_img[!is.finite(effect_img)] <- 0
        
        output_dir <- dirname(file_path)
        if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
        base_name  <- sub("\\.nii\\.gz$", "", basename(file_path))
        out_file   <- file.path(output_dir, paste0(base_name, "_effect_size.nii.gz"))
        RNifti::writeNifti(effect_img, out_file)
        
        # Get stats from converted values only (not from full image)
        effect_only <- converted[converted != 0 & is.finite(converted)]
        es_stats <- if (length(effect_only) > 0) {
          c(
            quantile(effect_only, type = 7),
            es_mean = mean(effect_only)
          )
        } else rep(NA_real_, 6)
        
        # Compute masked statistics using indices only (no intermediate arrays)
        grey_vals  <- effect_img[gm_idx]
        white_vals <- effect_img[wm_idx]
        
        # Write brain-masked version
        brain_file <- sub("\\.nii\\.gz$", "_masked_mni.nii.gz", out_file)
        brain_masked <- effect_img
        brain_masked[-brain_idx] <- 0  # Zero out non-brain voxels
        RNifti::writeNifti(brain_masked,  brain_file)
        
        list(
          index       = idx,
          effect_size_file = out_file,
          brain_file = brain_file,
          es_min      = es_stats[1],
          es_1qt      = es_stats[2],
          es_median   = es_stats[3],
          es_3qt      = es_stats[4],
          es_max      = es_stats[5],
          es_mean     = es_stats[6],
          grey_mean   = if (length(grey_vals)  > 0) mean(grey_vals[grey_vals != 0], na.rm = TRUE)  else NA_real_,
          grey_min    = if (length(grey_vals)  > 0) min(grey_vals[grey_vals != 0], na.rm = TRUE)   else NA_real_,
          grey_max    = if (length(grey_vals)  > 0) max(grey_vals[grey_vals != 0], na.rm = TRUE)   else NA_real_,
          white_mean  = if (length(white_vals) > 0) mean(white_vals[white_vals != 0], na.rm = TRUE) else NA_real_,
          white_min   = if (length(white_vals) > 0) min(white_vals[white_vals != 0], na.rm = TRUE)  else NA_real_,
          white_max   = if (length(white_vals) > 0) max(white_vals[white_vals != 0], na.rm = TRUE)  else NA_real_
        )
      }, error = function(e) {
        message(" [", Sys.time(), "] Skipping index ", idx, ": ", e$message)
        NULL
      })
    }
    
    # Use mcmapply to pass only necessary vectors to workers (not entire environment)
    cat(sprintf("[%s] [BG] Starting effect size computation for %d images...\n", Sys.time(), nrow(images_data)))
    results <- mcmapply(
      process_single_image,
      idx = 1:nrow(images_data),
      file_path = images_data$resliced,
      n_subj = images_data$number_of_subjects,
      map_type = images_data$map_type,
      MoreArgs = list(gm_idx = gm_idx, wm_idx = wm_idx, brain_idx = brain_idx),
      mc.cores = n_cores,
      SIMPLIFY = FALSE
    )
    
    # Filter and bind
    cat(sprintf("[%s] Processing results...\n", Sys.time()))
    results <- Filter(Negate(is.null), results)
    df_results <- do.call(rbind, lapply(results, as.data.frame))
    
    # Add index column to images_data for matching
    images_data$index <- 1:nrow(images_data)
    
    # merge keeping all of images, matched by index
    images_data = merge(images_data, df_results, all.x=TRUE)
    
    # Remove the temporary index column (not in original)
    images_data$index <- NULL
    
    # Save directly to versioned RDS file
    if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
    rds_path <- get_monthly_rds_path("imageMetadataEffectSizes")
    saveRDS(images_data, file = rds_path)
    
    cat(sprintf("[%s] Effect size computation completed. Results saved to %s\n", Sys.time(), rds_path))
  }
  
  logdir <- "logfiles"
  if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
  logfiles = c('logfiles/effectSize_estimation.log', 'logfiles/effectSize_estimation.err')
  unlink(logfiles)

  # Start background process - DON'T WAIT for completion
  bg_process <- callr::r(compute_and_save_effect_sizes_fn, 
                            args = list(images, gm_mask, wm_mask, mask_2mm_file, n_cores, rds_dir),
                            stdout = logfiles[1],
                            stderr = logfiles[2])
  
  message(sprintf("[%s] Background process started, Effect size computation will be saved to %s", 
            Sys.time(), rds_file_path))
} else {
  # Load already computed effect sizes
  images <- load_monthly_rds("imageMetadataEffectSizes")
  message("Loaded previously computed effect sizes from ", rds_file_path)
}

5.2 Meta-analysis outlier detection model

To screen for outlier images, we model the expected variability of these mean effect sizes as a function of (i) study sample size and (ii) voxel coverage, and flag observations that fall outside the implied funnel. The mean value of the effect size, \(y_{ij}\), for image \(j\) in collection \(i\), is proportional to \(n_i^{-1} V_i^{-2}\), where \(n_i\) denotes the number of subjects (sample size) and \(V_{ij}\) denotes the number of nonzero voxels in image \(j\) (see preprint for details).

We fit an intercept-only linear mixed-effects model with a collection-specific random intercept:

\[y_{ij} = \mu + u_{i} + \varepsilon_{ij},\]

where \(\mu\) is the grand mean effect size and \(u_{i} \sim \mathcal{N}(0, \tau^2)\) captures between-collection heterogeneity.

To account for heteroskedasticity across images, we model the residual variance using a power-of-covariates structure implemented via the \(\texttt{varPower}\) variance function in \(\texttt{nlme}\), and combine multiple variance components using . Specifically, for study-level covariates \(N_{ij}\) (sample size) and \(V_{ij}\) (scaled voxel coverage), we assume

\[\mathrm{Var}(\varepsilon_{ij}) = \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2},\]

so that residual variability decreases as \(N\) and/or \(V\) increase when \(\delta_1 < 0\) and/or \(\delta_2 < 0\).

The total per-observation variance incorporates both residual and between-collection variability:

\[\sigma_{ij}^2 = \tau^2 + \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2}.\]

This quantity represents the model-implied variance of observed mean effect sizes as a function of sample size and voxel coverage.

We flag whole images as outliers if their deviation from the fitted grand mean exceeds a multiple of its modeled standard deviation:

\[\bigl| y_{ij} - \hat{\mu} \bigr| > k \, \sigma_{ij},\]

with \(k = 3\).

This criterion standardizes residuals across studies while accounting for heteroskedasticity and within-collection correlation. This outlier identification procedure was applied only to the gray matter mean effect sizes. To mitigate the effect of outliers on the model fit used for outlier identification, we windsorized the data at the 1st and 99th percentiles. We did this after noting that a few extreme image values had a huge effect on the residual variance. The same procedure was applied separately to gray matter and white matter mean effect sizes. These analyses serve both to validate the effect-size transformation and to provide a principled, model-based mechanism for identifying implausible images prior to downstream inference. While alternative filtering strategies were explored, we selected the mixed-effects variance modeling approach as the most statistically grounded and interpretable method for screening artifactual data points. A summary table consolidating key model estimates and variance components is provided below.

# SNV: excludes missing es_mean -- effect size could not be computed?
images = images[ !is.na(images$es_mean),]
winsorize_vec <- function(x, probs = c(0.01, 0.99)) {
  qs <- quantile(x, probs = probs, na.rm = TRUE, names = FALSE, type = 7)
  x[x < qs[1]] <- qs[1]
  x[x > qs[2]] <- qs[2]
  x
}

# Load effect size data
images$es_mean_winsorized <- winsorize_vec(images$es_mean)
images$voxel_scaled <- images$nvox_new/1000

m_power_winsorized <- lme(
  es_mean_winsorized ~ 1,
  random  = ~ 1 | collection_id,
  weights = varComb(
  varPower(form = ~ number_of_subjects),
  varPower(form = ~ voxel_scaled)),
  data    = images,
  method  = "REML"
)

m_power_original <- lme(
  es_mean ~ 1,
  random  = ~ 1 | collection_id,
  weights = varComb(
  varPower(form = ~ number_of_subjects),
  varPower(form = ~ voxel_scaled)),
  data    = images,
  method  = "REML"
)

summary(m_power_winsorized)

## Linear mixed-effects model fit by REML
##   Data: images 
##         AIC       BIC   logLik
##   -27068.31 -27028.34 13539.16
## 
## Random effects:
##  Formula: ~1 | collection_id
##         (Intercept) Residual
## StdDev:  0.06573642 116.8415
## 
## Combination of variance functions: 
##  Structure: Power of variance covariate
##  Formula: ~number_of_subjects 
##  Parameter estimates:
##      power 
## 0.02178354 
##  Structure: Power of variance covariate
##  Formula: ~voxel_scaled 
##  Parameter estimates:
##     power 
## -1.270854 
## Fixed effects:  es_mean_winsorized ~ 1 
##                  Value   Std.Error    DF  t-value p-value
## (Intercept) 0.00204014 0.001561378 18299 1.306627  0.1914
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -5.44353378 -0.39624050  0.01803653  0.42848458  7.00064001 
## 
## Number of Observations: 21920
## Number of Groups: 3621

summary(m_power_original)

## Linear mixed-effects model fit by REML
##   Data: images 
##       AIC      BIC    logLik
##   23272.3 23312.28 -11631.15
## 
## Random effects:
##  Formula: ~1 | collection_id
##         (Intercept)    Residual
## StdDev:   0.5330694 61200986042
## 
## Combination of variance functions: 
##  Structure: Power of variance covariate
##  Formula: ~number_of_subjects 
##  Parameter estimates:
##      power 
## -0.8910117 
##  Structure: Power of variance covariate
##  Formula: ~voxel_scaled 
##  Parameter estimates:
##     power 
## -4.099169 
## Fixed effects:  es_mean ~ 1 
##                  Value  Std.Error    DF  t-value p-value
## (Intercept) 0.01813216 0.01029684 18299 1.760944  0.0783
## 
## Standardized Within-Group Residuals:
##           Min            Q1           Med            Q3           Max 
## -4.858372e+01 -1.252907e-01  7.317224e-04  1.384293e-01  4.238303e+01 
## 
## Number of Observations: 21920
## Number of Groups: 3621

m_power <- m_power_winsorized

The following chunk uses the outlier boundaries, as described above, from the winsorized model to make exclusions.

overall_effect <- as.numeric(fixef(m_power))                
sigma2 <- sigma(m_power)^2                                   
tau2   <- as.numeric(VarCorr(m_power)[1, "Variance"])    

# varPower exponents (delta_n, delta_v)
expParams <- intervals(m_power)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]

images$se_model <- sqrt(
  tau2 + sigma2 *
    (images$number_of_subjects ^ (2*delta_n)) *
    (images$voxel_scaled       ^ (2*delta_v))
)

# standardized residuals
images$res_std <- residuals(m_power, type = "pearson")

k <- 3
images$outlier <- abs(images$es_mean - overall_effect) > k * images$se_model

# SNV: These are all satisfied already in the dataset
# dat_plot <- subset(
#   images,
#   is.finite(number_of_subjects) & number_of_subjects > 0 &
#     is.finite(es_mean) & is.finite(se_model)
# )
n_out <- sum(images$outlier)
n_tot <- nrow(images)

grid <- data.frame(
  number_of_subjects = seq(1, 10000, length.out = 800)
)

# choose a constant voxel_scaled for the band (median is a sensible default)
grid$voxel_scaled <- median(images$voxel_scaled, na.rm = TRUE)

grid$se_model <- sqrt(
  tau2 + sigma2 *
    (grid$number_of_subjects ^ (2*delta_n)) *
    (grid$voxel_scaled       ^ (2*delta_v))
)

grid$upper <- overall_effect + k * grid$se_model
grid$lower <- overall_effect - k * grid$se_model
grid$log_n <- log10(grid$number_of_subjects)

library(plotly)

# SNV: dat_plot was the same as images, just replace it here
# dat_plot$se_model <- sqrt(
#   tau2 + sigma2 *
#     (dat_plot$number_of_subjects ^ (2 * delta_n)) *
#     (dat_plot$voxel_scaled       ^ (2 * delta_v))
# )
# dat_plot$upper_bound <- overall_effect + k * dat_plot$se_model
# dat_plot$lower_bound <- overall_effect - k * dat_plot$se_model
# 
# dat_plot$outlier <- with(dat_plot, es_mean > upper_bound | es_mean < lower_bound)

n_seq <- 10^seq(0, 4, length.out = 60)
v_seq <- seq(
  quantile(images$voxel_scaled, 0.05, na.rm = TRUE),
  quantile(images$voxel_scaled, 0.95, na.rm = TRUE),
  length.out = 60
)

grid3d <- expand.grid(number_of_subjects = n_seq, voxel_scaled = v_seq)

grid3d$se_model <- sqrt(
  tau2 + sigma2 *
    (grid3d$number_of_subjects ^ (2 * delta_n)) *
    (grid3d$voxel_scaled       ^ (2 * delta_v))
)

grid3d$upper <- overall_effect + k * grid3d$se_model
grid3d$lower <- overall_effect - k * grid3d$se_model


xg <- log10(n_seq) 
yg <- v_seq        

UP <- t(matrix(grid3d$upper, nrow = length(n_seq), ncol = length(v_seq)))
LO <- t(matrix(grid3d$lower, nrow = length(n_seq), ncol = length(v_seq)))
MU <- matrix(overall_effect, nrow = length(yg), ncol = length(xg))

fig8 <- plot_ly() |>
  add_markers(
    data   = subset(images, !outlier),
    x      = ~log10(number_of_subjects),
    y      = ~voxel_scaled,
    z      = ~es_mean,
    marker = list(size = 2, opacity = 0.25, color = "gray"),
    name   = "Inside funnel"
  ) |>
  add_markers(
    data   = subset(images, outlier),
    x      = ~log10(number_of_subjects),
    y      = ~voxel_scaled,
    z      = ~es_mean,
    marker = list(size = 4, color = "red"),
    name   = sprintf("Outliers (n = %d / %d)", n_out, n_tot)
  ) |>
  add_surface(
    x         = xg, y = yg, z = MU,
    opacity   = 0.35,
    colorscale = list(c(0,1), c("royalblue","royalblue")),
    showscale = FALSE,
    name      = "Overall effect (μ)"
  ) |>
  add_surface(
    x          = xg, y = yg, z = UP,
    opacity    = 0.20,
    colorscale = list(c(0,1), c("orange","orange")),
    showscale  = FALSE,
    name       = sprintf("Upper bound (μ + %d·SD)", k)
  ) |>
  add_surface(
    x          = xg, y = yg, z = LO,
    opacity    = 0.20,
    colorscale = list(c(0,1), c("orange","orange")),
    showscale  = FALSE,
    name       = sprintf("Lower bound (μ − %d·SD)", k)
  ) |>
  layout(
    title = list(
      text = paste0(
        "<b>Figure 8: Outlier Detection via Winsorized Variance-Power Meta-Analytic Model</b><br>",
        "<sup>Effect sizes flagged as outliers if |ES − μ| > ", k, "·SD(n, v); ",
        "SD modeled as a function of sample size and voxel coverage. ",
        "Red points (n = ", n_out, " of ", n_tot, " images) excluded from subsequent analyses.</sup>"
      ),
      x    = 0.5,
      xref = "paper",
      font = list(size = 13)
    ),
    scene = list(
      xaxis = list(
        title = "log\u2081\u2080(Sample size)",
        titlefont = list(size = 11)
      ),
      yaxis = list(
        title = "Voxel coverage (nonzero voxels / 1e5)",
        titlefont = list(size = 11)
      ),
      zaxis = list(
        title = "Mean effect size (ES)",
        titlefont = list(size = 11)
      ),
      camera = list(
        eye = list(x = 1.6, y = -1.6, z = 0.8)
      )
    ),
    legend = list(
      x = 0.01, y = 0.99,
      bgcolor = "rgba(255,255,255,0.7)",
      bordercolor = "gray",
      borderwidth = 1
    ),
    margin = list(t = 120)   # extra top margin for the title
  )

fig8

Figure \(\ref{fig:fig8}\) presents a three-dimensional visualization of the outlier detection procedure applied to the meta-analytic image database. Each point represents a single neuroimaging study, plotted along three axes: log-transformed sample size (\(\log_{10}(n)\)), voxel coverage scaled by \(1 \times 10^5\) (y-axis), and mean effect size (\(\bar{\theta}\), z-axis).

The blue surface represents the overall pooled effect estimate (\(\hat{\mu}\)) derived from the winsorized variance-power mixed-effects model, which accounts for heteroscedasticity in effect size variance as a joint function of sample size and voxel coverage. The orange surfaces define the upper and lower outlier boundaries, set at \(\hat{\mu} \pm k \cdot \hat{\sigma}(n, v)\) where \(k = 3\), and where \(\hat{\sigma}(n, v)\) is modeled as:

\[ \hat{\sigma}(n, v) = \sqrt{\tau^2 + \sigma^2 \cdot n^{2\delta_n} \cdot v^{2\delta_v}} \]

with \(\hat{\tau}^2 = 0.0043\), \(\hat{\sigma}^2 = 1.3651935\times 10^{4}\), \(\hat{\delta}_n = 0.022\), and \(\hat{\delta}_v = -1.271\) estimated from the data.

Gray points fall within the acceptable range and are retained for subsequent analyses. Red points (\(n = 609\) of \(21920\) total images) fall outside the funnel boundaries and are flagged as outliers, indicating that their effect sizes deviate more than \(k = 3\) standard deviations from the pooled estimate given their sample size and voxel coverage characteristics. These images are excluded from all downstream analyses.

This approach is more principled than applying a fixed effect size threshold, as it explicitly acknowledges that smaller studies and those with lower voxel coverage are expected to show greater variability under the variance-power model, and calibrates the exclusion boundaries accordingly.

5.3 Model refit with cleaned dataset

With the above exclusion, a new dataset is created with the outliers as defined from the model above excluded. The following chunk refits the model with the new subset without Windsorization. Outliers are identified using the same procedure, but not excluded.

# Summary table for the final curated dataset using table1 for consistency
table1( ~ outlier + number_of_subjects + es_mean + voxel_scaled + gm_perc + wm_perc + outside_perc + grey_mean + white_mean, 
        data = images, 
        caption = "Table 6: Summary statistics for the final curated Neurovault dataset before outlier exclusions.")

Table 6: Summary statistics for the final curated Neurovault dataset before outlier exclusions.
	Overall (N=21920)
outlier
Yes	609 (2.8%)
No	21311 (97.2%)
number_of_subjects
Mean (SD)	81.8 (240)
Median [Min, Max]	52.0 [3.00, 30000]
es_mean
Mean (SD)	0.0253 (0.845)
Median [Min, Max]	0.00391 [-52.7, 45.9]
voxel_scaled
Mean (SD)	240 (43.9)
Median [Min, Max]	228 [130, 344]
gm_perc
Mean (SD)	0.885 (0.0991)
Median [Min, Max]	0.902 [0.555, 1.00]
wm_perc
Mean (SD)	0.923 (0.0896)
Median [Min, Max]	0.944 [0.342, 1.00]
outside_perc
Mean (SD)	0.0328 (0.0381)
Median [Min, Max]	0.00999 [0, 0.152]
grey_mean
Mean (SD)	0.0282 (0.971)
Median [Min, Max]	0.00585 [-64.9, 52.4]
white_mean
Mean (SD)	0.0196 (0.763)
Median [Min, Max]	0.00191 [-43.2, 38.2]

dat = images[ images$outlier==FALSE,]
m_power <- lme(
  es_mean ~ 1,
  random  = ~ 1 | collection_id,
  weights = varComb(
  varPower(form = ~ number_of_subjects),
  varPower(form = ~ voxel_scaled)),
  data    = dat,
  method  = "REML"
)

overall_effect <- as.numeric(fixef(m_power))                
sigma2 <- sigma(m_power)^2                                   
tau2   <- as.numeric(VarCorr(m_power)[1, "Variance"])    

# varPower exponents (delta_n, delta_v)
expParams <- intervals(m_power)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]

dat$se_model <- sqrt(
  tau2 + sigma2 *
    (dat$number_of_subjects ^ (2*delta_n)) *
    (dat$voxel_scaled       ^ (2*delta_v))
)

# standardized residuals (optional panel A)
dat$res_std <- residuals(m_power, type = "pearson")

k <- 3
dat$outlier <- abs(dat$es_mean - overall_effect) > k * dat$se_model

# dat_plot <- subset(
#   dat,
#   is.finite(number_of_subjects) & number_of_subjects > 0 &
#     is.finite(es_mean) & is.finite(se_model)
# )
# dat_sorted <- dat_plot[order(dat_plot$number_of_subjects), ]
# dat_clean  <- subset(dat_sorted, !outlier)
# n_out <- sum(dat_plot$outlier)
# n_tot <- nrow(dat_plot)
# dat_plot$se_model <- sqrt(
#   tau2 + sigma2 *
#     (dat_plot$number_of_subjects ^ (2 * delta_n)) *
#     (dat_plot$voxel_scaled       ^ (2 * delta_v))
# )
# dat_plot$upper_bound <- overall_effect + k * dat_plot$se_model
# dat_plot$lower_bound <- overall_effect - k * dat_plot$se_model
# dat_plot$outlier <- with(dat_plot, es_mean > upper_bound | es_mean < lower_bound)

grid <- data.frame(
  number_of_subjects = seq(1, 10000, length.out = 800)
)

# choose a constant voxel_scaled for the band (median is a sensible default)
grid$voxel_scaled <- median(dat$voxel_scaled, na.rm = TRUE)

grid$se_model <- sqrt(
  tau2 + sigma2 *
    (grid$number_of_subjects ^ (2*delta_n)) *
    (grid$voxel_scaled       ^ (2*delta_v))
)

grid$upper <- overall_effect + k * grid$se_model
grid$lower <- overall_effect - k * grid$se_model
grid$log_n <- log10(grid$number_of_subjects)

library(plotly)


n_seq <- 10^seq(0, 4, length.out = 60)
v_seq <- seq(
  quantile(dat$voxel_scaled, 0.05, na.rm = TRUE),
  quantile(dat$voxel_scaled, 0.95, na.rm = TRUE),
  length.out = 60
)

grid3d <- expand.grid(number_of_subjects = n_seq, voxel_scaled = v_seq)

grid3d$se_model <- sqrt(
  tau2 + sigma2 *
    (grid3d$number_of_subjects ^ (2 * delta_n)) *
    (grid3d$voxel_scaled       ^ (2 * delta_v))
)

grid3d$upper <- overall_effect + k * grid3d$se_model
grid3d$lower <- overall_effect - k * grid3d$se_model


xg <- log10(n_seq) 
yg <- v_seq        

UP <- t(matrix(grid3d$upper, nrow = length(n_seq), ncol = length(v_seq)))
LO <- t(matrix(grid3d$lower, nrow = length(n_seq), ncol = length(v_seq)))
MU <- matrix(overall_effect, nrow = length(yg), ncol = length(xg))

n_out_refit <- sum(dat$outlier)
n_tot_refit <- nrow(dat)

fig9 <- plot_ly() |>
  add_markers(
    data   = subset(dat, !outlier),
    x      = ~log10(number_of_subjects),
    y      = ~voxel_scaled,
    z      = ~es_mean,
    marker = list(size = 2, opacity = 0.25, color = "gray"),
    name   = "Inside funnel"
  ) |>
  add_markers(
    data   = subset(dat, outlier),
    x      = ~log10(number_of_subjects),
    y      = ~voxel_scaled,
    z      = ~es_mean,
    marker = list(size = 4, color = "red"),
    name   = sprintf("Outliers (n = %d / %d)", n_out_refit, n_tot_refit)
  ) |>
  add_surface(
    x          = xg, y = yg, z = MU,
    opacity    = 0.35,
    colorscale = list(c(0,1), c("royalblue","royalblue")),
    showscale  = FALSE,
    name       = "Overall effect (μ)"
  ) |>
  add_surface(
    x          = xg, y = yg, z = UP,
    opacity    = 0.20,
    colorscale = list(c(0,1), c("orange","orange")),
    showscale  = FALSE,
    name       = sprintf("Upper bound (μ + %d·SD)", k)
  ) |>
  add_surface(
    x          = xg, y = yg, z = LO,
    opacity    = 0.20,
    colorscale = list(c(0,1), c("orange","orange")),
    showscale  = FALSE,
    name       = sprintf("Lower bound (μ − %d·SD)", k)
  ) |>
  layout(
    title = list(
      text = paste0(
        "<b>Figure 9: Refitted Variance-Power Model After Outlier Exclusion</b><br>",
        "<sup>Model refit on cleaned dataset (n = ", n_tot_refit, " images). ",
        "Remaining outliers flagged in red (n = ", n_out_refit, "). ",
        "Final parameters: μ = ", round(overall_effect, 4),
        ", τ² = ", round(tau2, 4),
        ", σ² = ", round(sigma2, 4),
        ", δ_n = ", round(delta_n, 3),
        ", δ_v = ", round(delta_v, 3), ".</sup>"
      ),
      x    = 0.5,
      xref = "paper",
      font = list(size = 13)
    ),
    scene = list(
      xaxis = list(title = "log\u2081\u2080(Sample size)",       titlefont = list(size = 11)),
      yaxis = list(title = "Voxel coverage (nonzero voxels / 1e5)", titlefont = list(size = 11)),
      zaxis = list(title = "Mean effect size (ES)",              titlefont = list(size = 11))
    ),
    legend = list(
      x = 0.01, y = 0.99,
      bgcolor     = "rgba(255,255,255,0.7)",
      bordercolor = "gray",
      borderwidth = 1
    ),
    margin = list(t = 120)
  )

fig9

images <- dat
saveRDS(images, get_monthly_rds_path('curatedMetadataset'))
cat(sprintf("Saved curated dataset: %d images retained\n", nrow(images)))

## Saved curated dataset: 21311 images retained

6 Table Summary of the Three LMMs

To evaluate the stability of parameter estimates across data curation steps, we fit three variants of the same variance-power linear mixed-effects model:

Original: fit on the full raw dataset prior to any quality-based exclusion
Winsorized: fit after winsorizing extreme effect sizes to the 1st and 99th percentiles, used as the reference model for outlier boundary construction
Cleaned: refit on the final curated dataset after removing images flagged as outliers by the winsorized model

The model specification is identical across all three fits:

\[ \bar{\theta}_{ij} = \mu + b_i + \varepsilon_{ij}, \quad b_i \sim \mathcal{N}(0, \tau^2), \quad \varepsilon_{ij} \sim \mathcal{N}\!\left(0,\; \sigma^2 \cdot n_{ij}^{2\delta_n} \cdot v_{ij}^{2\delta_v}\right) \]

where \(\mu\) is the overall pooled effect, \(b_i\) is the collection-level random intercept, and the residual variance is modeled as a power function of sample size \(n_{ij}\) and voxel coverage \(v_{ij}\).

Table \(\ref{tab:tablesummary}\) presents the fixed effect intercept (\(\hat{\mu}\)), random effect standard deviation (\(\hat{\tau}\)), residual standard deviation (\(\hat{\sigma}\)), and the two variance-power exponents (\(\hat{\delta}_n\), \(\hat{\delta}_v\)) across all three model fits.

Across the three models, the fixed effect estimate \(\hat{\mu}\) is expected to remain stable if the overall pooled effect is robust to outliers and winsorization. Changes in \(\hat{\tau}\) across models reflect shifts in between-collection heterogeneity as extreme images are progressively removed. Similarly, reductions in \(\hat{\sigma}\) from the original to the cleaned model indicate that outlier exclusion successfully reduced residual noise. The variance-power exponents \(\hat{\delta}_n\) and \(\hat{\delta}_v\) characterize how strongly effect size variability depends on sample size and voxel coverage respectively — values close to zero indicate near-homoscedastic residuals, while larger values indicate stronger heteroscedasticity.

library(dplyr)
library(knitr)
library(kableExtra)
library(nlme)
library(tidyr)

# Extract fixed and random effects parameters from the three models
extract_model_params <- function(fit, model_name) {
  fixef_val <- as.numeric(fixef(fit))
  tau2 <- as.numeric(VarCorr(fit)[1, "Variance"])
  sigma2 <- sigma(fit)^2
  
  expParams <- intervals(fit)$varStruct[, 2]
  delta_n <- expParams[1]
  delta_v <- expParams[2]
  
  # Convert variances to standard deviations
  tau_sd <- sqrt(tau2)
  sigma_sd <- sqrt(sigma2)
  
  data.frame(
    Model = model_name,
    Parameter = c("Fixed effect (intercept)", "Random effect SD (tau)", "Residual SD (sigma)", "delta_n", "delta_v"),
    Estimate = c(fixef_val, tau_sd, sigma_sd, delta_n, delta_v),
    stringsAsFactors = FALSE
  )
}

# Extract parameters from all three models
params_combined <- bind_rows(
  extract_model_params(m_power_original, "Original"),
  extract_model_params(m_power_winsorized, "Winsorized"),
  extract_model_params(m_power, "Cleaned")
)

# Format the table
tab_display <- params_combined %>%
  pivot_wider(
    names_from = Model,
    values_from = Estimate
  ) %>%
  mutate(
    across(c(Original, Winsorized, Cleaned), function(x) {
      case_when(
        grepl("Fixed effect", Parameter) ~ sprintf("%.3f", x),
        grepl("Residual SD", Parameter) ~ sprintf("%.2e", x),
        grepl("delta", Parameter) ~ sprintf("%.3f", x),
        TRUE ~ sprintf("%.4f", x)
      )
    })
  )

kable(
  tab_display,
  format = "html",
  escape = FALSE,
  col.names = c("Parameter", "Original", "Winsorized", "Cleaned"),
  caption = "Fixed and random effects parameter estimates from the three mixed-effects models."
) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = FALSE,
    position = "center"
  )

Fixed and random effects parameter estimates from the three mixed-effects models.
Parameter	Original	Winsorized	Cleaned
Fixed effect (intercept)	0.018	0.002	0.002
Random effect SD (tau)	0.5331	0.0657	0.0413
Residual SD (sigma)	6.12e+10	1.17e+02	1.27e+03
delta_n	-0.891	0.022	0.024
delta_v	-4.099	-1.271	-1.737

7 Summary of the curated Neurovault dataset

Following a three-stage quality control pipeline, the final curated dataset comprises \(21311\) neuroimaging maps retained for downstream analysis.

The first stage examined tabular image properties available in the NeuroVault metadata, including sample size, voxel dimensions, and image type, removing maps that were duplicated, missing key descriptors, or outside the scope of the analysis (Section 3).

The second stage assessed spatial coverage by loading each image directly and computing gray matter coverage (\(88.5\%\) mean retained), white matter coverage (\(92.3\%\) mean retained), and outside-mask signal (\(3.3\%\) mean retained). Images with insufficient brain coverage or excessive non-brain signal were excluded at this stage (Section 4).

The third stage applied a model-based outlier detection procedure using a variance-power linear mixed-effects model fit on winsorized effect sizes. Images whose effect sizes deviated more than \(k = 3\) standard deviations from the pooled estimate — after accounting for heteroscedasticity as a function of sample size and voxel coverage — were flagged and removed (Section 5).

Table \(\ref{tab:curation_summary}\) summarizes the key image-level characteristics of the curated dataset. The distribution of sample sizes (\(n\)) reflects the typical range seen in neuroimaging meta-analyses, with substantial right skew driven by a small number of large studies. Mean effect size (\(\bar{\theta}\)) and voxel coverage metrics (gray matter percentage, white matter percentage, and outside-mask signal) characterize the spatial quality of the retained images.

Gray matter coverage (\(88.5\%\) mean) and white matter coverage (\(92.3\%\) mean) confirm that retained images provide adequate brain signal. The outside-mask percentage (\(3.3\%\) mean) confirms that non-brain signal is minimal across the curated set, consistent with successful skull-stripping and registration.

# Summary table for the final curated dataset using table1 for consistency
table1( ~ number_of_subjects + es_mean + voxel_scaled + gm_perc + wm_perc + outside_perc + grey_mean + white_mean, 
        data = images, 
        caption = "Table 6: Summary statistics for the final curated Neurovault dataset after outlier exclusions.")

Table 6: Summary statistics for the final curated Neurovault dataset after outlier exclusions.
	Overall (N=21311)
number_of_subjects
Mean (SD)	82.5 (243)
Median [Min, Max]	54.0 [3.00, 30000]
es_mean
Mean (SD)	0.0104 (0.120)
Median [Min, Max]	0.00340 [-0.576, 0.551]
voxel_scaled
Mean (SD)	240 (43.9)
Median [Min, Max]	228 [130, 343]
gm_perc
Mean (SD)	0.885 (0.0993)
Median [Min, Max]	0.901 [0.557, 1.00]
wm_perc
Mean (SD)	0.923 (0.0892)
Median [Min, Max]	0.942 [0.342, 1.00]
outside_perc
Mean (SD)	0.0327 (0.0379)
Median [Min, Max]	0.00987 [0, 0.151]
grey_mean
Mean (SD)	0.0124 (0.130)
Median [Min, Max]	0.00512 [-0.683, 0.613]
white_mean
Mean (SD)	0.00544 (0.123)
Median [Min, Max]	0.00152 [-0.823, 0.625]

## Archive complete: /media/big/neurovault_effectsizes_masked.tar

## Files archived: 21311

file.remove(file.path(rds_dir, "zenodo_record_id.rds"))

## [1] TRUE

library(zen4R)

# ── 1. Prepare files ──────────────────────────────────────────────────────────
current_rds <- get_monthly_rds_path("curatedMetadataset")
if (!file.exists(current_rds)) stop("Current month RDS not found: ", current_rds)
cat("RDS file:", current_rds, "\n")

## RDS file: /media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds

# ── 2. Create tar archive of brain-masked effect size images ──────────────────
outdir    <- "/media/big/neurovault"
archive   <- "/media/big/neurovault_effectsizes_masked.tar"

img_paths <- images$brain_file
img_paths <- img_paths[!is.na(img_paths) & nzchar(img_paths)]
img_paths <- img_paths[file.exists(img_paths)]

outdir_abs <- normalizePath(outdir)
img_abs    <- normalizePath(img_paths)
inside     <- startsWith(img_abs, paste0(outdir_abs, .Platform$file.sep))
img_abs    <- img_abs[inside]

rel_paths <- sub(
  paste0("^", outdir_abs, .Platform$file.sep),
  "",
  img_abs
)

cat("Files to archive:", length(rel_paths), "\n")

## Files to archive: 21311

if (!file.exists(archive)) {
  old <- setwd(outdir)
  on.exit(setwd(old), add = TRUE)
  utils::tar(
    tarfile     = archive,
    files       = rel_paths,
    compression = "none"
  )
  cat("Archive complete:", archive, "\n")
} else {
  cat("Archive already exists, skipping tar creation:", archive, "\n")
}

## Archive already exists, skipping tar creation: /media/big/neurovault_effectsizes_masked.tar

cat("Archive size:", round(file.size(archive) / 1e9, 2), "GB\n")

## Archive size: 0 GB

# ── 3. Connect to Zenodo ──────────────────────────────────────────────────────
token_path <- file.path(rds_dir, "zenodo_token.rds")
if (!file.exists(token_path)) stop("zenodo_token.rds not found in ", rds_dir)
token  <- readRDS(token_path)
zenodo <- ZenodoManager$new(token = token, logger = "INFO")

## ✔ Successfully connected to Zenodo with user token

## [zen4R][INFO] ZenodoManager - Successfully connected to Zenodo with user token

# ── 4. Fetch or create record ─────────────────────────────────────────────────
record_id_path <- file.path(rds_dir, "zenodo_record_id.rds")

if (!file.exists(record_id_path)) {

  cat("No existing record — creating new\n")
  myrec <- zenodo$createEmptyRecord()

  myrec$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
  myrec$setDescription("Monthly-updated curated neuroimaging meta-analytic dataset derived from NeuroVault. Contains tabular metadata and brain-masked RESI effect size images.")
  myrec$addCreator(firstname = "Haozheng", lastname = "Xu",
                   affiliation = "Vanderbilt University")
  myrec$addCreator(firstname = "Simon", lastname = "Vandekar",
                   affiliation = "Vanderbilt University")
  myrec$setResourceType("dataset")
  myrec$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
  myrec$setPublisher("Vanderbilt University Medical Center")
  myrec$setSubjects(list(
    list(subject = "neuroimaging"),
    list(subject = "meta-analysis"),
    list(subject = "NeuroVault"),
    list(subject = "curated dataset")
  ))

  myrec <- zenodo$depositRecord(myrec, publish = FALSE)

  deposit_id <- as.character(myrec$id)
  if (is.null(deposit_id) || deposit_id == "NULL") {
    self_link  <- myrec$links$self
    deposit_id <- gsub(".*/records/([0-9]+)/.*", "\\1", self_link)
  }

  cat("New record created. Deposit ID:", deposit_id, "\n")
  saveRDS(deposit_id, record_id_path)

} else {

  deposit_id <- readRDS(record_id_path)
  cat("Found saved deposit ID:", deposit_id, "\n")

  myrec <- tryCatch(
    zenodo$getDepositionById(deposit_id),
    error = function(e) NULL
  )

  if (is.null(myrec)) {
    deps  <- zenodo$getDepositions()
    myrec <- Filter(function(x) as.character(x$id) == deposit_id, deps)
    myrec <- if (length(myrec) > 0) myrec[[1]] else NULL
  }

  if (is.null(myrec)) stop("Could not find record with ID: ", deposit_id)
}

## No existing record — creating new

## ✔ Successful record deposition

## [zen4R][INFO] ZenodoManager - Successful record deposition

## ✔ Successful reserved DOI for record 20292763

## [zen4R][INFO] ZenodoManager - Successful reserved DOI for record 20292763

## ✔ Successfully fetched list of affiliations!

## ✔ Successfully fetched list of affiliations!

## ✔ Successfully fetched resourcetype 'dataset'

## ✔ Successful record deposition

## [zen4R][INFO] ZenodoManager - Successful record deposition

## ! Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!

## [zen4R][WARN] ZenodoManager - Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation! 
## New record created. Deposit ID: 20292763

# ── 5. Determine record status ────────────────────────────────────────────────
record_status <- if (!is.null(myrec$state)  && length(myrec$state)  > 0) myrec$state  else
                 if (!is.null(myrec$status) && length(myrec$status) > 0) myrec$status else
                 "unsubmitted"

cat("Record status:", record_status, "\n")

## Record status: draft

is_draft <- !record_status %in% c("done", "published")

# ── Helper: upload both files ─────────────────────────────────────────────────
upload_files <- function(rec) {
  # Delete old RDS if present
  existing_files <- tryCatch(zenodo$getFiles(rec$id), error = function(e) list())
  if (length(existing_files) > 0) {
    old_rds <- Filter(function(f) {
      is.list(f) && !is.null(f$filename) &&
        grepl("^curatedMetadataset_[0-9]{4}-[0-9]{2}\\.rds$", f$filename)
    }, existing_files)
    invisible(lapply(old_rds, function(f) {
      zenodo$deleteFile(rec$id, f$id)
      cat("Deleted old file:", f$filename, "\n")
    }))
  }

  # Upload tabular RDS
  zenodo$uploadFile(current_rds, rec)
  cat("Uploaded RDS:", basename(current_rds), "\n")

  # Upload tar archive
  if (file.exists(archive)) {
    zenodo$uploadFile(archive, rec)
    cat("Uploaded archive:", basename(archive), "\n")
  } else {
    warning("Tar archive not found, skipping: ", archive)
  }
}

# ── 6. Upload and publish ─────────────────────────────────────────────────────
if (is_draft) {

  cat("Record is a draft — uploading directly\n")
  upload_files(myrec)

  myrec$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
  myrec$setDescription("Monthly-updated curated neuroimaging meta-analytic dataset derived from NeuroVault. Contains tabular metadata and brain-masked RESI effect size images.")
  myrec$setResourceType("dataset")
  myrec$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
  myrec$setPublisher("Vanderbilt University Medical Center")
  myrec <- zenodo$depositRecord(myrec, publish = FALSE)

  myrec <- zenodo$publishRecord(myrec$id)
  cat("Published! DOI:", myrec$doi, "\n")
  saveRDS(as.character(myrec$id), record_id_path)

} else {

  cat("Record is published — creating new version\n")
  myrec_new <- zenodo$newRecord(myrec$id)
  if (is.null(myrec_new)) stop("Failed to create new version")
  cat("New version draft ID:", myrec_new$id, "\n")

  upload_files(myrec_new)

  myrec_new$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
  myrec_new$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
  myrec_new$setPublisher("Vanderbilt University Medical Center")
  myrec_new <- zenodo$depositRecord(myrec_new, publish = FALSE)

  myrec_new <- zenodo$publishRecord(myrec_new$id)
  cat("New version published! DOI:", myrec_new$doi, "\n")
  saveRDS(as.character(myrec_new$id), record_id_path)
}

## Record is a draft — uploading directly
## [zen4R][INFO] ZenodoRequest - Fetching https://zenodo.org/api/records/20292763/draft/files

## ✔ Successful fetched file(s) for record '20292763'

## [zen4R][INFO] ZenodoManager - Successful fetched file(s) for record '20292763'

## ! No bucket link for record id = 20292763.

## [zen4R][WARN] ZenodoManager - No bucket link for record id = 20292763. 
## [zen4R][INFO] ZenodoManager - Start upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'

## ✔ Successfully started upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'

## [zen4R][INFO] ZenodoManager - Successfully started upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds' 
##   |                                                                              |                                                                      |   0%  |                                                                              |=                                                                     |   1%  |                                                                              |==                                                                    |   2%  |                                                                              |==                                                                    |   4%  |                                                                              |===                                                                   |   5%  |                                                                              |====                                                                  |   6%  |                                                                              |=====                                                                 |   7%  |                                                                              |======                                                                |   8%  |                                                                              |=======                                                               |   9%  |                                                                              |=======                                                               |  11%  |                                                                              |========                                                              |  12%  |                                                                              |=========                                                             |  13%  |                                                                              |==========                                                            |  14%  |                                                                              |===========                                                           |  15%  |                                                                              |===========                                                           |  16%  |                                                                              |============                                                          |  18%  |                                                                              |=============                                                         |  19%  |                                                                              |==============                                                        |  20%  |                                                                              |===============                                                       |  21%  |                                                                              |================                                                      |  22%  |                                                                              |================                                                      |  23%  |                                                                              |=================                                                     |  25%  |                                                                              |==================                                                    |  26%  |                                                                              |===================                                                   |  27%  |                                                                              |====================                                                  |  28%  |                                                                              |====================                                                  |  29%  |                                                                              |=====================                                                 |  30%  |                                                                              |======================                                                |  32%  |                                                                              |=======================                                               |  33%  |                                                                              |========================                                              |  34%  |                                                                              |=========================                                             |  35%  |                                                                              |=========================                                             |  36%  |                                                                              |==========================                                            |  37%  |                                                                              |===========================                                           |  39%  |                                                                              |============================                                          |  40%  |                                                                              |=============================                                         |  41%  |                                                                              |=============================                                         |  42%  |                                                                              |==============================                                        |  43%  |                                                                              |===============================                                       |  44%  |                                                                              |================================                                      |  46%  |                                                                              |=================================                                     |  47%  |                                                                              |==================================                                    |  48%  |                                                                              |==================================                                    |  49%  |                                                                              |===================================                                   |  50%  |                                                                              |====================================                                  |  51%  |                                                                              |=====================================                                 |  53%  |                                                                              |======================================                                |  54%  |                                                                              |======================================                                |  55%  |                                                                              |=======================================                               |  56%  |                                                                              |========================================                              |  57%  |                                                                              |=========================================                             |  58%  |                                                                              |==========================================                            |  60%  |                                                                              |===========================================                           |  61%  |                                                                              |===========================================                           |  62%  |                                                                              |============================================                          |  63%  |                                                                              |=============================================                         |  64%  |                                                                              |==============================================                        |  65%  |                                                                              |===============================================                       |  67%  |                                                                              |===============================================                       |  68%  |                                                                              |================================================                      |  69%  |                                                                              |=================================================                     |  70%  |                                                                              |==================================================                    |  71%  |                                                                              |===================================================                   |  72%  |                                                                              |====================================================                  |  74%  |                                                                              |====================================================                  |  75%  |                                                                              |=====================================================                 |  76%  |                                                                              |======================================================                |  77%  |                                                                              |=======================================================               |  78%  |                                                                              |========================================================              |  79%  |                                                                              |========================================================              |  81%  |                                                                              |=========================================================             |  82%  |                                                                              |==========================================================            |  83%  |                                                                              |===========================================================           |  84%  |                                                                              |============================================================          |  85%  |                                                                              |=============================================================         |  86%  |                                                                              |=============================================================         |  88%  |                                                                              |==============================================================        |  89%  |                                                                              |===============================================================       |  90%  |                                                                              |================================================================      |  91%  |                                                                              |=================================================================     |  92%  |                                                                              |=================================================================     |  94%  |                                                                              |==================================================================    |  95%  |                                                                              |===================================================================   |  96%  |                                                                              |====================================================================  |  97%  |                                                                              |===================================================================== |  98%  |                                                                              |======================================================================|  99%  |                                                                              |======================================================================| 100%

## ✖

## ℹ Complete upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'

## [zen4R][INFO] ZenodoManager - Complete upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'

## ✔ Successfully completed upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'

## [zen4R][INFO] ZenodoManager - Successfully completed upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds' 
## Uploaded RDS: curatedMetadataset_2026-05.rds

## ! No bucket link for record id = 20292763.

## [zen4R][WARN] ZenodoManager - No bucket link for record id = 20292763. 
## [zen4R][INFO] ZenodoManager - Start upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'

## ✔ Successfully started upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'

## [zen4R][INFO] ZenodoManager - Successfully started upload procedure for file '/media/big/neurovault_effectsizes_masked.tar' 
##   |                                                                              |                                                                      |   0%  |                                                                              |======================================================================| 100%

## ✖

## ℹ Complete upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'

## [zen4R][INFO] ZenodoManager - Complete upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'

## ✔ Successfully completed upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'

## [zen4R][INFO] ZenodoManager - Successfully completed upload procedure for file '/media/big/neurovault_effectsizes_masked.tar' 
## Uploaded archive: neurovault_effectsizes_masked.tar

## ✔ Successfully fetched resourcetype 'dataset'

## ✔ Successful record deposition

## [zen4R][INFO] ZenodoManager - Successful record deposition

## ! Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!

## [zen4R][WARN] ZenodoManager - Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!

## ✔ Successful published record '20292763'

## [zen4R][INFO] ZenodoManager - Successful published record '20292763' 
## Published! DOI:

8 Appendix

8.1 Collection-level variables

The following variables are available at the collection level in the NeuroVault metadata:

id, url, download_url, owner, contributors, owner_name, number_of_images, name, publication_status, DOI, preprint_DOI, authors, paper_url, journal_name, description, full_dataset_url, private, add_date, modify_date, doi_add_date, type_of_design, number_of_imaging_runs, number_of_experimental_units, length_of_runs, length_of_blocks, length_of_trials, optimization, optimization_method, subject_age_mean, subject_age_min, subject_age_max, handedness, proportion_male_subjects, inclusion_exclusion_criteria, number_of_rejected_subjects, group_comparison, group_description, scanner_make, scanner_model, field_strength, pulse_sequence, parallel_imaging, field_of_view, matrix_size, slice_thickness, skip_distance, acquisition_orientation, order_of_acquisition, repetition_time, echo_time, flip_angle, software_package, software_version, order_of_preprocessing_operations, quality_control, used_b0_unwarping, b0_unwarping_software, used_slice_timing_correction, slice_timing_correction_software, used_motion_correction, motion_correction_software, motion_correction_reference, motion_correction_metric, motion_correction_interpolation, used_motion_susceptibiity_correction, used_intersubject_registration, intersubject_registration_software, intersubject_transformation_type, nonlinear_transform_type, transform_similarity_metric, interpolation_method, object_image_type, functional_coregistered_to_structural, functional_coregistration_method, coordinate_space, target_template_image, target_resolution, used_smoothing, smoothing_type, smoothing_fwhm, resampled_voxel_size, intrasubject_model_type, intrasubject_estimation_type, intrasubject_modeling_software, hemodynamic_response_function, used_temporal_derivatives, used_dispersion_derivatives, used_motion_regressors, used_reaction_time_regressor, used_orthogonalization, orthogonalization_description, used_high_pass_filter, high_pass_filter_method, autocorrelation_model, group_model_type, group_estimation_type, group_modeling_software, group_inference_type, group_model_multilevel, group_repeated_measures, group_repeated_measures_method, nutbrain_hunger_state, nutbrain_food_viewing_conditions, nutbrain_food_choice_type, nutbrain_taste_conditions, nutbrain_odor_conditions, communities

8.2 Image-level variables

The following variables are available at the image level after preprocessing:

add_date, collection, collection_id, description, file, file_size, id, image_type, is_valid, modify_date, subject_species, target_template_image, url, name, brain_coverage, map_type, not_mni, perc_bad_voxels, data_origin, perc_voxels_outside, modality, is_thresholded, cognitive_paradigm_cogatlas, cognitive_paradigm_cogatlas_id, analysis_level, number_of_subjects, thumbnail, surface_left_file, surface_right_file, binarized_group_level, binarized_map_type, unthresholded, mni, nonzero_sample, localfile, download, voxdim, imgdim, nvox, orient, range, bounding_box, localfile_base, duplicated, dim_mm, log10_ratio_xy, log10_ratio_xz, log10_ratio_yz, ratio_x, ratio_y, ratio_z, is_proportional, range_low, range_high, range_diff, range_exclusion, localfileadj, resliced, nvox_new, voxdim_new, imgdim_new, orient_new, range_new, nvox_outside, nvox_gm, nvox_wm, q90_outside, gm_perc, wm_perc, outside_perc, gm_exclude, wm_exclude, nvox_outside_exclude, effect_size_file, brain_file, es_min, es_1qt, es_median, es_3qt, es_max, es_mean, grey_mean, grey_min, grey_max, white_mean, white_min, white_max, es_mean_winsorized, voxel_scaled, se_model, res_std, outlier

8.3 Data Availability and Access

8.3.1 Curated Dataset

The curated NeuroVault dataset produced by this pipeline is publicly available on Zenodo and updated on the first of each month. Each monthly release contains:

Tabular metadata (curatedMetadataset_YYYY-MM.rds) — an R data frame with one row per retained image, containing all image-level metadata, coverage statistics, and effect size summaries computed during preprocessing.
Registered effect size images — brain-registered NIfTI images converted to the robust effect size index (RESI), stored as *_rigid_cost_step1_effect_size_masked_mni.nii.gz.

The dataset can be downloaded directly from Zenodo at: DOI pending publication

8.3.2 Image Naming Convention

Images in this pipeline pass through four processing stages, each appending a suffix to the original filename:

Table A1: Image file naming convention across preprocessing stages.
Stage	Column in dataset	File suffix	Description
Downloaded	localfile	.nii.gz	Original image downloaded from NeuroVault
NaN/epsilon cleaned	localfileadj	_eps.nii.gz	NaN replaced with 0; near-zero voxels (\|v\|
Rigid registration	resliced	_rigid_cost_step1.nii.gz	Registered to MNI 2mm template using FSL FLIRT (DOF=6, mutual information)
Effect size	effect_size_file	_rigid_cost_step1_effect_size.nii.gz	T/Z statistics converted to RESI effect sizes
Brain-masked effect size	brain_file	_rigid_cost_step1_effect_size_masked_mni.nii.gz	Effect size image masked to MNI brain template

All files are stored under /media/big/neurovault/{collection_id}/ on the processing server, mirroring the NeuroVault collection structure. The tar archive uploaded to Zenodo (neurovault_subset.tar) contains the registered images (resliced) for all retained collections.

8.3.3 Reproducing the Pipeline

This document is re-rendered automatically on the first of each month. To reproduce the full pipeline locally:

Clone the repository and set rds_dir and outdir to your local paths
Ensure FSL (flirt) and Convert3D (c3d) are installed and on PATH
Render the Rmd — each chunk checks for an existing monthly RDS before rerunning, so previously completed steps are skipped automatically
The final curated dataset is saved to {rds_dir}/curatedMetadataset_YYYY-MM.rds and uploaded to Zenodo

Table A2: Pipeline steps, outputs, and approximate runtimes on 24 cores.
Section	Step	Output RDS / file	Approx. runtime
1	Metadata download	neurovaultMetadata_YYYY-MM.rds	~30 min
2	Image download	localfile (NIfTI on disk)	~2 hours
3	Image-derived exclusions	imageMetadataWithDimensions_YYYY-MM.rds	~1 hour
4	Image value processing	localfileadj (*_eps.nii.gz)	~30 min
5	Registration	imageMetadataRegistered_YYYY-MM.rds	~4 hours
6	Effect size estimation	imageMetadataEffectSizes_YYYY-MM.rds	~6 hours

8.4 Session information

sessionInfo()

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Chicago
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] grid      parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] zen4R_0.10.4       callr_3.7.6        gridExtra_2.3      sandwich_3.1-1    
##  [5] lmtest_0.9-40      zoo_1.8-12         papayaWidget_0.7.1 kableExtra_1.3.4  
##  [9] knitr_1.48         plotly_4.10.4      nlme_3.1-166       reactable_0.4.4   
## [13] stringr_1.6.0      RESI_1.2.4         pbapply_1.7-2      rlang_1.1.7       
## [17] scales_1.3.0       cowplot_1.1.3      gridGraphics_0.5-1 neurobase_1.32.4  
## [21] oro.nifti_0.11.4   forcats_1.0.0      tidyr_1.3.1        ComplexUpset_1.3.3
## [25] jsonlite_2.0.0     httr_1.4.8         GGally_2.2.1       ggplot2_3.5.1     
## [29] table1_1.4.3       dplyr_1.1.4        abind_1.4-8        pbj_0.1.6         
## [33] RNifti_1.7.0      
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-8         polynom_1.4-1        magrittr_2.0.4      
##  [4] multcomp_1.4-26      matrixStats_1.2.0    compiler_4.4.1      
##  [7] png_0.1-8            systemfonts_1.1.0    vctrs_0.7.1         
## [10] rvest_1.0.5          pkgconfig_2.0.3      fastmap_1.2.0       
## [13] labeling_0.4.3       utf8_1.2.6           rmarkdown_2.28      
## [16] ps_1.8.1             purrr_1.0.2          xfun_0.47           
## [19] cachem_1.1.0         PDQutils_0.1.6       clubSandwich_0.5.11 
## [22] highr_0.11           R6_2.6.1             bslib_0.8.0         
## [25] stringi_1.8.7        RColorBrewer_1.1-3   car_3.1-3           
## [28] boot_1.3-31          jquerylib_0.1.4      estimability_1.5.1  
## [31] Rcpp_1.0.13-1        assertthat_0.2.1     base64enc_0.1-3     
## [34] R.utils_2.12.3       Matrix_1.6-3         splines_4.4.1       
## [37] tidyselect_1.2.1     rstudioapi_0.16.0    yaml_2.3.10         
## [40] codetools_0.2-20     curl_7.0.0           processx_3.8.4      
## [43] lattice_0.22-6       tibble_3.3.1         plyr_1.8.9          
## [46] withr_3.0.2          coda_0.19-4.1        evaluate_0.24.0     
## [49] orthopolynom_1.0-6.1 moments_0.14.1       survival_3.7-0      
## [52] ggstats_0.8.0        xml2_1.5.2           pillar_1.11.1       
## [55] carData_3.0-5        generics_0.1.3       munsell_0.5.1       
## [58] aod_1.3.3            xtable_1.8-4         glue_1.8.0          
## [61] emmeans_1.10.4       lazyeval_0.2.2       tools_4.4.1         
## [64] data.table_1.16.0    webshot_0.5.5        XML_3.99-0.20       
## [67] mvtnorm_1.3-1        crosstalk_1.2.1      colorspace_2.1-1    
## [70] patchwork_1.3.0      pTFCE_0.2.2.1        mmand_1.6.3         
## [73] Formula_1.2-5        cli_3.6.5            viridisLite_0.4.2   
## [76] keyring_1.4.1        svglite_2.1.1        gtable_0.3.6        
## [79] reactR_0.6.1         R.methodsS3_1.8.2    sass_0.4.9          
## [82] digest_0.6.37        TH.data_1.1-2        htmlwidgets_1.6.4   
## [85] farver_2.1.2         htmltools_0.5.8.1    R.oo_1.26.0         
## [88] lifecycle_1.0.5      mime_0.13            MASS_7.3-61

Neurovault Image Preprocessing Pipeline

Haozheng Xu, Simon Vandekar

2026-05-19