NeuroVault, is an open repository for sharing statistical maps, parcellations, and atlases of the human brain. Scientists can upload the brain images that they use for their studies to the online repository, to allow for interactive visualization of images while also contributing data entries for meta-analysis.
The images in Neurovault are highly heterogeneous and susceptible to human reporting errors in manually enter image metadata. This document automatically curates the Neurovault repository to produce a final dataset that is amenable to voxel-level meta-analysis. Here, we develop and run a preprocessing pipeline to ensure the Neurovault repository entries correspond to group-level statistical images and that images are registered to a common template (MNI 2mm). Details of preprocessing are described below. As a final step, we convert the images to effect sizes (the robust effect size index; RESI) and perform statistical analysis to remove outliers.
NeuroVault is living growing database of images. This preprocessing pipeline is automatically run on the first of each month to build a comprehensive preprocessed dataset that is freely available for download. The curated dataset is available on Zenodo. In our preprint, we demonstrate the utility of the curated dataset through unsupervised and supervised meta-analyses. This document and the dataset were last updated on 2026-05-19.
The Neurovault metadata include manually entered information at the study- and image-level describing the type of data that were uploaded, including:
analysis_level: indicates whether the image is a
group-level or individual-level statistical map.not_mni: indicates whether the image is in MNI
space.is_thresholded: indicates whether the image is
thresholded or unthresholded.map_type: indicates the statistical type of the image
(e.g., T, Z, F).modality: indicates the imaging modality (e.g.,
structural, functional).perc_bad_voxels: Proportion of voxels with 0 intensity
or NaN values.perc_voxels_outside: Proportion of voxels outside the
brain.number_of_subjects: Sample size of the studyDOI: The DOI for the publication associated with the
repository entry.paper_url: The URL for the publication associated with
the repository entry.We download the Neurovault study-level and image-level metadata, then apply the following quality exclusionary criteria to the metadata to ensure uniformity of the group-level statistical images.
analysis_level == "group").not_mni == FALSE).is_thresholded == FALSE),
which ensures all statistical values computed in the image are included.
This avoids biased caused by only reporting significant resultsNaN.See Table 1 and Figure 1 for a summary of the intersection of these exclusionary criteria.
## This chunk would take some time, as it creates the file neurovaultMetadata, containing all the study-level images
library(callr)
# Check if data was already downloaded this month
rds_file_path <- get_monthly_rds_path("neurovaultMetadata")
file_exists_this_month <- file.exists(rds_file_path)
if (!file_exists_this_month) {
# Run in background using callr::r - function saves results internally
download_and_save_metadata_fn <- function(rds_dir) {
library(httr)
library(jsonlite)
# Helper function to get date suffix
get_date_suffix <- function() {
format(Sys.Date(), "%Y-%m")
}
# Helper function to get monthly RDS path
get_monthly_rds_path <- function(base_name, dir = rds_dir) {
date_suffix <- get_date_suffix()
current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
old_files <- all_files[all_files != current_file]
if (length(old_files) > 0) {
invisible(lapply(old_files, file.remove))
}
return(current_file)
}
Sys.setenv(CURL_CA_BUNDLE = "/etc/ssl/certs/ca-certificates.crt")
httr::GET("https://api.github.com")
url <- 'https://neurovault.org/api/collections/?format=json'
nv = list()
i = 0
while (!is.null(url)) {
i = i + 1
cat(sprintf("[%s] Downloading collections batch %d\n", Sys.time(), i))
response <- GET(url, config(cainfo="/etc/ssl/certs/ca-certificates.crt"))
data <- content(response, "text")
json_data <- fromJSON(data, flatten = TRUE)
nv[[i]] = json_data$results
url <- json_data[['next']]
}
nv = do.call(rbind, nv)
# subset to studies that have more than 0 images
nv = nv[nv$number_of_images>0,]
# Save directly to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_path <- get_monthly_rds_path("neurovaultMetadata")
saveRDS(nv, file = rds_path)
cat(sprintf("[%s] Neurovault metadata saved to %s\n", Sys.time(), rds_path))
}
# Start background process - DON'T WAIT for completion
logfiles = c('logfiles/neurovault_collection_download.log', 'logfiles/neurovault_collection_download.err')
unlink(logfiles)
bg_process <- callr::r(download_and_save_metadata_fn, args = list(rds_dir), stdout = logfiles[1], stderr=logfiles[2])
message(sprintf("[%s] Download complete. Output saved to %s",
Sys.time(), rds_file_path))
} else {
# Load already downloaded metadata
nv <- load_monthly_rds("neurovaultMetadata")
message("Loaded previously downloaded metadata from ", rds_file_path)
}nv$DOIexists = factor(ifelse(is.na(nv$DOI), "No", "Yes"), levels=c('Yes', 'No'))
nv$paperURLexists = factor(ifelse(is.na(nv$paper_url), "No", "Yes"), levels=c('Yes', 'No'))
label(nv$paperURLexists) = "Paper URL exists"
label(nv$number_of_images) = "Number of images"
label(nv$DOIexists) = "DOI exists"
table1(~number_of_images + DOIexists + paperURLexists, data = nv, caption = "Study-level metadata.")## The following two chunks would take some time, as it extracts and downloads all the image-level data and creates the first image tabular dataset that we work with. More than 2 hours
## May 2026: 10743 collections
library(callr)
api_cores = 4
# Check if image list was already downloaded this month
rds_file_path <- get_monthly_rds_path("imageList")
file_exists_this_month <- file.exists(rds_file_path)
if (!file_exists_this_month) {
# Define the download function to run in background - function saves results internally
download_and_save_image_list_fn <- function(collection_ids, n_cores, rds_dir) {
library(httr)
library(jsonlite)
library(parallel)
# Ensure collection_ids is a proper character vector (fix for list coercion error)
if (is.list(collection_ids)) {
collection_ids <- unlist(collection_ids)
}
collection_ids <- as.character(collection_ids)
cat(sprintf("[%s] [BG] Starting download with %d collection IDs\n", Sys.time(), length(collection_ids)))
# Helper function to get monthly RDS path
get_monthly_rds_path <- function(base_name, dir = rds_dir) {
get_date_suffix <- function() {
format(Sys.Date(), "%Y-%m")
}
date_suffix <- get_date_suffix()
current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
old_files <- all_files[all_files != current_file]
if (length(old_files) > 0) {
invisible(lapply(old_files, file.remove))
}
return(current_file)
}
# Download function with pagination support and error handling
download_collection_safe <- function(collection_id) {
tryCatch({
collection_id <- as.character(collection_id)
cat(sprintf("[%s] [BG] Downloading collection %s\n", Sys.time(), collection_id))
url <- paste0("https://neurovault.org/api/collections/", collection_id, "/images/?format=json")
all_results <- list()
page <- 0
# Paginate through all results
while (!is.null(url)) {
page <- page + 1
response <- GET(url, timeout(30))
if (status_code(response) != 200) {
cat(sprintf("[%s] [BG] Collection %s page %s: HTTP %s\n", Sys.time(), collection_id, page, status_code(response)))
break
}
data <- content(response, "text")
json_data <- fromJSON(data, flatten = TRUE)
# Extract results from this page
if (!is.null(json_data$results) && is.data.frame(json_data$results) && nrow(json_data$results) > 0) {
all_results[[page]] <- json_data$results
}
# Check for next page
url <- json_data[['next']]
}
# Combine all pages
if (length(all_results) == 0) {
cat(sprintf("[%s] [BG] Collection %s: No results\n", Sys.time(), collection_id))
return(data.frame())
}
result <- do.call(rbind, all_results)
# Validate result is a dataframe
if (!is.data.frame(result)) {
cat(sprintf("[%s] [BG] Collection %s: invalid result type (%s)\n", Sys.time(), collection_id, class(result)[1]))
return(data.frame())
}
cat(sprintf("[%s] [BG] Collection %s: OK (%s pages, %s total rows, %s cols)\n", Sys.time(), collection_id, page, nrow(result), ncol(result)))
return(result)
}, error = function(e) {
cat(sprintf("[%s] [BG] Collection %s: ERROR - %s\n", Sys.time(), collection_id, e$message))
return(data.frame())
})
}
# Download all collections
cat(sprintf("[%s] [BG] Starting mclapply", Sys.time(), n_cores))
image_list <- mclapply(collection_ids, download_collection_safe, mc.cores = n_cores)
# Validate we got a list
if (!is.list(image_list)) {
stop(sprintf("ERROR: mclapply did not return a list, got %s", class(image_list)))
}
cat(sprintf("[%s] [BG] mclapply returned %d results\n", Sys.time(), length(image_list)))
# Keep only data frames with >0 rows
cat(sprintf("[%s] [BG] Filtering dataframes with >0 rows\n", Sys.time()))
is_valid <- sapply(image_list, function(x) is.data.frame(x) && nrow(x) > 0)
valid_count <- sum(is_valid)
cat(sprintf("[%s] [BG] Valid dataframes: %d of %d\n", Sys.time(), valid_count, length(image_list)))
if (valid_count == 0) {
stop("ERROR: No valid dataframes downloaded. Check collection_ids and API.")
}
image_list <- image_list[is_valid]
# Save list directly to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_path <- get_monthly_rds_path("imageList")
cat(sprintf("[%s] [BG] Saving image list with %d dataframes to %s\n", Sys.time(), length(image_list), rds_path))
saveRDS(image_list, file = rds_path, compress = "gzip")
cat(sprintf("[%s] [BG] ✓ Image list saved successfully\n", Sys.time()))
}
if (is.null(nv) || nrow(nv) == 0) {
stop("ERROR: neurovaultMetadata is empty. Run getStudyLevelData chunk first.")
}
# Extract and validate collection IDs
collection_ids <- nv$id
if (is.list(collection_ids)) {
collection_ids <- unlist(collection_ids)
}
collection_ids <- as.character(collection_ids)
cat(sprintf("[%s] Extracted %d collection IDs\n", Sys.time(), length(collection_ids)))
# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/neurovault_download.log', 'logfiles/neurovault_download.err')
unlink(logfiles)
# Start background process - DON'T WAIT for completion
bg_process <- callr::r(download_and_save_image_list_fn,
args = list(collection_ids, api_cores, rds_dir),
stdout = logfiles[1],
stderr = logfiles[2])
message(sprintf("[%s] Background process started (PID: %s). Downloading %d collections. Output: %s",
Sys.time(), length(collection_ids), logfiles[1]))
} else {
# Load already downloaded list
image_list <- load_monthly_rds("imageList")
message("Loaded previously downloaded image list from ", rds_file_path)
}# Check if merged image-level data already exists this month
rds_file_path <- get_monthly_rds_path("fullimageMetadata")
file_exists_this_month <- file.exists(rds_file_path)
#res = unlist(by(images$collection_id, images$collection_id, length)); res = data.frame(collection_id = names(res), numberinImages=res)
#test = merge(res, nv[, c('id', 'number_of_images')], by.x='collection_id', by.y='id')
if (!file_exists_this_month) {
if (is.null(image_list) || length(image_list) == 0) {
stop("ERROR: imageList is empty. Run getImageLevelData_section1.1_download chunk first and wait for it to complete.")
}
cat(sprintf("[%s] Loaded image list with %d dataframes\n", Sys.time(), length(image_list)))
# Find common column names across all dataframes (columns in ALL collections)
cat(sprintf("[%s] Analyzing column overlap across %d collections\n", Sys.time(), length(image_list)))
all_colnames <- lapply(image_list, colnames)
# Columns that appear in ALL collections
# SNV: This excludes important columns
# common_cols <- Reduce(intersect, all_colnames)
# cat(sprintf("[%s] Found %d columns in ALL collections\n", Sys.time(), length(common_cols)))
# Find columns that appear in at least 80% of collections
col_freq <- table(unlist(all_colnames))
common_cols <- names(col_freq)[col_freq >= 0.8 * length(image_list)]
cat(sprintf("[%s] Fallback: %d columns in ≥80%% of collections\n", Sys.time(), length(common_cols)))
# Subset each dataframe to common columns only
cat(sprintf("[%s] Subsetting to %d common columns\n", Sys.time(), length(common_cols)))
image_list <- lapply(image_list, function(df) {
# Get columns that exist in this dataframe
cols_exist <- intersect(common_cols, colnames(df))
# Get missing columns
cols_missing <- setdiff(common_cols, colnames(df))
# Start with existing columns
result <- df[, cols_exist, drop = FALSE]
# Add missing columns filled with NA
if (length(cols_missing) > 0) {
result[, cols_missing] <- NA
}
# Reorder to match common_cols order
result[, common_cols, drop = FALSE]
})
# Combine dataframes using rbind (much faster than merge)
cat(sprintf("[%s] Combining %d dataframes via rbind...\n", Sys.time(), length(image_list)))
start_time <- Sys.time()
images <- do.call(rbind, image_list)
elapsed <- difftime(Sys.time(), start_time, units = "secs")
cat(sprintf("[%s] rbind complete in %.1f sec: %d rows × %d columns\n",
Sys.time(), elapsed, nrow(images), ncol(images)))
# Sort all columns by missingness
cat(sprintf("[%s] Computing column completeness for %d columns\n", Sys.time(), ncol(images)))
col_miss <- colMeans(is.na(images))
col_miss = sort(col_miss)
col_keep <- names(col_miss)[col_miss < 0.8]
images <- images[, col_keep]
# Save to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_file_path <- get_monthly_rds_path("fullimageMetadata")
cat(sprintf("[%s] Saving merged data to %s\n", Sys.time(), rds_file_path))
saveRDS(images, file = rds_file_path, compress = "gzip")
cat(sprintf("[%s] ✓ Full image metadata saved successfully\n", Sys.time()))
} else {
# Load already merged metadata
images <- load_monthly_rds("fullimageMetadata")
message("Loaded previously merged image metadata from ", rds_file_path)
}images <- images %>% mutate(
image_type = factor(image_type, levels = names(sort(table(image_type), decreasing = TRUE))),
map_type = factor(map_type, levels = names(sort(table(map_type), decreasing = TRUE))),
analysis_level = factor(analysis_level, levels = names(sort(table(analysis_level), decreasing = TRUE))),
is_thresholded = factor(is_thresholded, levels = names(sort(table(is_thresholded), decreasing = TRUE))),
not_mni = factor(not_mni, levels = names(sort(table(not_mni), decreasing = TRUE)))
)
table1Form = ~ number_of_subjects + modality + image_type + map_type + analysis_level + is_thresholded + not_mni + perc_bad_voxels + perc_voxels_outside
table1(table1Form,
data = images,
caption = "Table 1: Neurovault image-level metadata summary.")| Overall (N=618582) |
|
|---|---|
| number_of_subjects | |
| Mean (SD) | 244 (62800) |
| Median [Min, Max] | 30.0 [-275, 32200000] |
| Missing | 355365 (57.4%) |
| modality | |
| 41801 (6.8%) | |
| Diffusion MRI | 1026 (0.2%) |
| EEG | 43 (0.0%) |
| fMRI-BOLD | 442445 (71.5%) |
| fMRI-CBF | 275 (0.0%) |
| fMRI-CBV | 26 (0.0%) |
| MEG | 84 (0.0%) |
| Other | 126908 (20.5%) |
| PET FDG | 179 (0.0%) |
| PET other | 274 (0.0%) |
| Structural MRI | 4322 (0.7%) |
| Missing | 1199 (0.2%) |
| image_type | |
| statistic_map | 617383 (99.8%) |
| NIDM results statistic map | 1181 (0.2%) |
| atlas | 18 (0.0%) |
| map_type | |
| univariate-beta map | 147640 (23.9%) |
| variance | 137908 (22.3%) |
| ROI/mask | 111407 (18.0%) |
| T map | 75813 (12.3%) |
| other | 69067 (11.2%) |
| Z map | 66055 (10.7%) |
| P map (given null hypothesis) | 6812 (1.1%) |
| parcellation | 1245 (0.2%) |
| F map | 1052 (0.2%) |
| 1-P map ("inverted" probability) | 541 (0.1%) |
| anatomical | 490 (0.1%) |
| multivariate-beta map | 478 (0.1%) |
| Chi squared map | 56 (0.0%) |
| Missing | 18 (0.0%) |
| analysis_level | |
| single-subject | 365261 (59.0%) |
| meta-analysis | 117877 (19.1%) |
| group | 54178 (8.8%) |
| other | 2332 (0.4%) |
| 27 (0.0%) | |
| Missing | 78907 (12.8%) |
| is_thresholded | |
| FALSE | 451736 (73.0%) |
| TRUE | 159548 (25.8%) |
| Missing | 7298 (1.2%) |
| not_mni | |
| FALSE | 615842 (99.6%) |
| TRUE | 2722 (0.4%) |
| Missing | 18 (0.0%) |
| perc_bad_voxels | |
| Mean (SD) | 79.4 (16.7) |
| Median [Min, Max] | 78.1 [0, 100] |
| Missing | 18 (0.0%) |
| perc_voxels_outside | |
| Mean (SD) | 30.4 (38.9) |
| Median [Min, Max] | 10.1 [0, 100] |
| Missing | 909 (0.1%) |
Table 1 summarizes the full NeuroVault repository prior to any
exclusions (N = 618582). Several patterns are worth noting. The
repository is dominated by single-subject images and meta-analytic maps;
group-level images — the target of this pipeline — comprise only 10.0%
of the total. The most common map types are non-statistical maps such as
univariate-beta map and variance; T and Z maps
together account for a minority of all images. A large proportion of
images have no recorded sample size (57.4% missing), reflecting the
manually curated nature of the repository. The high median
perc_bad_voxels reflects expected zero-valued voxels
outside the brain mask rather than data corruption.
# SNV: Warning mapping missing to FALSE in upset plot is suppressed.
library(ComplexUpset)
## UpSet table
images$binarized_group_level <- ifelse(images$analysis_level == "group", TRUE, FALSE)
images$binarized_map_type <- ifelse(images$map_type == "T map" | images$map_type == "Z map", TRUE, FALSE)
images$unthresholded <- ifelse(images$is_thresholded == FALSE, TRUE, FALSE)
images$mni <- ifelse(images$not_mni == FALSE, TRUE, FALSE)
images$nonzero_sample <- ifelse(is.na(images$number_of_subjects) | images$number_of_subjects<=0, FALSE, TRUE)
images_clean <- images %>%
transmute(
"Group-level analysis" = as.logical(binarized_group_level),
"T/Z map" = as.logical(binarized_map_type),
"Unthresholded" = as.logical(unthresholded),
"MNI space" = as.logical(mni),
"Nonzero sample" = as.logical(nonzero_sample)
)
upset(
images_clean,
intersect = c(
"Group-level analysis",
"T/Z map",
"Unthresholded",
"MNI space",
'Nonzero sample'
),
name='Group',
min_size = 50,
base_annotations = list(
"Intersection size" = intersection_size(
text = list(
angle = 90,
vjust = 0.5,
hjust = -0.02
)
)
)
)Figure 1: Intersection of NeuroVault image-level exclusionary criteria.
Figure 1 shows the an UpSet plot showing the intersection of the five exclusionary criteria applied to the raw NeuroVault image metadata. Each column in the upper bar chart represents a unique combination of criteria, with the bar height indicating how many images fall into that combination. The dot matrix below each bar shows which criteria are active — filled dots connected by a vertical line indicate criteria that are simultaneously satisfied. The horizontal bars on the left show the total number of images satisfying each criterion individually (set size), regardless of the others. The final retained images satisfy all five criteria simultaneously. See Table 2 below for summary statistics after exclusion.
Note:
The analysis_level variable has other types of data,
and group-level images make up less than 10% of data.
Including only T and Z maps removes approximately 70% of the data.
0 images have a reported sample size greater than 100,000 (maximum: 32,000). Our approach here is by spot checking the entry with 32,222,222, and it turns out to be a sample size of 32. We have manually fixed it in our dataset. However, for everything else and future additional studies to our pipeline, a study with more than 0 images can reflect metadata entry errors. Thus, implausibly large values will be flagged and removed here (except for the one that we spot checked.
# SNV: I deleted reloading the dataset, here
images_before <- nrow(images)
# subset to group-level analysis
images = images[ which(images$analysis_level=='group'), ]
# subset to unthresholded images
images = images[which(images$is_thresholded==FALSE),]
# Including only T and Z map types
images = images[images$map_type %in% c("T map", "Z map"), ]
# Exclude all not mni
images = images[which(images$not_mni == FALSE), ]
# Exclude sample size 0
images <- images[which(!is.na(images$number_of_subjects) & images$number_of_subjects>0),]
table1(table1Form, data = images, caption = "Table 2: Summary statistics after metadata exclusions.")| Overall (N=25706) |
|
|---|---|
| number_of_subjects | |
| Mean (SD) | 79.7 (226) |
| Median [Min, Max] | 45.0 [1.00, 30000] |
| modality | |
| Diffusion MRI | 39 (0.2%) |
| fMRI-BOLD | 17647 (68.6%) |
| fMRI-CBF | 14 (0.1%) |
| fMRI-CBV | 2 (0.0%) |
| MEG | 39 (0.2%) |
| Other | 7361 (28.6%) |
| PET FDG | 50 (0.2%) |
| PET other | 43 (0.2%) |
| Structural MRI | 511 (2.0%) |
| image_type | |
| statistic_map | 25706 (100%) |
| NIDM results statistic map | 0 (0%) |
| atlas | 0 (0%) |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| ROI/mask | 0 (0%) |
| T map | 23143 (90.0%) |
| other | 0 (0%) |
| Z map | 2563 (10.0%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| analysis_level | |
| single-subject | 0 (0%) |
| meta-analysis | 0 (0%) |
| group | 25706 (100%) |
| other | 0 (0%) |
| 0 (0%) | |
| is_thresholded | |
| FALSE | 25706 (100%) |
| TRUE | 0 (0%) |
| not_mni | |
| FALSE | 25706 (100%) |
| TRUE | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 74.0 (12.2) |
| Median [Min, Max] | 75.9 [0, 86.1] |
| perc_voxels_outside | |
| Mean (SD) | 13.2 (21.3) |
| Median [Min, Max] | 6.41 [0, 100] |
After applying these exclusions, the dataset is reduced from 618,582
to 25,706 images (Table 2). The retained images are exclusively
group-level, unthresholded T and Z maps in MNI space with a valid
nonzero sample size. As expected, variables used as exclusionary
criteria — analysis_level, is_thresholded,
not_mni, and map_type — show no remaining
variation, confirming the filters were applied correctly. The dataset is
dominated by fMRI-BOLD images (68.6%), with a median sample size of 45
subjects. The mean sample size (80) is notably larger than the median,
suggesting a small number of studies with disproportionately large
sample sizes that will be addressed in the next section.
Here, we download Nifti images from each NeuroVault collection after applying the metadata exclusions. After preparing local file paths, it checks for missing files and downloads them in parallel using 36 cores.
## This chunk would take some time, as it downloads the images to local directories, to allow further work on the images.
library(callr)
# download the images
outdir = '/media/big/neurovault'
images$localfile = gsub('http://neurovault.org/media/images', outdir, images$file)
# creates the directories
invisible(capture.output(lapply(unique(dirname(images$localfile)), dir.create, showWarnings=FALSE, recursive = TRUE)))
# Actually download the images, here, if they aren't already
# downloaded. I am not sure if there are usage limits on
# neurovault. This parallel call would probably be a problem
# if there is. There were issues the first time.
images$download = !file.exists(images$localfile)
# Define the download function
download_images_fn <- function(url_vec, destfile_vec, n_cores) {
library(parallel)
downloadFile = function(url, destfile){
tryCatch(download.file(url=url, destfile=destfile), error=function(e){NA})
}
invisible(mcmapply(downloadFile, url=url_vec, destfile=destfile_vec, mc.cores = n_cores))
}
# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/image_download.log', 'logfiles/image_download.err')
unlink(logfiles)
# Run and wait for completion
cat(sprintf("[%s] Starting image download (%d files)...\n", Sys.time(), sum(images$download==TRUE)))## [2026-05-19 12:39:46.195016] Starting image download (46 files)...
result <- callr::r(download_images_fn,
args = list(images$file[images$download==TRUE],
images$localfile[images$download==TRUE],
n_cores),
stdout = logfiles[1],
stderr = logfiles[2])
cat(sprintf("[%s] ✓ Download completed. Output: %s\n", Sys.time(), logfiles[1]))## [2026-05-19 12:39:48.345042] ✓ Download completed. Output: logfiles/image_download.log
After downloading the images, we compute image dimensions
(imgdim), voxel dimensions (voxdim), number of
nonzero voxels (nvox), image intensity range
(range), and orientation from the Nifti images. The image
and voxel dimensions are used to compute the physical dimensions of each
image in millimeters. We screen for duplicate images based on both the
qualitative and quantitative aspects, including the file’s basename
imgdim, voxeldim, nvox,
range, and bounding_box. It would be rare for
two different images to have exact identical values across all these
metrics, especially bounding_box and range.
After identifying duplicates, we check whether the image dimensions in
mm are within 2.5% of the MNI template image. Finally, we check the
image range; if it is less than 0.01, we exclude the image (Figure 2).
This threshold was chosen because for images where the maximum and
minimum intensity values differ by less than 0.01, they contain
effectively no signal variation, and are therefore unlikely to represent
valid statistical maps. We do not exclude based on large ranges because
the test statistic under the alternative converges to infinity as the
sample size gets large. Further effect size based exclusionary criteria
are applied below. Tables 4 and 5 show the dataset before and after
these exclusionary criteria, respectively.
library(callr)
# Check if image dimensions/voxels metadata was already computed this month
rds_file_path <- get_monthly_rds_path("imageMetadataWithDimensions")
file_exists_this_month <- file.exists(rds_file_path)
if (!file_exists_this_month) {
# Define the function to compute image dimensions - runs in background
compute_and_save_image_dimensions_fn <- function(rds_dir, images, n_cores, c3d_bin) {
library(parallel)
# Helper function to get monthly RDS path
get_monthly_rds_path <- function(base_name, dir = rds_dir) {
get_date_suffix <- function() {
format(Sys.Date(), "%Y-%m")
}
date_suffix <- get_date_suffix()
current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
old_files <- all_files[all_files != current_file]
if (length(old_files) > 0) {
invisible(lapply(old_files, file.remove))
}
return(current_file)
}
# Helper function to extract value by keyword from a string
extract_info_field <- function(lines, keyword) {
val <- sapply(lines, function(x) {
first_line <- x[1]
splt <- strsplit(first_line, split = ";")[[1]]
match <- grep(keyword, splt, value = TRUE)
if (length(match) > 0) {
return(trimws(sub(paste0("^.*", keyword, "\\s*=\\s*"), "", match[1])))
} else {
return(NA)
}
})
return(val)
}
# Get image dimensions and voxel info
cat(sprintf("[%s] [BG] Computing image dimensions with c3d -info...\n", Sys.time()))
cmd = paste(c3d_bin, images$localfile, '-info')
voxdim = mclapply(cmd, system, intern=TRUE, mc.cores=n_cores)
images$voxdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('vox', splt)] }))
images$imgdim = unlist(lapply(voxdim, function(x){splt = strsplit(x[1], split = ';')[[1]]; splt[grep('dim', splt)] }))
images$voxdim <- trimws(images$voxdim)
images$imgdim <- trimws(images$imgdim)
cat(sprintf("[%s] [BG] Image dimensions computed\n", Sys.time()))
# Count non-zero voxels
cat(sprintf("[%s] [BG] Counting non-zero voxels with c3d -threshold...\n", Sys.time()))
cmd_vox <- paste(c3d_bin, images$localfile, '-threshold 0 0 0 1 -voxel-sum')
nvox <- unlist(mclapply(cmd_vox, system, intern = TRUE, mc.cores = n_cores))
images$nvox <- as.numeric(gsub("[^0-9.]", "", nvox))
cat(sprintf("[%s] [BG] Non-zero voxel count complete\n", Sys.time()))
# Get orientation, range, and bounding box from `-info` output
cat(sprintf("[%s] [BG] Extracting orientation, range, and bounding box...\n", Sys.time()))
cmd_info <- paste(c3d_bin, images$localfile, '-info')
info <- mclapply(cmd_info, system, intern = TRUE, mc.cores = n_cores)
images$orient <- extract_info_field(info, "orient")
images$range <- extract_info_field(info, "range")
images$bounding_box <- extract_info_field(info, "bb")
images$voxdim <- trimws(gsub("vox = ", "", images$voxdim))
images$imgdim <- trimws(gsub("Image #1: dim = ", "", images$imgdim))
# Save to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_path <- get_monthly_rds_path("imageMetadataWithDimensions")
cat(sprintf("[%s] [BG] Saving enhanced metadata to %s\n", Sys.time(), rds_path))
saveRDS(images, file = rds_path, compress = "gzip")
cat(sprintf("[%s] [BG] ✓ Image dimension metadata saved successfully\n", Sys.time()))
}
# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/c3d_dimensions.log', 'logfiles/c3d_dimensions.err')
unlink(logfiles)
# Start background process - DON'T WAIT for completion
bg_process <- callr::r(compute_and_save_image_dimensions_fn,
args = list(rds_dir, images, n_cores, c3d_bin),
stdout = logfiles[1],
stderr = logfiles[2])
message(sprintf("[%s] Background process started (PID: %s). Computing image dimensions. Output: %s",
Sys.time(), logfiles[1]))
} else {
# Load already computed metadata
images <- load_monthly_rds("imageMetadataWithDimensions")
message("Loaded previously computed image dimensions from ", rds_file_path)
}Table 3: Duplicate images identified and removed based on identical intensity range and filename.
images$localfile_base <- basename(as.character(images$localfile))
cand_keys <- c("range", "localfile_base")
keyDF <- as.data.frame(lapply(images[, cand_keys], function(x) {
if (is.list(x)) sapply(x, paste, collapse = ",") else as.character(x)
}))
images$duplicated <- duplicated(keyDF)
n_dup <- sum(images$duplicated)
reactable(
images[images$duplicated == TRUE, c('collection', 'id', 'range', 'localfile_base')],
striped = TRUE,
highlight = TRUE,
bordered = TRUE,
searchable = TRUE,
defaultPageSize = 10,
columns = list(
collection = colDef(name = "Collection", minWidth = 150),
id = colDef(name = "ID", minWidth = 80),
range = colDef(name = "Range", minWidth = 150),
localfile_base = colDef(name = "Filename", minWidth = 200)
)
)We screen for duplicate images based on identical intensity range and filename, as two images sharing both values are almost certainly the same statistical map uploaded to multiple collections. A total of 2066 duplicate images were identified and removed, leaving 23594 images for downstream analysis. The duplicates are displayed in the table above. The table above shows an interactive table of the objects being dropped here. Note: Only the second occurrence of each duplicate pair is shown here; the first occurrence is retained in the dataset and will not appear in this table, so it is normal for no two rows in the table to look identical to each other.
Using a proportionality tolerance of 0.025, we then check whether the physical dimensions of each image are approximately proportional to the MNI template (182x218x182 mm) as shown in Figure 2.
# compute dim_mm by multiplying imgdim with voxdim after converting to numeric values
imgdims = lapply(gsub('\\]', ')', gsub('\\[', 'c(', as.character(images$imgdim))), function(x) eval(parse(text=x)))
voxdims = lapply(gsub('\\]', ')', gsub('\\[', 'c(', as.character(images$voxdim))), function(x) eval(parse(text=x)))
dims_mat = t(mapply(function(x,y){ x*y}, imgdims, voxdims))
colnames(dims_mat) <- c("dim_x", "dim_y", "dim_z")
# Split dim_mm into x, y, z components
# SNV: dim_mm was not defined and only existed because it was loading an old dataset version.
images$dim_mm = apply(dims_mat, 1, function(x) paste(x, collapse = "*"))
# Compute ratios
ratio_xy <- dims_mat[,1] / dims_mat[,2]
ratio_xz <- dims_mat[,1] / dims_mat[,3]
ratio_yz <- dims_mat[,2] / dims_mat[,3]
# Define reference ratios (from MNI: 182x218x182)
ref_xy <- 182 / 218 # ≈ 0.8349
ref_xz <- 1 # since x = z
ref_yz <- 218 / 182 # ≈ 1.1978
images[, paste('log10', c('ratio_xy', 'ratio_xz', 'ratio_yz'), sep = '_')] <- cbind(ratio_xy/ref_xy, ratio_xz/ref_xz, ratio_yz/ref_yz)
images[, c('ratio_x', 'ratio_y', 'ratio_z') ] <- sweep(dims_mat, 2, c(182, 218, 182), FUN = "/")
# Reference thresholds by dimension
ref_thresholds <- list(
ratio_x = c(0.75, 1.25),
ratio_y = c(0.75, 1.25),
ratio_z = c(0.6, 1.2)
)
images$is_proportional = apply(images[, c('ratio_x', 'ratio_y', 'ratio_z')], 1, function(r) {
all(r >= ref_thresholds$ratio_x[1] & r <= ref_thresholds$ratio_x[2]) &
all(r >= ref_thresholds$ratio_y[1] & r <= ref_thresholds$ratio_y[2]) &
all(r >= ref_thresholds$ratio_z[1] & r <= ref_thresholds$ratio_z[2])
})
# Logical mask for proportional dimensions
# Set epsilon
epsilon <- 0.025
# images$is_proportional <- abs(ratio_xy - ref_xy) < epsilon &
# abs(ratio_xz - ref_xz) < epsilon &
# abs(ratio_yz - ref_yz) < epsilon
# SNV: Changed to not repeat code you had above and use the same (more compact) syntax you used above
images <- images %>% mutate(
dim_mm = factor(dim_mm, levels = names(sort(table(dim_mm), decreasing = TRUE)))
)
prop_near_mni <- sum(images$is_proportional)
mni_total <- length(images$is_proportional)
# PLOT
library(GGally)
library(ggplot2)
library(scales)
library(rlang)
# Create ratio dataframe
ratio_df <- images[, c('ratio_x', 'ratio_y', 'ratio_z')]
ratio_df <- na.omit(ratio_df)
# Diagonal function (histograms)
diag_fun_ratio <- function(data, mapping, ...) {
p <- GGally::ggally_barDiag(data, mapping, ...)
var <- as_label(mapping$x)
thrs <- ref_thresholds[[var]]
if (!is.null(thrs)) {
for (thr in thrs) {
p <- p + geom_vline(xintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
}
}
p
}
# Lower function (scatter plots)
lower_fun_ratio <- function(data, mapping, ...) {
p <- GGally::ggally_points(data, mapping, ...)
xvar <- as_label(mapping$x)
yvar <- as_label(mapping$y)
x_thrs <- ref_thresholds[[xvar]]
y_thrs <- ref_thresholds[[yvar]]
if (!is.null(x_thrs)) {
for (thr in x_thrs) {
p <- p + geom_vline(xintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
}
}
if (!is.null(y_thrs)) {
for (thr in y_thrs) {
p <- p + geom_hline(yintercept = thr, linetype = "dashed", color = "red", alpha = 0.7)
}
}
p
}
# Upper function (correlations)
upper_fun_ratio <- function(data, mapping, ...) {
GGally::ggally_cor(data, mapping, ...) +
theme_void()
}
# Create the plot
GGally::ggpairs(
ratio_df,
diag = list(continuous = diag_fun_ratio),
lower = list(continuous = lower_fun_ratio),
upper = list(continuous = upper_fun_ratio),
title = "Figure 2: Image dimension ratios (relative to MNI 182×218×182 template)"
)We find that 23191 of 23594 images are approximately proportional to the MNI template.
We convert the image dimensions to mm by multiplying the number of voxels by the pixel dimension. Using a proportionality tolerance of \(\varepsilon = 0.025\), we then check whether the physical dimensions of each image are approximately proportional to the MNI template (182×218×182 mm). We find that 23191 of 23594 images are approximately proportional to the MNI template.
Images with a low intensity range (\(|\max - \min|\)) are unlikely to be valid statistical images, so we exclude all images that have a range smaller than 0.01. A visual illustration is available in Figure 3 below. Everything below the red dashed line are removed due to small range.
library(stringr)
library(ggplot2)
library(scales)
# SNV: removed loading and saving dataset in this chunk
# parse [lo, hi]
hi <- str_match(
images[["range"]],
"\\[\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*,\\s*([+-]?(?:\\d*\\.?\\d+(?:[eE][+-]?\\d+)?))\\s*\\]"
)
lo <- as.numeric(hi[, 2])
hi <- as.numeric(hi[, 3])
# attach & compute range
images$range_low <- lo
images$range_high <- hi
images$range_diff <- images$range_high - images$range_low
rm(lo, hi)
# histogram of log10(range_diff) (positive values only)
xpos <- images$range_diff
# SNV: commented this because it was already satisfied.
# xpos <- xpos[is.finite(xpos) & xpos > 0]
# xpos already defined above as positive, finite range_diff values
df_sc <- data.frame(
idx = seq_along(xpos),
l10 = log10(xpos)
)
tick_min <- floor(min(df_sc$l10, na.rm = TRUE))
tick_max <- ceiling(max(df_sc$l10, na.rm = TRUE))
ticks <- tick_min:tick_max
# SNV: changed hard coding of horizontal line
# SNV: Change this to -2, since it is thresholding on the log scale
thr <- -2
ggplot(df_sc, aes(idx, l10)) +
geom_point(alpha = 0.5, size = 0.6) +
scale_y_continuous(breaks = ticks, labels = scales::math_format(10^.x)) +
geom_hline(yintercept = ticks, linetype = "dotted", linewidth = 0.3, color = "grey70") +
labs(
title = "Figure 3: Scatterplot of image range (log10 scale) by row index",
x = "Image index",
y = "Image numerical range (log10 scale)",
subtitle = sprintf("n = %d ", length(xpos))
) +
geom_hline(yintercept = thr,linetype = "dashed", color = "red") +
theme_minimal(base_size = 12)rm(df_sc)
# drop rows with log10(range_diff) < 0.1 (guard against non-positive / NA)
images$range_exclusion = log10(images$range_diff) < thrTable 4 below summarizes the dataset after
image-level exclusions, showing the distribution of our dataset after
the three additional criteria derived from the downloaded images:
duplicated (identical range and filename across
collections), is_proportional (physical dimensions within
2.5% of the MNI 182×218×182 template), and range_exclusion
(intensity range below 0.01). The dim_mm column shows the
distribution of image dimensions in millimeters across the dataset.
The remaining images are proportionally sized relative to the MNI
template, contain meaningful intensity variation, and have no detected
duplicates. The dim_mm distribution in Table 4 reflects
only the image dimensions present in the retained dataset, with the most
common dimension being 182218182.
# SNV: perc_of_voxels_outside was not included in table1Form, so I did not include it here.
table2Form = update.formula(table1Form, ~ duplicated + is_proportional + range_exclusion + dim_mm + .)
images <- images[which(images$is_proportional & !images$duplicated & !images$range_exclusion), ]
dimtab = table(images$dim_mm); dimtab = dimtab[dimtab > 0]
images$dim_mm = factor(images$dim_mm, levels = names(sort(dimtab, decreasing = TRUE)))
table1(table2Form, data = images, caption = "Table 4: After image-level exclusionary criteria with prior metadata.")| Overall (N=22833) |
|
|---|---|
| duplicated | |
| Yes | 0 (0%) |
| No | 22833 (100%) |
| is_proportional | |
| Yes | 22833 (100%) |
| No | 0 (0%) |
| range_exclusion | |
| Yes | 0 (0%) |
| No | 22833 (100%) |
| dim_mm | |
| 182*218*182 | 10756 (47.1%) |
| 195*231*196 | 2329 (10.2%) |
| 194*230*194 | 2061 (9.0%) |
| 195*231*195 | 1174 (5.1%) |
| 192*228*192 | 1094 (4.8%) |
| 158*190*158 | 825 (3.6%) |
| 195*232.5*195 | 648 (2.8%) |
| 159*189*138 | 544 (2.4%) |
| 181.5*217.5*181.5 | 465 (2.0%) |
| 195*231*198 | 457 (2.0%) |
| 196.749*230.755*194.4 | 388 (1.7%) |
| 183*219*183 | 296 (1.3%) |
| 159*189*156 | 291 (1.3%) |
| 194*230*194.4 | 288 (1.3%) |
| 150*186*160 | 108 (0.5%) |
| 184.25*220*184 | 77 (0.3%) |
| 157.5*192.5*157.5 | 76 (0.3%) |
| 162*192*150 | 76 (0.3%) |
| 180*216*180 | 63 (0.3%) |
| 194.4*230.4*194.4 | 59 (0.3%) |
| 158*190*156 | 56 (0.2%) |
| 158*190*138 | 45 (0.2%) |
| 194*230*195 | 45 (0.2%) |
| 196.218*231.894*196.42 | 33 (0.1%) |
| 170*210*145 | 30 (0.1%) |
| 197*233*189 | 28 (0.1%) |
| 193.2*229.2*193.2 | 26 (0.1%) |
| 181*217*181 | 24 (0.1%) |
| 180*217.5*180 | 20 (0.1%) |
| 188*220*184 | 19 (0.1%) |
| 158*190*168 | 18 (0.1%) |
| 182*217*182 | 18 (0.1%) |
| 148*184*156 | 16 (0.1%) |
| 196*232*197.6 | 15 (0.1%) |
| 158*190*174 | 14 (0.1%) |
| 162.4*191.4*150.8 | 14 (0.1%) |
| 194.247*230.346*195 | 14 (0.1%) |
| 195.8*231*195.8 | 14 (0.1%) |
| 157.5*190*137.5 | 13 (0.1%) |
| 182.5*220*182.5 | 13 (0.1%) |
| 159*195*156 | 12 (0.1%) |
| 169.5*205.5*169.5 | 12 (0.1%) |
| 194*230*195.8 | 12 (0.1%) |
| 157.5*190.5*157.5 | 11 (0.0%) |
| 192.5*229.25*192.5 | 11 (0.0%) |
| 196.875*231.25*198 | 11 (0.0%) |
| 153.9*189*160.38 | 10 (0.0%) |
| 164*192*140 | 10 (0.0%) |
| 196.552*231.384*195.25 | 10 (0.0%) |
| 152*184*152 | 9 (0.0%) |
| 161*190.75*150.5 | 9 (0.0%) |
| 195.816*231.636*198.75 | 9 (0.0%) |
| 156*186*147 | 8 (0.0%) |
| 157.5*190*157.5 | 8 (0.0%) |
| 158*190*136.5 | 8 (0.0%) |
| 194.4*230.4*196.56 | 8 (0.0%) |
| 153*189*147 | 7 (0.0%) |
| 156.6*189*158.4 | 6 (0.0%) |
| 195*231*197.12 | 6 (0.0%) |
| 157.5*189*136.5 | 5 (0.0%) |
| 159*195*159 | 5 (0.0%) |
| 196.472*232.412*194.4 | 5 (0.0%) |
| 198.75*232.5*197.6 | 5 (0.0%) |
| 198*231*195 | 5 (0.0%) |
| 154*190*164 | 4 (0.0%) |
| 157.5*195*171 | 4 (0.0%) |
| 182*218*183 | 4 (0.0%) |
| 193.5*229.5*195 | 4 (0.0%) |
| 195*231*196.56 | 4 (0.0%) |
| 195*231*198.8 | 4 (0.0%) |
| 139.01492*173.76865*151.65264 | 3 (0.0%) |
| 192.5*227.5*192.5 | 3 (0.0%) |
| 196.735*231.77*197.1 | 3 (0.0%) |
| 144*180*146 | 2 (0.0%) |
| 157*189*156 | 2 (0.0%) |
| 159*192*156 | 2 (0.0%) |
| 160*192*140 | 2 (0.0%) |
| 164*192*160 | 2 (0.0%) |
| 169*205*169 | 2 (0.0%) |
| 171*207*171 | 2 (0.0%) |
| 178.5*214.5*178.5 | 2 (0.0%) |
| 182.5*217.5*182.5 | 2 (0.0%) |
| 183*249*183 | 2 (0.0%) |
| 192.5*230*192.5 | 2 (0.0%) |
| 195.468*230.373*195.5 | 2 (0.0%) |
| 195.966*233.784*197.2 | 2 (0.0%) |
| 195*232.5*198 | 2 (0.0%) |
| 200*235*191.4 | 2 (0.0%) |
| 202*232*192 | 2 (0.0%) |
| 150.5*185*158 | 1 (0.0%) |
| 159*192*138 | 1 (0.0%) |
| 159*192*159 | 1 (0.0%) |
| 161.25*191.25*136.8 | 1 (0.0%) |
| 161.25*191.25*140 | 1 (0.0%) |
| 161*191*151 | 1 (0.0%) |
| 180*219*184 | 1 (0.0%) |
| 183*219*174 | 1 (0.0%) |
| 183*219*183.6 | 1 (0.0%) |
| 184.8*220.8*182.4 | 1 (0.0%) |
| 188*224*184 | 1 (0.0%) |
| 190.9*227.7*190.9 | 1 (0.0%) |
| 192.5*230.3125*192.5 | 1 (0.0%) |
| 193.6*228.8*193.6 | 1 (0.0%) |
| 193*229*193 | 1 (0.0%) |
| 198*234*190 | 1 (0.0%) |
| number_of_subjects | |
| Mean (SD) | 1490 (213000) |
| Median [Min, Max] | 53.0 [1.00, 32200000] |
| modality | |
| Diffusion MRI | 39 (0.2%) |
| fMRI-BOLD | 15000 (65.7%) |
| fMRI-CBF | 14 (0.1%) |
| fMRI-CBV | 1 (0.0%) |
| MEG | 39 (0.2%) |
| Other | 7351 (32.2%) |
| PET FDG | 31 (0.1%) |
| PET other | 42 (0.2%) |
| Structural MRI | 316 (1.4%) |
| image_type | |
| statistic_map | 22833 (100%) |
| NIDM results statistic map | 0 (0%) |
| atlas | 0 (0%) |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| ROI/mask | 0 (0%) |
| T map | 20356 (89.2%) |
| other | 0 (0%) |
| Z map | 2477 (10.8%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| analysis_level | |
| single-subject | 0 (0%) |
| meta-analysis | 0 (0%) |
| group | 22833 (100%) |
| other | 0 (0%) |
| 0 (0%) | |
| is_thresholded | |
| FALSE | 22833 (100%) |
| TRUE | 0 (0%) |
| not_mni | |
| FALSE | 22833 (100%) |
| TRUE | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 74.7 (10.6) |
| Median [Min, Max] | 77.0 [0, 86.1] |
| perc_voxels_outside | |
| Mean (SD) | 12.3 (21.3) |
| Median [Min, Max] | 3.47 [0, 100] |
Aside from the above three approaches, there are other attributes of certain images that stops registration.
This value was chosen by eroding the non-brain mask, defined below, and looking at a histogram of the 90th quantile of the values outside the brain.
library(callr)
# Define the image processing function
process_images_c3d_fn <- function(images_data, n_cores, eps_val) {
library(parallel)
# Create output filenames
images_data$localfileadj <- sub("\\.nii(\\.gz)?$", "_eps.nii.gz",
images_data$localfile, ignore.case = TRUE)
# Step 1: Convert NaN into zeros
cat(sprintf("[%s] [BG] Converting NaN to zeros...\n", Sys.time()))
cmd1 <- paste('/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d',
images_data$localfile, '-replace NaN 0.0 -o',
images_data$localfileadj)
invisible(mclapply(cmd1, function(c) system(paste(c, ">/dev/null")),
mc.cores = n_cores))
cat(sprintf("[%s] [BG] NaN conversion complete\n", Sys.time()))
# Step 2: Apply threshold filter
cat(sprintf("[%s] [BG] Applying threshold filter (eps = %g)...\n", Sys.time(), eps_val))
cmd2 <- sprintf("%s %s -dup -threshold -%g %g 0 1 -multiply -o %s",
shQuote("/usr/local/c3d-1.1.0-Linux-x86_64/bin/c3d"),
shQuote(images_data$localfileadj), eps_val, eps_val,
shQuote(images_data$localfileadj))
invisible(mclapply(cmd2, function(c) system(paste(c, ">/dev/null")),
mc.cores = n_cores))
cat(sprintf("[%s] [BG] Threshold filtering complete\n", Sys.time()))
# Return updated filenames
return(images_data$localfileadj)
}
# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/image_processing_c3d.log', 'logfiles/image_processing_c3d.err')
unlink(logfiles)
# Run and wait for completion
# SNV: @copilot, should only run this if not run
eps <- 1e-3
cat(sprintf("[%s] Starting image processing (%d images)...\n", Sys.time(), nrow(images)))## [2026-05-19 12:39:56.208183] Starting image processing (22833 images)...
localfileadj_result <- callr::r(process_images_c3d_fn,
args = list(images, n_cores, eps),
stdout = logfiles[1],
stderr = logfiles[2])
# Update images dataframe with processed filenames
images$localfileadj <- localfileadj_result
cat(sprintf("[%s] ✓ Image processing completed. Output: %s\n", Sys.time(), logfiles[1]))## [2026-05-19 12:43:15.195242] ✓ Image processing completed. Output: logfiles/image_processing_c3d.log
Despite the images being reported as MNI space. After visual inspection, we noticed they were not properly aligned with the MNI template. Examples of some misaligned images from the image dimension categories shown in Table 4 are shown in Figure 3.
Figure 4: Examples of image misalignment.
To resolve image misalignment, we evaluated registration of varying complexity:
For both rigid and linear registrations, we use mutual information as
the cost function (-cost mutualinfo using
FLIRT) to accommodate differences in contrast and intensity
between the template and statistical images. Results are shown for the
seven most common image dimensions covering the majority of the
dataset.
Based on these results, we chose rigid registration, as a balance between flexibility and accuracy across the image dimension categories. With the registered images, we further extracted information from the data:
voxdim_new: indicates the new voxel dimension of the
images, now all images should be 2mm isotropic.imgdim_new: indicates the new image dimension, now all
images should be 91*109*91.range_new: indicates the new range of voxel
intensity.library(RNifti)
library(callr)
# Add resliced output filenames to dataframe (fast operation, no need for background)
cat(sprintf("[%s] Adding resliced output filenames\n", Sys.time()))## [2026-05-19 12:43:15.278141] Adding resliced output filenames
images$resliced <- gsub('.nii.gz', '_rigid_cost_step1.nii.gz', images$localfile)
# Define the registration function to run in background
register_images_fn <- function(images_data, template_path, n_cores, outdir) {
library(parallel)
## FSL Setup - Robust
if (!nzchar(Sys.which("flirt"))) {
if (!nzchar(Sys.getenv("FSLDIR"))) {
fsl_candidates <- c("/usr/local/fsl", "/usr/share/fsl", "/opt/fsl")
existing <- file.exists(fsl_candidates)
if (any(existing)) {
Sys.setenv(FSLDIR = fsl_candidates[which(existing)[1]])
} else {
stop("FSL not found. Please install FSL and set FSLDIR environment variable")
}
}
Sys.setenv(FSLOUTPUTTYPE = "NIFTI_GZ")
fslbin <- file.path(Sys.getenv("FSLDIR"), "bin")
Sys.setenv(PATH = paste(fslbin, Sys.getenv("PATH"), sep = ":"))
}
## Reference image
ref <- file.path(Sys.getenv("FSLDIR"), "data", "standard", "MNI152_T1_2mm_brain.nii.gz")
## Helpers
validate_flirt <- function() {
flirt <- Sys.which("flirt")
if (!nzchar(flirt)) stop("FLIRT not found in PATH")
if (system2(flirt, args = "-version", stdout = FALSE, stderr = FALSE) != 0) {
stop("FLIRT exists but failed to run. Please check FSL installation")
}
flirt
}
flirt <- validate_flirt()
reslice_image <- function(input, output, ref) {
if (!file.exists(output)) {
message(sprintf("[%s] Reslicing: %s", Sys.time(), basename(input)))
cmd <- sprintf("%s -in '%s' -ref '%s' -out '%s' -dof 6 -cost mutualinfo -interp trilinear",
flirt, input, ref, output)
status <- system(cmd, ignore.stdout = TRUE, ignore.stderr = TRUE)
if (status != 0 || !file.exists(output) || file.size(output) == 0) {
warning("FLIRT failed for: ", input)
return(NA_character_)
}
}
return(output)
}
## Reslice all images
cat(sprintf("[%s] [BG] Starting reslicing with %d cores...\n", Sys.time(), n_cores))
results <- mcmapply(
FUN = reslice_image,
input = images_data$localfileadj,
output = images_data$resliced,
MoreArgs = list(ref = ref),
mc.cores = n_cores
)
cat(sprintf("[%s] [BG] Image registration completed\n", Sys.time()))
}
# Create logfiles directory if needed
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/flirt_registration.log', 'logfiles/flirt_registration.err')
unlink(logfiles)
# Start background process - DON'T WAIT for completion
bg_process <- callr::r(register_images_fn,
args = list(images, template, n_cores, outdir),
stdout = logfiles[1],
stderr = logfiles[2])
message(sprintf("[%s] Background process started. Running image registration. Output: %s",
Sys.time(), logfiles[1]))library(stringr)
# Check if post-registration extraction and coverage data already exists this month
rds_file_path <- get_monthly_rds_path("imageExtractionAndCoverage")
file_exists_this_month <- file.exists(rds_file_path)
if (!file_exists_this_month) {
# ============================================================================
# PART 1: FEATURE EXTRACTION FROM REGISTERED IMAGES
# ============================================================================
# Count non-zero voxels with error handling
cmd_vox <- paste(c3d_bin, images$resliced, '-threshold 0 0 0 1 -voxel-sum')
nvox_result <- tryCatch({
unlist(mclapply(cmd_vox, function(cmd) {
tryCatch(system(cmd, intern = TRUE), error = function(e) "")
}, mc.cores = n_cores))
}, error = function(e) {
cat(sprintf("[%s] Warning: c3d voxel count failed\n", Sys.time()))
rep("", length(cmd_vox))
})
images$nvox_new <- as.numeric(gsub("[^0-9.]", "", nvox_result))
images$nvox_new[is.na(images$nvox_new)] <- NA
# Extract new info with c3d -info
cat(sprintf("[%s] Extracting image info (voxdim, imgdim, orientation, range)...\n", Sys.time()))
cmd_info <- paste(c3d_bin, images$resliced, "-info")
info_result <- tryCatch({
mclapply(cmd_info, function(cmd) {
tryCatch(system(cmd, intern = TRUE), error = function(e) "")
}, mc.cores = n_cores)
}, error = function(e) {
cat(sprintf("[%s] Warning: c3d -info failed\n", Sys.time()))
rep(list(""), length(cmd_info))
})
# Enhanced extractField function with error handling
extractField <- function(x, field_regex) {
tryCatch({
if (length(x) == 0 || all(x == "")) return(NA_character_)
line <- x[1]
if (is.na(line) || line == "") return(NA_character_)
match <- regmatches(line, regexpr(field_regex, line))
if (length(match) == 0) return(NA_character_)
return(match)
}, error = function(e) NA_character_)
}
# Extract fields with safe mapping
cat(sprintf("[%s] Parsing extracted fields...\n", Sys.time()))
images$voxdim_new <- sapply(info_result, extractField, "vox = \\[.*?\\]")
images$imgdim_new <- sapply(info_result, extractField, "dim = \\[.*?\\]")
images$orient_new <- sapply(info_result, extractField, "orient = [A-Z]+")
images$range_new <- sapply(info_result, extractField, "range = \\[.*?\\]")
# Count how many extractions were successful
na_count <- sum(is.na(images$voxdim_new))
cat(sprintf("[%s] Extraction complete: %d successful, %d with missing/failed output\n",
Sys.time(), nrow(images) - na_count, na_count))
# ============================================================================
# PART 2: COMPUTE COVERAGE STATISTICS
# ============================================================================
# Get files for coverage computation
files_all <- images$resliced
keep <- !is.na(files_all) & file.exists(files_all)
files <- files_all[keep]
idx <- which(keep)
# Function to compute mask overlap
maskOverlap = function(mask_path, files){
mask = readNifti(mask_path)
unlist(mclapply(files, function(file){
out = sum(readNifti(file)[ mask==1]!=0)
}, mc.cores = n_cores))
}
# Compute coverage for different masks
nnz_outside = maskOverlap(nonbrain_mask, files)
nnz_gm = maskOverlap(gm_mask, files)
nnz_wm = maskOverlap(wm_mask, files)
# Get minimum nonzero value in the brain complement
maskcomp = readNifti(nonbrain_eroded_mask)
q90_outside = unlist(mclapply(files, function(file){
out = abs(readNifti(file)[ maskcomp==1])
quantile(out, probs=0.9)
}, mc.cores = n_cores))
# Write back to images (order-preserving; NAs for missing files)
images$nvox_outside[idx] <- nnz_outside
images$nvox_gm[idx] <- nnz_gm
images$nvox_wm[idx] <- nnz_wm
images$q90_outside[idx] = q90_outside
# Save to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_file_path <- get_monthly_rds_path("imageExtractionAndCoverage")
cat(sprintf("[%s] Saving extracted features and coverage data to %s\n", Sys.time(), rds_file_path))
saveRDS(images, file = rds_file_path, compress = "gzip")
cat(sprintf("[%s] ✓ Extraction and coverage data saved successfully\n", Sys.time()))
} else {
# Load previously computed data
images <- load_monthly_rds("imageExtractionAndCoverage")
message("Loaded previously computed extraction and coverage data from ", rds_file_path)
}## Loaded previously computed extraction and coverage data from /media/disk2/ree/metadataset/rds/imageExtractionAndCoverage_2026-05.rds
After registration, we compute the proportion of nonzero voxels overlapping with three spatial masks derived from the MNI template: gray matter, white matter, and the region outside both (i.e., non-brain voxels). These coverage metrics serve as a proxy for registration quality. A well-registered brain image should have high overlap with gray and white matter and minimal signal outside the brain mask.
Figure 5 shows the distribution of these three coverage proportions across all registered images. Based on these histograms, we selected thresholds that separate the bulk of the distribution from a low-coverage tail, and visually evaluated excluded images to confirm the thresholds were not discarding well-registered data.
# thresholds
GM_MIN <- 100000
WM_MIN <- 20000
OUT_MAX <- 100000
GM_total <- sum(readNifti(gm_mask))
WM_total <- sum(readNifti(wm_mask))
out_total <- 91*109*91 - GM_total - WM_total
images$gm_perc <- images$nvox_gm / GM_total
images$wm_perc <- images$nvox_wm / WM_total
images$outside_perc <- images$nvox_outside / out_total
library(GGally)
library(ggplot2)
library(scales)
library(rlang)
# --- proportions already computed above ---
coverage_df <- images[, c("gm_perc", "wm_perc", "outside_perc")]
coverage_df <- na.omit(coverage_df)
colnames(coverage_df) <- c("Gray Matter", "White Matter", "Outside Mask")
# convert to percent for plotting
coverage_pct <- coverage_df * 100
# convert voxel-count thresholds to percent thresholds
ref_pct <- c(
"Gray Matter" = 100 * GM_MIN / GM_total,
"White Matter" = 100 * WM_MIN / WM_total,
"Outside Mask" = 100 * OUT_MAX / out_total
)
# formatter for axes
pct_lab <- scales::label_percent(scale = 1, accuracy = 1) # since data already in 0-100
diag_fun <- function(data, mapping, ...) {
p <- GGally::ggally_barDiag(data, mapping, ...)
var <- as_label(mapping$x)
thr <- ref_pct[[var]]
if (!is.null(thr)) p <- p + geom_vline(xintercept = thr, linetype = "dashed")
p + scale_x_continuous(labels = pct_lab)
}
lower_fun <- function(data, mapping, ...) {
p <- GGally::ggally_points(data, mapping, ...)
xvar <- as_label(mapping$x)
yvar <- as_label(mapping$y)
# Reference identity line y = x
p <- p + geom_abline(
slope = 1, intercept = 0,
linetype = "solid", color = "black"
)
# Optional reference thresholds
if (!is.null(ref_pct[[xvar]]))
p <- p + geom_vline(xintercept = ref_pct[[xvar]], linetype = "dashed")
if (!is.null(ref_pct[[yvar]]))
p <- p + geom_hline(yintercept = ref_pct[[yvar]], linetype = "dashed")
p +
scale_x_continuous(labels = pct_lab) +
scale_y_continuous(labels = pct_lab)
}
upper_fun <- function(data, mapping, ...) {
GGally::ggally_cor(data, mapping, ...) +
theme_void()
}
GGally::ggpairs(
coverage_pct,
title = "Figure 5: Spatial Coverage Percentage",
upper = list(continuous = "cor"),
diag = list(continuous = diag_fun),
lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)We apply the following three exclusions:
Gray matter coverage above 55% — images with fewer than this proportion of nonzero voxels inside gray matter are likely missing substantial cortical signal, possibly due to partial brain coverage or failed registration.
White matter coverage above 32% — low white matter coverage similarly suggests the image does not span the full brain volume.
Non-brain signal below 15% — images with a high proportion of nonzero voxels outside the brain mask likely contain registration artifacts or were not properly skull-stripped prior to upload.
Together, these criteria remove images that are unlikely to represent valid whole-brain statistical maps in MNI space, regardless of whether they passed the earlier metadata-based filters.
# keep ONLY rows that pass all three checks (and are finite)
images$gm_exclude = images$nvox_gm <= GM_MIN
images$wm_exclude = images$nvox_wm <= WM_MIN
images$nvox_outside_exclude = (images$nvox_outside >= OUT_MAX)
# SNV: Hoazheng, to be consistent with the prior sections, lets use a table structure to show the exclusionary criteria
table3Form = update.formula(table2Form, ~ gm_exclude + wm_exclude + nvox_outside_exclude + .)
table1(table3Form, data=images,caption = "Table 5: Coverage-based exclusionary criteria prior to removal.")| Overall (N=22954) |
|
|---|---|
| gm_exclude | |
| Yes | 212 (0.9%) |
| No | 22742 (99.1%) |
| wm_exclude | |
| Yes | 77 (0.3%) |
| No | 22877 (99.7%) |
| nvox_outside_exclude | |
| Yes | 787 (3.4%) |
| No | 22167 (96.6%) |
| duplicated | |
| Yes | 0 (0%) |
| No | 22954 (100%) |
| is_proportional | |
| Yes | 22954 (100%) |
| No | 0 (0%) |
| range_exclusion | |
| Yes | 0 (0%) |
| No | 22954 (100%) |
| dim_mm | |
| 182*218*182 | 10757 (46.9%) |
| 195*231*196 | 2350 (10.2%) |
| 194*230*194 | 2083 (9.1%) |
| 195*231*195 | 1213 (5.3%) |
| 192*228*192 | 1094 (4.8%) |
| 158*190*158 | 825 (3.6%) |
| 195*232.5*195 | 648 (2.8%) |
| 159*189*138 | 544 (2.4%) |
| 195*231*198 | 475 (2.1%) |
| 181.5*217.5*181.5 | 465 (2.0%) |
| 196.749*230.755*194.4 | 408 (1.8%) |
| 183*219*183 | 296 (1.3%) |
| 159*189*156 | 291 (1.3%) |
| 194*230*194.4 | 288 (1.3%) |
| 150*186*160 | 108 (0.5%) |
| 184.25*220*184 | 77 (0.3%) |
| 157.5*192.5*157.5 | 76 (0.3%) |
| 162*192*150 | 76 (0.3%) |
| 180*216*180 | 63 (0.3%) |
| 194.4*230.4*194.4 | 59 (0.3%) |
| 158*190*156 | 56 (0.2%) |
| 158*190*138 | 45 (0.2%) |
| 194*230*195 | 45 (0.2%) |
| 196.218*231.894*196.42 | 33 (0.1%) |
| 170*210*145 | 30 (0.1%) |
| 197*233*189 | 28 (0.1%) |
| 193.2*229.2*193.2 | 26 (0.1%) |
| 181*217*181 | 24 (0.1%) |
| 180*217.5*180 | 20 (0.1%) |
| 188*220*184 | 19 (0.1%) |
| 158*190*168 | 18 (0.1%) |
| 182*217*182 | 18 (0.1%) |
| 148*184*156 | 16 (0.1%) |
| 196*232*197.6 | 15 (0.1%) |
| 158*190*174 | 14 (0.1%) |
| 162.4*191.4*150.8 | 14 (0.1%) |
| 194.247*230.346*195 | 14 (0.1%) |
| 195.8*231*195.8 | 14 (0.1%) |
| 157.5*190*137.5 | 13 (0.1%) |
| 182.5*220*182.5 | 13 (0.1%) |
| 169.5*205.5*169.5 | 12 (0.1%) |
| 159*195*156 | 12 (0.1%) |
| 194*230*195.8 | 12 (0.1%) |
| 157.5*190.5*157.5 | 11 (0.0%) |
| 192.5*229.25*192.5 | 11 (0.0%) |
| 196.875*231.25*198 | 11 (0.0%) |
| 153.9*189*160.38 | 10 (0.0%) |
| 164*192*140 | 10 (0.0%) |
| 196.552*231.384*195.25 | 10 (0.0%) |
| 152*184*152 | 9 (0.0%) |
| 161*190.75*150.5 | 9 (0.0%) |
| 195.816*231.636*198.75 | 9 (0.0%) |
| 156*186*147 | 8 (0.0%) |
| 157.5*190*157.5 | 8 (0.0%) |
| 158*190*136.5 | 8 (0.0%) |
| 194.4*230.4*196.56 | 8 (0.0%) |
| 153*189*147 | 7 (0.0%) |
| 156.6*189*158.4 | 6 (0.0%) |
| 195*231*197.12 | 6 (0.0%) |
| 157.5*189*136.5 | 5 (0.0%) |
| 159*195*159 | 5 (0.0%) |
| 196.472*232.412*194.4 | 5 (0.0%) |
| 198.75*232.5*197.6 | 5 (0.0%) |
| 198*231*195 | 5 (0.0%) |
| 154*190*164 | 4 (0.0%) |
| 157.5*195*171 | 4 (0.0%) |
| 182*218*183 | 4 (0.0%) |
| 193.5*229.5*195 | 4 (0.0%) |
| 195*231*196.56 | 4 (0.0%) |
| 195*231*198.8 | 4 (0.0%) |
| 139.01492*173.76865*151.65264 | 3 (0.0%) |
| 192.5*227.5*192.5 | 3 (0.0%) |
| 196.735*231.77*197.1 | 3 (0.0%) |
| 144*180*146 | 2 (0.0%) |
| 157*189*156 | 2 (0.0%) |
| 159*192*156 | 2 (0.0%) |
| 160*192*140 | 2 (0.0%) |
| 164*192*160 | 2 (0.0%) |
| 169*205*169 | 2 (0.0%) |
| 171*207*171 | 2 (0.0%) |
| 178.5*214.5*178.5 | 2 (0.0%) |
| 182.5*217.5*182.5 | 2 (0.0%) |
| 183*249*183 | 2 (0.0%) |
| 192.5*230*192.5 | 2 (0.0%) |
| 195.468*230.373*195.5 | 2 (0.0%) |
| 195.966*233.784*197.2 | 2 (0.0%) |
| 195*232.5*198 | 2 (0.0%) |
| 200*235*191.4 | 2 (0.0%) |
| 202*232*192 | 2 (0.0%) |
| 150.5*185*158 | 1 (0.0%) |
| 159*192*138 | 1 (0.0%) |
| 159*192*159 | 1 (0.0%) |
| 161.25*191.25*136.8 | 1 (0.0%) |
| 161.25*191.25*140 | 1 (0.0%) |
| 161*191*151 | 1 (0.0%) |
| 180*219*184 | 1 (0.0%) |
| 183*219*174 | 1 (0.0%) |
| 183*219*183.6 | 1 (0.0%) |
| 184.8*220.8*182.4 | 1 (0.0%) |
| 188*224*184 | 1 (0.0%) |
| 190.9*227.7*190.9 | 1 (0.0%) |
| 192.5*230.3125*192.5 | 1 (0.0%) |
| 193.6*228.8*193.6 | 1 (0.0%) |
| 193*229*193 | 1 (0.0%) |
| 198*234*190 | 1 (0.0%) |
| number_of_subjects | |
| Mean (SD) | 1490 (213000) |
| Median [Min, Max] | 52.0 [1.00, 32200000] |
| modality | |
| Diffusion MRI | 39 (0.2%) |
| fMRI-BOLD | 15121 (65.9%) |
| fMRI-CBF | 14 (0.1%) |
| fMRI-CBV | 1 (0.0%) |
| MEG | 39 (0.2%) |
| Other | 7351 (32.0%) |
| PET FDG | 31 (0.1%) |
| PET other | 42 (0.2%) |
| Structural MRI | 316 (1.4%) |
| image_type | |
| statistic_map | 22954 (100%) |
| NIDM results statistic map | 0 (0%) |
| atlas | 0 (0%) |
| map_type | |
| univariate-beta map | 0 (0%) |
| variance | 0 (0%) |
| ROI/mask | 0 (0%) |
| T map | 20477 (89.2%) |
| other | 0 (0%) |
| Z map | 2477 (10.8%) |
| P map (given null hypothesis) | 0 (0%) |
| parcellation | 0 (0%) |
| F map | 0 (0%) |
| 1-P map ("inverted" probability) | 0 (0%) |
| anatomical | 0 (0%) |
| multivariate-beta map | 0 (0%) |
| Chi squared map | 0 (0%) |
| analysis_level | |
| single-subject | 0 (0%) |
| meta-analysis | 0 (0%) |
| group | 22954 (100%) |
| other | 0 (0%) |
| 0 (0%) | |
| is_thresholded | |
| FALSE | 22954 (100%) |
| TRUE | 0 (0%) |
| not_mni | |
| FALSE | 22954 (100%) |
| TRUE | 0 (0%) |
| perc_bad_voxels | |
| Mean (SD) | 74.7 (10.6) |
| Median [Min, Max] | 76.8 [0, 86.1] |
| perc_voxels_outside | |
| Mean (SD) | 12.4 (21.3) |
| Median [Min, Max] | 3.65 [0, 100] |
## The following chunk finds out the ones that should be excluded, i.e. low coverage, and visualize them
par(mfrow = c(3, 4), mar = c(1, 1, 3.5, 1), pty = "m")
tempBG <- readNifti(template)
low_gm <- images[which(images$gm_exclude == TRUE), ][seq_len(min(4, sum(images$gm_exclude, na.rm = TRUE))), ]
low_wm <- images[which(images$wm_exclude == TRUE), ][seq_len(min(4, sum(images$wm_exclude, na.rm = TRUE))), ]
high_out <- images[which(images$nvox_outside_exclude == TRUE), ][seq_len(min(4, sum(images$nvox_outside_exclude, na.rm = TRUE))), ]
spot_check <- rbind(low_gm, low_wm, high_out)
spot_check$reason <- c(
rep("Low gray matter", nrow(low_gm)),
rep("Low white matter", nrow(low_wm)),
rep("High outside mask", nrow(high_out))
)
for (i in seq_len(nrow(spot_check))) {
img <- readNifti(spot_check$resliced[i])
pbj:::image.niftiImage(
img, BGimg = tempBG, lo = FALSE, index = 35,
limits = quantile(abs(img), probs = 0.01),
plane = 'sagittal'
)
title(
main = paste0(
spot_check$reason[i], "\n",
"GM: ", round(100 * spot_check$gm_perc[i], 1), "% | ",
"WM: ", round(100 * spot_check$wm_perc[i], 1), "% | ",
"Outside: ", round(100 * spot_check$outside_perc[i], 1), "%"
),
cex.main = 0.65,
line = 0.3
)
}Figure 6: Representative sagittal slices of excluded images grouped by exclusion reason. Top row: low gray matter coverage (images with sparse or misaligned cortical signal). Middle row: low white matter coverage (images with minimal subcortical signal, likely due to partial brain coverage or poor registration). Bottom row: excessive non-brain signal (images where signal extends well beyond the brain mask, indicating registration failure or absent skull-stripping).
n_before <- nrow(images)
images <- images[which(!(images$gm_exclude | images$wm_exclude | images$nvox_outside_exclude)), ]
n_after <- nrow(images)
cat(sprintf("Excluded %d images. Remaining: %d\n", n_before - n_after, n_after))## Excluded 999 images. Remaining: 21955
The following shows what the dataset looks like after excluding.
# SNV: Changed to coverage_pct to be consistent
# --- proportions already computed above ---
coverage_df <- images[, c("gm_perc", "wm_perc", "outside_perc")]
coverage_df <- na.omit(coverage_df)
colnames(coverage_df) <- c("Gray Matter", "White Matter", "Outside Mask")
# convert to percent for plotting
coverage_pct <- coverage_df * 100
GGally::ggpairs(
coverage_pct,
title = "Figure 7: Spatial Coverage Percentage",
upper = list(continuous = "cor"),
diag = list(continuous = diag_fun),
lower = list(continuous = wrap(lower_fun, alpha = 0.35, size = 0.6))
)Here, we convert nonzero values in the T and Z statistical images
into the signed version of the robust effect size index (RESI; using
RESI::t2S for T maps and RESI::z2S for Z
maps), based on each metadata sample size reported with the images. We
save the resulting effect size images and summary statistics within gray
and white matter (mean and 0, 0.25, 0.5, 0.75, 1 quantiles).
library(RNifti)
library(callr)
library(RESI)
# Check if effect size computation was already done this month
rds_file_path <- get_monthly_rds_path("imageMetadataEffectSizes")
file_exists_this_month <- file.exists(rds_file_path)
if (!file_exists_this_month) {
# Define the effect size computation function to run in background - function saves results internally
compute_and_save_effect_sizes_fn <- function(images_data, gm_mask_path, wm_mask_path,
mask_2mm_path, n_cores, rds_dir) {
library(RNifti)
library(parallel)
library(RESI)
# Helper function to get monthly RDS path
get_monthly_rds_path <- function(base_name, dir = rds_dir) {
get_date_suffix <- function() {
format(Sys.Date(), "%Y-%m")
}
date_suffix <- get_date_suffix()
current_file <- file.path(dir, paste0(base_name, "_", date_suffix, ".rds"))
pattern <- paste0("^", base_name, "_[0-9]{4}-[0-9]{2}\\.rds$")
all_files <- list.files(dir, pattern = pattern, full.names = TRUE)
old_files <- all_files[all_files != current_file]
if (length(old_files) > 0) {
invisible(lapply(old_files, file.remove))
}
return(current_file)
}
# Load masks once
gm_mask_nifti <- RNifti::readNifti(gm_mask_path)
wm_mask_nifti <- RNifti::readNifti(wm_mask_path)
brain_mask_nifti <- RNifti::readNifti(mask_2mm_path)
# Validate that required columns exist
cat(sprintf("[%s] [BG] Starting effect size computation for %d images...\n", Sys.time(), nrow(images_data)))
# Get indices of nonzero voxels once for efficiency
gm_idx <- which(gm_mask_nifti != 0)
wm_idx <- which(wm_mask_nifti != 0)
brain_idx <- which(brain_mask_nifti != 0)
# Worker function that takes ALL arguments explicitly (no lexical scoping)
process_single_image <- function(idx, file_path, n_subj, map_type, gm_idx, wm_idx, brain_idx) {
tryCatch({
if (is.na(file_path) || !nzchar(file_path) || !file.exists(file_path)) return(NULL)
if (is.na(n_subj) || is.na(map_type)) return(NULL)
img <- RNifti::readNifti(file_path)
# Only extract nonzero values
nonzero_mask <- (img != 0) & !is.na(img)
valid_idx <- which(nonzero_mask)
if (length(valid_idx) == 0) return(NULL)
valid_vals <- img[valid_idx]
# Convert map_type to character to handle factors
map_type_char <- as.character(map_type)
# Convert effect sizes
if (map_type_char == "T map") {
converted <- RESI::t2S(valid_vals, rdf = n_subj - 1, n = n_subj)
} else if (map_type_char == "Z map") {
converted <- RESI::z2S(valid_vals, n = n_subj, unbiased = TRUE)
} else {
return(NULL)
}
# Build effect size image only with converted values
effect_img <- img
effect_img[valid_idx] <- converted
effect_img[!is.finite(effect_img)] <- 0
output_dir <- dirname(file_path)
if (!dir.exists(output_dir)) dir.create(output_dir, recursive = TRUE)
base_name <- sub("\\.nii\\.gz$", "", basename(file_path))
out_file <- file.path(output_dir, paste0(base_name, "_effect_size.nii.gz"))
RNifti::writeNifti(effect_img, out_file)
# Get stats from converted values only (not from full image)
effect_only <- converted[converted != 0 & is.finite(converted)]
es_stats <- if (length(effect_only) > 0) {
c(
quantile(effect_only, type = 7),
es_mean = mean(effect_only)
)
} else rep(NA_real_, 6)
# Compute masked statistics using indices only (no intermediate arrays)
grey_vals <- effect_img[gm_idx]
white_vals <- effect_img[wm_idx]
# Write brain-masked version
brain_file <- sub("\\.nii\\.gz$", "_masked_mni.nii.gz", out_file)
brain_masked <- effect_img
brain_masked[-brain_idx] <- 0 # Zero out non-brain voxels
RNifti::writeNifti(brain_masked, brain_file)
list(
index = idx,
effect_size_file = out_file,
brain_file = brain_file,
es_min = es_stats[1],
es_1qt = es_stats[2],
es_median = es_stats[3],
es_3qt = es_stats[4],
es_max = es_stats[5],
es_mean = es_stats[6],
grey_mean = if (length(grey_vals) > 0) mean(grey_vals[grey_vals != 0], na.rm = TRUE) else NA_real_,
grey_min = if (length(grey_vals) > 0) min(grey_vals[grey_vals != 0], na.rm = TRUE) else NA_real_,
grey_max = if (length(grey_vals) > 0) max(grey_vals[grey_vals != 0], na.rm = TRUE) else NA_real_,
white_mean = if (length(white_vals) > 0) mean(white_vals[white_vals != 0], na.rm = TRUE) else NA_real_,
white_min = if (length(white_vals) > 0) min(white_vals[white_vals != 0], na.rm = TRUE) else NA_real_,
white_max = if (length(white_vals) > 0) max(white_vals[white_vals != 0], na.rm = TRUE) else NA_real_
)
}, error = function(e) {
message(" [", Sys.time(), "] Skipping index ", idx, ": ", e$message)
NULL
})
}
# Use mcmapply to pass only necessary vectors to workers (not entire environment)
cat(sprintf("[%s] [BG] Starting effect size computation for %d images...\n", Sys.time(), nrow(images_data)))
results <- mcmapply(
process_single_image,
idx = 1:nrow(images_data),
file_path = images_data$resliced,
n_subj = images_data$number_of_subjects,
map_type = images_data$map_type,
MoreArgs = list(gm_idx = gm_idx, wm_idx = wm_idx, brain_idx = brain_idx),
mc.cores = n_cores,
SIMPLIFY = FALSE
)
# Filter and bind
cat(sprintf("[%s] Processing results...\n", Sys.time()))
results <- Filter(Negate(is.null), results)
df_results <- do.call(rbind, lapply(results, as.data.frame))
# Add index column to images_data for matching
images_data$index <- 1:nrow(images_data)
# merge keeping all of images, matched by index
images_data = merge(images_data, df_results, all.x=TRUE)
# Remove the temporary index column (not in original)
images_data$index <- NULL
# Save directly to versioned RDS file
if (!dir.exists(rds_dir)) dir.create(rds_dir, showWarnings = FALSE, recursive = TRUE)
rds_path <- get_monthly_rds_path("imageMetadataEffectSizes")
saveRDS(images_data, file = rds_path)
cat(sprintf("[%s] Effect size computation completed. Results saved to %s\n", Sys.time(), rds_path))
}
logdir <- "logfiles"
if (!dir.exists(logdir)) dir.create(logdir, showWarnings = FALSE)
logfiles = c('logfiles/effectSize_estimation.log', 'logfiles/effectSize_estimation.err')
unlink(logfiles)
# Start background process - DON'T WAIT for completion
bg_process <- callr::r(compute_and_save_effect_sizes_fn,
args = list(images, gm_mask, wm_mask, mask_2mm_file, n_cores, rds_dir),
stdout = logfiles[1],
stderr = logfiles[2])
message(sprintf("[%s] Background process started, Effect size computation will be saved to %s",
Sys.time(), rds_file_path))
} else {
# Load already computed effect sizes
images <- load_monthly_rds("imageMetadataEffectSizes")
message("Loaded previously computed effect sizes from ", rds_file_path)
}To screen for outlier images, we model the expected variability of these mean effect sizes as a function of (i) study sample size and (ii) voxel coverage, and flag observations that fall outside the implied funnel. The mean value of the effect size, \(y_{ij}\), for image \(j\) in collection \(i\), is proportional to \(n_i^{-1} V_i^{-2}\), where \(n_i\) denotes the number of subjects (sample size) and \(V_{ij}\) denotes the number of nonzero voxels in image \(j\) (see preprint for details).
We fit an intercept-only linear mixed-effects model with a collection-specific random intercept:
\[y_{ij} = \mu + u_{i} + \varepsilon_{ij},\]
where \(\mu\) is the grand mean effect size and \(u_{i} \sim \mathcal{N}(0, \tau^2)\) captures between-collection heterogeneity.
To account for heteroskedasticity across images, we model the residual variance using a power-of-covariates structure implemented via the \(\texttt{varPower}\) variance function in \(\texttt{nlme}\), and combine multiple variance components using . Specifically, for study-level covariates \(N_{ij}\) (sample size) and \(V_{ij}\) (scaled voxel coverage), we assume
\[\mathrm{Var}(\varepsilon_{ij}) = \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2},\]
so that residual variability decreases as \(N\) and/or \(V\) increase when \(\delta_1 < 0\) and/or \(\delta_2 < 0\).
The total per-observation variance incorporates both residual and between-collection variability:
\[\sigma_{ij}^2 = \tau^2 + \sigma^2 \, |N_{ij}|^{2\delta_1} \, |V_{ij}|^{2\delta_2}.\]
This quantity represents the model-implied variance of observed mean effect sizes as a function of sample size and voxel coverage.
We flag whole images as outliers if their deviation from the fitted grand mean exceeds a multiple of its modeled standard deviation:
\[\bigl| y_{ij} - \hat{\mu} \bigr| > k \, \sigma_{ij},\]
with \(k = 3\).
This criterion standardizes residuals across studies while accounting for heteroskedasticity and within-collection correlation. This outlier identification procedure was applied only to the gray matter mean effect sizes. To mitigate the effect of outliers on the model fit used for outlier identification, we windsorized the data at the 1st and 99th percentiles. We did this after noting that a few extreme image values had a huge effect on the residual variance. The same procedure was applied separately to gray matter and white matter mean effect sizes. These analyses serve both to validate the effect-size transformation and to provide a principled, model-based mechanism for identifying implausible images prior to downstream inference. While alternative filtering strategies were explored, we selected the mixed-effects variance modeling approach as the most statistically grounded and interpretable method for screening artifactual data points. A summary table consolidating key model estimates and variance components is provided below.
# SNV: excludes missing es_mean -- effect size could not be computed?
images = images[ !is.na(images$es_mean),]
winsorize_vec <- function(x, probs = c(0.01, 0.99)) {
qs <- quantile(x, probs = probs, na.rm = TRUE, names = FALSE, type = 7)
x[x < qs[1]] <- qs[1]
x[x > qs[2]] <- qs[2]
x
}
# Load effect size data
images$es_mean_winsorized <- winsorize_vec(images$es_mean)
images$voxel_scaled <- images$nvox_new/1000
m_power_winsorized <- lme(
es_mean_winsorized ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)),
data = images,
method = "REML"
)
m_power_original <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)),
data = images,
method = "REML"
)
summary(m_power_winsorized)## Linear mixed-effects model fit by REML
## Data: images
## AIC BIC logLik
## -27068.31 -27028.34 13539.16
##
## Random effects:
## Formula: ~1 | collection_id
## (Intercept) Residual
## StdDev: 0.06573642 116.8415
##
## Combination of variance functions:
## Structure: Power of variance covariate
## Formula: ~number_of_subjects
## Parameter estimates:
## power
## 0.02178354
## Structure: Power of variance covariate
## Formula: ~voxel_scaled
## Parameter estimates:
## power
## -1.270854
## Fixed effects: es_mean_winsorized ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 0.00204014 0.001561378 18299 1.306627 0.1914
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -5.44353378 -0.39624050 0.01803653 0.42848458 7.00064001
##
## Number of Observations: 21920
## Number of Groups: 3621
## Linear mixed-effects model fit by REML
## Data: images
## AIC BIC logLik
## 23272.3 23312.28 -11631.15
##
## Random effects:
## Formula: ~1 | collection_id
## (Intercept) Residual
## StdDev: 0.5330694 61200986042
##
## Combination of variance functions:
## Structure: Power of variance covariate
## Formula: ~number_of_subjects
## Parameter estimates:
## power
## -0.8910117
## Structure: Power of variance covariate
## Formula: ~voxel_scaled
## Parameter estimates:
## power
## -4.099169
## Fixed effects: es_mean ~ 1
## Value Std.Error DF t-value p-value
## (Intercept) 0.01813216 0.01029684 18299 1.760944 0.0783
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -4.858372e+01 -1.252907e-01 7.317224e-04 1.384293e-01 4.238303e+01
##
## Number of Observations: 21920
## Number of Groups: 3621
The following chunk uses the outlier boundaries, as described above, from the winsorized model to make exclusions.
overall_effect <- as.numeric(fixef(m_power))
sigma2 <- sigma(m_power)^2
tau2 <- as.numeric(VarCorr(m_power)[1, "Variance"])
# varPower exponents (delta_n, delta_v)
expParams <- intervals(m_power)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]
images$se_model <- sqrt(
tau2 + sigma2 *
(images$number_of_subjects ^ (2*delta_n)) *
(images$voxel_scaled ^ (2*delta_v))
)
# standardized residuals
images$res_std <- residuals(m_power, type = "pearson")
k <- 3
images$outlier <- abs(images$es_mean - overall_effect) > k * images$se_model
# SNV: These are all satisfied already in the dataset
# dat_plot <- subset(
# images,
# is.finite(number_of_subjects) & number_of_subjects > 0 &
# is.finite(es_mean) & is.finite(se_model)
# )
n_out <- sum(images$outlier)
n_tot <- nrow(images)
grid <- data.frame(
number_of_subjects = seq(1, 10000, length.out = 800)
)
# choose a constant voxel_scaled for the band (median is a sensible default)
grid$voxel_scaled <- median(images$voxel_scaled, na.rm = TRUE)
grid$se_model <- sqrt(
tau2 + sigma2 *
(grid$number_of_subjects ^ (2*delta_n)) *
(grid$voxel_scaled ^ (2*delta_v))
)
grid$upper <- overall_effect + k * grid$se_model
grid$lower <- overall_effect - k * grid$se_model
grid$log_n <- log10(grid$number_of_subjects)
library(plotly)
# SNV: dat_plot was the same as images, just replace it here
# dat_plot$se_model <- sqrt(
# tau2 + sigma2 *
# (dat_plot$number_of_subjects ^ (2 * delta_n)) *
# (dat_plot$voxel_scaled ^ (2 * delta_v))
# )
# dat_plot$upper_bound <- overall_effect + k * dat_plot$se_model
# dat_plot$lower_bound <- overall_effect - k * dat_plot$se_model
#
# dat_plot$outlier <- with(dat_plot, es_mean > upper_bound | es_mean < lower_bound)
n_seq <- 10^seq(0, 4, length.out = 60)
v_seq <- seq(
quantile(images$voxel_scaled, 0.05, na.rm = TRUE),
quantile(images$voxel_scaled, 0.95, na.rm = TRUE),
length.out = 60
)
grid3d <- expand.grid(number_of_subjects = n_seq, voxel_scaled = v_seq)
grid3d$se_model <- sqrt(
tau2 + sigma2 *
(grid3d$number_of_subjects ^ (2 * delta_n)) *
(grid3d$voxel_scaled ^ (2 * delta_v))
)
grid3d$upper <- overall_effect + k * grid3d$se_model
grid3d$lower <- overall_effect - k * grid3d$se_model
xg <- log10(n_seq)
yg <- v_seq
UP <- t(matrix(grid3d$upper, nrow = length(n_seq), ncol = length(v_seq)))
LO <- t(matrix(grid3d$lower, nrow = length(n_seq), ncol = length(v_seq)))
MU <- matrix(overall_effect, nrow = length(yg), ncol = length(xg))
fig8 <- plot_ly() |>
add_markers(
data = subset(images, !outlier),
x = ~log10(number_of_subjects),
y = ~voxel_scaled,
z = ~es_mean,
marker = list(size = 2, opacity = 0.25, color = "gray"),
name = "Inside funnel"
) |>
add_markers(
data = subset(images, outlier),
x = ~log10(number_of_subjects),
y = ~voxel_scaled,
z = ~es_mean,
marker = list(size = 4, color = "red"),
name = sprintf("Outliers (n = %d / %d)", n_out, n_tot)
) |>
add_surface(
x = xg, y = yg, z = MU,
opacity = 0.35,
colorscale = list(c(0,1), c("royalblue","royalblue")),
showscale = FALSE,
name = "Overall effect (μ)"
) |>
add_surface(
x = xg, y = yg, z = UP,
opacity = 0.20,
colorscale = list(c(0,1), c("orange","orange")),
showscale = FALSE,
name = sprintf("Upper bound (μ + %d·SD)", k)
) |>
add_surface(
x = xg, y = yg, z = LO,
opacity = 0.20,
colorscale = list(c(0,1), c("orange","orange")),
showscale = FALSE,
name = sprintf("Lower bound (μ − %d·SD)", k)
) |>
layout(
title = list(
text = paste0(
"<b>Figure 8: Outlier Detection via Winsorized Variance-Power Meta-Analytic Model</b><br>",
"<sup>Effect sizes flagged as outliers if |ES − μ| > ", k, "·SD(n, v); ",
"SD modeled as a function of sample size and voxel coverage. ",
"Red points (n = ", n_out, " of ", n_tot, " images) excluded from subsequent analyses.</sup>"
),
x = 0.5,
xref = "paper",
font = list(size = 13)
),
scene = list(
xaxis = list(
title = "log\u2081\u2080(Sample size)",
titlefont = list(size = 11)
),
yaxis = list(
title = "Voxel coverage (nonzero voxels / 1e5)",
titlefont = list(size = 11)
),
zaxis = list(
title = "Mean effect size (ES)",
titlefont = list(size = 11)
),
camera = list(
eye = list(x = 1.6, y = -1.6, z = 0.8)
)
),
legend = list(
x = 0.01, y = 0.99,
bgcolor = "rgba(255,255,255,0.7)",
bordercolor = "gray",
borderwidth = 1
),
margin = list(t = 120) # extra top margin for the title
)
fig8Figure \(\ref{fig:fig8}\) presents a three-dimensional visualization of the outlier detection procedure applied to the meta-analytic image database. Each point represents a single neuroimaging study, plotted along three axes: log-transformed sample size (\(\log_{10}(n)\)), voxel coverage scaled by \(1 \times 10^5\) (y-axis), and mean effect size (\(\bar{\theta}\), z-axis).
The blue surface represents the overall pooled effect estimate (\(\hat{\mu}\)) derived from the winsorized variance-power mixed-effects model, which accounts for heteroscedasticity in effect size variance as a joint function of sample size and voxel coverage. The orange surfaces define the upper and lower outlier boundaries, set at \(\hat{\mu} \pm k \cdot \hat{\sigma}(n, v)\) where \(k = 3\), and where \(\hat{\sigma}(n, v)\) is modeled as:
\[ \hat{\sigma}(n, v) = \sqrt{\tau^2 + \sigma^2 \cdot n^{2\delta_n} \cdot v^{2\delta_v}} \]
with \(\hat{\tau}^2 = 0.0043\), \(\hat{\sigma}^2 = 1.3651935\times 10^{4}\), \(\hat{\delta}_n = 0.022\), and \(\hat{\delta}_v = -1.271\) estimated from the data.
Gray points fall within the acceptable range and are retained for subsequent analyses. Red points (\(n = 609\) of \(21920\) total images) fall outside the funnel boundaries and are flagged as outliers, indicating that their effect sizes deviate more than \(k = 3\) standard deviations from the pooled estimate given their sample size and voxel coverage characteristics. These images are excluded from all downstream analyses.
This approach is more principled than applying a fixed effect size threshold, as it explicitly acknowledges that smaller studies and those with lower voxel coverage are expected to show greater variability under the variance-power model, and calibrates the exclusion boundaries accordingly.
With the above exclusion, a new dataset is created with the outliers as defined from the model above excluded. The following chunk refits the model with the new subset without Windsorization. Outliers are identified using the same procedure, but not excluded.
# Summary table for the final curated dataset using table1 for consistency
table1( ~ outlier + number_of_subjects + es_mean + voxel_scaled + gm_perc + wm_perc + outside_perc + grey_mean + white_mean,
data = images,
caption = "Table 6: Summary statistics for the final curated Neurovault dataset before outlier exclusions.")| Overall (N=21920) |
|
|---|---|
| outlier | |
| Yes | 609 (2.8%) |
| No | 21311 (97.2%) |
| number_of_subjects | |
| Mean (SD) | 81.8 (240) |
| Median [Min, Max] | 52.0 [3.00, 30000] |
| es_mean | |
| Mean (SD) | 0.0253 (0.845) |
| Median [Min, Max] | 0.00391 [-52.7, 45.9] |
| voxel_scaled | |
| Mean (SD) | 240 (43.9) |
| Median [Min, Max] | 228 [130, 344] |
| gm_perc | |
| Mean (SD) | 0.885 (0.0991) |
| Median [Min, Max] | 0.902 [0.555, 1.00] |
| wm_perc | |
| Mean (SD) | 0.923 (0.0896) |
| Median [Min, Max] | 0.944 [0.342, 1.00] |
| outside_perc | |
| Mean (SD) | 0.0328 (0.0381) |
| Median [Min, Max] | 0.00999 [0, 0.152] |
| grey_mean | |
| Mean (SD) | 0.0282 (0.971) |
| Median [Min, Max] | 0.00585 [-64.9, 52.4] |
| white_mean | |
| Mean (SD) | 0.0196 (0.763) |
| Median [Min, Max] | 0.00191 [-43.2, 38.2] |
dat = images[ images$outlier==FALSE,]
m_power <- lme(
es_mean ~ 1,
random = ~ 1 | collection_id,
weights = varComb(
varPower(form = ~ number_of_subjects),
varPower(form = ~ voxel_scaled)),
data = dat,
method = "REML"
)
overall_effect <- as.numeric(fixef(m_power))
sigma2 <- sigma(m_power)^2
tau2 <- as.numeric(VarCorr(m_power)[1, "Variance"])
# varPower exponents (delta_n, delta_v)
expParams <- intervals(m_power)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]
dat$se_model <- sqrt(
tau2 + sigma2 *
(dat$number_of_subjects ^ (2*delta_n)) *
(dat$voxel_scaled ^ (2*delta_v))
)
# standardized residuals (optional panel A)
dat$res_std <- residuals(m_power, type = "pearson")
k <- 3
dat$outlier <- abs(dat$es_mean - overall_effect) > k * dat$se_model
# dat_plot <- subset(
# dat,
# is.finite(number_of_subjects) & number_of_subjects > 0 &
# is.finite(es_mean) & is.finite(se_model)
# )
# dat_sorted <- dat_plot[order(dat_plot$number_of_subjects), ]
# dat_clean <- subset(dat_sorted, !outlier)
# n_out <- sum(dat_plot$outlier)
# n_tot <- nrow(dat_plot)
# dat_plot$se_model <- sqrt(
# tau2 + sigma2 *
# (dat_plot$number_of_subjects ^ (2 * delta_n)) *
# (dat_plot$voxel_scaled ^ (2 * delta_v))
# )
# dat_plot$upper_bound <- overall_effect + k * dat_plot$se_model
# dat_plot$lower_bound <- overall_effect - k * dat_plot$se_model
# dat_plot$outlier <- with(dat_plot, es_mean > upper_bound | es_mean < lower_bound)
grid <- data.frame(
number_of_subjects = seq(1, 10000, length.out = 800)
)
# choose a constant voxel_scaled for the band (median is a sensible default)
grid$voxel_scaled <- median(dat$voxel_scaled, na.rm = TRUE)
grid$se_model <- sqrt(
tau2 + sigma2 *
(grid$number_of_subjects ^ (2*delta_n)) *
(grid$voxel_scaled ^ (2*delta_v))
)
grid$upper <- overall_effect + k * grid$se_model
grid$lower <- overall_effect - k * grid$se_model
grid$log_n <- log10(grid$number_of_subjects)
library(plotly)
n_seq <- 10^seq(0, 4, length.out = 60)
v_seq <- seq(
quantile(dat$voxel_scaled, 0.05, na.rm = TRUE),
quantile(dat$voxel_scaled, 0.95, na.rm = TRUE),
length.out = 60
)
grid3d <- expand.grid(number_of_subjects = n_seq, voxel_scaled = v_seq)
grid3d$se_model <- sqrt(
tau2 + sigma2 *
(grid3d$number_of_subjects ^ (2 * delta_n)) *
(grid3d$voxel_scaled ^ (2 * delta_v))
)
grid3d$upper <- overall_effect + k * grid3d$se_model
grid3d$lower <- overall_effect - k * grid3d$se_model
xg <- log10(n_seq)
yg <- v_seq
UP <- t(matrix(grid3d$upper, nrow = length(n_seq), ncol = length(v_seq)))
LO <- t(matrix(grid3d$lower, nrow = length(n_seq), ncol = length(v_seq)))
MU <- matrix(overall_effect, nrow = length(yg), ncol = length(xg))
n_out_refit <- sum(dat$outlier)
n_tot_refit <- nrow(dat)
fig9 <- plot_ly() |>
add_markers(
data = subset(dat, !outlier),
x = ~log10(number_of_subjects),
y = ~voxel_scaled,
z = ~es_mean,
marker = list(size = 2, opacity = 0.25, color = "gray"),
name = "Inside funnel"
) |>
add_markers(
data = subset(dat, outlier),
x = ~log10(number_of_subjects),
y = ~voxel_scaled,
z = ~es_mean,
marker = list(size = 4, color = "red"),
name = sprintf("Outliers (n = %d / %d)", n_out_refit, n_tot_refit)
) |>
add_surface(
x = xg, y = yg, z = MU,
opacity = 0.35,
colorscale = list(c(0,1), c("royalblue","royalblue")),
showscale = FALSE,
name = "Overall effect (μ)"
) |>
add_surface(
x = xg, y = yg, z = UP,
opacity = 0.20,
colorscale = list(c(0,1), c("orange","orange")),
showscale = FALSE,
name = sprintf("Upper bound (μ + %d·SD)", k)
) |>
add_surface(
x = xg, y = yg, z = LO,
opacity = 0.20,
colorscale = list(c(0,1), c("orange","orange")),
showscale = FALSE,
name = sprintf("Lower bound (μ − %d·SD)", k)
) |>
layout(
title = list(
text = paste0(
"<b>Figure 9: Refitted Variance-Power Model After Outlier Exclusion</b><br>",
"<sup>Model refit on cleaned dataset (n = ", n_tot_refit, " images). ",
"Remaining outliers flagged in red (n = ", n_out_refit, "). ",
"Final parameters: μ = ", round(overall_effect, 4),
", τ² = ", round(tau2, 4),
", σ² = ", round(sigma2, 4),
", δ_n = ", round(delta_n, 3),
", δ_v = ", round(delta_v, 3), ".</sup>"
),
x = 0.5,
xref = "paper",
font = list(size = 13)
),
scene = list(
xaxis = list(title = "log\u2081\u2080(Sample size)", titlefont = list(size = 11)),
yaxis = list(title = "Voxel coverage (nonzero voxels / 1e5)", titlefont = list(size = 11)),
zaxis = list(title = "Mean effect size (ES)", titlefont = list(size = 11))
),
legend = list(
x = 0.01, y = 0.99,
bgcolor = "rgba(255,255,255,0.7)",
bordercolor = "gray",
borderwidth = 1
),
margin = list(t = 120)
)
fig9images <- dat
saveRDS(images, get_monthly_rds_path('curatedMetadataset'))
cat(sprintf("Saved curated dataset: %d images retained\n", nrow(images)))## Saved curated dataset: 21311 images retained
To evaluate the stability of parameter estimates across data curation steps, we fit three variants of the same variance-power linear mixed-effects model:
The model specification is identical across all three fits:
\[ \bar{\theta}_{ij} = \mu + b_i + \varepsilon_{ij}, \quad b_i \sim \mathcal{N}(0, \tau^2), \quad \varepsilon_{ij} \sim \mathcal{N}\!\left(0,\; \sigma^2 \cdot n_{ij}^{2\delta_n} \cdot v_{ij}^{2\delta_v}\right) \]
where \(\mu\) is the overall pooled effect, \(b_i\) is the collection-level random intercept, and the residual variance is modeled as a power function of sample size \(n_{ij}\) and voxel coverage \(v_{ij}\).
Table \(\ref{tab:tablesummary}\) presents the fixed effect intercept (\(\hat{\mu}\)), random effect standard deviation (\(\hat{\tau}\)), residual standard deviation (\(\hat{\sigma}\)), and the two variance-power exponents (\(\hat{\delta}_n\), \(\hat{\delta}_v\)) across all three model fits.
Across the three models, the fixed effect estimate \(\hat{\mu}\) is expected to remain stable if the overall pooled effect is robust to outliers and winsorization. Changes in \(\hat{\tau}\) across models reflect shifts in between-collection heterogeneity as extreme images are progressively removed. Similarly, reductions in \(\hat{\sigma}\) from the original to the cleaned model indicate that outlier exclusion successfully reduced residual noise. The variance-power exponents \(\hat{\delta}_n\) and \(\hat{\delta}_v\) characterize how strongly effect size variability depends on sample size and voxel coverage respectively — values close to zero indicate near-homoscedastic residuals, while larger values indicate stronger heteroscedasticity.
library(dplyr)
library(knitr)
library(kableExtra)
library(nlme)
library(tidyr)
# Extract fixed and random effects parameters from the three models
extract_model_params <- function(fit, model_name) {
fixef_val <- as.numeric(fixef(fit))
tau2 <- as.numeric(VarCorr(fit)[1, "Variance"])
sigma2 <- sigma(fit)^2
expParams <- intervals(fit)$varStruct[, 2]
delta_n <- expParams[1]
delta_v <- expParams[2]
# Convert variances to standard deviations
tau_sd <- sqrt(tau2)
sigma_sd <- sqrt(sigma2)
data.frame(
Model = model_name,
Parameter = c("Fixed effect (intercept)", "Random effect SD (tau)", "Residual SD (sigma)", "delta_n", "delta_v"),
Estimate = c(fixef_val, tau_sd, sigma_sd, delta_n, delta_v),
stringsAsFactors = FALSE
)
}
# Extract parameters from all three models
params_combined <- bind_rows(
extract_model_params(m_power_original, "Original"),
extract_model_params(m_power_winsorized, "Winsorized"),
extract_model_params(m_power, "Cleaned")
)
# Format the table
tab_display <- params_combined %>%
pivot_wider(
names_from = Model,
values_from = Estimate
) %>%
mutate(
across(c(Original, Winsorized, Cleaned), function(x) {
case_when(
grepl("Fixed effect", Parameter) ~ sprintf("%.3f", x),
grepl("Residual SD", Parameter) ~ sprintf("%.2e", x),
grepl("delta", Parameter) ~ sprintf("%.3f", x),
TRUE ~ sprintf("%.4f", x)
)
})
)
kable(
tab_display,
format = "html",
escape = FALSE,
col.names = c("Parameter", "Original", "Winsorized", "Cleaned"),
caption = "Fixed and random effects parameter estimates from the three mixed-effects models."
) %>%
kable_styling(
bootstrap_options = c("striped", "hover"),
full_width = FALSE,
position = "center"
)| Parameter | Original | Winsorized | Cleaned |
|---|---|---|---|
| Fixed effect (intercept) | 0.018 | 0.002 | 0.002 |
| Random effect SD (tau) | 0.5331 | 0.0657 | 0.0413 |
| Residual SD (sigma) | 6.12e+10 | 1.17e+02 | 1.27e+03 |
| delta_n | -0.891 | 0.022 | 0.024 |
| delta_v | -4.099 | -1.271 | -1.737 |
Following a three-stage quality control pipeline, the final curated dataset comprises \(21311\) neuroimaging maps retained for downstream analysis.
The first stage examined tabular image properties available in the NeuroVault metadata, including sample size, voxel dimensions, and image type, removing maps that were duplicated, missing key descriptors, or outside the scope of the analysis (Section 3).
The second stage assessed spatial coverage by loading each image directly and computing gray matter coverage (\(88.5\%\) mean retained), white matter coverage (\(92.3\%\) mean retained), and outside-mask signal (\(3.3\%\) mean retained). Images with insufficient brain coverage or excessive non-brain signal were excluded at this stage (Section 4).
The third stage applied a model-based outlier detection procedure using a variance-power linear mixed-effects model fit on winsorized effect sizes. Images whose effect sizes deviated more than \(k = 3\) standard deviations from the pooled estimate — after accounting for heteroscedasticity as a function of sample size and voxel coverage — were flagged and removed (Section 5).
Table \(\ref{tab:curation_summary}\) summarizes the key image-level characteristics of the curated dataset. The distribution of sample sizes (\(n\)) reflects the typical range seen in neuroimaging meta-analyses, with substantial right skew driven by a small number of large studies. Mean effect size (\(\bar{\theta}\)) and voxel coverage metrics (gray matter percentage, white matter percentage, and outside-mask signal) characterize the spatial quality of the retained images.
Gray matter coverage (\(88.5\%\) mean) and white matter coverage (\(92.3\%\) mean) confirm that retained images provide adequate brain signal. The outside-mask percentage (\(3.3\%\) mean) confirms that non-brain signal is minimal across the curated set, consistent with successful skull-stripping and registration.
# Summary table for the final curated dataset using table1 for consistency
table1( ~ number_of_subjects + es_mean + voxel_scaled + gm_perc + wm_perc + outside_perc + grey_mean + white_mean,
data = images,
caption = "Table 6: Summary statistics for the final curated Neurovault dataset after outlier exclusions.")| Overall (N=21311) |
|
|---|---|
| number_of_subjects | |
| Mean (SD) | 82.5 (243) |
| Median [Min, Max] | 54.0 [3.00, 30000] |
| es_mean | |
| Mean (SD) | 0.0104 (0.120) |
| Median [Min, Max] | 0.00340 [-0.576, 0.551] |
| voxel_scaled | |
| Mean (SD) | 240 (43.9) |
| Median [Min, Max] | 228 [130, 343] |
| gm_perc | |
| Mean (SD) | 0.885 (0.0993) |
| Median [Min, Max] | 0.901 [0.557, 1.00] |
| wm_perc | |
| Mean (SD) | 0.923 (0.0892) |
| Median [Min, Max] | 0.942 [0.342, 1.00] |
| outside_perc | |
| Mean (SD) | 0.0327 (0.0379) |
| Median [Min, Max] | 0.00987 [0, 0.151] |
| grey_mean | |
| Mean (SD) | 0.0124 (0.130) |
| Median [Min, Max] | 0.00512 [-0.683, 0.613] |
| white_mean | |
| Mean (SD) | 0.00544 (0.123) |
| Median [Min, Max] | 0.00152 [-0.823, 0.625] |
## Archive complete: /media/big/neurovault_effectsizes_masked.tar
## Files archived: 21311
## [1] TRUE
library(zen4R)
# ── 1. Prepare files ──────────────────────────────────────────────────────────
current_rds <- get_monthly_rds_path("curatedMetadataset")
if (!file.exists(current_rds)) stop("Current month RDS not found: ", current_rds)
cat("RDS file:", current_rds, "\n")## RDS file: /media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds
# ── 2. Create tar archive of brain-masked effect size images ──────────────────
outdir <- "/media/big/neurovault"
archive <- "/media/big/neurovault_effectsizes_masked.tar"
img_paths <- images$brain_file
img_paths <- img_paths[!is.na(img_paths) & nzchar(img_paths)]
img_paths <- img_paths[file.exists(img_paths)]
outdir_abs <- normalizePath(outdir)
img_abs <- normalizePath(img_paths)
inside <- startsWith(img_abs, paste0(outdir_abs, .Platform$file.sep))
img_abs <- img_abs[inside]
rel_paths <- sub(
paste0("^", outdir_abs, .Platform$file.sep),
"",
img_abs
)
cat("Files to archive:", length(rel_paths), "\n")## Files to archive: 21311
if (!file.exists(archive)) {
old <- setwd(outdir)
on.exit(setwd(old), add = TRUE)
utils::tar(
tarfile = archive,
files = rel_paths,
compression = "none"
)
cat("Archive complete:", archive, "\n")
} else {
cat("Archive already exists, skipping tar creation:", archive, "\n")
}## Archive already exists, skipping tar creation: /media/big/neurovault_effectsizes_masked.tar
## Archive size: 0 GB
# ── 3. Connect to Zenodo ──────────────────────────────────────────────────────
token_path <- file.path(rds_dir, "zenodo_token.rds")
if (!file.exists(token_path)) stop("zenodo_token.rds not found in ", rds_dir)
token <- readRDS(token_path)
zenodo <- ZenodoManager$new(token = token, logger = "INFO")## ✔ Successfully connected to Zenodo with user token
## [zen4R][INFO] ZenodoManager - Successfully connected to Zenodo with user token
# ── 4. Fetch or create record ─────────────────────────────────────────────────
record_id_path <- file.path(rds_dir, "zenodo_record_id.rds")
if (!file.exists(record_id_path)) {
cat("No existing record — creating new\n")
myrec <- zenodo$createEmptyRecord()
myrec$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
myrec$setDescription("Monthly-updated curated neuroimaging meta-analytic dataset derived from NeuroVault. Contains tabular metadata and brain-masked RESI effect size images.")
myrec$addCreator(firstname = "Haozheng", lastname = "Xu",
affiliation = "Vanderbilt University")
myrec$addCreator(firstname = "Simon", lastname = "Vandekar",
affiliation = "Vanderbilt University")
myrec$setResourceType("dataset")
myrec$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
myrec$setPublisher("Vanderbilt University Medical Center")
myrec$setSubjects(list(
list(subject = "neuroimaging"),
list(subject = "meta-analysis"),
list(subject = "NeuroVault"),
list(subject = "curated dataset")
))
myrec <- zenodo$depositRecord(myrec, publish = FALSE)
deposit_id <- as.character(myrec$id)
if (is.null(deposit_id) || deposit_id == "NULL") {
self_link <- myrec$links$self
deposit_id <- gsub(".*/records/([0-9]+)/.*", "\\1", self_link)
}
cat("New record created. Deposit ID:", deposit_id, "\n")
saveRDS(deposit_id, record_id_path)
} else {
deposit_id <- readRDS(record_id_path)
cat("Found saved deposit ID:", deposit_id, "\n")
myrec <- tryCatch(
zenodo$getDepositionById(deposit_id),
error = function(e) NULL
)
if (is.null(myrec)) {
deps <- zenodo$getDepositions()
myrec <- Filter(function(x) as.character(x$id) == deposit_id, deps)
myrec <- if (length(myrec) > 0) myrec[[1]] else NULL
}
if (is.null(myrec)) stop("Could not find record with ID: ", deposit_id)
}## No existing record — creating new
## ✔ Successful record deposition
## [zen4R][INFO] ZenodoManager - Successful record deposition
## ✔ Successful reserved DOI for record 20292763
## [zen4R][INFO] ZenodoManager - Successful reserved DOI for record 20292763
## ✔ Successfully fetched list of affiliations!
## ✔ Successfully fetched list of affiliations!
## ✔ Successfully fetched resourcetype 'dataset'
## ✔ Successful record deposition
## [zen4R][INFO] ZenodoManager - Successful record deposition
## ! Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!
## [zen4R][WARN] ZenodoManager - Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!
## New record created. Deposit ID: 20292763
# ── 5. Determine record status ────────────────────────────────────────────────
record_status <- if (!is.null(myrec$state) && length(myrec$state) > 0) myrec$state else
if (!is.null(myrec$status) && length(myrec$status) > 0) myrec$status else
"unsubmitted"
cat("Record status:", record_status, "\n")## Record status: draft
is_draft <- !record_status %in% c("done", "published")
# ── Helper: upload both files ─────────────────────────────────────────────────
upload_files <- function(rec) {
# Delete old RDS if present
existing_files <- tryCatch(zenodo$getFiles(rec$id), error = function(e) list())
if (length(existing_files) > 0) {
old_rds <- Filter(function(f) {
is.list(f) && !is.null(f$filename) &&
grepl("^curatedMetadataset_[0-9]{4}-[0-9]{2}\\.rds$", f$filename)
}, existing_files)
invisible(lapply(old_rds, function(f) {
zenodo$deleteFile(rec$id, f$id)
cat("Deleted old file:", f$filename, "\n")
}))
}
# Upload tabular RDS
zenodo$uploadFile(current_rds, rec)
cat("Uploaded RDS:", basename(current_rds), "\n")
# Upload tar archive
if (file.exists(archive)) {
zenodo$uploadFile(archive, rec)
cat("Uploaded archive:", basename(archive), "\n")
} else {
warning("Tar archive not found, skipping: ", archive)
}
}
# ── 6. Upload and publish ─────────────────────────────────────────────────────
if (is_draft) {
cat("Record is a draft — uploading directly\n")
upload_files(myrec)
myrec$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
myrec$setDescription("Monthly-updated curated neuroimaging meta-analytic dataset derived from NeuroVault. Contains tabular metadata and brain-masked RESI effect size images.")
myrec$setResourceType("dataset")
myrec$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
myrec$setPublisher("Vanderbilt University Medical Center")
myrec <- zenodo$depositRecord(myrec, publish = FALSE)
myrec <- zenodo$publishRecord(myrec$id)
cat("Published! DOI:", myrec$doi, "\n")
saveRDS(as.character(myrec$id), record_id_path)
} else {
cat("Record is published — creating new version\n")
myrec_new <- zenodo$newRecord(myrec$id)
if (is.null(myrec_new)) stop("Failed to create new version")
cat("New version draft ID:", myrec_new$id, "\n")
upload_files(myrec_new)
myrec_new$setTitle("A Curated Meta-analytic Neuroimaging Dataset from NeuroVault")
myrec_new$setPublicationDate(format(Sys.Date(), "%Y-%m-%d"))
myrec_new$setPublisher("Vanderbilt University Medical Center")
myrec_new <- zenodo$depositRecord(myrec_new, publish = FALSE)
myrec_new <- zenodo$publishRecord(myrec_new$id)
cat("New version published! DOI:", myrec_new$doi, "\n")
saveRDS(as.character(myrec_new$id), record_id_path)
}## Record is a draft — uploading directly
## [zen4R][INFO] ZenodoRequest - Fetching https://zenodo.org/api/records/20292763/draft/files
## ✔ Successful fetched file(s) for record '20292763'
## [zen4R][INFO] ZenodoManager - Successful fetched file(s) for record '20292763'
## ! No bucket link for record id = 20292763.
## [zen4R][WARN] ZenodoManager - No bucket link for record id = 20292763.
## [zen4R][INFO] ZenodoManager - Start upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## ✔ Successfully started upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## [zen4R][INFO] ZenodoManager - Successfully started upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## | | | 0% | |= | 1% | |== | 2% | |== | 4% | |=== | 5% | |==== | 6% | |===== | 7% | |====== | 8% | |======= | 9% | |======= | 11% | |======== | 12% | |========= | 13% | |========== | 14% | |=========== | 15% | |=========== | 16% | |============ | 18% | |============= | 19% | |============== | 20% | |=============== | 21% | |================ | 22% | |================ | 23% | |================= | 25% | |================== | 26% | |=================== | 27% | |==================== | 28% | |==================== | 29% | |===================== | 30% | |====================== | 32% | |======================= | 33% | |======================== | 34% | |========================= | 35% | |========================= | 36% | |========================== | 37% | |=========================== | 39% | |============================ | 40% | |============================= | 41% | |============================= | 42% | |============================== | 43% | |=============================== | 44% | |================================ | 46% | |================================= | 47% | |================================== | 48% | |================================== | 49% | |=================================== | 50% | |==================================== | 51% | |===================================== | 53% | |====================================== | 54% | |====================================== | 55% | |======================================= | 56% | |======================================== | 57% | |========================================= | 58% | |========================================== | 60% | |=========================================== | 61% | |=========================================== | 62% | |============================================ | 63% | |============================================= | 64% | |============================================== | 65% | |=============================================== | 67% | |=============================================== | 68% | |================================================ | 69% | |================================================= | 70% | |================================================== | 71% | |=================================================== | 72% | |==================================================== | 74% | |==================================================== | 75% | |===================================================== | 76% | |====================================================== | 77% | |======================================================= | 78% | |======================================================== | 79% | |======================================================== | 81% | |========================================================= | 82% | |========================================================== | 83% | |=========================================================== | 84% | |============================================================ | 85% | |============================================================= | 86% | |============================================================= | 88% | |============================================================== | 89% | |=============================================================== | 90% | |================================================================ | 91% | |================================================================= | 92% | |================================================================= | 94% | |================================================================== | 95% | |=================================================================== | 96% | |==================================================================== | 97% | |===================================================================== | 98% | |======================================================================| 99% | |======================================================================| 100%
## ✖
## ℹ Complete upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## [zen4R][INFO] ZenodoManager - Complete upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## ✔ Successfully completed upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## [zen4R][INFO] ZenodoManager - Successfully completed upload procedure for file '/media/disk2/ree/metadataset/rds/curatedMetadataset_2026-05.rds'
## Uploaded RDS: curatedMetadataset_2026-05.rds
## ! No bucket link for record id = 20292763.
## [zen4R][WARN] ZenodoManager - No bucket link for record id = 20292763.
## [zen4R][INFO] ZenodoManager - Start upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## ✔ Successfully started upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## [zen4R][INFO] ZenodoManager - Successfully started upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## | | | 0% | |======================================================================| 100%
## ✖
## ℹ Complete upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## [zen4R][INFO] ZenodoManager - Complete upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## ✔ Successfully completed upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## [zen4R][INFO] ZenodoManager - Successfully completed upload procedure for file '/media/big/neurovault_effectsizes_masked.tar'
## Uploaded archive: neurovault_effectsizes_masked.tar
## ✔ Successfully fetched resourcetype 'dataset'
## ✔ Successful record deposition
## [zen4R][INFO] ZenodoManager - Successful record deposition
## ! Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!
## [zen4R][WARN] ZenodoManager - Existing DOI (10.5281/zenodo.20292763) for record 20292763. Aborting DOI reservation!
## ✔ Successful published record '20292763'
## [zen4R][INFO] ZenodoManager - Successful published record '20292763'
## Published! DOI:
The following variables are available at the collection level in the NeuroVault metadata:
id, url, download_url, owner, contributors, owner_name, number_of_images, name, publication_status, DOI, preprint_DOI, authors, paper_url, journal_name, description, full_dataset_url, private, add_date, modify_date, doi_add_date, type_of_design, number_of_imaging_runs, number_of_experimental_units, length_of_runs, length_of_blocks, length_of_trials, optimization, optimization_method, subject_age_mean, subject_age_min, subject_age_max, handedness, proportion_male_subjects, inclusion_exclusion_criteria, number_of_rejected_subjects, group_comparison, group_description, scanner_make, scanner_model, field_strength, pulse_sequence, parallel_imaging, field_of_view, matrix_size, slice_thickness, skip_distance, acquisition_orientation, order_of_acquisition, repetition_time, echo_time, flip_angle, software_package, software_version, order_of_preprocessing_operations, quality_control, used_b0_unwarping, b0_unwarping_software, used_slice_timing_correction, slice_timing_correction_software, used_motion_correction, motion_correction_software, motion_correction_reference, motion_correction_metric, motion_correction_interpolation, used_motion_susceptibiity_correction, used_intersubject_registration, intersubject_registration_software, intersubject_transformation_type, nonlinear_transform_type, transform_similarity_metric, interpolation_method, object_image_type, functional_coregistered_to_structural, functional_coregistration_method, coordinate_space, target_template_image, target_resolution, used_smoothing, smoothing_type, smoothing_fwhm, resampled_voxel_size, intrasubject_model_type, intrasubject_estimation_type, intrasubject_modeling_software, hemodynamic_response_function, used_temporal_derivatives, used_dispersion_derivatives, used_motion_regressors, used_reaction_time_regressor, used_orthogonalization, orthogonalization_description, used_high_pass_filter, high_pass_filter_method, autocorrelation_model, group_model_type, group_estimation_type, group_modeling_software, group_inference_type, group_model_multilevel, group_repeated_measures, group_repeated_measures_method, nutbrain_hunger_state, nutbrain_food_viewing_conditions, nutbrain_food_choice_type, nutbrain_taste_conditions, nutbrain_odor_conditions, communities
The following variables are available at the image level after preprocessing:
add_date, collection, collection_id, description, file, file_size, id, image_type, is_valid, modify_date, subject_species, target_template_image, url, name, brain_coverage, map_type, not_mni, perc_bad_voxels, data_origin, perc_voxels_outside, modality, is_thresholded, cognitive_paradigm_cogatlas, cognitive_paradigm_cogatlas_id, analysis_level, number_of_subjects, thumbnail, surface_left_file, surface_right_file, binarized_group_level, binarized_map_type, unthresholded, mni, nonzero_sample, localfile, download, voxdim, imgdim, nvox, orient, range, bounding_box, localfile_base, duplicated, dim_mm, log10_ratio_xy, log10_ratio_xz, log10_ratio_yz, ratio_x, ratio_y, ratio_z, is_proportional, range_low, range_high, range_diff, range_exclusion, localfileadj, resliced, nvox_new, voxdim_new, imgdim_new, orient_new, range_new, nvox_outside, nvox_gm, nvox_wm, q90_outside, gm_perc, wm_perc, outside_perc, gm_exclude, wm_exclude, nvox_outside_exclude, effect_size_file, brain_file, es_min, es_1qt, es_median, es_3qt, es_max, es_mean, grey_mean, grey_min, grey_max, white_mean, white_min, white_max, es_mean_winsorized, voxel_scaled, se_model, res_std, outlier
The curated NeuroVault dataset produced by this pipeline is publicly available on Zenodo and updated on the first of each month. Each monthly release contains:
Tabular metadata
(curatedMetadataset_YYYY-MM.rds) — an R data frame with one
row per retained image, containing all image-level metadata, coverage
statistics, and effect size summaries computed during
preprocessing.
Registered effect size images — brain-registered
NIfTI images converted to the robust effect size index (RESI), stored as
*_rigid_cost_step1_effect_size_masked_mni.nii.gz.
The dataset can be downloaded directly from Zenodo at: DOI pending publication
Images in this pipeline pass through four processing stages, each appending a suffix to the original filename:
| Stage | Column in dataset | File suffix | Description |
|---|---|---|---|
|
localfile | .nii.gz | Original image downloaded from NeuroVault |
|
localfileadj | _eps.nii.gz | NaN replaced with 0; near-zero voxels (|v| |
|
resliced | _rigid_cost_step1.nii.gz | Registered to MNI 2mm template using FSL FLIRT (DOF=6, mutual information) |
|
effect_size_file | _rigid_cost_step1_effect_size.nii.gz | T/Z statistics converted to RESI effect sizes |
|
brain_file | _rigid_cost_step1_effect_size_masked_mni.nii.gz | Effect size image masked to MNI brain template |
All files are stored under
/media/big/neurovault/{collection_id}/ on the processing
server, mirroring the NeuroVault collection structure. The tar archive
uploaded to Zenodo (neurovault_subset.tar) contains the
registered images (resliced) for all retained
collections.
This document is re-rendered automatically on the first of each month. To reproduce the full pipeline locally:
rds_dir and
outdir to your local pathsflirt) and Convert3D (c3d) are
installed and on PATH{rds_dir}/curatedMetadataset_YYYY-MM.rds and uploaded to
Zenodo| Section | Step | Output RDS / file | Approx. runtime |
|---|---|---|---|
| 1 | Metadata download | neurovaultMetadata_YYYY-MM.rds | ~30 min |
| 2 | Image download | localfile (NIfTI on disk) | ~2 hours |
| 3 | Image-derived exclusions | imageMetadataWithDimensions_YYYY-MM.rds | ~1 hour |
| 4 | Image value processing | localfileadj (*_eps.nii.gz) | ~30 min |
| 5 | Registration | imageMetadataRegistered_YYYY-MM.rds | ~4 hours |
| 6 | Effect size estimation | imageMetadataEffectSizes_YYYY-MM.rds | ~6 hours |
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/Chicago
## tzcode source: system (glibc)
##
## attached base packages:
## [1] grid parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] zen4R_0.10.4 callr_3.7.6 gridExtra_2.3 sandwich_3.1-1
## [5] lmtest_0.9-40 zoo_1.8-12 papayaWidget_0.7.1 kableExtra_1.3.4
## [9] knitr_1.48 plotly_4.10.4 nlme_3.1-166 reactable_0.4.4
## [13] stringr_1.6.0 RESI_1.2.4 pbapply_1.7-2 rlang_1.1.7
## [17] scales_1.3.0 cowplot_1.1.3 gridGraphics_0.5-1 neurobase_1.32.4
## [21] oro.nifti_0.11.4 forcats_1.0.0 tidyr_1.3.1 ComplexUpset_1.3.3
## [25] jsonlite_2.0.0 httr_1.4.8 GGally_2.2.1 ggplot2_3.5.1
## [29] table1_1.4.3 dplyr_1.1.4 abind_1.4-8 pbj_0.1.6
## [33] RNifti_1.7.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-8 polynom_1.4-1 magrittr_2.0.4
## [4] multcomp_1.4-26 matrixStats_1.2.0 compiler_4.4.1
## [7] png_0.1-8 systemfonts_1.1.0 vctrs_0.7.1
## [10] rvest_1.0.5 pkgconfig_2.0.3 fastmap_1.2.0
## [13] labeling_0.4.3 utf8_1.2.6 rmarkdown_2.28
## [16] ps_1.8.1 purrr_1.0.2 xfun_0.47
## [19] cachem_1.1.0 PDQutils_0.1.6 clubSandwich_0.5.11
## [22] highr_0.11 R6_2.6.1 bslib_0.8.0
## [25] stringi_1.8.7 RColorBrewer_1.1-3 car_3.1-3
## [28] boot_1.3-31 jquerylib_0.1.4 estimability_1.5.1
## [31] Rcpp_1.0.13-1 assertthat_0.2.1 base64enc_0.1-3
## [34] R.utils_2.12.3 Matrix_1.6-3 splines_4.4.1
## [37] tidyselect_1.2.1 rstudioapi_0.16.0 yaml_2.3.10
## [40] codetools_0.2-20 curl_7.0.0 processx_3.8.4
## [43] lattice_0.22-6 tibble_3.3.1 plyr_1.8.9
## [46] withr_3.0.2 coda_0.19-4.1 evaluate_0.24.0
## [49] orthopolynom_1.0-6.1 moments_0.14.1 survival_3.7-0
## [52] ggstats_0.8.0 xml2_1.5.2 pillar_1.11.1
## [55] carData_3.0-5 generics_0.1.3 munsell_0.5.1
## [58] aod_1.3.3 xtable_1.8-4 glue_1.8.0
## [61] emmeans_1.10.4 lazyeval_0.2.2 tools_4.4.1
## [64] data.table_1.16.0 webshot_0.5.5 XML_3.99-0.20
## [67] mvtnorm_1.3-1 crosstalk_1.2.1 colorspace_2.1-1
## [70] patchwork_1.3.0 pTFCE_0.2.2.1 mmand_1.6.3
## [73] Formula_1.2-5 cli_3.6.5 viridisLite_0.4.2
## [76] keyring_1.4.1 svglite_2.1.1 gtable_0.3.6
## [79] reactR_0.6.1 R.methodsS3_1.8.2 sass_0.4.9
## [82] digest_0.6.37 TH.data_1.1-2 htmlwidgets_1.6.4
## [85] farver_2.1.2 htmltools_0.5.8.1 R.oo_1.26.0
## [88] lifecycle_1.0.5 mime_0.13 MASS_7.3-61