Scraping OpenNeuro

23rd Jun 2025
13 min read

Disclaimer: I'm not a neuroscientist, none of this is medical advice. MRI section can be skipped.

Introduction

Many studies applying AI to neuroimaging data, like MindEye, train models on fMRI data. It is often used to measure BOLD - blood oxygen level-dependent - signals. The idea is that when a brain region is active, it consumes more oxygen, and the blood flow to that region increases. We can use this to get a low-temporal-high-spatial-resolution large-scale (whole-brain) view of brain activity.

fMRI can be used to find the brain's functional connectivity - how likely two regions are to be active at the same time - based on a sample of a patient's brain activity. While this gives us some information about the brain's structure, the connectivity we get depends on what the patient is doing at the time of the scan. If we wanted a scan that shows the brain's structure, we would use a different modality.

Structural MRI is used to get a high-resolution image of the brain's structure. We can get an idea of the brain's shape and size as well as an image of the brain's white and gray matter. This information is useful for understanding the anatomy of a person's brain as opposed to its activity. One possible application would be to train a contrastive neural network to find a latent space that characterizes the brain's structure in a way that helps discriminate between different people most effectively. But it's also possible to use a dataset of such images to train any classifier that takes brain structure as input.

OpenNeuro is a repository of images from neuroimaging studies. It contains a lot of structural MRI images. This post is about how I downloaded all structural MRI images from it in Fall 2024 and how I processed them for machine learning.

Some information about MRI

MRI scans are taken in a static magnetic field. The strength of the field is measured in Tesla. The higher the field strength, the better the image quality. It varies from 0.2T to 7T, and the most common field strengths are 1.5T and 3T. Obviously, the image can have varying resolutions. The most common are 1mm isotropic voxels.

The way MRI works is pretty complicated and is frequently misunderstood. I don't fully understand it myself. To my best knowledge, the process is:

The strong magnetic field makes the spins of H atoms in the body more likely to be aligned with the field. This is not for the reason magnets align with the Earth's magnetic field. It's just that spins parallel to the external magnetic field have a slightly lower energy state, and
At this point, atoms will be precessing at their Larmor frequency determined by the type of atom (fixed) and strength of the magnetic field at a point in space. We can apply a radiofrequency pulse at this frequency for a specific amount of time to rotate the spins by 90 degrees.
Now, the net magnetization is rotating in the plane orthogonal to the external magnetic field (the transverse plane). Over time, the net magnetic field gets weaker, as the spins will rotate with slightly different frequencies. The time it takes for the net magnetization to disperse is called the T2 time.
We wait for the spins to realign with the magnetic field. The time this takes is called the T1 time. T1 is greater than T2. The spins will produce a signal that we can measure during realignment that mirrors the frequency and phase of the rotation. This rotation will create waves we can measure. We measure the waves a time TE (echo time) after the pulse.

This doesn't tell us how to capture the signal and process it. We need to do two things to get there: select a 2D slice of 3D space and give slightly different phases and/or frequencies to precessing spins at different points in space. We select the slice by applying a gradient magnetic field in the direction of the slice. We apply this gradient during the pulse. Since we tuned the pulse to a constant external magnetic field, any slice that doesn't match it will rotate by an angle other than 90 degrees and not produce a precessing net magnetization.

We can give slightly different frequencies to precessing spins at different points in space along an axis (that is perpendicular to the slice axis) by applying a gradient in that direction. This is the frequency encoding axis. We may now perform a Fourier transform to get the magnitude of the signal at each frequency. We can repeat the pulse after a time called TR (repetition time). We can get a "line" of magnitudes for each angle at which we can apply the frequency encoding and then combine them with a Radon transform to get a 2D image. Job done?

Yes, but this method will produce bad images. We have two lossy steps in the algorithm: converting each direction's intensity and then applying the Radon transform.

There is a way of sidestepping this issue. It requires changing the phase of rotation along the axis perpendicular to the frequency encoding. We may do this by applying a gradient along that axis and then turning it off before the frequency encoding. Since we know the initial angle of the net magnetization and how much the gradient changed phase and frequency, we have a known gradient of phases and frequencies. We can repeat this scan, not with different frequency encoding angles but with different magnitudes of the phase encoding. We will get a series of signals. We may take the Fourier transform of each to get a series of frequencies and associated phases. The magnitude should be equal, but the phase will be different. The phases will vary linearly, so if we take a Fourier transform over normalized complex outputs of the first Fourier transform along an axis, we will see the distribution of source magnitudes. We can repeat this process with different slices to get the original image.

I highly recommend The Basics of MRI for a more detailed and accurate explanation with visualizations. All of this is based on my reading of the book and may be a horrible misinterpretation that doesn't reflect reality.

What is important is that we can take an MRI image in many ways. Just this brief exposition went over some of the areas where we have choice: how to slice, encode, decode the image and postprocess it. For structural brain MRI, the common type of processing is T1 weighting. T1 tries to emphasize the impact of T1 realignment by using short TE and TR. This is as opposed to T2 weighting, which emphasizes impact of T2 net magnetization loss by waiting for a long time before measuring the signal and repeating the puls. This selects for tissues with a high T2 time, such as cerebrospinal fluid (CSF). We will primarily look for T1-weighted images because they represent grey and white matter better.

Getting the data

OpenNeuro has a GraphQL API. There are many queries we can run that have to do with things like datasets, users, files, comments, etc. We are interested in first retrieving a list of datasets and then downloading files from them.

I did the initial API download in Python using gql[all] and aiohttp:


# List datasets
from tqdm.auto import tqdm
from math import ceil
import json
from gql import gql, Client
from gql.transport.aiohttp import AIOHTTPTransport
from gql.transport.exceptions import TransportQueryError
import asyncio


transport = AIOHTTPTransport(url="https://openneuro.org/crn/graphql")
client = Client(transport=transport, fetch_schema_from_transport=True,
                execute_timeout=60)

timeout_wait = 5
datasets_path = "openneuro-fmri-datasets.json"
try:
    datasets = json.load(open(datasets_path))
except FileNotFoundError:

    query_list = gql("""query($cursor: String) {
    datasets(modality: "MRI", after: $cursor, first: 1) {
        edges {
            node {
                name
                metadata {
                    dxStatus
                    affirmedConsent
                    affirmedDefaced
                }
                id
                public
                stars {
                userId
                }
                followers {
                userId
                }
            }
        }
        pageInfo {
        endCursor
        hasNextPage
        count
        }
    }
    }""")
    cursor = None
    datasets = []
    page = 0
    for _ in (bar := tqdm(iter(int, 1))):
        for _ in range(3):
            try:
                result = (await client.execute_async(query_list,
                                                    variable_values=dict(cursor=cursor)
                                                    ))["datasets"]
                break
            except (TransportQueryError, asyncio.TimeoutError):
                print("Timeout, retrying with cursor", cursor, "in", timeout_wait, "seconds")
                await asyncio.sleep(timeout_wait)
                print("Trying...")
        else:
            raise Exception("Failed to fetch datasets")
        datasets.extend([e["node"] for e in result["edges"]])
        if not result["pageInfo"]["hasNextPage"]:
            break
        cursor = result["pageInfo"]["endCursor"]
        if page == 0:
            bar.reset(ceil(result["pageInfo"]["count"] / len(datasets)))
        page += 1
        bar.set_postfix(datasets=len(datasets))
    json.dump(datasets, open(datasets_path, "w"))

After this, I downloaded information about each dataset and its files:


# List files
from gql.transport.exceptions import TransportServerError
import os


dataset_file_list_path = "openneuro_mri_dataset_file_list.json"
if os.path.exists(dataset_file_list_path):
    dataset_file_list = json.load(open(dataset_file_list_path))
else:
    dataset_file_list = {}

transport = AIOHTTPTransport(url="https://openneuro.org/crn/graphql")
client = Client(transport=transport, fetch_schema_from_transport=True,
                execute_timeout=10)

query_dataset = gql("""query($id: ID!) {
    dataset(id: $id) {
        latestSnapshot {
            files {
                urls
                size
                key
                id
                filename
                directory
                annexed
            }
            downloadFiles {
                urls
                size
                key
                id
                filename
                directory
                annexed
            }
            readme
            hexsha
            id
            issues {
                severity
                reason
                key
                helpUrl
                additionalFileCount
                code
            }
            tag
        }
    }
}""")

for dataset in tqdm(datasets):
    if "node" in dataset:
        dataset = dataset["node"]
    did = dataset["id"]
    if did in dataset_file_list:
        continue
    for _ in range(1):
        try:
            result = (await client.execute_async(query_dataset,
                                                variable_values=dict(id=did)
                                                ))["dataset"]
            dataset_file_list[did] = result
            break
        except (TransportQueryError, asyncio.TimeoutError, TimeoutError, TransportServerError):
            pass
    else:
        print("Skipped dataset", did)
        dataset_file_list[did] = None
    json.dump(dataset_file_list, open(dataset_file_list_path, "w"))

Most datasets have a participants.tsv file that contains information about the participants. An example can be found here. These files contain participant IDs and demographic information. We can download and parse these on the fly:


def iterate_tsvs(dataset_file_list, tsv_dir = "tsvs"):
    os.makedirs(tsv_dir, exist_ok=True)
    for dataset_id, dataset in tqdm(dataset_file_list.items()):
        if dataset is None:
            continue
        # tag = dataset["latestSnapshot"]["tag"]
        file_list = dataset["latestSnapshot"]["files"]
        tsvs = {}
        for f in file_list:
            filename = f["filename"]
            if filename not in ("participants.tsv", "scans.json"):
                continue
            save_path = os.path.join(tsv_dir, dataset_id, filename)
            save_dir = os.path.dirname(save_path)
            os.makedirs(save_dir, exist_ok=True)
            if os.path.exists(save_path):
                with open(save_path) as t:
                    tsvs[filename] = t.read()
                continue
            url = f["urls"][0]
            tsv_contents = requests.get(url).text
            with open(save_path, "w") as t:
                t.write(tsv_contents)
            tsvs[filename] = tsv_contents
        if not tsvs:
            continue
        try:
            tsvs["participants.tsv"] = pd.read_csv(io.StringIO(tsvs["participants.tsv"]), sep="\t")
            if "scans.json" in tsvs:
                tsvs["scans.json"] = pd.read_json(io.StringIO(tsvs["scans.json"]))
        except (pd.errors.ParserError, ValueError) as e:
            import traceback
            traceback.print_exc()
            continue
        yield dataset_id, tsvs

The T1 scans must all be in a predictable place: {SUBJECT_ID}/anat/{SUBJECT_ID}_T1w.nii.gz. But here we hit a problem. We could download the .nii.gz files corresponding to the T1 scans, but the combined size of all the files would be too large to store. What we can do instead is save the files in a temporary directory, then process, compress and store them.

Data processing

I looked at some papers that use anatomical MRI for ML tasks. Many of them followed a pipeline similar to CAT 12, using CAT or ANTs. The processing steps vary, but I could see some commonalities:

Resampling the image to a common spatial resolution, usually of 1-3mm isotropic voxels. N4 bias field correction. MRI images usually have a low-frequency bias field that we need to remove. Denoising.
Tranforming the image to a common space. By default, different regions in an MRI image can be in different places. We need to register the image to a common space like the MNI space. This is done with an affine or elastic transformation. We need to be careful to keep track of the transformation, specifically its Jacobian determinant, so we can find how much volume each area corresponded to in the original image.
Finding a segmentation mask for the brain. Another name for this is brain extraction.¹ This is done with classic machine learning methods like Atropos, or, more recently, with 3D U-Nets.
Segmenting the image into different tissue types. There are many possible ways of segmenting an image, but most commonly we segment into white matter, gray matter and cerebrospinal fluid.
Weighting the image to find the portion of the total intracranial volume it occupies.

I had trouble applying this pipeline initially. I am not a neuroscientists, so I don't know what a good processed T1 image looks like, but I could tell when a step was failing. Beyond that, most of the software used in these papers is no longer maintained or available. ANTs is the only one that is still actively developed. It also has a handy Python API. Some of the steps, such as bias field correction, did work well. However, the segmentation tool always crashed with, fittingly, a segmentation fault.

I ended up using a combination of ANTs and ANTsPyNet The latter uses neural networks to segment the image into different tissue types, sidestepping the issues with the code. The final pipeline looks like this:


import ants
import numpy as np
import antspynet
import ants
import os

def process(image, verbose=False, md_only=False):
    initial_prep = antspynet.preprocess_brain_image(image,
                                                    brain_extraction_modality="t1",
                                                    do_bias_correction=False,
                                                    do_denoising=False,
                                                    verbose=False)
    mask = initial_prep["brain_mask"]
    extra_md = dict(
        tiv=(mask.flatten() > 1e-2).sum() * np.prod(mask.spacing)
    )
    if md_only:
        return None, extra_md
    after_nf4 = ants.n4_bias_field_correction(initial_prep["preprocessed_image"], verbose=verbose, mask=mask)
    smart_preprocessed = antspynet.preprocess_brain_image(after_nf4,
                                    brain_extraction_modality="t1",
                                    # template_transform_type="Affine",
                                    template_transform_type="antsRegistrationSyNQuickRepro[a]",
                                    # https://github.com/ANTsX/ANTsPyNet/blob/57fd2185b9fd9bb680a40190a5dff7fe593510be/antspynet/utilities/get_antsxnet_data.py
                                    # https://github.com/ANTsX/ANTsPyNet/blob/57fd2185b9fd9bb680a40190a5dff7fe593510be/antspynet/utilities/deep_atropos.py#L82C27-L82C40
                                    template="croppedMni152",
                                    do_bias_correction=False,
                                    do_denoising=False,
                                    verbose=verbose)
    brain = smart_preprocessed["preprocessed_image"] * smart_preprocessed["brain_mask"]
    return brain, extra_md

def downsample(brain, factor=3):
    return ants.resample_image(brain, (factor, factor, factor))

def segment(brain, classic=False):
    if classic:
        mask = ants.get_mask(brain)
        # this is the code that crashes
        img_seg = ants.atropos(a=brain, m='[0.2,1x1]', c='[2,0]', 
                       i='kmeans[3]', x=mask, verbose=1)
        gm, wm, csf = img_seg['probabilityimages']
        return dict(
            dgm=gm * 0,
            gm=gm,
            wm=wm,
            csf=csf
        )
    else:
        segmentations = antspynet.deep_atropos(brain, do_preprocessing=False)
        masks = segmentations["probability_images"]
        csf, gm, wm, dgm = masks[1:5]
        return dict(csf=csf, gm=gm, wm=wm, dgm=dgm)

Some rationale for the pipeline:

A pre-segmentation step where we locate the brain in the original image. We use these outputs to find the total intracranial volume.
Bias field correction on just the brain. I'm not sure why I thought this was a good idea, it should be better to see the entire image for this step.
Pass off the preprocessing to ANTsPyNet. I wasn't sure which settings to use, so I just used the defaults for T1 image segmentation.
Segmentation. We skip the preprocessing step and just use the output of the network.

With this pipeline, I processed all of the images from OpenNeuro and saved them to a private Huggingface dataset.

Taking MRI scans of your own brain

This is a bit tangential, but in this section I will describe how you can get (structural) MRI scans of your own brain. Most people (in the US) can do this for free.

Step 0: Make sure you don't have any metal in your body. Well, nobody will let you near an MRI machine if you do.

Step 0.5: Live near a research university. Go on campus and look for study posters.*

* You can also just go on the university's website and see if they're looking for study participants.

Step 1: Find a study that takes MRI scans and volunteer. This is the most important step: many universities have labs run psychology or neuroscience studies that take MRI scans of participants.

You don't need to be a student most of the time. Some factors that might make you a better participant, from what I've seen looking for studies:

English proficiency.
Having good vision when corrected in an MRI-compatible way.
Ability to fit random tasks into your schedule.
Being above 18 and below a varying age cutoff.
Not having a mental illness or neurological disorder.
Having a mental illness or neurological disorder, if the study is specifically looking for people with a certain condition.
Fitting any number of other criteria.

This is a good point to mention that you can participate in psychology studies without a reason like getting an MRI scan. You can just volunteer and get paid at a good rate. MRI studies usually pay more.

Step 2: Meet and sign consent forms. You may meet the principal investigator and have your tasks and compensation explained to you. If you're in an fMRI study with a task, you may get a preview of the task on a computer now.

If you want to get your anatomical scans, this is a good time to ask. The study I participated in actually advertised offering anatomical scans to participants, so I just asked how that worked.

Step 3: Come for the MRI scan. I had two sessions. I was free to schedule both, but the time windows for these are strict because the MRI machine is in use for other patients. This is probably the case for medical MRI scans, but I never had one so I can only share experience from participating in one study.

I prepared, drank water, changed into a gown, and went to the MRI room. I lied down on the table, was strapped down, and then entered the machine. I was wearing ear plugs to block out the noise, but I could still hear the assistant talk. I had a microphone to speak to them and a controller in my hand. The controller was a bit uncomfortable, it was difficult to press all 4 buttons with one finger without readjusting.

The machine was very loud when it was starting up, but the task mostly distracted me from the noise. I think the fMRI sections had a different sound from the initial anatomical scan, but I don't remember well. I do remember needing to be still for the anatomical scan.

Step 4: Get your results. I got a USB stick with JPEG images of one T1 scan. These were pretty low quality, and I couldn't process them using the pipeline I described above. I asked for a DICOM scan by email and received it. It already contained information about voxel spacing, so I didn't need to ask and processed it simply:


import pydicom as dicom
from pprint import pprint
from glob import glob
dicom_files = glob("data_other/T1 DICOM/*.dcm")
dicom_files.sort()
slices = []
for f in dicom_files:
    ds = dicom.dcmread(f)
    slices.append(ds.pixel_array)

import numpy as np
import ants
image = ants.from_numpy(np.asarray(slices).transpose(0, 2, 1)[:, :, ::-1])
image.set_spacing((1, 1, 1))
ants.plot(image)
ants.image_write(image, "data_other/my_dicom.nii.gz")

This scan was compatible with the pipeline I described above, so I processed it the same way.

If you can come up with a joke about this, please let me know so I can include it here.