Skip to content

Data Ingestion

Before any signal processing happens, the pipeline must answer three questions: What files exist for this session? What format are they in? And do the channel labels match what the rest of the pipeline expects? Data ingestion handles all three—finding the recordings, loading them into MNE-Python’s internal representation, and standardizing the channel names so that downstream stages can assume a consistent montage.

The pipeline supports several EEG file formats, each serving a different role in the clinical workflow:

EEGLAB .set files are the standard format for pre-processed EEG data in the EEGLAB ecosystem. These are MATLAB-compatible files that can contain raw or pre-processed continuous data, epoched data, ICA decomposition results, DIPFIT dipole locations, and ICLabel component classifications. The pipeline reads .set files via mne.io.read_raw_eeglab() and can extract embedded ICA and ICLabel data for component profiling. This format is the primary bridge between existing EEGLAB workflows and the Coherence Workstation—if a clinician has already processed their data in EEGLAB, the pipeline picks up where EEGLAB left off.

EDF (European Data Format) is the international standard for clinical EEG recording, used by most hospital systems and many portable amplifiers. The pipeline reads EDF files via mne.io.read_raw_edf(). EDF support ensures broad compatibility across amplifier vendors and clinical recording systems.

XDF (Extensible Data Format) is the standard format for Lab Streaming Layer (LSL) recordings, commonly used for ERP paradigms where multiple data streams (EEG, event markers, physiological signals) need to be synchronized with sub-millisecond precision. The pipeline reads XDF files for event-related processing, extracting both the continuous EEG and the event marker stream.

The pipeline also supports additional vendor-specific formats for direct amplifier import. Regardless of input format, every recording passes through the same ingestion, standardization, and preprocessing chain.

The pipeline uses pattern-based file discovery to locate session recordings. The default patterns are defined in configs/default.yaml:

files:
resting_eo: "{SUBJ}_EO1.set"
resting_ec: "{SUBJ}_EC1.set"
erp_go: "{SUBJ}_ERP_GO.set"
erp_nogo: "{SUBJ}_ERP_NOGO.set"
erp_continuous: "{SUBJ}_ERP_CONTINUOUS.set"

The {SUBJ} placeholder is replaced with the subject_id from the configuration. The discovery process scans the input directory for files matching these patterns and catalogs which recording types are present. Missing files don’t produce errors—the pipeline simply skips stages that require them. A session with only resting-state recordings will produce spectral, connectivity, and microstate analyses but no ERP or ERSP output.

This pattern-based approach means the pipeline doesn’t require a rigid directory structure. As long as the files follow the naming convention, they can coexist with other files in the same directory.

EEG channel naming should be simple—the international 10-20 system defines standard names for 19 electrode positions. In practice, it’s a mess. Different amplifier vendors, different EEGLAB versions, and different clinical conventions produce recordings with inconsistent channel labels for the same physical electrode locations.

The most common source of confusion is the temporal electrodes. The original 10-20 system uses T3, T4, T5, T6 for the temporal sites. The newer 10-10 extension renamed these to T7, T8, P7, P8. Both naming conventions are in active clinical use, and a recording might use either—or a mixture. The pipeline handles this by maintaining an explicit mapping table and normalizing all channel names to the 10-20 convention on load:

10-10 Name10-20 NameLocation
T7T3Left mid-temporal
T8T4Right mid-temporal
P7T5Left posterior temporal
P8T6Right posterior temporal

This normalization runs automatically during ingestion. The pipeline logs any channel renaming that occurs so you can verify that the mapping was correct for your recording.

The pipeline’s default montage is the standard 19-channel 10-20 system:

Fp1, Fp2 (frontopolar), F3, F4, F7, F8, Fz (frontal), C3, C4, Cz (central), T3, T4, T5, T6 (temporal), P3, P4, Pz (parietal), O1, O2 (occipital).

These 19 positions provide coverage of the major cortical regions with well-characterized electrode locations. The pipeline assigns standard 10-20 coordinates from mne.channels.make_standard_montage('standard_1020'), which provides the 3D positions needed for topographic interpolation, source localization, and connectivity hub assignment.

For systems with more than 19 channels—including extended 10-10 configurations—the pipeline supports a 37-channel montage that adds: AF3, AF4, FC3, FC4, FC1, FC2, CP3, CP4, CP1, CP2, PO3, PO4, FT7, FT8, TP7, TP8, P1, P2.

The pipeline never assumes a fixed channel count. All downstream processing derives the channel list from the loaded data, so a 19-channel recording and a 37-channel recording follow the same code paths with different input dimensions.

We believe higher channel density is the future of clinical QEEG. The standard 19-channel 10-20 montage was designed in the 1950s for visual EEG reading—it provides adequate coverage for spectral analysis but undersamples the scalp for connectivity and source localization. Thirty-seven channels nearly double the spatial resolution, which matters most for connectivity analysis (where finer electrode spacing reveals coupling patterns that 19 channels cannot resolve) and source localization (where more sensors improve the accuracy of the inverse solution). The 37-channel extended montage is our recommended configuration for new clinical setups.

Not every recording arrives with a clean 19-channel montage. Channels may be missing (a bad electrode during recording), extra (auxiliary channels like EMG, EOG, or ECG), or mislabeled (vendor-specific naming conventions).

The pipeline handles these cases as follows. Extra non-EEG channels (EMG, EOG, ECG, STIM, auxiliary) are identified by name or type and excluded from EEG processing—they may be used for artifact detection or HRV analysis, but they don’t enter the spectral or connectivity pipeline. Missing channels are detected during bad channel detection and can be interpolated from neighboring electrodes if the montage has enough surviving channels. Unrecognized channel names that don’t match any known 10-20 or 10-10 label are logged as warnings and excluded.

The goal is conservative: the pipeline would rather exclude a channel it doesn’t recognize than include one that might contaminate the analysis. Every inclusion and exclusion decision is logged in the stage output.

EEG data arrives in different units depending on the file format and amplifier. The pipeline standardizes everything to microvolts (µV) on load—the conventional unit for scalp EEG. MNE-Python internally represents data in volts (V), so a conversion factor is applied during loading. This is invisible to the clinician but matters for reproducibility: amplitude thresholds in the artifact rejection stage (e.g., 75 µV voltage threshold) assume µV-scaled data.

When a photoplethysmography (PPG) channel is present in the recording, the pipeline extracts it before the EEG channels are isolated for processing. This PPG signal feeds the HRV analysis stage, which computes time-domain metrics (SDNN, RMSSD), frequency-domain metrics (VLF/LF/HF power), and nonlinear measures (DFA, sample entropy). The HRV analysis runs in parallel with the EEG pipeline—it doesn’t depend on EEG preprocessing and doesn’t affect it.