Skip to content

Session Comparison

When you compare two sessions, you’re asking a specific question: did something measurably change? Not “does this look different on a topomap”—that’s eyeballing, and eyeballing is unreliable when differences are subtle or distributed. What you need is a principled statistical framework that can distinguish genuine change from the noise inherent in any biological recording.

This page explains exactly how the comparison works—what it computes, what it tests, and what the results mean. No black boxes.

The comparison pipeline operates on two recordings from the same subject, taken at different times. You designate one as baseline (before) and the other as the comparison (after). The pipeline compares them across two domains:

Spectral power. For each frequency band (Delta, Theta, Alpha, Beta, Gamma, and sub-bands if present), the pipeline computes absolute band power at every electrode for both sessions, then tests whether the difference at each channel is statistically significant.

Event-related potentials. If both sessions include ERP data (auditory oddball), the pipeline compares AODEMR stage metrics—peak amplitudes, latencies, and GFP recovery times—between sessions.

Connectivity comparison uses a separate methodology (permutation testing on dwPLI matrices) described on the Connectivity page.


The spectral comparison uses a segment-level parametric test. Here’s what that means in plain language, and why each step matters.

The pipeline divides each recording into short, overlapping windows—by default, 2-second segments with 50% overlap. Each segment gets its own power spectral density estimate via Welch’s method.

Why segments? A single recording yields a single power spectrum—one number per channel per band. You can’t run a statistical test on one number. By breaking the recording into segments, each session produces many independent spectral estimates, giving the test enough samples to distinguish real change from random fluctuation.

The 50% overlap means adjacent windows share half their data, which increases the number of segments without introducing excessive redundancy. This is standard practice in spectral analysis—it improves statistical power while the overlap’s effect on independence is well-characterized and modest.

Raw EEG power spectral density follows a chi-squared distribution—it’s right-skewed, bounded at zero, and its variance scales with its mean. These properties violate the assumptions of the t-test, which expects roughly normally distributed data with stable variance.

The fix is straightforward: take the natural logarithm of each power value before testing. Log-transforming chi-squared data produces values that are approximately normal with stabilized variance. This is not a statistical trick—it’s a standard transformation used throughout the EEG literature precisely because PSD data has this known distributional property.

With log-transformed segment-level power values in hand, the pipeline runs an independent-samples Welch’s t-test (unequal variance) comparing the before and after segments at each channel × band combination.

Why Welch’s t-test rather than a paired test? The two recordings typically have different numbers of clean segments (artifact rejection removes variable amounts of data), and segment-level data from different sessions cannot be meaningfully paired. Welch’s variant doesn’t assume equal variance between groups—a safe default when comparing recordings that may differ in noise characteristics.

For a typical 2-minute recording with 2-second segments at 50% overlap, each session yields roughly 119 segments per channel. That’s more than enough statistical power to detect clinically meaningful changes.

A 19-channel montage with 7 frequency bands produces 133 simultaneous statistical tests. A 37-channel montage with the same bands produces 259. Without correction, at α = 0.05 you’d expect roughly 7–13 false positives by chance alone.

The pipeline applies the Benjamini-Hochberg procedure to control the false discovery rate (FDR). Unlike the more conservative Bonferroni correction—which controls the probability of any false positive and becomes extremely stringent as the number of tests grows—FDR correction controls the proportion of discoveries that are false. At FDR α = 0.05, you expect no more than 5% of the channels flagged as significant to be false positives.

This is the right trade-off for clinical EEG comparison. Bonferroni would miss subtle but real changes in the interest of absolute certainty. FDR lets real changes through while keeping the false positive rate bounded and interpretable.

Statistical significance alone isn’t enough. With enough segments, even a trivially small difference will reach significance—the t-test has no concept of “clinically meaningful.” A channel that changed by 0.01 µV² might be statistically significant but practically irrelevant.

To address this, the pipeline applies a second gate: Cohen’s d (standardized effect size). Only channels that pass both the FDR-corrected significance threshold and a minimum effect size are flagged as significant.

The default threshold is |d| ≥ 0.3, which corresponds roughly to a “small-to-medium” effect in Cohen’s conventional classification. This filters out changes that, while statistically detectable, are too small to warrant clinical attention.

The dual gate—statistical significance and practical effect size—means that when a channel is flagged, you can be reasonably confident that (a) the change is unlikely to be noise, and (b) the change is large enough to matter.


The three-column topographic grid shows one row per frequency band:

  • Left column (Before): Absolute power distribution from the baseline session. Color scale runs from cool (low power) to warm (high power), symmetric around the midpoint for that band.
  • Center column (After): Same scale, same colormap—the comparison session. Visual differences between left and center reflect raw power changes.
  • Right column (Significance): Electrode markers on a head outline. Red squares indicate channels where the change is statistically significant (FDR-corrected p < α and Cohen’s d ≥ threshold). Black squares indicate channels that did not reach significance.

The key insight: look at the significance column first, then examine the before/after columns for context. The significance map tells you where to look; the power maps tell you what changed.

The heatmap table shows channels (rows) × frequency bands (columns), with each cell displaying the percent change in band power. Red tones indicate increases; blue tones indicate decreases. Statistically significant cells are marked with a ★ and rendered in bold.

This is your summary view—scan for patterns. Are the significant changes clustered in a particular region? A particular band? Distributed or focal? The spatial and spectral pattern of change often matters more than any individual cell value.

The dot map projects significant channels onto the 10-20 scalp layout for a selected frequency band. Red dots mark significant increases, blue dots mark significant decreases, and gray dots mark channels without significant change. Use the band selector to cycle through frequencies and see how the spatial pattern of significance shifts.


ParameterDefaultMeaning
segment_seconds2.0Window length for spectral segmentation
significance_alpha0.05FDR-corrected significance threshold
min_effect_size0.3Minimum Cohen’s d for clinical relevance

These defaults can be adjusted per-comparison in the comparison dialog. Lowering the alpha makes the test more conservative (fewer false positives, potentially more missed real changes). Raising the effect size threshold focuses on larger changes. The defaults represent a balanced starting point for clinical use.


Statistical comparison answers “did this change exceed what we’d expect from noise?” It does not answer why the change occurred. A significant decrease in frontal theta could reflect successful neurofeedback training, natural circadian variation, a change in medication, better sleep the night before, or a dozen other factors. The comparison provides evidence of change; clinical reasoning provides the interpretation.

The test also operates channel-by-channel. It doesn’t capture network-level reorganization—a coordinated shift in connectivity patterns across multiple regions might not produce large changes at any single electrode. For network-level comparison, use the connectivity comparison panel, which applies permutation testing to hub-level dwPLI matrices.

Finally, the segment-level approach assumes approximate stationarity within each 2-second window. For most resting-state EEG, this is a reasonable assumption. For recordings with large state transitions (falling asleep mid-recording, sudden anxiety onset), the segments from different states will increase variance and reduce sensitivity. Clean, stable recordings produce the most reliable comparisons.