We show that individual differences in visual experience emerge from high-dimensional neural activity patterns during naturalistic movie viewing, with distinct latent dimensions capturing behaviorally relevant aspects of perception. This multidimensional geometry, revealed through spectral decomposition, explains inter-individual variability beyond what conventional methods capture.
We examined whether visual neural networks align with brain representations due to shared constraints or universal features. Analysis of diverse networks revealed shared latent dimensions for image representation. Comparing these to human fMRI data showed brain-aligned representations are universal across networks. This suggests similarities between artificial and biological vision arise from core universal image representations learned convergently.
We find that untrained convolutional neural networks can produce brain-like visual representations. This challenges the view that extensive training is necessary for such similarities. The key factors are the networks' architecture, specifically how they compress spatial information and expand feature information. This suggests that the basic structure of convolutional networks mimics biological vision constraints, allowing for cortex-like representations even without learning from experience.
Is it possible to deeply understand neural representations through dimensionality reduction? Our recent work demonstrates that the visual representations in the human brain require interpretation within high-dimensional spaces. Moreover, we reveal that traditional methods like representational similarity analysis fail to detect this high-dimensional information in cortical activity; instead, a spectral approach is necessary. This research uncovers a vast expanse of uncharted dimensions that conventional techniques have overlooked but may be crucial for decoding the cortical code of human vision.
Artificial neural networks resemble ventral-stream activity but still miss aspects of cortical processing. We show that test-time augmentation (TTA)—averaging features from multiple augmented views of the same image—reliably boosts model–brain similarity without changing the model. This holds across CNNs, ResNets, ViTs, and robust models. Surprisingly, averaging features from semantically similar but visually varied augmentations, including text-guided diffusion samples, can outperform the original image. This suggests that semantic gist, not precise visual detail, drives the improved alignment with high-level visual cortex.
Investigating deep neural networks (DNNs) as models for the visual cortex, we found that higher-dimensional representations in these networks better predict cortical responses and improve learning of new stimuli, challenging the idea that lower dimensionality enhances performance. This indicates that high-dimensional geometries might be advantageous for DNN models of visual processing.
What factors determine when visual memories will include details that go beyond perceptual experience? Here, seven experiments (N = 1,100 adults) explored whether spatial scale-specifically, perceived viewing distance-drives boundary extension. We created fake miniatures by exploiting tilt shift, a photographic effect that selectively reduces perceived distance while preserving other scene properties (e.g., making a distant railway appear like a model train). We found that visual memory is modulated by the spatial scale at which the environment is viewed.
Human judgments of which objects “go together” rely heavily on context, not just visual appearance or word meaning. By modeling which scenes objects tend to appear in, we built contextual prototypes—the average CNN response to the scenes surrounding each object. These prototypes strongly predicted human similarity judgments across many categories, matching or exceeding models based on object images or word embeddings. This shows that natural scene context is a powerful driver of intuitive object similarity.
People systematically misremember scene depth - views closer than the typical (“modal”) depth are remembered as farther away (boundary extension), and deeper views are remembered as closer (boundary contraction). These biases align with the statistical distribution of natural scene viewpoints, suggesting memory is pulled toward high-probability views to compensate for noisy or incomplete representations.