Xiaoming Zhao, Alexander G. Schwing
Classifier-free guidance has become a staple for conditional generation with
denoising diffusion models. However, a comprehensive understanding of
classifier-free guidance is still missing. In this work, we carry out an
empirical study to provide a fresh perspective on classifier-free guidance.
Conc...
2025-03-13
CV
AI
LG
Xiaoming Zhao, Alexander G. Schwing
Classifier-free guidance has become a staple for conditional generation with
denoising diffusion models. However, a comprehensive understanding of
classifier-free guidance is still missing. In this work, we carry out an
empirical study to provide a fresh perspective on classifier-free guidance.
Concretely, instead of solely focusing on classifier-free guidance, we trace
back to the root, i.e., classifier guidance, pinpoint the key assumption for
the derivation, and conduct a systematic study to understand the role of the
classifier. We find that both classifier guidance and classifier-free guidance
achieve conditional generation by pushing the denoising diffusion trajectories
away from decision boundaries, i.e., areas where conditional information is
usually entangled and is hard to learn. Based on this classifier-centric
understanding, we propose a generic postprocessing step built upon
flow-matching to shrink the gap between the learned distribution for a
pre-trained denoising diffusion model and the real data distribution, majorly
around the decision boundaries. Experiments on various datasets verify the
effectiveness of the proposed approach.
Ho Kei Cheng, Alexander Schwing
Minibatch optimal transport coupling straightens paths in unconditional flow
matching. This leads to computationally less demanding inference as fewer
integration steps and less complex numerical solvers can be employed when
numerically solving an ordinary differential equation at test time. However...
2025-03-13
LG
CV
Ho Kei Cheng, Alexander Schwing
Minibatch optimal transport coupling straightens paths in unconditional flow
matching. This leads to computationally less demanding inference as fewer
integration steps and less complex numerical solvers can be employed when
numerically solving an ordinary differential equation at test time. However, in
the conditional setting, minibatch optimal transport falls short. This is
because the default optimal transport mapping disregards conditions, resulting
in a conditionally skewed prior distribution during training. In contrast, at
test time, we have no access to the skewed prior, and instead sample from the
full, unbiased prior distribution. This gap between training and testing leads
to a subpar performance. To bridge this gap, we propose conditional optimal
transport C^2OT that adds a conditional weighting term in the cost matrix when
computing the optimal transport assignment. Experiments demonstrate that this
simple fix works with both discrete and continuous conditions in
8gaussians-to-moons, CIFAR-10, ImageNet-32x32, and ImageNet-256x256. Our method
performs better overall compared to the existing baselines across different
function evaluation budgets. Code is available at
https://hkchengrex.github.io/C2OT
Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack...
2025-03-13
CV
AI
LG
Zhaoyi Li, Xiaohan Zhao, Dong-Dong Wu, Jiacheng Cui, Zhiqiang Shen
Despite promising performance on open-source large vision-language models
(LVLMs), transfer-based targeted attacks often fail against black-box
commercial LVLMs. Analyzing failed adversarial perturbations reveals that the
learned perturbations typically originate from a uniform distribution and lack
clear semantic details, resulting in unintended responses. This critical
absence of semantic information leads commercial LVLMs to either ignore the
perturbation entirely or misinterpret its embedded semantics, thereby causing
the attack to fail. To overcome these issues, we notice that identifying core
semantic objects is a key objective for models trained with various datasets
and methodologies. This insight motivates our approach that refines semantic
clarity by encoding explicit semantic details within local regions, thus
ensuring interoperability and capturing finer-grained features, and by
concentrating modifications on semantically rich areas rather than applying
them uniformly. To achieve this, we propose a simple yet highly effective
solution: at each optimization step, the adversarial image is cropped randomly
by a controlled aspect ratio and scale, resized, and then aligned with the
target image in the embedding space. Experimental results confirm our
hypothesis. Our adversarial examples crafted with local-aggregated
perturbations focused on crucial regions exhibit surprisingly good
transferability to commercial LVLMs, including GPT-4.5, GPT-4o,
Gemini-2.0-flash, Claude-3.5-sonnet, Claude-3.7-sonnet, and even reasoning
models like o1, Claude-3.7-thinking and Gemini-2.0-flash-thinking. Our approach
achieves success rates exceeding 90% on GPT-4.5, 4o, and o1, significantly
outperforming all prior state-of-the-art attack methods. Our optimized
adversarial examples under different configurations and training code are
available at https://github.com/VILA-Lab/M-Attack.
Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, Yedid Hoshen
As there are now millions of publicly available neural networks, searching
and analyzing large model repositories becomes increasingly important.
Navigating so many models requires an atlas, but as most models are poorly
documented charting such an atlas is challenging. To explore the hidden
potenti...
2025-03-13
LG
CL
CV
Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, Yedid Hoshen
As there are now millions of publicly available neural networks, searching
and analyzing large model repositories becomes increasingly important.
Navigating so many models requires an atlas, but as most models are poorly
documented charting such an atlas is challenging. To explore the hidden
potential of model repositories, we chart a preliminary atlas representing the
documented fraction of Hugging Face. It provides stunning visualizations of the
model landscape and evolution. We demonstrate several applications of this
atlas including predicting model attributes (e.g., accuracy), and analyzing
trends in computer vision models. However, as the current atlas remains
incomplete, we propose a method for charting undocumented regions.
Specifically, we identify high-confidence structural priors based on dominant
real-world model training practices. Leveraging these priors, our approach
enables accurate mapping of previously undocumented areas of the atlas. We
publicly release our datasets, code, and interactive atlas.
Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of
learnable activation functions with the potential to capture more complex
relationships from data. Although KANs are useful in finding symbolic
representations and continual learning of one-dimensional functions, their
effec...
2025-03-13
LG
CV
68T07
10
Subhajit Maity, Killian Hitsman, Xin Li, Aritra Dutta
Kolmogorov-Arnold networks (KANs) are a remarkable innovation consisting of
learnable activation functions with the potential to capture more complex
relationships from data. Although KANs are useful in finding symbolic
representations and continual learning of one-dimensional functions, their
effectiveness in diverse machine learning (ML) tasks, such as vision, remains
questionable. Presently, KANs are deployed by replacing multilayer perceptrons
(MLPs) in deep network architectures, including advanced architectures such as
vision Transformers (ViTs). In this paper, we are the first to design a general
learnable Kolmogorov-Arnold Attention (KArAt) for vanilla ViTs that can operate
on any choice of basis. However, the computing and memory costs of training
them motivated us to propose a more modular version, and we designed particular
learnable attention, called Fourier-KArAt. Fourier-KArAt and its variants
either outperform their ViT counterparts or show comparable performance on
CIFAR-10, CIFAR-100, and ImageNet-1K datasets. We dissect these architectures'
performance and generalization capacity by analyzing their loss landscapes,
weight distributions, optimizer path, attention visualization, and spectral
behavior, and contrast them with vanilla ViTs. The goal of this paper is not to
produce parameter- and compute-efficient attention, but to encourage the
community to explore KANs in conjunction with more advanced architectures that
require a careful understanding of learnable activations. Our open-source code
and implementation details are available on: https://subhajitmaity.me/KArAt
Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou
Expressing confidence is challenging for embodied agents navigating dynamic
multimodal environments, where uncertainty arises from both perception and
decision-making processes. We present the first work investigating embodied
confidence elicitation in open-ended multimodal environments. We introduc...
2025-03-13
AI
LG
Tianjiao Yu, Vedant Shah, Muntasir Wahed, Kiet A. Nguyen, Adheesh Juvekar, Tal August, Ismini Lourentzou
Expressing confidence is challenging for embodied agents navigating dynamic
multimodal environments, where uncertainty arises from both perception and
decision-making processes. We present the first work investigating embodied
confidence elicitation in open-ended multimodal environments. We introduce
Elicitation Policies, which structure confidence assessment across inductive,
deductive, and abductive reasoning, along with Execution Policies, which
enhance confidence calibration through scenario reinterpretation, action
sampling, and hypothetical reasoning. Evaluating agents in calibration and
failure prediction tasks within the Minecraft environment, we show that
structured reasoning approaches, such as Chain-of-Thoughts, improve confidence
calibration. However, our findings also reveal persistent challenges in
distinguishing uncertainty, particularly under abductive settings, underscoring
the need for more sophisticated embodied confidence elicitation methods.
Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their
application in scientific problem-solving, yet their fine-grained capabilities
remain under-explored. In this paper, we introduce SciVerse, a multi-modal
scientific evaluation benchmark to thoroughly assess LMMs across 5,735 ...
2025-03-13
CV
AI
CL
Ziyu Guo, Ray Zhang, Hao Chen, Jialin Gao, Dongzhi Jiang, Jiaze Wang, Pheng-Ann Heng
The rapid advancement of Large Multi-modal Models (LMMs) has enabled their
application in scientific problem-solving, yet their fine-grained capabilities
remain under-explored. In this paper, we introduce SciVerse, a multi-modal
scientific evaluation benchmark to thoroughly assess LMMs across 5,735 test
instances in five distinct versions. We aim to investigate three key dimensions
of LMMs: scientific knowledge comprehension, multi-modal content
interpretation, and Chain-of-Thought (CoT) reasoning. To unveil whether LMMs
possess sufficient scientific expertise, we first transform each problem into
three versions containing different levels of knowledge required for solving,
i.e., Knowledge-free, -lite, and -rich. Then, to explore how LMMs interpret
multi-modal scientific content, we annotate another two versions, i.e.,
Vision-rich and -only, marking more question information from texts to
diagrams. Comparing the results of different versions, SciVerse systematically
examines the professional knowledge stock and visual perception skills of LMMs
in scientific domains. In addition, to rigorously assess CoT reasoning, we
propose a new scientific CoT evaluation strategy, conducting a step-wise
assessment on knowledge and logical errors in model outputs. Our extensive
evaluation of different LMMs on SciVerse reveals critical limitations in their
scientific proficiency and provides new insights into future developments.
Project page: https://sciverse-cuhk.github.io
Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black
Acquiring physically plausible motor skills across diverse and unconventional
morphologies-including humanoid robots, quadrupeds, and animals-is essential
for advancing character simulation and robotics. Traditional methods, such as
reinforcement learning (RL) are task- and body-specific, require ex...
2025-03-13
CV
AI
LG
RO
Mert Albaba, Chenhao Li, Markos Diomataris, Omid Taheri, Andreas Krause, Michael Black
Acquiring physically plausible motor skills across diverse and unconventional
morphologies-including humanoid robots, quadrupeds, and animals-is essential
for advancing character simulation and robotics. Traditional methods, such as
reinforcement learning (RL) are task- and body-specific, require extensive
reward function engineering, and do not generalize well. Imitation learning
offers an alternative but relies heavily on high-quality expert demonstrations,
which are difficult to obtain for non-human morphologies. Video diffusion
models, on the other hand, are capable of generating realistic videos of
various morphologies, from humans to ants. Leveraging this capability, we
propose a data-independent approach for skill acquisition that learns 3D motor
skills from 2D-generated videos, with generalization capability to
unconventional and non-human forms. Specifically, we guide the imitation
learning process by leveraging vision transformers for video-based comparisons
by calculating pair-wise distance between video embeddings. Along with
video-encoding distance, we also use a computed similarity between segmented
video frames as a guidance reward. We validate our method on locomotion tasks
involving unique body configurations. In humanoid robot locomotion tasks, we
demonstrate that 'No-data Imitation Learning' (NIL) outperforms baselines
trained on 3D motion-capture data. Our results highlight the potential of
leveraging generative video models for physically plausible skill learning with
diverse morphologies, effectively replacing data collection with data
generation for imitation learning.
Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Animatable 3D human reconstruction from a single image is a challenging
problem due to the ambiguity in decoupling geometry, appearance, and
deformation. Recent advances in 3D human reconstruction mainly focus on static
human modeling, and the reliance of using synthetic 3D scans for training
limits...
2025-03-13
CV
AI
Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo
Animatable 3D human reconstruction from a single image is a challenging
problem due to the ambiguity in decoupling geometry, appearance, and
deformation. Recent advances in 3D human reconstruction mainly focus on static
human modeling, and the reliance of using synthetic 3D scans for training
limits their generalization ability. Conversely, optimization-based video
methods achieve higher fidelity but demand controlled capture conditions and
computationally intensive refinement processes. Motivated by the emergence of
large reconstruction models for efficient static reconstruction, we propose LHM
(Large Animatable Human Reconstruction Model) to infer high-fidelity avatars
represented as 3D Gaussian splatting in a feed-forward pass. Our model
leverages a multimodal transformer architecture to effectively encode the human
body positional features and image features with attention mechanism, enabling
detailed preservation of clothing geometry and texture. To further boost the
face identity preservation and fine detail recovery, we propose a head feature
pyramid encoding scheme to aggregate multi-scale features of the head regions.
Extensive experiments demonstrate that our LHM generates plausible animatable
human in seconds without post-processing for face and hands, outperforming
existing methods in both reconstruction accuracy and generalization ability.
Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu
Fitting a body to a 3D clothed human point cloud is a common yet challenging
task. Traditional optimization-based approaches use multi-stage pipelines that
are sensitive to pose initialization, while recent learning-based methods often
struggle with generalization across diverse poses and garment ty...
2025-03-13
CV
AI
GR
Boqian Li, Haiwen Feng, Zeyu Cai, Michael J. Black, Yuliang Xiu
Fitting a body to a 3D clothed human point cloud is a common yet challenging
task. Traditional optimization-based approaches use multi-stage pipelines that
are sensitive to pose initialization, while recent learning-based methods often
struggle with generalization across diverse poses and garment types. We propose
Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline
that estimates cloth-to-body surface mapping through locally approximate SE(3)
equivariance, encoding tightness as displacement vectors from the cloth surface
to the underlying body. Following this mapping, pose-invariant body features
regress sparse body markers, simplifying clothed human fitting into an
inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show
that ETCH significantly outperforms state-of-the-art methods -- both
tightness-agnostic and tightness-aware -- in body fitting accuracy on loose
clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant
tightness design can even reduce directional errors by (67.2% ~ 89.8%) in
one-shot (or out-of-distribution) settings. Qualitative results demonstrate
strong generalization of ETCH, regardless of challenging poses, unseen shapes,
loose clothing, and non-rigid dynamics. We will release the code and models
soon for research purposes at https://boqian-li.github.io/ETCH/.
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise opera...
2025-03-13
LG
AI
CL
CV
Jiachen Zhu, Xinlei Chen, Kaiming He, Yann LeCun, Zhuang Liu
Normalization layers are ubiquitous in modern neural networks and have long
been considered essential. This work demonstrates that Transformers without
normalization can achieve the same or better performance using a remarkably
simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation
$DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization
layers in Transformers. DyT is inspired by the observation that layer
normalization in Transformers often produces tanh-like, $S$-shaped input-output
mappings. By incorporating DyT, Transformers without normalization can match or
exceed the performance of their normalized counterparts, mostly without
hyperparameter tuning. We validate the effectiveness of Transformers with DyT
across diverse settings, ranging from recognition to generation, supervised to
self-supervised learning, and computer vision to language models. These
findings challenge the conventional understanding that normalization layers are
indispensable in modern neural networks, and offer new insights into their role
in deep networks.
Andy Zhou
We introduce Siege, a multi-turn adversarial framework that models the
gradual erosion of Large Language Model (LLM) safety through a tree search
perspective. Unlike single-turn jailbreaks that rely on one meticulously
engineered prompt, Siege expands the conversation at each turn in a
breadth-first...
2025-03-13
AI
CL
CR
Andy Zhou
We introduce Siege, a multi-turn adversarial framework that models the
gradual erosion of Large Language Model (LLM) safety through a tree search
perspective. Unlike single-turn jailbreaks that rely on one meticulously
engineered prompt, Siege expands the conversation at each turn in a
breadth-first fashion, branching out multiple adversarial prompts that exploit
partial compliance from previous responses. By tracking these incremental
policy leaks and re-injecting them into subsequent queries, Siege reveals how
minor concessions can accumulate into fully disallowed outputs. Evaluations on
the JailbreakBench dataset show that Siege achieves a 100% success rate on
GPT-3.5-turbo and 97% on GPT-4 in a single multi-turn run, using fewer queries
than baselines such as Crescendo or GOAT. This tree search methodology offers
an in-depth view of how model safeguards degrade over successive dialogue
turns, underscoring the urgency of robust multi-turn testing procedures for
language models.
Andy Zhou
Adapting large language models to multiple tasks can cause cross-skill
interference, where improvements for one skill degrade another. While methods
such as LoRA impose orthogonality constraints at the weight level, they do not
fully address interference in hidden-state representations. We propose
C...
2025-03-13
CL
AI
Andy Zhou
Adapting large language models to multiple tasks can cause cross-skill
interference, where improvements for one skill degrade another. While methods
such as LoRA impose orthogonality constraints at the weight level, they do not
fully address interference in hidden-state representations. We propose
Compositional Subspace Representation Fine-tuning (CS-ReFT), a novel
representation-based approach that learns multiple orthonormal subspace
transformations, each specializing in a distinct skill, and composes them via a
lightweight router. By isolating these subspace edits in the hidden state,
rather than weight matrices, CS-ReFT prevents cross-task conflicts more
effectively. On the AlpacaEval benchmark, applying CS-ReFT to Llama-2-7B
achieves a 93.94% win rate, surpassing GPT-3.5 Turbo (86.30%) while requiring
only 0.0098% of model parameters. These findings show that specialized
representation edits, composed via a simple router, significantly enhance
multi-task instruction following with minimal overhead.
Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, Xilong Lu
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology
for understanding human social behavior and enhancing human-computer
interaction experiences, where the core challenge lies in dynamic correlation
modeling and robust fusion of multimodal temporal signals. To address the
li...
2025-03-13
CV
AI
Jun Yu, Lingsi Zhu, Yanjun Chi, Yunxiang Zhang, Yang Zheng, Yongqi Wang, Xilong Lu
Emotional Mimicry Intensity (EMI) estimation serves as a critical technology
for understanding human social behavior and enhancing human-computer
interaction experiences, where the core challenge lies in dynamic correlation
modeling and robust fusion of multimodal temporal signals. To address the
limitations of existing methods in insufficient exploitation of modal
synergistic effects, noise sensitivity, and limited fine-grained alignment
capabilities, this paper proposes a dual-stage cross-modal alignment framework.
First, we construct vision-text and audio-text contrastive learning networks
based on an improved CLIP architecture, achieving preliminary alignment in the
feature space through modality-decoupled pre-training. Subsequently, we design
a temporal-aware dynamic fusion module that combines Temporal Convolutional
Networks (TCN) and gated bidirectional LSTM to respectively capture the
macro-evolution patterns of facial expressions and local dynamics of acoustic
features. Innovatively, we introduce a quality-guided modality fusion strategy
that enables modality compensation under occlusion and noisy scenarios through
differentiable weight allocation. Experimental results on the Hume-Vidmimic2
dataset demonstrate that our method achieves an average Pearson correlation
coefficient of 0.35 across six emotion dimensions, outperforming the best
baseline by 40\%. Ablation studies further validate the effectiveness of the
dual-stage training strategy and dynamic fusion mechanism, providing a novel
technical pathway for fine-grained emotion analysis in open environments.
Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Object Hallucination (OH) has been acknowledged as one of the major
trustworthy challenges in Large Vision-Language Models (LVLMs). Recent
advancements in Large Language Models (LLMs) indicate that internal states,
such as hidden states, encode the "overall truthfulness" of generated
responses. Howe...
2025-03-13
CV
AI
CL
Jinhao Duan, Fei Kong, Hao Cheng, James Diffenderfer, Bhavya Kailkhura, Lichao Sun, Xiaofeng Zhu, Xiaoshuang Shi, Kaidi Xu
Object Hallucination (OH) has been acknowledged as one of the major
trustworthy challenges in Large Vision-Language Models (LVLMs). Recent
advancements in Large Language Models (LLMs) indicate that internal states,
such as hidden states, encode the "overall truthfulness" of generated
responses. However, it remains under-explored how internal states in LVLMs
function and whether they could serve as "per-token" hallucination indicators,
which is essential for mitigating OH. In this paper, we first conduct an
in-depth exploration of LVLM internal states in relation to OH issues and
discover that (1) LVLM internal states are high-specificity per-token
indicators of hallucination behaviors. Moreover, (2) different LVLMs encode
universal patterns of hallucinations in common latent subspaces, indicating
that there exist "generic truthful directions" shared by various LVLMs. Based
on these discoveries, we propose Truthful-Guided Pre-Intervention (TruthPrInt)
that first learns the truthful direction of LVLM decoding and then applies
truthful-guided inference-time intervention during LVLM decoding. We further
propose ComnHallu to enhance both cross-LVLM and cross-data hallucination
detection transferability by constructing and aligning hallucination latent
subspaces. We evaluate TruthPrInt in extensive experimental settings, including
in-domain and out-of-domain scenarios, over popular LVLMs and OH benchmarks.
Experimental results indicate that TruthPrInt significantly outperforms
state-of-the-art methods. Codes will be available at
https://github.com/jinhaoduan/TruthPrInt.
Antonia van Betteray, Matthias Rottmann, Karsten Kahl
The structural analogies of ResNets and Multigrid (MG) methods such as common
building blocks like convolutions and poolings where already pointed out by He
et al.\ in 2016. Multigrid methods are used in the context of scientific
computing for solving large sparse linear systems arising from partial...
2025-03-13
LG
CV
Antonia van Betteray, Matthias Rottmann, Karsten Kahl
The structural analogies of ResNets and Multigrid (MG) methods such as common
building blocks like convolutions and poolings where already pointed out by He
et al.\ in 2016. Multigrid methods are used in the context of scientific
computing for solving large sparse linear systems arising from partial
differential equations. MG methods particularly rely on two main concepts:
smoothing and residual restriction / coarsening. Exploiting these analogies, He
and Xu developed the MgNet framework, which integrates MG schemes into the
design of ResNets. In this work, we introduce a novel neural network building
block inspired by polynomial smoothers from MG theory. Our polynomial block
from an MG perspective naturally extends the MgNet framework to Poly-Mgnet and
at the same time reduces the number of weights in MgNet. We present a
comprehensive study of our polynomial block, analyzing the choice of initial
coefficients, the polynomial degree, the placement of activation functions, as
well as of batch normalizations. Our results demonstrate that constructing
(quadratic) polynomial building blocks based on real and imaginary polynomial
roots enhances Poly-MgNet's capacity in terms of accuracy. Furthermore, our
approach achieves an improved trade-off of model accuracy and number of weights
compared to ResNet as well as compared to specific configurations of MgNet.
Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel
Despite classical statistical theory predicting severe overfitting, modern
massively overparameterized neural networks still generalize well. This
unexpected property is attributed to the network's so-called implicit bias,
which describes its propensity to converge to solutions that generalize
effec...
2025-03-13
LG
AI
Justin Sahs, Ryan Pyle, Fabio Anselmi, Ankit Patel
Despite classical statistical theory predicting severe overfitting, modern
massively overparameterized neural networks still generalize well. This
unexpected property is attributed to the network's so-called implicit bias,
which describes its propensity to converge to solutions that generalize
effectively, among the many possible that correctly label the training data.
The aim of our research is to explore this bias from a new perspective,
focusing on how non-linear activation functions contribute to shaping it.
First, we introduce a reparameterization which removes a continuous weight
rescaling symmetry. Second, in the kernel regime, we leverage this
reparameterization to generalize recent findings that relate shallow Neural
Networks to the Radon transform, deriving an explicit formula for the implicit
bias induced by a broad class of activation functions. Specifically, by
utilizing the connection between the Radon transform and the Fourier transform,
we interpret the kernel regime's inductive bias as minimizing a spectral
seminorm that penalizes high-frequency components, in a manner dependent on the
activation function. Finally, in the adaptive regime, we demonstrate the
existence of local dynamical attractors that facilitate the formation of
clusters of hyperplanes where the input to a neuron's activation function is
zero, yielding alignment between many neurons' response functions. We confirm
these theoretical results with simulations. All together, our work provides a
deeper understanding of the mechanisms underlying the generalization
capabilities of overparameterized neural networks and its relation with the
implicit bias, offering potential pathways for designing more efficient and
robust models.
Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
Vision-Language Models have made significant progress on many
perception-focused tasks, however, their progress on reasoning-focused tasks
seem to be limited due to the lack of high-quality and diverse training data.
In this work, we aim to address the scarcity issue of reasoning-focused
multimodal ...
2025-03-13
CV
AI
CL
Yiming Jia, Jiachen Li, Xiang Yue, Bo Li, Ping Nie, Kai Zou, Wenhu Chen
Vision-Language Models have made significant progress on many
perception-focused tasks, however, their progress on reasoning-focused tasks
seem to be limited due to the lack of high-quality and diverse training data.
In this work, we aim to address the scarcity issue of reasoning-focused
multimodal datasets. We propose VisualWebInstruct - a novel approach that
leverages search engine to create a diverse, and high-quality dataset spanning
multiple disciplines like math, physics, finance, chemistry, etc. Starting with
meticulously selected 30,000 seed images, we employ Google Image search to
identify websites containing similar images. We collect and process the HTMLs
from over 700K unique URL sources. Through a pipeline of content extraction,
filtering and synthesis, we build a dataset of approximately 900K
question-answer pairs, with 40% being visual QA pairs and the rest as text QA
pairs. Models fine-tuned on VisualWebInstruct demonstrate significant
performance gains: (1) training from Llava-OV-mid shows 10-20% absolute point
gains across benchmarks, (2) training from MAmmoTH-VL shows 5% absoluate gain.
Our best model MAmmoTH-VL2 shows state-of-the-art performance within the 10B
parameter class on MMMU-Pro-std (40.7%), MathVerse (42.6%), and DynaMath
(55.7%). These remarkable results highlight the effectiveness of our dataset in
enhancing VLMs' reasoning capabilities for complex multimodal tasks.
Ishaq Aden-Ali
We prove an upper bound on the expected $\ell_p$ injective norm of sums of
subgaussian random tensors. Our proof is simple and does not rely on any
explicit geometric or chaining arguments. Instead, it follows from a simple
application of the PAC-Bayesian lemma, a tool that has proven effective at
c...
2025-03-13
PR
LG
ST
TH
Ishaq Aden-Ali
We prove an upper bound on the expected $\ell_p$ injective norm of sums of
subgaussian random tensors. Our proof is simple and does not rely on any
explicit geometric or chaining arguments. Instead, it follows from a simple
application of the PAC-Bayesian lemma, a tool that has proven effective at
controlling the suprema of certain ``smooth'' empirical processes in recent
years. Our bound strictly improves a very recent result of Bandeira, Gopi,
Jiang, Lucca, and Rothvoss. In the Euclidean case ($p=2$), our bound sharpens a
result of Lata{\l}a that was central to proving his estimates on the moments of
Gaussian chaoses. As a consequence, we obtain an elementary proof of this
fundamental result.
Nina Vesseron, Louis Béthune, Marco Cuturi
A common approach to generative modeling is to split model-fitting into two
blocks: define first how to sample noise (e.g. Gaussian) and choose next what
to do with it (e.g. using a single map or flows). We explore in this work an
alternative route that ties sampling and mapping. We find inspiration...
2025-03-13
ML
LG
Nina Vesseron, Louis Béthune, Marco Cuturi
A common approach to generative modeling is to split model-fitting into two
blocks: define first how to sample noise (e.g. Gaussian) and choose next what
to do with it (e.g. using a single map or flows). We explore in this work an
alternative route that ties sampling and mapping. We find inspiration in moment
measures, a result that states that for any measure $\rho$ supported on a
compact convex set of $\mathbb{R}^d$, there exists a unique convex potential
$u$ such that $\rho=\nabla u\,\sharp\,e^{-u}$. While this does seem to tie
effectively sampling (from log-concave distribution $e^{-u}$) and action
(pushing particles through $\nabla u$), we observe on simple examples (e.g.,
Gaussians or 1D distributions) that this choice is ill-suited for practical
tasks. We study an alternative factorization, where $\rho$ is factorized as
$\nabla w^*\,\sharp\,e^{-w}$, where $w^*$ is the convex conjugate of $w$. We
call this approach conjugate moment measures, and show far more intuitive
results on these examples. Because $\nabla w^*$ is the Monge map between the
log-concave distribution $e^{-w}$ and $\rho$, we rely on optimal transport
solvers to propose an algorithm to recover $w$ from samples of $\rho$, and
parameterize $w$ as an input-convex neural network.