Computer Vision and Pattern Recognition 5
♻ ☆ MoME: Estimating Psychological Traits from Gait with Multi-Stage Mixture of Movement Experts
Gait encodes rich biometric and behavioural information, yet leveraging the
manner of walking to infer psychological traits remains a challenging and
underexplored problem. We introduce a hierarchical Multi-Stage Mixture of
Movement Experts (MoME) architecture for multi-task prediction of psychological
attributes from gait sequences represented as 2D poses. MoME processes the
walking cycle in four stages of movement complexity, employing lightweight
expert models to extract spatio-temporal features and task-specific gating
modules to adaptively weight experts across traits and stages. Evaluated on the
PsyMo benchmark covering 17 psychological traits, our method outperforms
state-of-the-art gait analysis models, achieving a 37.47% weighted F1 score at
the run level and 44.6% at the subject level. Our experiments show that
integrating auxiliary tasks such as identity recognition, gender prediction,
and BMI estimation further improves psychological trait estimation. Our
findings demonstrate the viability of multi-task gait-based learning for
psychological trait estimation and provide a foundation for future research on
movement-informed psychological inference.
comment: 4 Figures, 4 Tables
♻ ☆ ExGS: Extreme 3D Gaussian Compression with Diffusion Priors
Jiaqi Chen, Xinhao Ji, Yuanyuan Gao, Hao Li, Yuning Gong, Yifei Liu, Dan Xu, Zhihang Zhong, Dingwen Zhang, Xiao Sun
Neural scene representations, such as 3D Gaussian Splatting (3DGS), have
enabled high-quality neural rendering; however, their large storage and
transmission costs hinder deployment in resource-constrained environments.
Existing compression methods either rely on costly optimization, which is slow
and scene-specific, or adopt training-free pruning and quantization, which
degrade rendering quality under high compression ratios. In contrast, recent
data-driven approaches provide a promising direction to overcome this
trade-off, enabling efficient compression while preserving high rendering
quality. We introduce ExGS, a novel feed-forward framework that unifies
Universal Gaussian Compression (UGC) with GaussPainter for Extreme 3DGS
compression. UGC performs re-optimization-free pruning to aggressively reduce
Gaussian primitives while retaining only essential information, whereas
GaussPainter leverages powerful diffusion priors with mask-guided refinement to
restore high-quality renderings from heavily pruned Gaussian scenes. Unlike
conventional inpainting, GaussPainter not only fills in missing regions but
also enhances visible pixels, yielding substantial improvements in degraded
renderings. To ensure practicality, it adopts a lightweight VAE and a one-step
diffusion design, enabling real-time restoration. Our framework can even
achieve over 100X compression (reducing a typical 354.77 MB model to about 3.31
MB) while preserving fidelity and significantly improving image quality under
challenging conditions. These results highlight the central role of diffusion
priors in bridging the gap between extreme compression and high-quality neural
rendering. Our code repository will be released at:
https://github.com/chenttt2001/ExGS
♻ ☆ Video-in-the-Loop: Span-Grounded Long Video QA with Interleaved Reasoning
Chendong Wang, Donglin Bai, Yifan Yang, Xiao Jin, Anlan Zhang, Rui Wang, Shiqi Jiang, Yuqing Yang, Hao Wu, Qi Dai, Chong Luo, Ting Cao, Lili Qiu, Suman Banerjee
We present \emph{Video-in-the-Loop} (ViTL), a two-stage long-video QA
framework that preserves a fixed token budget by first \emph{localizing}
question-relevant interval(s) with a low-fps skim and then \emph{answering} via
span-aware reallocation of visual tokens at higher effective frame rate,
emitting an interleaved output with both spans and the final option for direct
attribution. We also introduce \dataname{}, which converts description based
event graphs into \emph{span-grounded} multiple-choice QA by pairing each
question with \emph{ground-truth} time span(s) and related reasoning. ViTL is
trained end-to-end with an interleaved group-relative objective that couples
temporal IoU for localization with answer correctness, allowing credit to flow
from answers back to spans without increasing compute. Under fixed token
budgets, ViTL attains up to 8.6% with 50% less frame input on long-video QA and
temporal grounding (e.g., Charades-STA, ActivityNet-Captions) and ablations
show that span-aware token reallocation consistently surpasses uniform
sampling. Together, \dataname{} and ViTL provide an interpretable,
compute-efficient recipe for scalable long-video QA.
♻ ☆ Keep It on a Leash: Controllable Pseudo-label Generation Towards Realistic Long-Tailed Semi-Supervised Learning NeurIPS 2025
Current long-tailed semi-supervised learning methods assume that labeled data
exhibit a long-tailed distribution, and unlabeled data adhere to a typical
predefined distribution (i.e., long-tailed, uniform, or inverse long-tailed).
However, the distribution of the unlabeled data is generally unknown and may
follow an arbitrary distribution. To tackle this challenge, we propose a
Controllable Pseudo-label Generation (CPG) framework, expanding the labeled
dataset with the progressively identified reliable pseudo-labels from the
unlabeled dataset and training the model on the updated labeled dataset with a
known distribution, making it unaffected by the unlabeled data distribution.
Specifically, CPG operates through a controllable self-reinforcing optimization
cycle: (i) at each training step, our dynamic controllable filtering mechanism
selectively incorporates reliable pseudo-labels from the unlabeled dataset into
the labeled dataset, ensuring that the updated labeled dataset follows a known
distribution; (ii) we then construct a Bayes-optimal classifier using logit
adjustment based on the updated labeled data distribution; (iii) this improved
classifier subsequently helps identify more reliable pseudo-labels in the next
training step. We further theoretically prove that this optimization cycle can
significantly reduce the generalization error under some conditions.
Additionally, we propose a class-aware adaptive augmentation module to further
improve the representation of minority classes, and an auxiliary branch to
maximize data utilization by leveraging all labeled and unlabeled samples.
Comprehensive evaluations on various commonly used benchmark datasets show that
CPG achieves consistent improvements, surpassing state-of-the-art methods by up
to $\textbf{15.97\%}$ in accuracy. The code is available at
https://github.com/yaxinhou/CPG.
comment: The paper is accepted by NeurIPS 2025
♻ ☆ Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking
Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang
Improving vision-language models (VLMs) in the post-training stage typically
relies on supervised fine-tuning or reinforcement learning, methods that
necessitate costly, human-annotated data. While self-supervised techniques such
as self-consistency have proven effective for enhancing reasoning capabilities,
their application to perceptual domains such as image quality assessment (IQA)
remains largely unexplored. In this work, we introduce EvoQuality, a novel
framework that enables a VLM to autonomously refine its quality perception
capabilities without any ground-truth labels. EvoQuality adapts the principle
of self-consistency to the ranking-based nature of IQA. It generates
pseudo-labels by performing pairwise majority voting on the VLM's own outputs
to establish a consensus on relative quality. These pseudo-rankings are then
formulated into a fidelity reward that guides the model's iterative evolution
through group relative policy optimization (GRPO). By iteratively leveraging
its own predictions, EvoQuality progressively refines the VLM's perceptual
capability. Extensive experiments show that EvoQuality boosts the base VLM's
zero-shot performance by 31.8\% on PLCC across diverse IQA benchmarks.
Remarkably, despite being entirely self-supervised, EvoQuality achieves
performance that is competitive with, or even surpasses, state-of-the-art
supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA
benchmarks.
comment: Technical Report