清华大学<br>深圳国际研究生院<br>袁春教授课题组

2023-2025年CVML实验室科研成果展示

Posted on 2024-12-11 Edited on 2026-01-07

TPAMI: Accelerating Zero-Shot NAS With Feature Map-Based Proxy and Operation Scoring Function

Author: 蒋唐宇

Abstract

Neural Architecture Search (NAS) has been extensively studied due to its ability in automatic architecture engineering. Existing NAS methods rely heavily on the gradients and data labels, which either incur immense computational costs or suffer from discretization discrepancy due to the supernet structure. Moreover, the majority of them are limited in generating diverse architectures. To alleviate these issues, in this paper, we propose a novel zero-cost proxy called MeCo based on the Pearson correlation matrix of the feature maps. Unlike the previous work, the computation of MeCo as well as its variant MeCoopt requires only one random data for a single forward pass. Based on the proposed zero-cost proxy, we further craft a new zero-shot NAS scheme called FLASH, which harnesses a new proxy-based operation scoring function and a greedy heuristic. Compared to the existing methods, FLASH is highly efficient and can construct diverse model architectures instead of repeated cells. We design comprehensive experiments and extensively evaluate our designs on multiple benchmarks and datasets. The experimental results show that our method is one to six orders of magnitudes more efficient than the state-of-the-art baselines with the highest model accuracy.

AAAI2026: Prune&Comp: Free Lunch for Layer-PrunedLLMs via Iterative Pruning with MagnitudeCompensation

Author: 陈新睿

Abstract

Layer pruning has emerged as a promising technique for compressing large language models (LLMs) while achieving acceleration proportional to the pruning ratio. In this work, we identify that removing any layer induces a significant magnitude gap in hidden states, resulting in substantial performance degradation. To address this issue, we propose Prune&Comp, a novel plug-and-play layer pruning scheme that leverages magnitude compensation to mitigate such gaps in a training-free manner. Specifically, we first estimate the magnitude gap caused by layer removal and then eliminate this gap by rescaling the remaining weights offline, with zero runtime overhead incurred. We further demonstrate the advantages of Prune&Comp through an iterative pruning strategy. When integrated with an iterative prune-and-compensate loop, Prune&Comp consistently enhances existing layer pruning metrics. For instance, when 5 layers of LLaMA-3-8B are pruned using the prevalent block influence metric, Prune&Comp nearly halves the perplexity and retains 93.19% of the original model’s question-answering performance, outperforming the baseline by 4.01%.

Author: 贾书凝

Abstract

Multivariate Time Series Forecasting (MTSF) aims to capture the dependencies among multiple variables and their temporal dynamics to predict future values. In recent years, Large Language Models (LLMs) have set a new paradigm for MTSF, incorporating external knowledge into the modeling process through textual prompts. However, we observe that current LLM-based methods fail to exploit these priors due to their coarse-grained representation of time series data, which hinders effective alignment of the two modals. To address this, we propose M3Time, a multi-modal, multi-scale, and multi-frequency framework for multivariate time series forecasting. It enhances the quality of time series representations and facilitates the integration of LLM semantic priors with fine-grained temporal features. Additionally, M3Time further improved training stability and model robustness with an adaptive mixed loss function, which dynamically balances L1 and L2 error terms. Experiment results on seven real-world public datasets show that M3Time consistently outperforms state-of-the-art methods, underscoring its effectiveness.

TPAMI: SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition

Author: 卢锋

Abstract

Abstract—Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance. In our previous work, we propose a novel method to realize seamless adaptation of foundation models to VPR (SelaVPR). This method can produce both global and local features that focus on discriminative landmarks to recognize places for two-stage VPR by a parameterefficient adaptation approach. Although SelaVPR has achieved competitive results, we argue that the previous adaptation is inefficient in training time and GPU memory usage, and the reranking paradigm is also costly in retrieval latency and storage usage. In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++. Concretely, we first design a parameter-, time-, and memory-efficient adaptation method that uses lightweight multi-scale convolution (MultiConv) adapters to refine intermediate features from the frozen foundation backbone. This adaptation method does not back-propagate gradients through the backbone during training, and the MultiConv adapter facilitates feature interactions along
the spatial axes and introduces proper local priors, thus achieving higher efficiency and better performance. Moreover, we propose an innovative re-ranking paradigm for more efficient VPR.
Instead of relying on local features for re-ranking, which incurs huge overhead in latency and storage, we employ compact binary features for initial retrieval and robust floating-point (global) features for re-ranking. To obtain such binary features, we propose a similarity-constrained deep hashing method, which can be easily integrated into the VPR pipeline. Finally, we improve our training strategy and unify the training protocol of several common training datasets to merge them for better training of VPR models. Extensive experiments show that SelaVPR++ is highly efficient in training time, GPU memory usage, and retrieval latency (6000× faster than TransVPR), as well as outperforms the state-of-the-art methods by a large margin (ranks 1st on MSLS challenge leaderboard). Code and models will be released (and merged with SelaVPR) at https://github.com/Lu-Feng/SelaVPR.

NeurIPS2025: Towards Implicit Aggregation: Robust Image Representation for Place Recognition in the Transformer Era

Author: 卢锋

Abstract

Visual place recognition (VPR) is typically regarded as a specific image retrieval task, whose core lies in representing images as global descriptors. Over the past decade, dominant VPR methods (e.g., NetVLAD) have followed a paradigm that first extracts the patch features/tokens of the input image using a backbone, and then aggregates these patch features into a global descriptor via an aggregator. This backbone-plus-aggregator paradigm has achieved overwhelming dominance in the CNN era and remains extensively used in transformer-based models. In this paper, however, we argue that a dedicated aggregator is not necessary in the transformer era, that is, we can obtain robust global descriptors only with the backbone. Specifically, we introduce some learnable aggregation tokens, which are prepended to the patch tokens before a particular transformer block. All these tokens will be jointly processed and interact globally via the intrinsic self-attention mechanism, implicitly aggregating useful information within the patch tokens to the aggregation tokens. Finally, we only take these aggregation tokens from the last output tokens and concatenate them as the global representation. Although implicit aggregation can provide robust global descriptors in an extremely simple manner, where and how to insert additional tokens, as well as the initialization of tokens, remains an open issue worthy of further exploration. To this end, we also propose the optimal token insertion strategy and token initialization method derived from empirical studies. Experiments show that our method outperforms state-of-the-art methods on several VPR datasets with higher efficiency and ranks 1st on the MSLS challenge leaderboard.

NeurIPS2025: NaDRO: Leveraging Dual-Reward Strategies for LLMs Training on Noisy Data

Author: 钱浩龙

Abstract

Group Relative Policy Optimization (GRPO) fine-tuning has been empirically shown to significantly enhance the reasoning abilities of language models. However, it often relies on large-scale, high-quality labeled data, which is typically difficult to obtain. To address this challenge, we introduce the Noise-Aware Dual-Reward Optimization (NaDRO), which effectively enhances LLMs training in environments where data is noisy or imperfect. NaDRO operates through two key components: (1) Preference-based Outcome Reward (POR), which extracts reliable preference signals from noisy data, guiding LLMs towards more effective decisions instead of relying on specific noisy scores; and (2) a Context Perception Reward (CPR) mechanism, which ensures that LLMs conduct necessary qualitative assessment of the current problem state, rewarding accurate judgments to foster better cognitive understanding before decision-making. In the context of combinatorial optimization problems, where dynamically selecting heuristic algorithms is challenging due to large problem scales and the difficulty of obtaining accurate decision data, we designed experiments to test our approach. Our results indicate that the fine-tuned Qwen 7B and Llama 3-8B models outperform mainstream large language models (LLMs) training in this task.

NeurIPS2025: Towards Robust Uncertainty Calibration for Composed Image Retrieval

Author: 王依凡

Abstract

The interactive task of composed image retrieval aims to retrieve the most relevant images with the bi-modal query, consisting of a reference image and a modification sentence. Despite significant efforts to bridge the heterogeneous gap within the bi-modal query and leverage contrastive learning to reduce the disparity between positive and negative triplets, prior methods often fail to ensure reliable matching due to aleatoric and epistemic uncertainty. Specifically, the aleatoric uncertainty stems from underlying semantic correlations within candidate instances and annotation noise, and the epistemic uncertainty is usually caused by overconfidence in dominant semantic categories. In this paper, we propose Robust UNcertainty Calibration (RUNC) to quantify the uncertainty and calibrate the imbalanced semantic distribution. To mitigate semantic ambiguity in similarity distribution between fusion queries and targets, RUNC maximizes the matching evidence by utilizing a high-order conjugate prior distribution to fit the semantic covariances in candidate samples. With the estimated uncertainty coefficient of each candidate, the target distribution is calibrated to encourage balanced semantic alignment. Additionally, we minimize the ambiguity in the fusion evidence when forming the unified query by incorporating orthogonal constraints on explicit textual embeddings and implicit queries, to reduce the representation redundancy. Extensive experiments and ablation analysis on benchmark datasets FashionIQ and CIRR verify the robustness of RUNC in predicting reliable retrieval results from a large image gallery.

NeurIPS2025: SCOUT: Teaching Pre-trained Language Models to Enhance Reasoning via Flow Chain-of-Thought

Author: 李广昊

Abstract

Chain of Thought (CoT) prompting improves the reasoning performance of large language models (LLMs) by encouraging step by step thinking. However, CoT-based methods depend on intermediate reasoning steps, which limits scalability and generalization. Recent work explores recursive reasoning, where LLMs reuse internal layers across iterations to refine latent representations without explicit CoT supervision. While promising, these approaches often require costly pretraining and lack a principled framework for how reasoning should evolve across iterations. We address this gap by introducing Flow Chain of Thought (Flow CoT), a reasoning paradigm that models recursive inference as a progressive trajectory of latent cognitive states. Flow CoT frames each iteration as a distinct cognitive stage deepening reasoning across iterations without relying on manual supervision. To realize this, we propose SCOUT (Stepwise Cognitive Optimization Using Teachers), a lightweight fine tuning framework that enables Flow CoT style reasoning without the need for pretraining. SCOUT uses progressive distillation to align each iteration with a teacher of appropriate capacity, and a cross attention based retrospective module that integrates outputs from previous iterations while preserving the models original computation flow. Experiments across eight reasoning benchmarks show that SCOUT consistently improves both accuracy and explanation quality, achieving up to 1.8% gains under fine tuning. Qualitative analyses further reveal that SCOUT enables progressively deeper reasoning across iterations refining both belief formation and explanation granularity. These results not only validate the effectiveness of SCOUT, but also demonstrate the practical viability of Flow CoT as a scalable framework for enhancing reasoning in LLMs.

NeurIPS2025: A Simple Linear Patch Revives Layer-Pruned Large Language Models

Author: 陈新睿

Abstract

Layer pruning has become a popular technique for compressing large language models (LLMs) due to its simplicity. However, existing layer pruning methods often suffer from significant performance drops. We identify that this degradation stems from the mismatch of activation magnitudes across layers and tokens at the pruning interface. To address this, we propose LinearPatch, a simple plug-and-play technique to revive the layer-pruned LLMs. The proposed method adopts Hadamard transformation to suppress massive outliers in particular tokens, and channel-wise scaling to align the activation magnitudes. These operations can be fused into a single matrix, which functions as a patch to bridge the pruning interface with negligible inference overhead. LinearPatch retains up to 94.15% performance of the original model when pruning 5 layers of LLaMA-3-8B on the question answering benchmark, surpassing existing state-of-the-art methods by 4%. In addition, the patch matrix can be further optimized with memory efficient offline knowledge distillation. With only 5K samples, the retained performance of LinearPatch can be further boosted to 95.16% within 30 minutes on a single computing card.

NeurIPS2025: Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking

Author: 苏梓瀚

Abstract

The explosive growth of generative video models has amplified the demand for reliable copyright preservation of AI-generated content. Despite its popularity in image synthesis, invisible generative watermarking remains largely underexplored in video generation. To address this gap, we propose Safe-Sora, the first framework to embed graphical watermarks directly into the video generation process. Motivated by the observation that watermarking performance is closely tied to the visual similarity between the watermark and cover content, we introduce a hierarchical coarse-to-fine adaptive matching mechanism. Specifically, the watermark image is divided into patches, each assigned to the most visually similar video frame, and further localized to the optimal spatial region for seamless embedding. To enable spatiotemporal fusion of watermark patches across video frames, we develop a 3D wavelet transform-enhanced Mamba architecture with a novel spatiotemporal local scanning strategy, effectively modeling long-range dependencies during watermark embedding and retrieval. To the best of our knowledge, this is the first attempt to apply state space models to watermarking, opening new avenues for efficient and robust watermark protection. Extensive experiments demonstrate that Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness, which is largely attributed to our proposals. We will release our code upon publication.

ICCV2025: VPR-Cloak: A First Look at Privacy Cloak Against Visual Place Recognition

Author: 董姝婷

Abstract

With the rapid advancement of Visual Place Recognition (VPR) systems, their unauthorized use on social media images enables monitoring of individuals’ daily movements, posing serious privacy risks. However, privacy protection for addressing these risks in VPR systems remains an underexplored area. While adversarial perturbations have been widely explored for visual privacy protection, existing methods still fail to simultaneously satisfy the black-box constraint, imperceptibility, and real-time performance required in realistic VPR privacy protection scenarios. In this paper, we present the first look at privacy protection in VPR systems and introduce VPR-Cloak, an efficient privacy-preserving network. We introduce a saliency-aware prior to identify decisive regions for place recognition and propose Saliency-Aware Prior Guided Perturbation Optimization (SAP-PO) to selectively optimize perturbation generation in these areas. To enhance imperceptibility, we further optimize perturbations in the frequency domain, meticulously refining high-frequency components of perturbations while preserving low-frequency structures essential for human perception. Extensive experiments on multiple benchmark datasets and on various black-box VPR models verify that our method outperforms existing SOTA methods. Additionally, our method achieves a \textbf{15× speedup} in runtime compared to SOTA methods. We also validate the effectiveness of our method based on commercial APIs, including \textbf{Google and Microsoft Bing}, demonstrating the practical applicability in real-world scenarios. The code will be publicly available.

ICCV2025: Text-guided Visual Prompt DINO for Generic Segmentation

Author: 关雨晨

Abstract

Recent advancements in multimodal vision models have highlighted limitations in late-stage feature fusion and suboptimal query selection for hybird prompts open-world segmentation, alongside constraints from caption-derived vocabularies. To address these challenges, we propose Prompt-DINO, a text-guided visual Prompt DINO framework featuring three key innovations. First, we introduce an early fusion mechanism that unifies text/visual prompts, and backbone features at the initial encoding stage, enabling deeper cross-modal interactions to resolve semantic ambiguities. Second, we design order-aligned query selection for DETR-based architectures, explicitly optimizing the structural alignment between text and visual queries during decoding to enhance semantic-spatial consistency. Third, we develop a generative data engine powered by the Recognize Anything via Prompting(RAP) model, which synthesizes 0.5B diverse training instances through a dual-path cross-verification pipeline, reducing label noise by 80.5% compared to conventional approaches. Extensive experiments demonstrate that Prompt-DINO achieves state-of-the-art performance on open-world detection benchmarks while significantly expanding semantic coverage beyond fixed-vocabulary constraints. Our work establishes a new paradigm for scalable multimodal detection and data generation in open-world scenarios. Data&Code will be made available.

ICCV2025: Perceive, Understand and Restore: Real-World Image Super-Resolution with Autoregressive Multimodal Generative Models

Author: 韦弘杨

Abstract

By leveraging the generative priors from pre-trained text-to-image diffusion models, significant progress has been made in real-world image super-resolution (Real-ISR). However, these methods tend to generate inaccurate and unnatural reconstructions in complex and/or heavily degraded scenes, primarily due to their limited perception and understanding capability of the input low-quality image. To address these limitations, we propose, for the first time to our knowledge, to adapt the pre-trained autoregressive multimodal model such as Lumina-mGPT into a robust Real-ISR model, namely PURE, which Perceives and Understands the input low-quality image, then REstores its high-quality counterpart. Specifically, we implement instruction tuning on Lumina-mGPT to perceive the image degradation level and the relationships between previously generated image tokens and the next token, understand the image content by generating image semantic descriptions, and consequently restore the image by generating high-quality image tokens autoregressively with the collected information. In addition, we reveal that the image token entropy reflects the image structure and present a entropy-based Top-k sampling strategy to optimize the local structure of the image during inference. Experimental results demonstrate that PURE preserves image content while generating realistic details, especially in complex scenes with multiple objects, showcasing the potential of autoregressive multimodal generative models for robust Real-ISR. The model and code will be available at https://github.com/nonwhy/PURE.

ICCV2025: UniGlyph: Unified Segmentation-Conditioned Diffusion for Precise Visual Text Synthesis

Author: 王渊睿

Abstract

Text-to-image generation has greatly advanced content creation, yet accurately rendering visual text remains a key challenge due to blurred glyphs, semantic drift, and limited style control. Existing methods often rely on pre-rendered glyph images as conditions, but these struggle to retain original font styles and color cues, necessitating complex multi-branch designs that increase model overhead and reduce flexibility. To address these issues, we propose a segmentation-guided framework that uses pixel-level visual text masks – rich in glyph shape, color, and spatial detail – as unified conditional inputs. Our method introduces two core components: (1) a fine-tuned bilingual segmentation model for precise text mask extraction, and (2) a streamlined diffusion model augmented with adaptive glyph conditioning and a region-specific loss to preserve textual fidelity in both content and style. Our approach achieves state-of-the-art performance on the AnyText benchmark, significantly surpassing prior methods in both Chinese and English settings. To enable more rigorous evaluation, we also introduce two new benchmarks: GlyphMM-benchmark for testing layout and glyph consistency in complex typesetting, and MiniText-benchmark for assessing generation quality in small-scale text regions. Experimental results show that our model outperforms existing methods by a large margin in both scenarios, particularly excelling at small text rendering and complex layout preservation, validating its strong generalization and deployment readiness.

ICCV2025: ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning

Author: 许正卓

Abstract

Multimodal Large Language Models (MLLMs) have emerged as powerful tools for chart comprehension. However, they heavily rely on extracted content via OCR, which leads to numerical hallucinations when chart textual annotations are sparse. While existing methods focus on scaling instructions, they fail to address the fundamental challenge, i.e., reasoning with visual perception. In this paper, we identify a critical observation: MLLMs exhibit weak grounding in chart elements and proportional relationships, as evidenced by their inability to localize key positions to match their reasoning. To bridge this gap, we propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts. By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions. We further introduce an automated pipeline to construct ChartPoint-SFT-62k, a dataset featuring 19.2K highquality chart samples with step-by-step CoT, bounding box, and re-rendered visualizations. Leveraging this data, we develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks, e.g., +5.04% on ChartBench.

ICCV2025: SparseFlex: High-Resolution and Arbitrary-Topology 3D Shape Modeling

Author: 何相龙

Abstract

Creating high-fidelity 3D meshes with arbitrary topology, including open surfaces and complex interiors, remains a significant challenge. Existing implicit field methods often require costly and detail-degrading watertight conversion, while other approaches struggle with high resolutions. This paper introduces SparseFlex, a novel sparse-structured isosurface representation that enables differentiable mesh reconstruction at resolutions up to $1024^{3}$ directly from rendering losses. SparseFlex combines the accuracy of Flexicubes with a sparse voxel structure, focusing computation on surface-adjacent regions and efficiently handling open surfaces. Crucially, we introduce a frustum-aware sectional voxel training strategy that activates only relevant voxels during rendering, dramatically reducing memory consumption and enabling high-resolution training. This also allows, for the first time, the reconstruction of mesh interiors using only rendering supervision. Building upon this, we demonstrate a complete shape modeling pipeline by training a variational autoencoder (VAE) and a rectified flow transformer for high-quality 3D shape generation. Our experiments show state-of-the-art reconstruction accuracy, with a ~82% reduction in Chamfer Distance and a ~88% increase in F-score compared to previous methods, and demonstrate the generation of high-resolution, detailed 3D shapes with arbitrary topology. By enabling high-resolution, differentiable mesh reconstruction and generation with rendering losses, SparseFlex significantly advances the state-of-the-art in 3D shape representation and modeling.

ICME2025: Semantic Alignment and Hard Sample Retraining for Visible-Infrared Person Re-Identification

Author: 倪靖宸，吕科宇

Abstract

Visible-Infrared Person Re-Identification (VI-ReID) seeks to match individuals across different modalities. Recent methods focus on discriminative feature extraction and hard sample learning. However, they often suffer from semantic misalignment due to horizontal partitioning in local feature extraction and overlook global hard samples in training. Moreover, the widely used PK Sampler cannot ensure viewpoint balance and diversity. To overcome these limitations, we propose the Semantic Alignment and Hard Sample Retraining (SAHSR) framework. This framework incorporates a Recurrent Semantic Aggregation (RSA) module that progressively aggregates and aligns regional semantics with the help of Modality Alignment loss. Besides, we propose a Confidence-based Hard Sample Retraining (CHSR) strategy that identifies and retrains hard samples to improve the model’s robustness. Additionally, we introduce the Viewpoint-Balanced (VB) Sampler to guarantee a balanced distribution of viewpoints. Extensive experiments on VI-ReID benchmarks demonstrate the significant performance gains of our approach, showing state-of-the-art performance. Code will be available.

ICME2025: EAV-Mamba: Efficient Audio-Visual Representation Learninfor Weakly-Supervised Temporal Action Localization

Author: 张权

Abstract

Weakly supervised temporal action localization aims to learn to locate actions in videos from video-level or pointlevel labels, avoiding the need for costly frame-level annotations. Unlike previous work that relies solely on visual modality information, we propose incorporating audio information into the weakly supervised temporal action localization task. While audiovisual localization tasks combine audio and visual information for video localization, temporal action localization often deals with action categories that have weak audio cues. To addressthis, we propose EAV-Mamba, the first audio-visual perception modeling method based on Mamba. Leveraging Mamba’s powerful audio-visual perception capabilities, we developed modules such as Audio-Perceptive Flow Enhancement, Audio-Perceptive RGB Enhancement, and Audio Self-Perceptive Enhancement. Extensive experiments on two publicly available temporal action localization datasets demonstrate that EAV-Mamba achieves efficient audio-visual perception modeling and state-of-the-art performance in weakly supervised temporal action localization tasks.

ICIP2025: CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization

Author: 夏瑞

Abstract

Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model’s superiority over several state-of-the-art competitors.

ICIP2025: Overlooked Factors in Continual Zero-Shot Learning: Inflexible Semantic Prototypes, Simplistic Loss Functions, and SGD Noise

Author: 郝清扬

Abstract

With the continuous evolution of deep learning architectures, computer vision technologies have achieved remarkable breakthroughs in image classification. However, when confronted with the recognition requirements of dynamically emerging unknown categories in open environments, traditional supervised learning paradigms face significant challenges due to their heavy reliance on annotated data. In this context, Zero-Shot Learning (ZSL), as an emerging cross-modal inference paradigm, provides an innovative pathway for addressing unseen category recognition by establishing transferable mapping relationships between visual and semantic modalities. The core challenge of this technology lies in bridging the semantic gap between known and unknown categories to achieve robust cross-domain knowledge transfer. This paper focuses on zero-shot image classification tasks in both single-task and continual-task scenarios, investigating from the perspective of semantic representation limitations. Corresponding semantic enhancement strategies are proposed for each scenario, accompanied by comprehensive experimental analyses that validate the effectiveness of the proposed methods. Firstly, for the single-task scenario, we propose a Visual-Semantic Dual Calibration Network (VSDCN) that addresses inherent biases in visual and semantic spaces through a two-stage calibration process. During training-phase calibration, the visual calibration network enhances visual space by integrating semantic information, while the semantic calibration network refines semantic prototypes using visual information. During test-phase calibration, classification is performed separately in both spaces, with their weighted average serving as the final prediction, thereby mitigating individual space biases. Secondly, for the continual-task scenario, we introduce a catastrophic forgetting mitigation method based on semantic information updating. Specifically, a semantic refinement mechanism is proposed to progressively update semantic prototypes across tasks. By treating these prototypes similarly to visual features for generation, the model preserves knowledge of old categories through attribute sharing between new and old classes. Additionally, this paper analyzes the impact of Stochastic Gradient Descent (SGD) noise - a previously overlooked factor in Continual Zero-Shot Learning (CZSL). Our investigation enhances understanding of this noise in CZSL contexts, providing novel perspectives for CZSL research. Finally, extensive evaluations on multiple benchmark datasets demonstrate the superior performance of our proposed methods across various metrics in both scenarios. Ablation studies further reveal the critical role of semantic information in zero-shot image classification tasks. The research outcomes not only validate the effectiveness of our approaches but also offer new insights for designing knowledge transfer systems in open environments.

ICML2025: Enhancing Logits Distillation with Plug&Play Kendall’s $\tau$ Ranking Loss

Author: 关雨晨，程润曦

Abstract

Knowledge distillation typically employs the Kullback-Leibler (KL) divergence to constrain the output of the student model to precisely match the soft labels provided by the teacher model. However, the optimization process of KL divergence is challenging for the student and prone to suboptimal points. Also, we demonstrate that the gradients provided by KL divergence depend on channel scale and thus tend to overlook low-probability channels. The mismatch in low-probability channels also results in the neglect of inter-class relationship information, making it difficult for the student to further enhance performance. To address this issue, we propose an auxiliary ranking loss based on Kendall’s τ Coefficient, which can be plug-and-play in any logit-based distillation method, providing inter-class relationship information and balancing the attention to low-probability channels. We show that the proposed ranking loss is less affected by channel scale, and its optimization objective is consistent with that of KL divergence. Extensive experiments on CIFAR-100, ImageNet, and COCO datasets, as well as various CNN and ViT teacher-student architecture combinations, demonstrate that the proposed ranking loss can be plug-and-play on various baselines and enhance their performance.

ICML2025: Whoever Started the interference Should End It: Guiding Data-Free Model Merging via Task Vectors

Author: 程润曦

Abstract

Model merging seeks to integrate task-specific expert models into a unified architecture while preserving multi-task generalization capabilities, yet parameter interference between constituent models frequently induces performance degradation. Although prior work has explored many merging strategies, resolving interference without additional data for retraining or test-time computation remains challenging. In this paper, we theoretically demonstrate that the task vectors of the linear layer constitute an approximate linear subspace for its corresponding input. Therefore, we can minimize interference under the guidance of task vectors. Based on this insight, we propose \textbf{WUDI-Merging} (\textbf{W}hoever started the interference sho\textbf{U}ld en\textbf{D} \textbf{I}t), a simple yet effective model merging method that eliminates interference without any additional data or rescaling coefficients. Comprehensive empirical evaluations across vision and language benchmarks demonstrate our method’s superiority, achieving state-of-the-art performance in data-free model merging scenarios (average 10.9% improvement versus baseline methods) while even outperforming mainstream test-time adaptation approaches by 3.3%, and only very few computing resources are required. The code will be publicly available soon.

ICML2025: FlatQuant: Flatness Matters for LLM Quantization

Author: 刘睿康

Abstract

Recently, quantization has been widely used for the compression and acceleration of large language models (LLMs). Due to the outliers in LLMs, it is crucial to flatten weights and activations to minimize quantization error with the equally spaced quantization points. Prior research explores various pre-quantization transformations to suppress outliers, such as per-channel scaling and Hadamard transformation. However, we observe that these transformed weights and activations can still remain steep and outspread. In this paper, we propose FLATQUANT (Fast and Learnable Affine Transformation), a new post-training quantization approach to enhance flatness of weights and activations. Our approach identifies optimal affine transformations tailored to each linear layer, calibrated in hours via a lightweight objective. To reduce runtime overhead, we apply Kronecker decomposition to the transformation matrices, and fuse all operations in FLATQUANT into a single kernel. Extensive experiments show that FLATQUANT sets up a new state-of-theart quantization benchmark. For instance, it achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%. For inference latency, FLATQUANT reduces the slowdown induced by prequantization transformation from 0.26x of QuaRot to merely 0.07x, bringing up to 2.3x speedup for prefill and 1.7x speedup for decoding, respectively. Code is available at: https://github.com/ruikangliu/FlatQuant.

ICML2025: Preference Optimization for Combinatorial Optimization Problems

Author: 林冠权

Abstract

Reinforcement Learning (RL) has emerged as a powerful tool for neural combinatorial optimization, enabling models to learn heuristics that solve complex problems without requiring optimal solutions. Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast combinatorial action spaces, leading to inefficient learning. In this paper, we propose Preference Optimization(PO), a novel framework that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling, emphasizing the superiority among generated solutions. Methodologically, by reparameterizing the reward function in terms of policy probabilities and utilizing preference models like Bradley-Terry and Thurstone, we formulate an entropy-regularized optimization objective that aligns the policy directly with preferences while avoiding intractable computations. Furthermore, we integrate heuristic local search techniques into the fine-tuning process to generate high-quality preference pairs, helping the policy escape local optima. Empirical results on standard combinatorial optimization benchmarks, such as the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP) and the Flexible Flow Shop Problem (FFSP), demonstrate that our method outperforms traditional RL algorithms, achieving superior sample efficiency and solution quality. Our work offers a simple yet efficient algorithmic advancement in neural combinatorial optimization.

ICML2025: Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

Author: 韦永贤

Abstract

Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental requirement of model merging: ensuring the merged model performs comparably to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem (i.e., minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through data-free optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a shared subspace spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-andplay approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains. Our code is available here.

SIGGRAPH2025: Cobra: Efficient Line Art COlorization with BRoAder References

Author: 庄俊豪

Abstract

The comic production industry requires reference-based line art colorization with high accuracy, efficiency, contextual consistency, and flexible control. A comic page often involves diverse characters, objects, and backgrounds, which complicates the coloring process. Despite advancements in diffusion models for image generation, their application in line art colorization remains limited, facing challenges related to handling extensive reference images, time-consuming inference, and flexible control. We investigate the necessity of extensive contextual image guidance on the quality of line art colorization. To address these challenges, we introduce Cobra, an efficient and versatile method that supports color hints and utilizes over 200 reference images while maintaining low latency. Central to Cobra is a Causal Sparse DiT architecture, which leverages specially designed positional encodings, causal sparse attention, and Key-Value Cache to effectively manage long-context references and ensure color identity consistency. Results demonstrate that Cobra achieves accurate line art colorization through extensive contextual reference, significantly enhancing inference speed and interactivity, thereby meeting critical industrial demands. We release our codes and models on our project page: https://zhuang2002.github.io/Cobra/.

CVPR2025: ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices

Author: 余豪

Abstract

The Transformer architecture has revolutionized various regions since it was proposed, and its effectiveness largely depends on the ability to encode positional information. Traditional position encoding methods exhibit significant limitations due to lack of robustness and flexibility of position. Therefore, Rotary Positional Encoding (RoPE) was proposed to alleviate these issues, which integrates positional information by rotating the embeddings in the attention mechanism. However, RoPE requires manually defined rotation matrices with limited transformation space, constraining the model’s capacity. In this work, we propose ComRoPE, which generalizes RoPE by defining it in terms of trainable commuting angle matrices. Specifically, we demonstrate that pairwise commutativity of these matrices is essential for RoPE to achieve scalability and positional robustness. We formally define the RoPE Equation, which is an essential condition that ensures consistent performance with position offsets. Based on the theoretical analysis, we present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation, which significantly improve performance, surpassing the current state-of-the-art method by 1.6% at training resolution and 2.9% at higher resolution on the ImageNet-1K dataset. Furthermore, our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.

CVPR2025: Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models

Author: 张权

Abstract

Recent breakthroughs in Multimodal Large Language Models (MLLMs) have gained significant recognition within the deep learning community, where the fusion of the Video Foundation Models (VFMs) and Large Language Models(LLMs) has proven instrumental in constructing robust video understanding systems, effectively surmounting constraints associated with predefined visual tasks. These sophisticated MLLMs exhibit remarkable proficiency in comprehending videos, swiftly attaining unprecedented performance levels across diverse benchmarks. However, their operation demands substantial memory and computational resources, underscoring the continued importance of traditional models in video comprehension tasks. In this paper, we introduce a novel learning paradigm termed MLLM4WTAL. This paradigm harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors for conventional Weakly-supervised Temporal Action Localization (WTAL) methods. MLLM4WTAL facilitates the enhancement of WTAL by leveraging MLLM guidance. It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR). These modules work in tandem to effectively address prevalent issues like incomplete and over-complete outcomes common in WTAL methods. Rigorous experiments are conducted to validate the efficacy of our proposed approach in augmenting the performance of various heterogeneous WTAL models.

CVPR2025: Unlocking Tuning-Free Few-Shot Adaptability in Visual Foundation Models by Recycling Pre-Tuned LoRAs

Author: 胡梓轩

Abstract

Large Language Models (LLMs) such as ChatGPT demonstrate strong few-shot adaptability without requiring fine-tuning, positioning them ideal for data-limited and real-time applications. However, this adaptability has not yet been replicated in current Visual Foundation Models (VFMs), which require explicit fine-tuning with sufficient tuning data. Besides, the pretraining-finetuning paradigm has led to the surge of numerous task-specific modular components, such as Low-Rank Adaptation (LoRA). For the first time, we explore the potential of reusing diverse pre-tuned LoRAs without accessing their original training data, to achieve tuning-free few-shot adaptation in VFMs. Our framework, LoRA Recycle, distills a meta-LoRA from diverse pre-tuned LoRAs with a meta-learning objective, using surrogate data generated inversely from pre-tuned LoRAs themselves. The VFM, once equipped with the meta-LoRA, is empowered to solve new few-shot tasks in a single forward pass, akin to the in-context learning of LLMs. Additionally, we incorporate a double-efficient mechanism tailored to our framework, significantly accelerating the meta-training process while maintaining or even improving performance. Extensive experiments across various few-shot classification benchmarks across both in- and cross-domain scenarios demonstrate the superiority of our framework.

ICLR2025: ChartMoE: Mixture of Diversely Aligned Expert Connector for Advanced Chart Understanding

Author: 许正卓

Abstract

Automatic chart understanding is crucial for content comprehension and document parsing. Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in chart understanding through domain-specific alignment and fine-tuning. However, the application of alignment training within the chart domain is still underexplored. To address this, we propose ChartMoE, which employs the mixture of expert (MoE) architecture to replace the traditional linear projector to bridge the modality gap. Specifically, we train multiple linear connectors through distinct alignment tasks, which are utilized as the foundational initialization parameters for different experts. Additionally, we introduce ChartMoE-Align, a dataset with over 900K chart-table-JSON-code quadruples to conduct three alignment tasks (chart-table/JSON/code). Combined with the vanilla connector, we initialize different experts in four distinct ways and adopt high-quality knowledge learning to further refine the MoE connector and LLM parameters. Extensive experiments demonstrate the effectiveness of the MoE connector and our initialization strategy, e.g., ChartMoE improves the accuracy of the previous state-of-the-art from 80.48% to 84.64% on the ChartQA benchmark.

ICLR2025: IMDPrompter: Adapting SAM to Image Manipulation Detection by Cross-View Automated Prompt Learning

Author: 张权

Abstract

Using extensive training data from SA-1B, the Segment Anything Model (SAM) has demonstrated exceptional generalization and zero-shot capabilities, attracting widespread attention in areas such as medical image segmentation and remote sensing image segmentation. However, its performance in the field of image manipulation detection remains largely unexplored and unconfirmed. There are two main challenges in applying SAM to image manipulation detection: a) reliance on manual prompts, and b) the difficulty of single-view information in supporting cross-dataset generalization. To address these challenges, we develops a cross-view prompt learning paradigm called IMDPrompter based on SAM. Benefiting from the design of automated prompts, IMDPrompter no longer relies on manual guidance, enabling automated detection and localization. Additionally, we propose components such as Cross-view Feature Perception, Optimal Prompt Selection, and Cross-View Prompt Consistency, which facilitate cross-view perceptual learning and guide SAM to generate accurate masks. Extensive experimental results from five datasets (CASIA, Columbia, Coverage, IMD2020, and NIST16) validate the effectiveness of our proposed method.

ICLR2025: Open-Vocabulary Customization from CLIP via Data-Free Knowledge Distillation

Author: 韦永贤

Abstract

Vision-language models such as CLIP have demonstrated strong zero-shot performance, but their considerable size and inefficient inference limit customizable deployment for users. While knowledge distillation is a solution, it still requires the original data, which is not always available due to copyrights and privacy concerns. For many users seeking open-vocabulary customization, Data-Free Knowledge Distillation (DFKD) emerges as a promising direction. Upon rethinking DFKD, we find that existing methods fail on CLIP due to their heavy reliance on BatchNorm layers, which are unexpectedly unusable in CLIP. Based on our findings, we adopt image-text matching to achieve DFKD for CLIP, enabling customization based on arbitrary class texts. This involves (i) inversing a surrogate dataset from CLIP based on text prompts; and (ii) distilling a student model from CLIP using the surrogate dataset. Specifically, we introduce style dictionary diversification to enhance the diversity of synthetic images. To prevent uncontrollable semantics introduced by diversification, we propose a class consistency maintaining strategy to ensure the consistency of synthetic images. Based on synthetic images with various styles, we further propose meta knowledge distillation to train the student model with good generalization ability. Moreover, we introduce a simple yet effective method to enable customization based on few example images. Comprehensive experiments showcase the superiority of our approach across twelve customized tasks, achieving a 9.33% improvement compared to existing DFKD methods.

ICASSP2025: TextureDiffusion: Target Prompt Disentangled Editing for Various Texture Transfer

Author: 苏梓瀚

Abstract

Recently, text-guided image editing has achieved significant success. However, existing methods can only apply simple textures like wood or gold when changing the texture of an object. Complex textures such as cloud or fire pose a challenge. This limitation stems from that the target prompt needs to contain both the input image content and <texture>, restricting the texture representation. In this paper, we propose TextureDiffusion, a tuning-free image editing method applied to various texture transfer. Initially, the target prompt is directly set to “<texture>“, making the texture disentangled from the input image content to enhance texture representation. Subsequently, query features in self-attention and features in residual blocks are utilized to preserve the structure of the input image. Finally, to maintain the background, we introduce an edit localization technique which blends the self-attention results and the intermediate latents. Comprehensive experiments demonstrate that TextureDiffusion can harmoniously transfer various textures with excellent structure and background preservation.

AAAI2025: Aligning Composed Query with Image via Discriminative Perception from Negative Correspondences

Author: 王依凡

Abstract

The task of composed image retrieval aims to match the multi-modal query composed of a reference image and a modification sentence with the target image. Most current approaches narrow the distances between the composed queries and targets by investigating matched correspondences in positive triplets. Nevertheless, they are inclined to exhibit heavy reliance on partial correlations. As the negative correspondences are underestimated, semantic clues that distinguish the target from mismatched candidates are obscured by incomplete associations. Moreover, the correlations between the modification textual features and the visual variations from the reference to candidates are imperative to further strengthen the semantic discriminations. In this paper, we propose DIscriminative Perception from NEgative Correspondences (DIPNEC) to address the aforementioned issues. To encourage awareness of the differences between matched and mismatched correspondences, DIPNEC introduces optimal transport with semantic preservation for reassignments on hard negative triplets. Besides, Difference Quantization Alignments (DQA) and Composed Word-level Alignments (CWA) jointly determine the matching scores between multi-modal queries and candidates. Specifically, DQA concentrates on the correlations of textual features with source-to-target visual differences, and CWA further emphasizes the differentiated semantics. DIPNEC has demonstrated competitive performances on the experimental results and ablation studies on widely-used datasets FashionIQ and CIRR.

AAAI2025: Rethinking Pseudo-Label Guided Learning for Weakly-Supervised Temporal Action Localization from the Perspective of Noise Correction

Author: 张权

Abstract

Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fullysupervised detection head. We argue that the noise in pseudolabels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.

NeurIPS2024: SuperVLAD: Compact and Robust Image Descriptors for Visual Place Recognition

Author: 卢锋

Abstract

Visual place recognition (VPR) is an essential task for multiple applications such as augmented reality and robot localization. Over the past decade, mainstream methods in the VPR area have been to use feature representation based on global aggregation, as exemplified by NetVLAD. These features are suitable for large-scale VPR and robust against viewpoint changes. However, the VLAD-based aggregation methods usually learn a large number of (e.g., 64) clusters and their corresponding cluster centers, which directly leads to a high dimension of the yielded global features. More importantly, when there is a domain gap between the data in training and inference, the cluster centers determined on the training set are usually improper for inference, resulting in a performance drop. To this end, we first attempt to improve NetVLAD by removing the cluster center and setting only a small number of (e.g., only 4) clusters. The proposed method not only simplifies NetVLAD but also enhances the generalizability across different domains. We name this method SuperVLAD. In addition, by introducing ghost clusters that will not be retained in the final output, we further propose a very low-dimensional 1-Cluster VLAD descriptor, which has the same dimension as the output of GeM pooling but performs notably better. Experimental results suggest that, when paired with a transformer-based backbone, our SuperVLAD shows better domain generalization performance than NetVLAD with significantly fewer parameters. The proposed method also surpasses state-of-the-art methods with lower feature dimensions on several benchmark datasets. The code is available at https://github.com/lu-feng/SuperVLAD.

ICPR2024: Benchmarking AI in Mental Health: A Critical Examination of LLMs Across Key Performance and Ethical Metrics

Author: 袁睿

Abstract

The rapid advancement of artificial intelligence (AI) has led to an increasing application of Large Language Models (LLMs) in psychological counseling. This study focuses on a comprehensive evaluation of LLMs in this domain, moving beyond traditional case-based reasoning. We introduce a novel multi-agent LLM framework that enhances the analysis of psychological case interactions. Our approach involves expanding the Emotional First Aid dataset with diverse client backgrounds, enhancing its applicability and generalizability. A sophisticated user profile model, incorporating eight critical dimensions, is developed and applied within a multi-agent system to examine counseling scenarios. The system’s performance is extensively evaluated based on accuracy, robustness, consistency, and fairness. The findings reveal significant differences among LLMs in these areas, highlighting their strengths and limitations in psychological interventions. This research underscores the need for ongoing refinement in LLM applications to ensure equitable and reliable support in psychological counseling. The detailed results and methodologies are available on the GitHub platform for further academic scrutiny and development.

ICPR2024: AMC-OA: Adaptive Multi-Scale Convolutional Networks with Optimized Attention for Temporal Action Localization

Author: 袁睿

Abstract

Temporal Action Localization (TAL) is crucial in video understanding, focusing on identifying and timestamping actions within raw video footage. A critical challenge in TAL is processing the rich spatiotemporal details inherent in videos, traditionally addressed through methods adapted from image processing. The Vision Transformer (VIT) model marked a significant evolution, using a self-attention mechanism for enhanced temporal information blending. Despite these advancements, two key issues remain: insufficient extraction of spatial semantic information at lower levels of feature pyramids and inadequate capture of temporal semantic information at higher levels. To address these challenges, we introduce Adaptive Multi-Scale Convolutional Networks with Optimized Attention (AMC-OA). AMC-OA enhances lower-level features within the pyramid using multi-scale convolutional kernels, enriching spatial contextual semantics. Simultaneously, upper-level features are refined with a temporally-focused contextual enhancement network utilizing residual structures for better temporal understanding. To further improve the model’s capability in handling extensive temporal spans, we integrate an advanced multi-head attention mechanism. Empirical results on benchmarks like THUMOS14 and ActivityNet1.3 demonstrate AMC-OA’s superiority in TAL tasks, significantly improving both spatial and temporal information extraction compared to state-of-the-art models.

ACM MM2024: Semantic Distillation from Neighborhood for Composed Image Retrieval

Author: 王依凡

Abstract

The challenging task composed image retrieval targets at identifying the matched image from the multi-modal query with a reference image and a textual modifier. Most existing methods are devoted to composing the unified query representations from the query images and texts, yet the distribution gaps between the hybrid-modal query representations and visual target representations are neglected. However, directly incorporating target features on the query may cause ambiguous rankings and poor robustness due to the insufficient exploration of the distinguishments and overfitting issues. To address the above concerns, we propose a novel framework termed SemAntic Distillation from Neighborhood (SADN) for composed image retrieval. For mitigating the distribution divergences, we construct neighborhood sampling from the target domain for each query and aggregate neighborhood features with adaptive weights to restructure the query representations. Specifically, the adaptive weights are determined by the collaboration of two individual modules, as correspondence-induced adaption and divergence-based correction. Correspondence-induced adaption accounts for capturing the correlation alignments from neighbor features under the guidance of the positive representations, and the divergence-based correction regulates the weights based on the embedding distances between hard negatives and the query in the latent space. Extensive results and ablation studies on CIRR and FashionIQ validate that the proposed semantic distillation from neighborhood significantly outperforms baseline methods.

ACM MM2024: CustomNet: Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models

Author: 袁梓洋

Abstract

Incorporating a customized object into image generation presents an attractive feature in text-to-image (T2I) generation. Some methods finetune T2I models for each object individually at test-time, which tend to be overfitted and time-consuming. Others train an extra encoder to extract object visual information for customization efficiently but struggle to preserve the object’s identity. To address these limitations, we present CustomNet, a unified encoder-based object customization framework that explicitly incorporates 3D novel view synthesis capabilities into the customization process. This integration facilitates the adjustment of spatial positions and viewpoints, producing diverse outputs while effectively preserving the object’s identity. To train our model effectively, we propose a dataset construction pipeline to better handle real-world objects and complex backgrounds. Additionally, we introduce delicate designs that enable location control and flexible background control through textual descriptions or user-defined backgrounds. Our method allows for object customization without the need of test-time optimization, providing simultaneous control over viewpoints, location, and text. Experimental results show that our method outperforms other customization methods regarding identity preservation, diversity, and harmony. Codes are available at https://github.com/TencentARC/CustomNet.

ECCV2024: GVGEN: Text-to-3D Generation with Volumetric Representation

Author: 何相龙

Abstract

In recent years, 3D Gaussian splatting has emerged as a powerful technique for 3D reconstruction and generation, known for its fast and high-quality rendering capabilities. Nevertheless, these methods often come with limitations, either lacking the ability to produce diverse samples or requiring prolonged inference times. To address these shortcomings, this paper introduces a novel diffusion-based framework, GVGEN, designed to efficiently generate 3D Gaussian representations from text input. We propose two innovative techniques: (1) Structured Volumetric Representation. We first arrange disorganized 3D Gaussian points as a structured form GaussianVolume. This transformation allows the capture of intricate texture details within a volume composed of a fixed number of Gaussians. To better optimize the representation of these details, we propose a unique pruning and densifying method named the Candidate Pool Strategy, enhancing detail fidelity through selective optimization. (2) Coarse-to-fine Generation Pipeline. To simplify the generation of GaussianVolume and empower the model to generate instances with detailed 3D geometry, we propose a coarse-to-fine pipeline. It initially constructs a basic geometric structure, followed by the prediction of complete Gaussian attributes. Our framework, GVGEN, demonstrates superior performance in qualitative and quantitative assessments compared to existing 3D generation methods. Simultaneously, it maintains a fast generation speed (∼7 s), effectively striking a balance between quality and efficiency. Our project page is https://gvgen.github.io/.

ECCV2024: A Task is Worth One Word: Learning with Task Prompts for High-Quality Versatile Image Inpainting

Author: 庄俊豪

Abstract

Advancing image inpainting is challenging as it requires filling user-specified regions for various intents, such as background filling and object synthesis. Existing approaches focus on either context-aware filling or object synthesis using text descriptions. However, achieving both tasks simultaneously is challenging due to differing training strategies. To overcome this challenge, we introduce PowerPaint, the first high-quality and versatile inpainting model that excels in multiple inpainting tasks. First, we introduce learnable task prompts along with tailored fine-tuning strategies to guide the model’s focus on different inpainting targets explicitly. This enables PowerPaint to accomplish various inpainting tasks by utilizing different task prompts, resulting in state-of-the-art performance. Second, we demonstrate the versatility of the task prompt in PowerPaint by showcasing its effectiveness as a negative prompt for object removal. Moreover, we leverage prompt interpolation techniques to enable controllable shape-guided object inpainting, enhancing the model’s applicability in shape-guided applications. Finally, we conduct extensive experiments and applications to verify the effectiveness of PowerPaint. We release our codes and models on our project page: https://powerpaint.github.io/.

ECCV2024: DreamDiffusion: High-Quality EEG-to-Image Generation with Temporal Masked Signal Modeling and CLIP Alignment

Author: 白云鹏

Abstract

This paper introduces DreamDiffusion, a novel method for generating high-quality images directly from brain electroencephalogram (EEG) signals, without the need to translate thoughts into text. DreamDiffusion leverages pre-trained text-to-image models and employs temporal masked signal modeling to pre-train the EEG encoder for effective and robust EEG representations. Additionally, the method further leverages the CLIP image encoder to provide extra supervision to better align EEG, text, and image embeddings with limited EEG-image pairs. Overall, the proposed method overcomes the challenges of using EEG signals for image generation, such as noise, limited information, and individual differences, and achieves promising results. Quantitative and qualitative results demonstrate the effectiveness of the proposed method as a significant step towards portable and low-cost “thoughts-to-image”, with potential applications in neuroscience and computer vision.

ECCV2024: MirrorGaussian: Reflecting 3D Gaussians for Reconstructing Mirror Reflections

Author: 刘佳月

Abstract

3D Gaussian Splatting showcases notable advancements in photo-realistic and real-time novel view synthesis. However, it faces challenges in modeling mirror reflections, which exhibit substantial appearance variations from different viewpoints. To tackle this problem, we present MirrorGaussian, the first method for mirror scene reconstruction with real-time rendering based on 3D Gaussian Splatting. The key insight is grounded on the mirror symmetry between the real-world space and the virtual mirror space. We introduce an intuitive dual-rendering strategy that enables differentiable rasterization of both the real-world 3D Gaussians and the mirrored counterpart obtained by reflecting the former about the mirror plane. All 3D Gaussians are jointly optimized with the mirror plane in an end-to-end framework. MirrorGaussian achieves high-quality and real-time rendering in scenes with mirrors, empowering scene editing like adding new mirrors and objects. Comprehensive experiments on multiple datasets demonstrate that our approach significantly outperforms existing methods, achieving state-of-the-art results. Project page: https://mirror-gaussian.github.io/.

IEEE TCSVT: Meta-Learning without Data via Unconditional Diffusion Models

Author: 韦永贤

Abstract

Although few-shot learning aims to address data scarcity, it still requires large, annotated datasets for training, which are often unavailable due to cost and privacy concerns. Previous studies have utilized pre-trained diffusion models, either to synthesize auxiliary data besides limited labeled samples, or to employ diffusion models as zero-shot classifiers. However, they are limited to conditional diffusion models needing class prior information (e.g., carefully crafted text prompts) about unseen tasks. To overcome this, we leverage unconditional diffusion models without needs for class information to train a meta-model capable of generalizing to unseen tasks. The framework contains (1) a meta-learning without data approach that uses synthetic data during training; and (2) a diffusion model-based data augmentation to calibrate the distribution shift during testing. During meta-training, we implement a self-taught class-learner to gradually capture class concepts, guiding unconditional diffusion models to generate a labeled pseudo dataset. This pseudo dataset is then used to jointly train the class-learner and the meta-model, allowing for iterative refinement and clear differentiation between classes. During meta-testing, we introduce a data augmentation that employs the diffusion models used in meta-training, to narrow the gap between meta-training and meta-testing task distribution. This enables the meta-model trained on synthetic images to effectively classify real images in unseen tasks. Comprehensive experiments showcase the superiority and adaptability of our approach in four real-world scenarios. Code available at https://github.com/WalkerWorldPeace/MLWDUDM .

ICML2024: CrossGET: Cross-Guided Ensemble of Tokens for Accelerating Vision-Language Transformers

Author: 石大川

Abstract

Recent vision-language models have achieved tremendous advances. However, their computational costs are also escalating dramatically, making model acceleration exceedingly critical. To pursue more efficient vision-language Transformers, this paper introduces Cross-Guided Ensemble of Tokens (CrossGET), a general acceleration framework for vision-language Transformers. This framework adaptively combines tokens in real-time during inference, significantly reducing computational costs while maintaining high performance. CrossGET features two primary innovations: 1) Cross-Guided Matching and Ensemble. CrossGET leverages cross-modal guided token matching and ensemble to effectively utilize cross-modal information, achieving wider applicability across both modality-independent models, e.g., CLIP, and modality-dependent ones, e.g., BLIP2. 2) Complete-Graph Soft Matching. CrossGET introduces an algorithm for the token-matching mechanism, ensuring reliable matching results while facilitating parallelizability and high efficiency. Extensive experiments have been conducted on various vision-language tasks, such as image-text retrieval, visual reasoning, image captioning, and visual question answering. The performance on both classic multimodal architectures and emerging multimodal LLMs demonstrates the framework’s effectiveness and versatility. The code is available at https://github.com/sdc17/CrossGET .

ICML2024: Sparse Model Inversion: Efficient Inversion of Vision Transformers with Less Hallucination

Author: 胡梓轩

Abstract

Model inversion, which aims to reconstruct the original training data from pre-trained discriminative models, is especially useful when the original training data is unavailable due to privacy, usage rights, or size constraints. However, existing dense inversion methods attempt to reconstruct the entire image area, making them extremely inefficient when inverting high-resolution images from large-scale Vision Transformers (ViTs). We further identify two underlying causes of this inefficiency: the redundant inversion of noisy backgrounds and the unintended inversion of spurious correlations—a phenomenon we term ``hallucination’’ in model inversion. To address these limitations, we propose a novel sparse model inversion strategy, as a plug-and-play extension to speed up existing dense inversion methods with no need for modifying their original loss functions. Specifically, we selectively invert semantic foregrounds while stopping the inversion of noisy backgrounds and potential spurious correlations. Through both theoretical and empirical studies, we validate the efficacy of our approach in achieving significant inversion acceleration (up to ×3.79) while maintaining comparable or even enhanced downstream performance in data-free model quantization and data-free knowledge transfer. Code is available at https://github.com/Egg-Hu/SMI.

ICML2024: Task Groupings Regularization: Data-Free Meta-Learning with Heterogeneous Pre-trained Models

Author: 韦永贤

Abstract

Data-Free Meta-Learning (DFML) aims to derive knowledge from a collection of pre-trained models without accessing their original data, enabling the rapid adaptation to new unseen tasks. Current methods often overlook the heterogeneity among pre-trained models, which leads to performance degradation due to task conflicts. In this paper, we empirically and theoretically identify and analyze the model heterogeneity in DFML. We find that model heterogeneity introduces a heterogeneity-homogeneity trade-off, where homogeneous models reduce task conflicts but also increase the overfitting risk. Balancing this trade-off is crucial for learning shared representations across tasks. Based on our findings, we propose Task Groupings Regularization that benefits from model heterogeneity by grouping and aligning conflicting tasks. Specifically, we embed pre-trained models into a task space to compute dissimilarity, and group heterogeneous models together based on this measure. Then, we introduce implicit gradient regularization within each group to mitigate potential conflicts. By encouraging a gradient direction suitable for all tasks, the meta-model captures shared representations that generalize across tasks. Comprehensive experiments showcase the superiority of our approach in multiple benchmarks, effectively tackling the model heterogeneity in challenging multi-domain and multi-architecture scenarios.

ICML2024: DFD: Distillng the Feature Disparity Differently for Detectors

Author: 刘康

Abstract

Knowledge distillation is a widely adopted model compression technique that has been successfully applied to object detection. In feature distillation, it is common practice for the student model to imitate the feature responses of the teacher model, with the underlying objective of improving its own abilities by reducing the disparity with the teacher. However, it is crucial to recognize that the disparities between the student and teacher are inconsistent, highlighting their varying abilities. In this paper, we explore the inconsistency in the disparity between teacher and student feature maps and analyze their impact on the efficiency of the distillation. We find that regions with varying degrees of difference should be treated separately, with different distillation constraints applied accordingly. We introduce our distillation method called Disparity Feature Distillation(DFD). The core idea behind DFD is to apply different treatments to regions with varying learning difficulties, simultaneously incorporating leniency and strictness. It enables the student to better assimilate the teacher’s knowledge. Through extensive experiments, we demonstrate the effectiveness of our proposed DFD in achieving significant improvements. For instance, when applied to detectors based on ResNet50 such as RetinaNet, FasterRCNN, and RepPoints, our method enhances their mAP from 37.4%, 38.4%, 38.6% to 41.7%, 42.4%, 42.7%, respectively. Our approach also demonstrates substantial improvements on YOLO and ViT-based models. The code is available at https://github.com/luckin99/DFD.

IJCNN2024: Integrating Local & Global Features for Estimating Shortest-Path Distance in Large-Scale Graphs

Author: 王浩宇

Abstract

We propose an effective hybrid approach jointly leveraging local and global features for shortest-path (SP) distance estimation in domain-agnostic large-scale graphs. Previous works struggle to make estimations either from node-wise local embeddings or by compressing a global SP distance matrix, causing insufficient learning at some distance and loss of accuracy. Unlike them, we find a way to better preserve local distance on node embeddings, and then integrate them with a global process for accurate estimation at every distance. First, we propose a distance-consistent embedding method that better preserves the distance between each node and its local neighbors due to resampling node occurrence on random walks. Second, we train a feed-forward network with boosting techniques (FFN-BT) to estimate SP distance from these embeddings plus existing global features. Experimental results show that our approach averagely yields 10% improved accuracy and 20% reduced time when compared to existing methods on a broad class of graphs.

IJCNN2024: MMR: Multi-Scale Motion Retargeting Between Skeleton-Agnostic Characters

Author: 王浩宇

Abstract

We present a simple yet effective method for skeleton-agnostic motion retargeting. Previous methods transfer motion between high-resolution meshes, failing to preserve the inherent local-part motions in the mesh. Addressing this issue, our proposed method learns the correspondence in a coarse-to-fine fashion by disentangling the retargeting process within multi-scale meshes. First, we propose a mesh-pooling module that pools the mesh representations for better motion transfer. This module improves the ability to handle small-part motion and preserves the local motion interdependence between neighboring mesh vertices. Furthermore, we leverage a multi-scale refinement procedure to complement missing mesh details by gradually refining the low-resolution mesh output with a higher-resolution one. We evaluate our method on several well-known 3D character datasets, and it yields an average improvement of 25% on point-wise mesh Euclidean distance (PMD) against the start-of-art method. Qualitative results show that our method is significantly helpful in preserving the moving consistency of different body parts on the target character due to disentangling body-part structures and mesh details in a multi-scale way.

IJCNN2024: Error Bound Based Noise Schedule Design in Diffusion Models

Author: 刘力源

Abstract

Diffusion-based generative model currently serves as a mainstream generative method. The noise schedule has a significant impact on the training process of diffusion model, as it affects both the distribution of the noisy training set and the weights of the objective function at each noise level. In this paper, we design the noise schedule from the scope of reducing the final error upper bound of the reverse denoising process. By examining Monte Carlo training from a theoretical perspective, we establish an association between noise schedule and the upper bound of network output error. Furthermore, we derive the connection between network output and final error through reverse process. We design our noise schedule with the goal of reducing the upper bound of error combined with the correlation analysis of network output. Experimental results demonstrate that our noise schedule enhances perceptual quality on CIFAR-10, FFHQ-64x64 and AFHQv2-64x64. Our noise schedule achieves state-of-the-art FID score of 1.70 on CIFAR-10 unconditional generation task using discriminator guidance method. On FFHQ/AFHQv2, using our noise schedule to retrain the pre-trained model can improve the sample quality at little training cost.

IJCNN2024: Noise Weighting Phased Prompt Image Editing

Author: 徐国炜

Abstract

The remarkable performance of large-scale Text-to-Image generation(TI) models is evident in their ability to produce high-quality and diverse images. However, despite advancements, the field of image editing still faces challenges. Current methods struggle to strike a balance between fidelity and powerful editing capabilities. Moreover, approaches that do not involve fine-tuning fail to produce diverse editing results. We introduce Noise Weighting Phased Prompt Image Editing (NWPP), a method that excels in powerful editing, high fidelity, and diverse results without fine-tuning. Our approach involves a two-phase generation process. The first phase employs the original prompt to guide initial image editing, ensuring a layout resembling the original image. In the second phase, a noise-weighting technique based on the Cross-Attention map minimizes the impact of the target text on non-editing regions. Further enhancement is achieved through the integration of the KV injection module, expanding the editing capabilities and enabling diverse result generation. Experimental evaluations, conducted on both generated images and the COCO dataset, affirm the efficacy of our method.

CVPR2024: Distilling Semantic Priors from SAM to Efficient Image Restoration Models

Author: 张权

Abstract

In image restoration (IR) leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However the computational cost of SAM is prohibitive for IR compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue we propose a general framework to distill SAM’s semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionallywe design a semantic-guided relation (SGR) module for SPD which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks including deraining deblurring and denoising.

CVPR2024: CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition

Author: 卢锋

Abstract

Over the past decade most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination) which limits their robustness in challenging scenes. In this paper we propose a robust global representation method with cross-image correlation awareness for VPR named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints or even captured from different places. Therefore our method can utilize the cross-image variations as a cue to guide the representation learning which ensures more robust features are produced. To further facilitate the robustness we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR.

CVPR2024: A Free Lunch for Faster and Better Data-Free Meta-Learning

Author: 韦永贤

Abstract

Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically within the module Faster Inversion via Meta-Generator each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps significantly accelerating the data recovery. Furthermore we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach marking a notable speed-up (20x) and performance enhancement (1.42% 4.78%) in comparison to the state-of-the-art.

IEEE T-MM: Negative-Sensitive Framework with Semantic Enhancement for Composed Image Retrieval

Author: 王依凡

Abstract

Composed image retrieval is a challenging task in the field of multi-modal learning, aiming at measuring the similarities between target images and query images with modification sentences. Most previous methods either construct feature composition for the query image and modification text or concentrate on extracting cross-modal alignments. However, these methods are prone to neglect the negative impacts of the mismatched correspondences between the hybrid-modal query and target, which could be discriminative when comparing similar instances. Besides, localized textual representations are not fully explored when learning similarities between the query and the target. To overcome the above issues, we propose a Negative-Sensitive Framework with Semantic Enhancement (NSFSE) for mining the adaptive boundaries between matched and mismatched samples with comprehensive consideration of positive and negative correspondences. It can optimize the threshold dynamically based on distributions to explore the intrinsic characteristics of positive and negative correlations, which could further facilitate accurate similarity learning. A text-guided attention mechanism after infusing cross-modal affinities on localized word features is exploited in NSFSE to explore latent semantic-related visual similarity and cross-modal similarity simultaneously. The performance of extensive experiments and comprehensive analysis on three representative datasets CIRR, FashionIQ, and Fashion200 K demonstrate the effectiveness of negative mining of similarity with semantic enhancement in the proposed NSFSE.

ICLR2024: Convolution Meets LoRA: Parameter Efficient Finetuning for Segment Anything Model

Author: 钟子涵

Abstract

The Segment Anything Model (SAM) stands as a foundational framework for image segmentation. While it exhibits remarkable zero-shot generalization in typical scenarios, its advantage diminishes when applied to specialized domains like medical imagery and remote sensing. To address this limitation, this paper introduces Conv-LoRA, a simple yet effective parameter-efficient fine-tuning approach. By integrating ultra-lightweight convolutional parameters into Low-Rank Adaptation (LoRA), Conv-LoRA can inject image-related inductive biases into the plain ViT encoder, further reinforcing SAM’s local prior assumption. Notably, Conv-LoRA not only preserves SAM’s extensive segmentation knowledge but also revives its capacity of learning high-level image semantics, which is constrained by SAM’s foreground-background segmentation pretraining. Comprehensive experimentation across diverse benchmarks spanning multiple domains underscores Conv-LoRA’s superiority in adapting SAM to real-world semantic segmentation tasks.

ICLR2024: Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition

Author: 卢锋

Abstract

Recent studies show that vision models pre-trained in generic visual learning tasks with large-scale data can provide useful feature representations for a wide range of visual perception problems. However, few attempts have been made to exploit pre-trained foundation models in visual place recognition (VPR). Due to the inherent difference in training objectives and data between the tasks of model pre-training and VPR, how to bridge the gap and fully unleash the capability of pre-trained models for VPR is still a key issue to address. To this end, we propose a novel method to realize seamless adaptation of pre-trained models for VPR. Specifically, to obtain both global and local features that focus on salient landmarks for discriminating places, we design a hybrid adaptation method to achieve both global and local adaptation efficiently, in which only lightweight adapters are tuned without adjusting the pre-trained model. Besides, to guide effective adaptation, we propose a mutual nearest neighbor local feature loss, which ensures proper dense local features are produced for local matching and avoids time-consuming spatial verification in re-ranking. Experimental results show that our method outperforms the state-of-the-art methods with less training data and training time, and uses about only 3% retrieval runtime of the two-stage VPR methods with RANSAC-based spatial verification. It ranks 1st on the MSLS challenge leaderboard (at the time of submission). The code is released at https://github.com/Lu-Feng/SelaVPR.

AAAI2024: Efficient Conditional Diffusion Model with Probability Flow Sampling for Image Super-resolution

Author: 袁宇韬

Abstract

Image super-resolution is a fundamentally ill-posed problem because multiple valid high-resolution images exist for one low-resolution image. Super-resolution methods based on diffusion probabilistic models can deal with the ill-posed nature by learning the distribution of high-resolution images conditioned on low-resolution images, avoiding the problem of blurry images in PSNR-oriented methods. However, existing diffusion-based super-resolution methods have high time consumption with the use of iterative sampling, while the quality and consistency of generated images are less than ideal due to problems like color shifting. In this paper, we propose Efficient Conditional Diffusion Model with Probability Flow Sampling (ECDP) for image super-resolution. To reduce the time consumption, we design a continuous-time conditional diffusion model for image super-resolution, which enables the use of probability flow sampling for efficient generation. Additionally, to improve the consistency of generated images, we propose a hybrid parametrization for the denoiser network, which interpolates between the data-predicting parametrization and the noise-predicting parametrization for different noise scales. Moreover, we design an image quality loss as a complement to the score matching loss of diffusion models, further improving the consistency and quality of super-resolution. Extensive experiments on DIV2K, ImageNet, and CelebA demonstrate that our method achieves higher super-resolution quality than existing diffusion-based image super-resolution methods while having lower time consumption. Our code is available at https://github.com/Yuan-Yutao/ECDP.

AAAI2024: Blind Face Restoration under Extreme Conditions: Leveraging 3D-2D Prior Fusion for Superior Structural and Texture Recovery

Author: 袁梓洋

Abstract

Blind face restoration under extreme conditions involves reconstructing high-quality face images from severely degraded inputs. These input images are often in poor quality and have extreme facial poses, leading to errors in facial structure and unnatural artifacts within the restored images. In this paper, we show that utilizing 3D priors effectively compensates for structure knowledge deficiencies in 2D priors while preserving the texture details. Based on this, we introduce FREx (Face Restoration under Extreme conditions) that combines structure-accurate 3D priors and texture-rich 2D priors in pretrained generative networks for blind face restoration under extreme conditions. To fuse the different information in 3D and 2D priors, we introduce an adaptive weight module that adjusts the importance of features based on the input image’s condition. With this approach, our model can restore structure-accurate and natural-looking faces even when the images have lost a lot of information due to degradation and extreme pose. Extensive experimental results on synthetic and real-world datasets validate the effectiveness of our methods.

AAAI2024: Mean Teacher DETR with Masked Feature Alignment: A Robust Domain Adaptive Detection Transformer Framework

Author: 翁玮熙

Abstract

Unsupervised domain adaptation object detection(UDAOD) research on Detection Transformer(DETR) mainly focuses on feature alignment and existing methods can be divided into two kinds, each of which has its unresolved issues. One-stage feature alignment methods can easily lead to performance fluctuation and training stagnation. Two-stage feature alignment method based on mean teacher comprises a pretraining stage followed by a self-training stage, each facing problems in obtaining reliable pretrained model and achieving consistent performance gains. Methods mentioned above have not yet explore how to utilize the third related domain such as target-like domain to assist adaptation. To address these issues, we propose a two-stage framework named MTM, i.e. Mean Teacher-DETR with Masked Feature Alignment. In the pretraining stage, we utilize labeled target-like images produced by image style transfer to avoid performance fluctuation. In the self-training stage, we leverage unlabeled target images by pseudo labels based on mean teacher and propose a module called Object Queries Knowledge Transfer(OQKT) to ensure consistent performance gains of the student model. Most importantly, we propose masked feature alignment methods including Masked Domain Query-based Feature Alignment(MDQFA) and Masked Token-wise Feature Alignment(MTWFA) to alleviate domain shift in a more robust way, which not only prevent training stagnation and lead to a robust pretrained model in the pretraining stage, but also enhance the model’s target performance in the self-training stage. Experiments on three challenging scenarios and a theoretical analysis verify the effectiveness of MTM.

AAAI2024: Deep Homography Estimation for Visual Place Recognition

Author: 卢锋

Abstract

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

IEEE T-MM: Towards Effective Collaborative Learning in Long-Tailed Recognition

Author: 许正卓

Abstract

Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented compared to the majority ones. Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority, where collaborative learning is employed to aggregate the knowledge of experts, i.e., online distillation. In this article, we observe that the knowledge transfer between experts is imbalanced in terms of class distribution, which results in limited performance improvement of the minority classes. To address it, we propose a re-weighted distillation loss by comparing two classifiers’ predictions, which are supervised by online distillation and label annotations, respectively. We also emphasize that feature-level distillation will significantly improve model performance and increase feature robustness. Finally, we propose an Effective Collaborative Learning (ECL) framework that integrates a contrastive proxy task branch to further improve feature quality. Quantitative and qualitative experiments on four standard datasets demonstrate that ECL achieves state-of-the-art performance and the detailed ablation studies manifest the effectiveness of each component in ECL.

ACL2023: LET: Leveraging Error Type Information for Grammatical Error Correction

Author: 李泓嘉

Abstract

Grammatical error correction (GEC) aims to correct errors in given sentences and is significant to many downstream natural language understanding tasks. Recent work introduces the idea of grammatical error detection (GED) to improve the GEC task performance. In contrast, these explicit multi-stage works propagate and amplify the problem of misclassification of the GED module. To introduce more convincing error type information, we propose an end-to-end framework in this paper, which Leverages Error Type (LET) information in the generation process. First, the input text is fed into a classification module to obtain the error type corresponding to each token. Then, we introduce the category information into the decoder’s input and cross-attention module in two ways, respectively. Experiments on various datasets show that our proposed method outperforms existing methods by a clear margin.

ACL2023: Tailoring Instructions to Student’s Learning Levels Boosts Knowledge Distillation

Author: 任昱鑫

Abstract

It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer. In order to enhance the guidance of the teacher training process, we introduce the concept of distillation influence to determine the impact of distillation from each training sample on the student’s generalization ability. In this paper, we propose Learning Good Teacher Matters (LGTM), an efficient training technique for incorporating distillation influence into the teacher’s learning process. By prioritizing samples that are likely to enhance the student’s generalization ability, our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.

IEEE T-MM: HHF: Hashing-guided Hinge Function for Deep Hashing Retrieval

Author: 徐呈寅

Abstract

Deep hashing has shown promising performance in large-scale image retrieval. The hashing process utilizes Deep Neural Networks (DNNs) to embed images into compact continuous latent codes, then map them into binary codes by hashing function for efficient retrieval. Recent approaches perform metric loss and quantization loss to supervise the two procedures that cluster samples with the same categories and alleviate semantic information loss after binarization in the end-to-end training framework. However, we observe the incompatible conflict that the optimal cluster positions are not identical to the ideal hash positions because of the different objectives of the two loss terms, which lead to severe ambiguity and error-hashing after the binarization process. To address the problem, we borrow the Theory of Minimum-Distance Bounds for Binary Linear Codes to design the inflection point that depends on the hash bit length and category numbers and thereby propose Hashing-guided Hinge Function (HHF) to explicitly enforce the termination of metric loss to prevent the negative pairs unlimited alienated. Such modification is proven effective and essential for training, which contributes to proper intra- and inter-distances for clusters and better hash positions for accurate image retrieval simultaneously. Extensive experiments in CIFAR-10, CIFAR-100, ImageNet, and MS-COCO justify that HHF consistently outperforms existing techniques and is robust and flexible to transplant into other methods. Code is available at https://github.com/JerryXu0129/HHF .

ACM MM2023: Enhanced Image Deblurring: An Efficient Frequency Exploitation and Preservation Network

Author: 董姝婷

Abstract

Most of these frequency-based deblurring methods mainly have two major limitations: (1) insufficient exploitation of frequency information, (2) inadequate preservation of frequency information. In this paper, we propose a novel Efficient Frequency Exploitation and Preservation Network (EFEP) to address these limitations. Firstly, we propose a novel Frequency-Balanced Exploitation Encoder (FBE-Encoder) to sufficiently exploit frequency information. We insert a novel Frequency-Balanced Navigator (FBN) module in the encoder, which establishes a dynamic balance that adaptively explores and integrates the correlations between frequency features and other features presented in the network. And it also can highlight the most important regions in frequency features. Secondly, considering the limitation that frequency information is inevitably lost in deep network architectures, we present an Enhanced Selective Frequency Decoder (ESF-Decoder) that not only effectively reduces spatial information redundancy, but also fully explores the different importance of various frequency information to ensure the supplement of valid spatial information and weaken the invalid information. Thirdly, each encoder/decoder block of the EFEP consists of multiple Contrastive Residual Blocks (CRBs), which are designed to explicitly compute and incorporate feature distinctions. Powered by the above designs, our EFEP outperforms state-of-the-art models on both quantitative and qualitative evaluations.

IJCAI2023: DFVSR: Directional Frequency Video Super-Resolution via Asymmetric and Enhancement Alignment Network

Author: 董姝婷

Abstract

Recently, techniques utilizing frequency-based methods have gained signifcant attention, as they exhibit exceptional restoration capabilities for detail and structure in video super-resolution tasks. However, most of these frequency-based methods mainly have three major limitations: 1) insuffcient exploration of object motion information, 2) inadequate enhancement for high-fdelity regions, and 3) loss of spatial information during convolution. In this paper, we propose a novel network, Directional Frequency Video Super-Resolution (DFVSR), to address these limitations. Specifcally, we reconsider object motion from a new perspective and propose Directional Frequency Representation (DFR), which not only borrows the property of frequency representation of detail and structure information but also contains the direction information of the object motion that is extremely significant in videos. Based on this representation, we propose a Directional Frequency-Enhanced Alignment (DFEA) to use double enhancements of task-related information for ensuring the retention of high-fdelity frequency regions to generate the high-quality alignment feature. Furthermore, we design a novel Asymmetrical U-shaped network architecture to progressively fuse these alignment features and output the fnal output. This architecture enables the intercommunication of the same level of resolution in the encoder and decoder to achieve the supplement of spatial information. Powered by the above designs, our method achieves superior performance over state-of-the-art models on both quantitative and qualitative evaluations.

ACM MM2023: Adaptive Contrastive Learning for Learning Robust Representations under Label Noise

Author: 王子浩

Abstract

Deep Neural Networks suffer significant performance degeneration when noisy labels corrupt latent data representations. Previous work has attempted to alleviate this problem by exploiting contrastive learning, the pair building of which is critical. However, existing methods either conduct sample-level processes and then use the resultant subset to construct pairs or directly perform pair-level selecting using a fixed threshold, both leading to sub-optimal pairing and subsequent representation learning. To address this issue, we propose a novel adaptive contrastive learning method (ACL) working at the pair level to select contrastive pairs adaptively. Specifically, we consider the model’s learning status to adjust the confidence threshold in a self-adaptive manner instead of fixing it. Then, towards the ineffectiveness of the thresholding method on unconfident pairs, we automatically apply instance-specific temperature to boost the confidence of accurately-predicted samples and their pairs. We further introduce temporal cross-ensembling to handle the impact of noisy labels on model predictions. As a result, diverse pairs are correctly selected for contrastive learning to induce discriminative representations robust to various types of label noise. Extensive experimental results on several standard benchmarks and real-world datasets indicate the superiority of ACL, especially in extremely noisy scenarios.

ICCV2023 Best Paper Candidate: When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method

Author: 张曼怡

Abstract

Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method.

ICCV2023: HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details

Author: 柴增豪

Abstract

3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image. The facial surface is influenced by the coarse shape, as well as the static detail (e,g., person-specific appearance) and dynamic detail (e.g., expression-driven wrinkles). Previous work struggles to decouple the static and dynamic details through image-level supervision, leading to reconstructions that are not realistic. In this paper, we aim at high-fidelity 3D face reconstruction and propose HiFace to explicitly model the static and dynamic details. Specifically, the static detail is modeled as the linear combination of a displacement basis, while the dynamic detail is modeled as the linear interpolation of two displacement maps with polarized expressions. We exploit several loss functions to jointly learn the coarse shape and fine details with both synthetic and real-world datasets, which enable HiFace to reconstruct high-fidelity 3D shapes with animatable details. Extensive quantitative and qualitative experiments demonstrate that HiFace presents state-of-the-art reconstruction quality and faithfully recovers both the static and dynamic details.

ICCV2023: Accurate 3D Face Reconstruction with Facial Component Tokens

Author: 章天珂

Abstract

Accurately reconstructing 3D faces from monocular images and videos is crucial for various applications, such as digital avatar creation. However, the current deep learning-based methods face significant challenges in achieving accurate reconstruction with disentangled facial parameters and ensuring temporal stability in single-frame methods for 3D face tracking on video data. In this paper, we propose TokenFace, a transformer-based monocular 3D face reconstruction model. TokenFace uses separate tokens for different facial components to capture information about different facial parameters and employs temporal transformers to capture temporal information from video data. This design can naturally disentangle different facial components and is flexible to both 2D and 3D training data. Trained on hybrid 2D and 3D data, our model shows its power in accurately reconstructing faces from images and producing stable results for video data. Experimental results on popular benchmarks NoW and Stirling demonstrate that TokenFace achieves state-of-the-art performance, outperforming existing methods on all metrics by a large margin.

ICCVW2023: Effective Whole-body Pose Estimation with Two-stages Distillation

Author: 杨震东

Abstract

Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image. This task is challenging due to multi-scale body parts, fine-grained localization for low-resolution regions, and data scarcity. Meanwhile, applying a highly efficient and accurate pose estimator to widely human-centric understanding and generation tasks is urgent. In this work, we present a two-stage pose Distillation for Whole-body Pose estimators, named DWPose, to improve their effectiveness and efficiency. The first-stage distillation designs a weight-decay strategy while utilizing a teacher’s intermediate feature and final logits with both visible and invisible keypoints to supervise the student from scratch. The second stage distills the student model itself to further improve performance. Different from the previous self-knowledge distillation, this stage finetunes the student’s head with only 20% training time as a plug-and-play training strategy. For data limitations, we explore the UBody dataset that contains diverse facial expressions and hand gestures for real-life applications. Comprehensive experiments show the superiority of our proposed simple yet effective methods. We achieve new state-of-the-art performance on COCO-WholeBody, significantly boosting the whole-body AP of RTMPose-l from 64.8% to 66.5%, even surpassing RTMPose-x teacher with 65.3% AP. We release a series of models with different sizes, from tiny to large, for satisfying various downstream tasks. Our code and models are available at https://github.com/IDEA-Research/DWPose.

ICCV2023: From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels

Author: 杨震东

Abstract

Knowledge Distillation (KD) uses the teacher’s prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image’s category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student’s non-target logits to match the teacher’s, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf’s law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Code is available at https://github.com/yzd-v/cls_KD.

ICCV2023: Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware Encoding

Author: 袁梓洋

Abstract

3D GAN inversion aims to achieve high reconstruction fidelity and reasonable 3D geometry simultaneously from a single image input. However, existing 3D GAN inversion methods rely on time-consuming optimization for each individual case. In this work, we introduce a novel encoder-based inversion framework based on EG3D, one of the most widely-used 3D GAN models. We leverage the inherent properties of EG3D’s latent space to design a discriminator and a background depth regularization. This enables us to train a geometry-aware encoder capable of converting the input image into corresponding latent code. Additionally, we explore the feature space of EG3D and develop an adaptive refinement stage that improves the representation ability of features in EG3D to enhance the recovery of fine-grained textural details. Finally, we propose an occlusion-aware fusion operation to prevent distortion in unobserved regions. Our method achieves impressive results comparable to optimization-based methods while operating up to 500 times faster. Our framework is well-suited for applications such as semantic editing.

ICML2023: UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

Author: 石大川

Abstract

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the Unified and Progressive Pruning (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

ICML2023: Learning to Learn from APIs: Black-Box Data-Free Meta-Learning

Author: 胡梓轩

Abstract

Data-free meta-learning (DFML) aims to enable efficient learning of new tasks by meta-learning from a collection of pre-trained models without access to the training data. Existing DFML work can only meta-learn from (i) white-box and (ii) small-scale pre-trained models (iii) with the same architecture, neglecting the more practical setting where the users only have inference access to the APIs with arbitrary model architectures and model scale inside. To solve this issue, we propose a Bi-level Data-free Meta Knowledge Distillation (BiDf-MKD) framework to transfer more general meta knowledge from a collection of black-box APIs to one single meta model. Specifically, by just querying APIs, we inverse each API to recover its training data via a zero-order gradient estimator and then perform meta-learning via a novel bi-level meta knowledge distillation structure, in which we design a boundary query set recovery technique to recover a more informative query set near the decision boundary. In addition, to encourage better generalization within the setting of limited API budgets, we propose task memory replay to diversify the underlying task distribution by covering more interpolated tasks. Extensive experiments in various real-world scenarios show the superior performance of our BiDf-MKD framework.

ICRA2023: AANet: Aggregation and Alignment Network with Semi-hard Positive Sample Mining for Hierarchical Place Recognition

Author: 卢锋

Abstract

Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots. Recently, the hierarchical two-stage VPR methods have become popular in this field due to the trade-off between accuracy and efficiency. These methods retrieve the top-k candidate images using the global features in the first stage, then re-rank the candidates by matching the local features in the second stage. However, they usually require additional al-gorithms (e.g. RANSAC) for geometric consistency verification in re-ranking, which is time-consuming. Here we propose a Dynamically Aligning Local Features (DALF) algorithm to align the local features under spatial constraints. It is significantly more efficient than the methods that need geometric consistency verification. We present a unified network capable of extracting global features for retrieving candidates via an aggregation module and aligning local features for re-ranking via the DALF alignment module. We call this network AANet. Meanwhile, many works use the simplest positive samples in triplet for weakly supervised training, which limits the ability of the network to recognize harder positive pairs. To address this issue, we propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks. Extensive experiments on four benchmark VPR datasets show that the proposed AANet can outperform several state-of-the-art methods with less time consumption. The code is released at https://github.com/Lu-Feng/AANet.

CVPR2023: High-fidelity Facial Avatar Reconstruction from Monocular Video with Generative Priors

Author: 白云鹏

Abstract

High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views, and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audio. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance. The code is available here https://github.com/bbaaii/HFA-GP.

CVPR2023: Learning Imbalanced Data with Vision Transformers

Author: 许正卓

Abstract

The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs’ performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. Although Binary Cross Entropy (BCE) loss performs well with ViTs, it struggles on the LTR tasks. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins for deploying it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.

IEEE TCSVT: Task-adaptive Feature Disentanglement and Hallucination for Few-shot Classification

Author: 胡梓轩

Abstract

Few-shot classification is a challenging task of computer vision and is critical to the data-sparse scenario like rare disease diagnosis. Feature augmentation is a straightforward way to alleviate the data-sparse issue in few-shot classification. However, mimicking the original feature distribution from a small amount of data is challenging. Existing augmentation-based methods are task-agnostic: the augmented feature is not with optimal intra-class diversity and inter-class discriminability concerning a certain task. To address this drawback, we propose a novel Task-adaptive Feature Disentanglement and Hallucination framework, dubbed TaFDH. Concretely, we first perceive the task information to disentangle the original feature into two components: class-irrelevant and class-specific features. Then more class-irrelevant features are decoded from a learned variational distribution, fused with the class-specific feature to get the augmented features. Finally, a generalized prior distribution over a quadratic classifier is meta-learned, which can be fast adapted to the class-specific posterior, thus further alleviating the inadequacy and uncertainty of feature hallucination via the nature of Bayesian inference. In this way, we construct a more discriminable embedding space with reasonable intra-class diversity instead of simply restoring the original embedding space, which can lead to a more precise decision boundary. We obtain the augmented features equipped with enhanced inter-class discriminability by highlighting the most discriminable part while boosting the intra-class diversity by fusing with the diverse generated class-irrelevant parts. Experiments on five multi-grained few-shot classification datasets demonstrate the superiority of our method.

CVPR2023: Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning

Author: 胡梓轩

Abstract

The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement–ICFIL–only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours.

CVPR2023: Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks

Author: 江晓湖

Abstract

Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.

ICASSP2023: FREQUENCY RECIPROCAL ACTION AND FUSION FOR SINGLE IMAGE SUPER-RESOLUTION

Author: 董姝婷

Abstract

Frequency-based methods have recently received much attention due to their impressive restoration of detail and structure in single image super-resolution (SISR). However, most of these methods mainly use frequency information as auxiliary means but ignore exploring the correlations and pixel distribution differences among various frequencies. To address the limitations, we propose a novel Frequency Reciprocal Action and Fusion Network (FRAF) that explores various frequency correlations and differences. Specifically, we design a Frequency Reciprocal Action (FRA) module, which safely enhances valid spatial information and decreases un-necessary repetition by reciprocal action among various spatial frequencies, to generate refined high- and low-frequency features. These refined frequency features are then progressively to guide the details and structure recovery, respectively. Furthermore, we develop a Detail and Structure Fusion (DSF) module to adaptively select, enhance and fuse the features to output the final HR image. This way ensures the final image is a high-quality product with rich details and a clear structure. Experimental results demonstrate that our method achieves superior performance over state-of-the-art (SOTA) approaches on both quantitative and qualitative evaluations.

ICASSP2023: Rethink Long-Tailed Recognition With Vision Transforms

Author: 许正卓

Abstract

In the real world, data tends to follow long-tailed distributions w.r.t. class or attribution, motivating the challenging Long-Tailed Recognition (LTR) problem. In this paper, we revisit recent LTR methods with promising Vision Transformers (ViT). We figure out that 1) ViT is hard to train with longtailed data. 2) ViT learns generalized features in an unsupervised manner, like mask generative training, either on longtailed or balanced datasets. Hence, we propose to adopt unsupervised learning to utilize long-tailed data. Furthermore, we propose the Predictive Distribution Calibration (PDC) as a novel metric for LTR, where the model tends to simply classify inputs into common classes. Our PDC can measure the model calibration of predictive preferences quantitatively. On this basis, we find many LTR approaches alleviate it slightly, despite the accuracy improvement. Extensive experiments on benchmark datasets validate that PDC reflects the model’s predictive preference precisely, which is consistent with the visualization.

IEEE TNNLS: PatchNet: Maximize the Exploration of Congeneric Semantics for Weakly Supervised Semantic Segmentation

Author: 张可

Abstract

With the increase in the number of image data and the lack of corresponding labels, weakly supervised learning has drawn a lot of attention recently in computer vision tasks, especially in the fine-grained semantic segmentation problem. To alleviate human efforts from expensive pixel-by-pixel annotations, our method focuses on weakly supervised semantic segmentation (WSSS) with image-level labels, which are much easier to obtain. As a considerable gap exists between pixel-level segmentation and image-level labels, how to reflect the image-level semantic information on each pixel is an important question. To explore the congeneric semantic regions from the same class to the maximum, we construct the patch-level semantic augmentation network (PatchNet) based on the self-detected patches from different images that contain the same class labels. Patches can frame the objects as much as possible and include as little background as possible. The patch-level semantic augmentation network that is established with patches as the nodes can maximize the mutual learning of similar objects. We regard the embedding vectors of patches as nodes and use a transformer-based complementary learning module to construct weighted edges according to the embedding similarity between different nodes. Moreover, to better supplement semantic information, we propose softcomplementary loss functions matched with the whole network structure. We conduct experiments on the popular PASCAL VOC 2012 and MS COCO 2014 benchmarks, and our model yields the state-of-the-art performance.

AAAI2023: Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos

Author: 王子啸

Abstract

Learning with noisy label is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering temporal semantics and computational cost is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) a lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category. 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed truNcatE-split-contrAsT (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10% of it, our method achieves over 0.4 noise detection F1-score and 5% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6%.

AAAI2023: Darwinian Model Upgrades: Model Evolving with Selective Compatibility

Author: 张斌杰

Abstract

The traditional model upgrading paradigm for retrieval requires recomputing all gallery embeddings before deploying the new model (dubbed as “backfilling”), which is quite expensive and time-consuming considering billions of instances in industrial applications. BCT presents the first step towards backward-compatible model upgrades to get rid of backfilling. It is workable but leaves the new model in a dilemma between new feature discriminativeness and new-to-old compatibility due to the undifferentiated compatibility constraints. In this work, we propose Darwinian Model Upgrades (DMU), which disentangle the inheritance and variation in the model evolving with selective backward compatibility and forward adaptation, respectively. The old-to-new heritable knowledge is measured by old feature discriminativeness, and the gallery features, especially those of poor quality, are evolved in a lightweight manner to become more adaptive in the new latent space. We demonstrate the superiority of DMU through comprehensive experiments on large-scale landmark retrieval and face recognition benchmarks. DMU effectively alleviates the new-to-new degradation at the same time improving new-to-old compatibility, rendering a more proper model upgrading paradigm in large-scale retrieval systems.Code: https://github.com/TencentARC/OpenCompatible.