19 44 19

Xiangtai Li

LXT

https://lxtgh.github.io/

AI & ML interests

Computer Vision, Multi-Modal Understanding, Generative AI

Recent Activity

upvoted a paper 4 days ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

upvoted a paper 18 days ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

upvoted a paper 20 days ago

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

View all activity

Organizations

upvoted a paper 4 days ago

Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

Paper • 2512.02457 • Published 5 days ago • 12

upvoted a paper 18 days ago

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Paper • 2511.13853 • Published 20 days ago • 34

upvoted a paper 20 days ago

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation

Paper • 2511.09611 • Published 25 days ago • 68

upvoted a paper 26 days ago

Visual Spatial Tuning

Paper • 2511.05491 • Published 30 days ago • 49

upvoted 5 papers about 1 month ago

PairUni: Pairwise Training for Unified Multimodal Language Models

Paper • 2510.25682 • Published Oct 29 • 13

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30 • 33

upvoted a paper about 2 months ago

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21 • 36

upvoted a collection about 2 months ago

Sa2VA Model Zoo

Collection

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 10 days ago • 44

upvoted 3 papers about 2 months ago

Detect Anything via Next Point Prediction

Paper • 2510.12798 • Published Oct 14 • 46

Diffusion Transformers with Representation Autoencoders

Paper • 2510.11690 • Published Oct 13 • 165

DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

Paper • 2510.11712 • Published Oct 13 • 30

upvoted a paper 3 months ago

SpatialVID: A Large-Scale Video Dataset with Spatial Annotations

Paper • 2509.09676 • Published Sep 11 • 32

upvoted 5 papers 5 months ago

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology

Paper • 2507.07999 • Published Jul 10 • 49

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Paper • 2506.24119 • Published Jun 30 • 50

Ovis-U1 Technical Report

Paper • 2506.23044 • Published Jun 29 • 62

GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Paper • 2507.01006 • Published Jul 1 • 240

VMoBA: Mixture-of-Block Attention for Video Diffusion Models

Paper • 2506.23858 • Published Jun 30 • 32

Xiangtai Li

AI & ML interests

Recent Activity

Organizations

LXT's activity