Li's picture

4 20 5

Li

Xiangtai

·

AI & ML interests

None yet

Recent Activity

updated a collection 12 days ago

Sa2VA Model Zoo

updated a model 12 days ago

ByteDance/Sa2VA-Qwen3-VL-2B

published a model 12 days ago

ByteDance/Sa2VA-Qwen3-VL-2B

View all activity

Organizations

updated a collection 12 days ago

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 12 days ago • 44

updated a model 12 days ago

ByteDance/Sa2VA-Qwen3-VL-2B

Image-Text-to-Text • 3B • Updated 12 days ago • 127 • 11

published a model 12 days ago

ByteDance/Sa2VA-Qwen3-VL-2B

Image-Text-to-Text • 3B • Updated 12 days ago • 127 • 11

upvoted 4 papers about 1 month ago

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Paper • 2510.26802 • Published Oct 30 • 33

Kimi Linear: An Expressive, Efficient Attention Architecture

Paper • 2510.26692 • Published Oct 30 • 116

Emu3.5: Native Multimodal Models are World Learners

Paper • 2510.26583 • Published Oct 30 • 107

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer

Paper • 2504.10462 • Published Apr 14 • 15

updated a collection about 1 month ago

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 12 days ago • 44

upvoted 2 papers about 1 month ago

MERIT: Multilingual Semantic Retrieval with Interleaved Multi-Condition Query

Paper • 2506.03144 • Published Jun 3 • 7

CyberV: Cybernetics for Test-time Scaling in Video Understanding

Paper • 2506.07971 • Published Jun 9 • 5

upvoted a paper about 2 months ago

Open-o3 Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence

Paper • 2510.20579 • Published Oct 23 • 55

updated a collection about 2 months ago

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 12 days ago • 44

upvoted a paper about 2 months ago

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs

Paper • 2510.18876 • Published Oct 21 • 36

updated a model about 2 months ago

ByteDance/Sa2VA-Qwen3-VL-4B

Image-Text-to-Text • 5B • Updated Oct 21 • 4.08k • 14

published a model about 2 months ago

ByteDance/Sa2VA-Qwen3-VL-4B

Image-Text-to-Text • 5B • Updated Oct 21 • 4.08k • 14

updated a collection about 2 months ago

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research • 12 items • Updated 12 days ago • 44