Date: Sunday, April 27, 2025 | 🕐 1:00 PM – 6:00 PM
Location: LT16, COM2, National University of Singapore (NUS) [How to go]
Thank you for your interest in the Open Multimodal Gathering workshop! This event, hosted by Show Lab at the National University of Singapore (NUS), will take place during the ICLR 2025, providing an excellent opportunity for ICLR attendees to join us.
This event brings together researchers, students, and practitioners to explore cutting-edge topics in vision-language, video understanding, intelligent agents, and embodied AI.
The workshop will feature talks by confirmed speakers, with opportunities for additional attendees to present. A campus tour will follow the presentations, offering a chance to experience the NUS research environment firsthand.
Whether you’d like to attend as an audience member or present your work as a speaker, we warmly welcome you to join this exciting gathering and connect with the broader multimodal research community!
Start | End | Speaker | Institution | Topic | Presentation Title |
---|---|---|---|---|---|
13:00 | 13:05 | Opening remarks | |||
13:05 | 13:35 | Liang Zheng | ANU | Generative model & Diffusion | Features of Generative models |
13:35 | 13:45 | Min Zhao | Tsinghua University | RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers | |
13:45 | 13:55 | Kangfu Mei | Google Research | The Power of Context: How Multimodality Improves Image Super-Resolution | |
13:55 | 14:05 | Zaixiang Zheng | ByteDance Research | Towards large-scale multimodal generative Foundation models for protein modeling & design | |
14:05 | 14:35 | Xiao-Ming Wu | PolyU | AIGC Application | Language Model Adaptation and Multimodal Applications |
14:35 | 14:45 | Yeying Jin | Tencent | AI × Gaming: Creative Applications of AIGC | |
14:45 | 14:55 | Cong Wei | U. of Waterloo | Mocha: towards movie-grade talking character synthesis | |
14:55 | 15:05 | Shanyan Guan | VIVO & SJTU | Next-Gen Camera Systems: Instilling Intuitive Imagination in Smartphones | |
15:05 | 15:20 | Break | |||
15:20 | 15:50 | Mohit Bansal | UNC | Multi-modal Understanding, Unified Model | Multimodal Generative Models: Planning Agents, Skill Learning, and Composable Generalization |
15:50 | 16:00 | Kevin Qinghong Lin | NUS | Learning Human-centric Multimodal Assistants from Videos | |
16:00 | 16:10 | Jinheng Xie | NUS | Show-o: One Single Transformer to Unify Multimodal Understanding and Generation | |
16:10 | 16:20 | Zhaokai Wang | SJTU & Shanghai AI Lab | Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training | |
16:20 | 16:35 | Kun Shao | Huawei UK | Agent | Towards Generalist GUI Agent: model and optimization |
16:35 | 16:45 | Zhaorun Chen | University of Chicago | Agent Safety Purple-Teaming | |
16:45 | 16:55 | Yatai Ji | HKU | VLM, Reasoning, Evaluation | IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model. |
16:55 | 17:10 | Freda Shi | U. of Waterloo & Vector Institute | Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities | |
17:10 | 17:20 | Ziyao Shangguan | Yale | TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models | |
17:20 | 17:30 | Mehdi Ataei | Autodesk Research | A Modular Framework for Physical Reasoning in Vision-Language Models | |
17:30 | 17:40 | Boshen Xu | Renmin University of China | Embodied AI | Learning to Perceive Egocentric Hand-Object Interactions |
17:40 | 17:55 | Yunzhu Li | Columbia University | Learning Structured World Models From and For Physical Interactions | |
18:00 | Campus Visiting |
Overview
1. Walk from Kent Ridge MRT to Opp Kent Ridge Stn Exit A (School Bus Stop).
2. Take school bus D2 to COM3.
3. Walk from COM3 to LT16 (according to the following guidance).
The Guidance for walking from COM3 to LT16.