Open Multimodal Gathering Workshop


Date: Sunday, April 27, 2025 | 🕐 1:00 PM – 6:00 PM
Location: LT16, COM2, National University of Singapore (NUS) [How to go]

Tencent

Introduction

Thank you for your interest in the Open Multimodal Gathering workshop! This event, hosted by Show Lab at the National University of Singapore (NUS), will take place during the ICLR 2025, providing an excellent opportunity for ICLR attendees to join us.

This event brings together researchers, students, and practitioners to explore cutting-edge topics in vision-language, video understanding, intelligent agents, and embodied AI.

The workshop will feature talks by confirmed speakers, with opportunities for additional attendees to present. A campus tour will follow the presentations, offering a chance to experience the NUS research environment firsthand.

Whether you’d like to attend as an audience member or present your work as a speaker, we warmly welcome you to join this exciting gathering and connect with the broader multimodal research community!

Schedule

We feature keynote talks across diverse topics, followed by lightning presentations from volunteers, including Ph.D. students, to encourage broader involvement.
Start End Speaker Institution Topic Presentation Title
13:00 13:05 Opening remarks
13:05 13:35 Liang Zheng ANU Generative model & Diffusion Features of Generative models
13:35 13:45 Min Zhao Tsinghua University RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers
13:45 13:55 Kangfu Mei Google Research The Power of Context: How Multimodality Improves Image Super-Resolution
13:55 14:05 Zaixiang Zheng ByteDance Research Towards large-scale multimodal generative Foundation models for protein modeling & design
14:05 14:35 Xiao-Ming Wu PolyU AIGC Application Language Model Adaptation and Multimodal Applications
14:35 14:45 Yeying Jin Tencent AI × Gaming: Creative Applications of AIGC
14:45 14:55 Cong Wei U. of Waterloo Mocha: towards movie-grade talking character synthesis
14:55 15:05 Shanyan Guan VIVO & SJTU Next-Gen Camera Systems: Instilling Intuitive Imagination in Smartphones
15:05 15:20 Break
15:20 15:50 Mohit Bansal UNC Multi-modal Understanding, Unified Model Multimodal Generative Models: Planning Agents, Skill Learning, and Composable Generalization
15:50 16:00 Kevin Qinghong Lin NUS Learning Human-centric Multimodal Assistants from Videos
16:00 16:10 Jinheng Xie NUS Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
16:10 16:20 Zhaokai Wang SJTU & Shanghai AI Lab Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
16:20 16:35 Kun Shao Huawei UK Agent Towards Generalist GUI Agent: model and optimization
16:35 16:45 Zhaorun Chen University of Chicago Agent Safety Purple-Teaming
16:45 16:55 Yatai Ji HKU VLM, Reasoning, Evaluation IDA-VLM: Towards Movie Understanding via ID-Aware Large Vision-Language Model.
16:55 17:10 Freda Shi U. of Waterloo & Vector Institute Do Vision-Language Models Represent Space and How? Evaluating Spatial Frame of Reference Under Ambiguities
17:10 17:20 Ziyao Shangguan Yale TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models
17:20 17:30 Mehdi Ataei Autodesk Research A Modular Framework for Physical Reasoning in Vision-Language Models
17:30 17:40 Boshen Xu Renmin University of China Embodied AI Learning to Perceive Egocentric Hand-Object Interactions
17:40 17:55 Yunzhu Li Columbia University Learning Structured World Models From and For Physical Interactions
18:00 Campus Visiting

Gallery

This event has attracted 230+ registrations. We sincerely appreciate everyone's interest and support.
Gallery 1 Gallery 2

Traffic

Traffic Image Before

Overview

1. Walk from Kent Ridge MRT to Opp Kent Ridge Stn Exit A (School Bus Stop).

2. Take school bus D2 to COM3.

3. Walk from COM3 to LT16 (according to the following guidance).

Traffic Image After

The Guidance for walking from COM3 to LT16.

Sponsor