AI learns how vision and sound are connected, without human intervention - Michael Dukakis Institute for Leadership and Innovation (MDI)

This new machine-learning model can match corresponding audio and visual data, which could someday help robots interact in the real world.

“We are building AI systems that can process the world like humans do, in terms of having both audio and visual information coming in at once and being able to seamlessly process both modalities. Looking forward, if we can integrate this audio-visual technology into some of the tools we use on a daily basis, like large language models, it could open up a lot of new applications,” says Andrew Rouditchenko, an MIT graduate student and co-author of a paper on this research.

A new approach developed by researchers from MIT and elsewhere improves an AI model’s ability to learn in this same fashion. This could be useful in applications such as journalism and film production, where the model could help with curating multimodal content through automatic video and audio retrieval.

https://news.mit.edu/2025/ai-learns-how-vision-and-sound-are-connected-without-human-intervention-0522

Related contents