Monday, June 24, 2024

AI Finds Specific Actions In Videos

- Advertisement -

MIT researchers have developed a streamlined method for spatiotemporal grounding, leveraging videos and automated transcripts for enhanced efficiency.

Researchers from MIT developed a technique that teaches machine-learning models to identify specific actions in long videos.
Credits:Image: MIT News; iStock
Researchers from MIT developed a technique that teaches machine-learning models to identify specific actions in long videos.
Credits:Image: MIT News; iStock

The internet offers instructional videos on various tasks, but finding specific actions in long videos takes a lot of work. Scientists aim to teach AI to locate described actions automatically, though this usually requires costly, hand-labelled data. 

Researchers at MIT and the MIT-IBM Watson AI Lab have developed an efficient spatiotemporal grounding approach using videos and automatic transcripts. Their model analyses small details and overall sequences to accurately identify actions in longer videos with multiple activities. Training on spatial and temporal information simultaneously improves performance. This technique enhances online learning, virtual training, and health care by quickly identifying key moments in diagnostic videos.

- Advertisement -

Global and local learning

Researchers typically teach models to perform spatiotemporal grounding using annotated videos, but generating such data is expensive and difficult to label precisely. Instead, these researchers use unlabeled instructional videos and text transcripts from sources like YouTube, which require no special preparation. They split the training into two parts: teaching the model to understand the overall timing of actions and focusing on specific regions where actions occur.

An additional component addresses misalignments between narration and video. Their approach uses uncut, several-minute-long videos for a more realistic solution, unlike most AI techniques that use short, trimmed clips.

A new benchmark

When evaluating their approach, the researchers found no effective benchmark for testing models on longer, uncut videos, so they created one. They have developed a new annotation technique for identifying multi-step actions, where users mark the intersection of objects, like where a knife edge cuts a tomato, rather than drawing a box around important objects. Multiple people doing point annotations on the same video can better capture actions over time, such as the flow of milk being poured. Using this benchmark, their approach was more accurate at pinpointing actions and focusing on human-object interactions than other AI techniques.

The researchers plan to enhance their approach so models can automatically detect misalignment between text and narration, switching focus as needed, and to extend their framework to include audio data, given the strong correlations between actions and sounds.

Nidhi Agarwal
Nidhi Agarwal
Nidhi Agarwal is a journalist at EFY. She is an Electronics and Communication Engineer with over five years of academic experience. Her expertise lies in working with development boards and IoT cloud. She enjoys writing as it enables her to share her knowledge and insights related to electronics, with like-minded techies.


Unique DIY Projects

Electronics News

Truly Innovative Tech

MOst Popular Videos

Electronics Components