Monday, June 17, 2024

Can Unlabeled Audio Visual Learning Enhance Speech Recognition Model?

- Advertisement -

MIT researchers have developed a novel technique for analyzing unlabeled audio and visual data, enhancing machine learning models for speech recognition and object detection.

Can Unlabeled Audio Visual Learning Enhance Speech Recognition Model?

Humans often acquire knowledge through self-supervised learning due to insufficient supervision signals. Self-supervised learning is the basis for an initial model, leveraging unlabeled data. Fine-tuning can be achieved through supervised learning or reinforcement learning for specific tasks.

MIT and IBM Watson Artificial Learning (AI) Lab researchers have developed a new method to analyze unlabeled audio and visual data, improving machine learning models for speech recognition and object detection. The work merges self-supervised learning architectures, combining contrastive learning and masked data modeling. It aims to scale machine-learning tasks, such as event classification, in various data formats without annotation. This approach mimics human understanding and perception. The contrastive audio-visual masked autoencoder (CAV-MAE), technique, a neural network, learns latent representations from acoustic and visual data. 

- Advertisement -

A joint and coordinated approach

CAV-MAE employs “learning by prediction” and “learning by comparison.” Masked data modeling involves masking a portion of audio-visual inputs, which are then processed by separate encoders before being reconstructed by a joint encoder/decoder. The model is trained based on the difference between the original and reconstructed data. While this approach may not fully capture video-audio associations, contrastive learning complements it by leveraging them. However, some modality-unique details, like video background, may need to be recovered.

The researchers evaluated CAV-MAE, their method without contrastive loss or a masked autoencoder, and other methods on standard datasets. The tasks included audio-visual retrieval and audio-visual event classification. Retrieval involved finding missing audio/visual components, while event classification identified actions or sounds in the data. Contrastive learning and masked data modeling complement each other. CAV-MAE outperforms previous techniques by 2% for event classification, matching models with industry-level computation. It ranks similarly to models with only contrastive loss. Incorporating multi-modal data in CAV-MAE improves single-modality representation and audio-only event classification. Multi-modal information acts as a “soft label” boost, aiding tasks like distinguishing between electric and acoustic guitars.

Bringing self-supervised audio-visual learning into our world

The researchers consider CAV-MAE a significant advancement for applications transitioning to multi-modality and audio-visual fusion. They envision its future use in action recognition for sports, education, entertainment, motor vehicles, and public safety, with potential extensions to other modalities. Although currently limited to audio-visual data, the team aims to target multimodal learning to mimic human abilities in AI development and explore other modalities.

Nidhi Agarwal
Nidhi Agarwal
Nidhi Agarwal is a journalist at EFY. She is an Electronics and Communication Engineer with over five years of academic experience. Her expertise lies in working with development boards and IoT cloud. She enjoys writing as it enables her to share her knowledge and insights related to electronics, with like-minded techies.


Unique DIY Projects

Electronics News

Truly Innovative Tech

MOst Popular Videos

Electronics Components