A Technology That Doubles Vision-Language Model speed

“SparseVLM” makes Vision-Language Models faster by looking only at parts of an image or video that match the question, without losing accuracy.

Panasonic R&D Company of America (PRDCA) and Panasonic Holdings Co., Ltd. (Panasonic HD), working with researchers from Peking University, Fudan University, the University of California, Berkeley, and Shanghai Jiao Tong University, have developed SparseVLM—a method that increases processing speed in Vision-Language Models (VLMs) while keeping accuracy in visual question answering. SparseVLM does this by considering the input prompt—something earlier methods did not.

- Advertisement -

When tested across eight visual question answering benchmarks, SparseVLM showed a 48.3% drop in latency and a 71.9% drop in computational load (FLOPs), while keeping an average accuracy of 89.3%. It outperformed other techniques and can be used in systems that need quick recognition and description of a user’s state and surroundings based on visual input.

VLMs are AI models made to handle both visual data (like images and videos) and text. These models can answer questions about visual input, but high-resolution images and long videos slow down response time and raise computational needs. SparseVLM helps by processing only the visual parts linked to the input prompt, cutting down both response time and load while keeping accuracy.

Many existing methods try to reduce visual load by filtering visual tokens. But they usually filter based only on image content and do not check if the tokens match the text prompt. As a result, they still process unused data, limiting gains.

- Advertisement -

SparseVLM avoids this by using a filtering method that selects tokens based on their link to the prompt. It finds key words in the prompt that match the image or video, then processes only the tokens connected to those words. For example, if the prompt is “What is written on this blue sign?”, it focuses only on the sign area and skips other parts.

SparseVLM does not need added training or outside datasets to find useful tokens. It works with current VLM systems without changes.

Panasonic HD plans to continue AI work. The company wants to apply AI in daily life and workplaces to improve use and value for users.

A Technology That Doubles Vision-Language Model speed

SHARE YOUR THOUGHTS & COMMENTS Cancel reply

EFY Prime

Unique DIY Projects

Electronics News

Truly Innovative Electronics

Latest DIY Videos

Electronics Components

Electronics Jobs

Calculators For Electronics