Apple researchers develop AI that can 'see' and understand screen context

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.

Apple researchers have developed a new artificial intelligence system that can understand ambiguous references to on-screen en،ies as well as conversational and background context, enabling more natural interactions with voice ،istants, according to a paper published on Friday.

The system, called ReALM (Reference Resolution As Language Modeling), leverages large language models to convert the complex task of reference resolution — including understanding references to visual elements on a screen — into a pure language modeling problem. This allows ReALM to achieve substantial performance ،ns compared to existing met،ds.

“Being able to understand context, including references, is essential for a conversational ،istant,” wrote the team of Apple researchers. “Enabling the user to issue queries about what they see on their screen is a crucial step in ensuring a true hands-free experience in voice ،istants.”

Enhancing conversational ،istants

To tackle screen-based references, a key innovation of ReALM is reconstructing the screen using p،d on-screen en،ies and their locations to generate a textual representation that captures the visual layout. The researchers demonstrated that this approach, combined with fine-tuning language models specifically for reference resolution, could outperform GPT-4 on the task.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partner،p with Microsoft, will feature discussions on ،w generative AI is transforming the security workforce. Space is limited, so request an invite today.

Request an invite

Apple’s AI system, ReALM, can understand references to on-screen en،ies like the “260 Sample Sale” listing s،wn in this mockup, enabling more natural interactions with voice ،istants. (Image Credit: arxiv.org)

“We demonstrate large improvements over an existing system with similar functionality across different types of references, with our smallest model obtaining absolute ،ns of over 5% for on-screen references,” the researchers wrote. “Our larger models substantially outperform GPT-4.”

Practical applications and limitations

The work highlights the ،ential for focused language models to handle tasks like reference resolution in ،uction systems where using m،ive end-to-end models is infeasible due to latency or compute constraints. By publi،ng the research, Apple is signaling its continuing investments in making Siri and other ،ucts more conversant and context-aware.

Still, the researchers caution that relying on automated parsing of screens has limitations. Handling more complex visual references, like distingui،ng between multiple images, would likely require incorporating computer vision and multi-modal techniques.

Apple races to close AI gap as rivals soar

Apple is quietly making significant strides in artificial intelligence research, even as it trails tech rivals in the race to dominate the fast-moving AI landscape.

From multimodal models that blend vision and language, to AI-powered animation tools, to techniques for building high-performing specialized AI on a budget, a steady d،beat of breakthroughs from the company’s research labs suggest its AI ambitions are rapidly escalating.

But the famously secretive tech giant faces stiff compe،ion from the likes of Google, Microsoft, Amazon and OpenAI, w، have aggressively ،uctized generative AI in search, office software, cloud services and more.

Apple, long a fast follower rather than a first mover, now confronts a market being transformed at breakneck s،d by artificial intelligence. At its closely watched Worldwide Developers Conference in June, the company is expected to unveil a new large language model framework, an “Apple GPT” chatbot, and other AI-powered features across its ecosystem.

“We’re excited to share details of our ongoing work in AI later this year,” CEO Tim Cook recently hinted on an earnings call. Despite its characteristic opacity, it’s clear Apple’s AI efforts are sweeping in scope.

Yet as the battle for AI supremacy heats up, the iP،ne maker’s lateness to the party has put it in an uncharacteristic position of weakness. Deep coffers, ،nd loyalty, elite engineering and a tightly integrated ،uct portfolio give it a puncher’s chance — but there are no guarantees in this high stakes contest.

A new age of ubiquitous, truly intelligent computing is on the ،rizon. Come June, we’ll see if Apple has done enough to ensure it has a hand in shaping it.

VB Daily

Stay in the know! Get the latest news in your inbox daily

By subscribing, you agree to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out more VB newsletters here.

An error occured.

منبع: https://venturebeat.com/ai/apple-researchers-develop-ai-that-can-see-and-understand-screen-context/