If you are to snapshot how competitive the artificial intelligence (AI) space is, these past few days do the job. Persistent rumours about Apple’s inevitable AI push reached a crescendo. Just a day ahead of Google’s I/O 2024, OpenAI announced an iterative GPT-4o model, which is faster and has improved multimodal capabilities across text, vision, and audio methods. Microsoft too should be able to draw benefits of that soon. Then there is the Google I/O keynote itself, which takes things forward with the Gemini, in more ways than one.

Gemini Live will figure prominently in Google’s Messages app, with the added context of your chats.

Here’s a second snapshot, of all things Google. Updates across Gemini models including Nano and Pro, a new and lightweight Gemini 1.5 Flash, a conversational AI functionality called Gemini Live, something called Gems, Generative AI in Search, next generations of generative AI tools Veo and Imagen as well as an eye on the future with Gemma models.

Unlock exclusive access to the latest news on India’s general elections, only on the HT App. Download Now! Download Now!

“While we’ve made incredible progress developing AI systems that can understand multimodal information, getting response time down to something conversational is a difficult engineering challenge. Over the past few years, we’ve been working to improve how our models perceive, reason and converse to make the pace and quality of interaction feel more natural,” Demis Hassabis, chief executive officer (CEO) of Google DeepMind, said, summarising the journey thus far and challenges that lie ahead.

Gemini Live, joins the family

Google already has another response ready for OpenAI’s GPT-4o and its improved conversational and situational context skills – a key feature is real-time translations. It is called Gemini Live, and the promise is one encompassing how humans interact – natural-sounding voices, conversational flow and even mid-sentence interruptions. This will also figure prominently in Google’s Messages app, with the added context of your chats.

That’s not all with Gemini Live. “Later this year you’ll be able to use your camera when you go Live, opening up conversations about what you see around you,” said Sissie Hsiao, vice president and general manager for Gemini Experiences and Google Assistant.

The Gemini model family, three strong with Nano, Pro and Ultra (depending on their size and deployment utility), gets a fourth member. Gemini 1.5 Flash, albeit lighter than Gemini 1.5 Pro, receives the same training for multimodal reasoning and knowledge as the Pro model. 1.5 Flash’s relevance, because of its capabilities, spans not just Android devices at some point in the future, but most of Google’s services including the Workplace.

“The 1.5 Flash is highly capable of multimodal reasoning across vast amounts of information and delivers impressive quality for its size and excels at summarisation, chat applications, image and video captioning, data extraction from long documents and tables,” Hassabis told HT.

Gemini Nano, which is crucial to Google’s AI vision for Android and the importance of on-device AI computation, is being updated with multimodality, with visuals and spoken language. This gets its first run-around on Google’s Pixel phones.

Alongside, the Gemini 1.5 Pro model is now available to Gemini Advanced subscribers (around 1,950 per month). Thus far, it was only available for developers and enterprise customers. This means it is available in more than 150 countries and over 35 languages. The improvements include an ability to follow complex and nuanced instructions, including format and style. Therefore, it is now capable of trip planning.

“Gemini takes into account your flight timing, meal preferences and information about local museums, while also understanding where each stop is located and how long it will take to travel,” explained Hsiao.

“Gemini is natively multimodal, and 1.5 Pro brings big improvements to image understanding. For example, you can snap a photo of a dish at your favourite restaurant and ask for a recipe or take a picture of a math problem and get step-by-step instructions on how to solve it, all from a single image,” she said, detailing its capabilities.

Gemini Advanced can make sense of 1500-page documents or summarise 100 emails. Hence, an ability to upload files from Google Drive or directly from a user’s device, into Gemini Advanced. “Gemini keeps your files private to you, and they’re not used to train our models,” she added.

Coming soon, for Gemini Advanced subscribers, will be the ability to create customised versions of Gemini specific to the tasks at hand. Called Gems, these would be a response to OpenAI’s custom GPTs, made available late last year. Google is continuing to expand the scope of extensions for Gemini. The newest one is a YouTube Music Extension that’s rolling out now, This means users can search for their favourite music even if they don’t know the song title by mentioning a verse or a featured artist. More tools are incoming for Calendar, Tasks and Keep.

Last but not least, Gemma. These are Google’s next-generation models, that learn from the same research as Gemini’s family of models do. However, they will now get some architectural changes for improved performance and flexibility for different model sizes, a template that Gemini Nano, Flash, Pro and Ultra, already make clear.

AI enhancing Google Search

Google’s experimentation with generative AI within Search, gets a finality with AI Overviews in Search results. After a period of optionally opt-in testing, it will be rolling out to users in the US this week for all search results. More countries including India, are expected to get similar functionality in the coming months. There will be a flexibility to customise the AI Overview information, and how it is presented.

“We’re making AI Overviews available to everyone in the U.S., with more countries coming soon. That means that this week, hundreds of millions of users will have access to AI Overviews, and we expect to bring them to over a billion people by the end of the year,” said Liz Reid, vice president and head of Google Search.

Generative AI’s latest tryst with realism

Google generative AI tools, the text-to-video model Veo and the text-to-image model Imagen 3 are promising a step forward in terms of realism. “Over the past year, we’ve made incredible progress in enhancing the quality of our generative media technologies. We’ve been working closely with the creative community to explore how generative AI can best support the creative process, and to make sure our AI tools are as useful as possible at each stage,” said Eli Collins, vice president, Product Management at Google.

The tech giant wants filmmakers and creators to experiment with Veo. This has a long lineage of generative video models to follow, including Generative Query Network (GQN), DVD-GAN, Imagen-Video, Phenaki, WALT, VideoPoet and Lumiere. They previewed work with filmmaker Donald Glover and his creative studio, Gilga, who experimented with Veo for a film project.

“With Veo, we’ve improved techniques for how the model learns to understand what’s in a video, renders high-definition images, simulates the physics of our world and more. These learnings will fuel advances across our AI research and enable us to build even more useful products that help people interact and communicate in new ways,” said Collins. For now, Veo remains available to select creators as part of a private preview, and the tech giant confirmed they intend to bring some of Veo’s capabilities to YouTube Shorts at some point down the line.

Imagen 3, Google insists, is their highest-quality text-to-image model to date. Not only does this generative model have the option to pick from a range of styles to generate an image based on a command, but it is supposed to also better understand natural language, and the intent behind a prompt, and can incorporate small details from longer prompts.

“Imagen 3 is our highest quality text-to-image model. It generates an incredible level of detail, producing photorealistic, lifelike images, with far fewer distracting visual artefacts than our prior models,” said Doug Eck, senior research director, at Google. For now, Imagen 3 is also available to select creators as part of a limited preview, to prevent misuse and incorrect creations.



Source link