My Blog
Technology

Google’s New Gemini AI Will Understand Your Photos and Videos, not Just Text


Google has begun bringing a native understanding of video, audio and photos to its Bard AI chatbot with a new model called Gemini.

The first incarnations of the new technology arrived Wednesday in dozens of countries, but only in English, providing text-based chat abilities that Google says improves the AI’s abilities in complex tasks like summarizing documents, reasoning and writing programming code. The bigger change with multimedia abilities, for example understanding the data underlying a graph or figuring out the result of a child’s dot-to-dot drawing puzzle, will arrive “soon,” Google said.

The new version represents a dramatic departure for AI. Text-based chat is important, but humans must process much richer information as we inhabit our three-dimensional, ever-changing world. And we respond with complex communication abilities, like speech and imagery, not just written words. Gemini is an attempt to come closer to our own fuller understanding of the world.

Gemini comes in three versions tailored for different levels of computing power, Google said:

  • Gemini Nano runs on mobile phones, with two varieties available built for different levels of available memory. It’ll power new features on Google’s Pixel 8 phones, like summarizing conversations in its Recorder app or suggesting message replies in WhatsApp typed with Google’s Gboard.
  • Gemini Pro, tuned for fast responses, runs in Google’s data centers and will power a new version of Bard, starting Wednesday.
  • Gemini Ultra, limited to a test group for now, will be available in a new Bard Advanced chatbot due in early 2024. Google declined to reveal pricing details, but expect to pay a premium for this top capability.

The new version spotlights the breakneck pace of advancement in the new generative AI field, where chatbots create their own responses to prompts that we write in plain language rather than arcane programming instructions. Google’s top competitor, OpenAI, stole a march with the launch of ChatGPT a year ago, but already Google is on its third major AI model revision and expects to deliver that technology through products that billions of us use, like search, Chrome, Google Docs and Gmail.

“For a long time we wanted to build a new generation of AI models inspired by the way people understand and interact with the world — an AI that feels more like a helpful collaborator and less like a smart piece of software,” said Eli Collins, a product vice president at Google’s DeepMind division. “Gemini brings us a step closer to that vision.”

AI is getting smarter, but it’s not perfect

Multimedia likely will be a big change compared to text when it arrives. But what hasn’t changed is the fundamental problems of AI models trained by recognizing patterns in vast quantities of real-world data. They can turn increasingly complex prompts into increasingly sophisticated responses, but you still can’t trust that they didn’t just provide an answer that was plausible instead of actually correct. As Google’s chatbot warns when you use it, “Bard may display inaccurate info, including about people, so double-check its responses.”

Gemini is the next generation of Google’s large language model, a sequel to the PaLM and PaLM 2 that have been the foundation of Bard so far. But by training Gemini simultaneously on text, programming code, images, audio and video, it’s able to more efficiently cope with multimedia input than with separate but interlinked AI models for each mode of input.

Examples of Gemini’s abilities, according to a Google research paper, are diverse.

Looking at a series of shapes consisting of a triangle, square and pentagon, it can correctly guess the next shape in the series is a hexagon. Presented with photos of the moon and a hand holding a golf ball and asked to find the link, it correctly points out that Apollo astronauts hit two golf balls on the moon in 1971. It converted four bar charts showing country-by-country waste disposal techniques into a labeled table and spotted an outlying data point, namely that the US throws a lot more plastic in the dump than other regions.

The company also showed Gemini processing a handwritten physics problem involving a simple sketch, figuring out where a student’s error lay, and explaining a correction. A more involved demo video showed Gemini recognizing a blue duck, hand puppets, sleight-of-hand tricks and other videos. None of the demos were live, however, and it’s not clear how often Gemini fumbles such challenges.

Gemini Ultra awaits further testing before appearing next year.

“Red teaming,” in which a product-maker enlists people to find security vulnerabilities and other problems, is underway for Gemini Ultra. Such tests are more complicated with multimedia input data. For example, a text message and photo could each be innocuous on their own, but when paired could convey dramatically different meaning.

“We’re approaching this work boldly and responsibly,” Google CEO Sundar Pichai said in a blog post. That means a combination of ambitious research with big potential payoffs, but also adding safeguards and working collaboratively with governments and others “to address risks as AI becomes more capable.”

Editors’ note: CNET is using an AI engine to help create some stories. For more, see this post.



Related posts

Electric Cars Are Too Costly for Many, Even With Aid in Climate Bill

newsconquest

Microsoft pulled off a huge win hiring Sam Altman, analysts say

newsconquest

White Area Launches Activity Drive to Take on On-line Harassment

newsconquest

Leave a Comment