ChatGPT has learned to talk.
OpenAI, the San Francisco artificial intelligence start-up, released a version of its popular chatbot on Monday that can interact with people using spoken words. As with Amazon’s Alexa, Apple’s Siri, and other digital assistants, users can talk to ChatGPT and it will talk back.
For the first time, ChatGPT can also respond to images. People can, for example, upload a photo of the inside of their refrigerator, and the chatbot can give them a list of dishes they could cook with the ingredients they have.
“We’re looking to make ChatGPT easier to use — and more helpful,” said Peter Deng, OpenAI’s vice president of consumer and enterprise product.
OpenAI has accelerated the release of its A.I tools in recent weeks. This month, it unveiled a version of its DALL-E image generator and folded the tool into ChatGPT.
ChatGPT attracted hundreds of millions of users after it was introduced in November, and several other companies soon released similar services. With the new version of the bot, OpenAI is pushing beyond rival chatbots like Google Bard, while also competing with older technologies like Alexa and Siri.
Alexa and Siri have long provided ways of interacting with smartphones, laptops and other devices through spoken words. But chatbots like ChatGPT and Google Bard have more powerful language skills and are able to instantly write emails, poetry and term papers, and riff on almost any topic tossed their way.
OpenAI has essentially combined the two communication methods.
The company sees talking as a more natural way of interacting with its chatbot. It argues that ChatGPT’s synthetic voices — people can choose from five different options, including male and females voices — are more convincing than others used with popular digital assistants.
Over the next two weeks, the company said, the new version of the chatbot would start rolling out to everyone who subscribes to ChatGPT Plus, a service that costs $20 a month. But the bot can respond with voice only when used on iPhones, iPads and Android devices.
The bot’s synthetic voices are more natural than many others on the market, though they still can sound robotic. Like other digital assistants, it can struggle with homonyms. When The New York Times asked the new ChatGPT how to spell “gym,” it said: “J-I-M.”
But one of the advantages of a chatbot like ChatGPT is that it can correct itself. When told “No, the other kind of gym,” the bot replied: “Ah, I see what you’re referring to now. The place where people exercise and work out is spelled G-Y-M.”
Though ChatGPT’s voice interface is reminiscent of earlier assistants, the underlying technology is fundamentally different. ChatGPT is driven primarily by a large language model, or L.L.M., which has learned to generate language on the fly by analyzing huge amounts of text culled from across the internet.
Older digital assistants, like Alexa and Siri, acted like command-and-control centers that could perform a set number of tasks or give answers to a finite list of questions programmed into their databases, such as “Alexa, turn on the lights” or “What’s the weather in Cupertino?” Adding new commands to the older assistants could take weeks. ChatGPT can respond authoritatively to virtually any question thrown at it in seconds — though it is not always correct.
As OpenAI is transforming ChatGPT into something more like Alexa or Siri, companies like Amazon and Apple are transforming their digital assistants into something more like ChatGPT.
Last week, Amazon previewed an updated system for Alexa that aims for more fluid conversation about “any topic.” It is driven in a part by a new L.L.M. and has other upgrades to pacing and intonation to make it sound more natural, the company said.
Apple, which has not publicly shared its plans for how it will compete with ChatGPT, has been testing a prototype of its large language model for future products, according to two people briefed on the project.
When used via the web as well as on iPhone, iPad and Android devices, the new ChatGPT can also respond to images. Given a photograph, chart or diagram, it can provide a detailed description of the image and answer questions about its contents. This could be a useful tool for people who are visually impaired.
OpenAI first demonstrated the image tool in the spring, but the company said it would not be shared with the public until researchers better understood how the technology could be misused. Among other concerns, they worried the tool could become a de facto face recognition service used to quickly identify people in photos.
Microsoft introduced this kind of visual search tool, based on OpenAI’s technology, in its Bing chatbot over the summer.
Sandhini Agarwal, an OpenAI researcher who focuses on safety and policy, said the new version of the bot would now refuse efforts to identify faces. But it is designed to provide enormously detailed descriptions of other photos. Given an image from the Hubble Space Telescope, for example, it can respond with paragraphs detailing the contents in the photo.
The bot can also be a tool for students. Given an image of a high school math problem that includes words, numbers and diagrams, the bot can instantly read the problem and solve it. It could be an effective way to learn — or cheat.