Whisper’s AI Transcription Tool Faces Scrutiny Over Fabricated Outputs
OpenAI’s Whisper, an AI-driven transcription tool, has garnered attention for its capabilities, but it has a significant drawback: it frequently generates erroneous text, including entire sentences that were never spoken. This phenomenon, known in the tech world as “hallucinations,” has been documented by numerous software engineers, developers, and academic researchers who have reported encountering these fabrications in their work.
Experts have expressed concern over the types of content that Whisper can invent, which can range from benign to troubling, including racial slurs, violent language, and fictitious medical treatments. The tool is gaining traction across various sectors, including healthcare, where medical facilities are rapidly adopting Whisper to transcribe doctor-patient interactions. This is despite OpenAI’s warnings against using Whisper in “high-risk domains,” which include medical settings where accuracy is critical.
The extent of Whisper’s hallucinations is alarming. For example, a University of Michigan researcher studying public meeting transcripts found that eight out of ten audio transcriptions contained invented text. Another machine learning engineer discovered hallucinations in half of the over 100 hours of Whisper transcriptions he analyzed, while a third reported similar issues in nearly all of the 26,000 transcripts he generated. A recent study involving more than 13,000 clear audio snippets identified 187 instances of hallucinations, suggesting that the scale of inaccuracies could lead to tens of thousands of flawed transcripts across millions of recordings.
The ramifications of these errors are particularly concerning in healthcare contexts. Alondra Nelson, a former head of the White House Office of Science and Technology Policy, emphasized the serious implications of such inaccuracies, noting that “nobody wants a misdiagnosis.” The tool is also employed for creating closed captions for the Deaf and hard of hearing, a demographic particularly vulnerable to the risks of faulty transcriptions. Christian Vogler, director of Gallaudet University’s Technology Access Program, pointed out that those who rely on accurate captions have no way of identifying inaccuracies hidden within the text.
In light of these issues, experts, advocates, and even former OpenAI employees are calling for government regulation of AI technologies. They argue that OpenAI must address these flaws to avoid misleading users about the tool’s capabilities. William Saunders, a research engineer who previously worked at OpenAI, commented, “This seems solvable if the company is willing to prioritize it,” highlighting the dangers of overconfidence in Whisper’s capabilities.
Despite these challenges, OpenAI maintains that it is actively working to minimize hallucinations and incorporates feedback to improve the model. However, developers note that they have not encountered another AI-powered transcription tool that generates as many inaccuracies as Whisper. The tool is widely integrated into various platforms, including some versions of OpenAI’s flagship chatbot ChatGPT, as well as cloud computing services provided by Oracle and Microsoft.
In just the past month, one iteration of Whisper was downloaded over 4.2 million times from the open-source AI platform HuggingFace. This makes Whisper the most popular open-source speech recognition model, utilized in everything from customer service call centers to voice-activated assistants.
Research conducted by professors Allison Koenecke from Cornell University and Mona Sloane from the University of Virginia found that nearly 40% of hallucinations could be harmful or misleading. For instance, they highlighted a transcription where the software inaccurately added a violent narrative to a benign statement, such as implying a speaker committed acts of violence with a “terror knife.” In another case, it incorrectly inserted racial identifiers into a speaker’s comment, and in yet another instance, it created a fictional medication called “hyperactivated antibiotics.”
Experts are still trying to understand why Whisper produces these inaccuracies, with software developers noting that hallucinations often occur during pauses, background noises, or when music is playing. OpenAI has advised against using Whisper in contexts where accuracy is paramount, such as in medical decision-making.
Despite this caution, many healthcare providers are employing Whisper-based tools to transcribe patient visits, aiming to allow clinicians to focus more on patient care than documentation. Over 30,000 clinicians and 40 healthcare systems have adopted Whisper-enhanced solutions developed by Nabla, a company that specializes in medical transcription technology.
Nabla’s chief technology officer, Martin Raison, acknowledged the issue of hallucinations and assured that the company is actively working on solutions. However, the company’s approach involves erasing original audio recordings to protect patient data, which could hinder the ability to verify transcription accuracy. Saunders cautioned that without access to the original recordings, it becomes difficult to identify and rectify errors in the transcriptions.
As concerns grow over the impact of AI-generated transcripts on patient care, lawmakers are taking notice. California Assemblymember Rebecca Bauer-Kahan recently declined to allow a health network to share her child’s medical consultation audio with vendors, including those affiliated with OpenAI, citing privacy concerns. “I was like ‘absolutely not,’” she said, expressing her hesitation about sharing sensitive medical information with for-profit tech companies.
The implications of Whisper’s flaws extend beyond technical issues; they raise essential questions about the ethics of using AI in sensitive fields such as healthcare, where the stakes are incredibly high.