Artificial intelligence has been around in some form for decades, but recent advances with generative AI models like OpenAI’s ChatGPT and DALL-E represent a significant leap in capability. The ability to automatically generate content that is virtually indistinguishable from content created by a human has significant implications—both potential benefits and possible concerns. The use of AI and deep learning to generate deepfake audio, however, has particularly insidious potential.
Microsoft VALL-E
Microsoft recently shared details of a new AI voice generator called VALL-E (Voice Artificial Lifeform with Learnt Embeddings) that can simulate the voice of anyone based on just a 3-second audio sample. This technology has the potential to revolutionize the way we interact with AI, but also raises important questions about security and privacy.
VALL-E is a deepfake generator that uses machine learning to create synthetic voices that can mimic real people. The technology is based on a neural network that has been trained on a massive dataset of audio samples from real people. Once the network has learned the patterns and characteristics of a particular voice, it can generate new audio that sounds similar to the original. It is virtually flawless—matching pitch, tone, cadence, and mannerisms.
There is tremendous potential value for technology like VALL-E. The technology could be used to create personalized voice assistants that sound like the user, making the experience more natural and intuitive. VALL-E could also be used to create more realistic-sounding virtual assistants for customer service, or to generate synthetic voices for use in podcasts, videos, or other media.
Social Engineering on Steroids
VALL-E also raises important security and privacy concerns. Because VALL-E can generate realistic-sounding synthetic voices, it could be used to impersonate real people or create fake audio that is indistinguishable from the real thing. This could be used to spread misinformation, impersonate others, or commit fraud.
Business Email Compromise (BEC) and phishing scams are a significant threat for organizations. A European subsidiary of Toyota lost $37 million through a BEC scam in 2019, and the FBI estimates that BEC scams cost businesses more than $2 billion per year. Now, imagine how effective a BVC (Business Voice Compromise) scam will be. What happens when someone in accounting receives a phone call or voicemail from the CFO, or CEO—or at least a voice claiming to be one of those individuals that is indistinguishable from the actual voice—directing them to share credentials or transfer funds?
Preventing Nefarious Audio Deepfakes
Microsoft didn’t make the actual tool public, and has implemented several security measures to protect the VALL-E technology and prevent its use for nefarious means. For example, the company has implemented a system that verifies the identity of users who want to access the technology, and it has also implemented a system that detects and flags suspicious activity. Microsoft also made the source code for VALL-E available to the public, enabling researchers and experts to examine the technology and identify any potential vulnerabilities.
Despite these measures, it’s important to continue to monitor the development and use of this technology to ensure that it is used responsibly. Governments and other organizations could implement regulations to protect against the potential misuse of AI-generated synthetic voices, but the issue with laws in general is that only law-abiding people care about them.
It will be more challenging—perhaps impossible—to determine what is real in the future. When you receive a phone call or voice message from your boss, or parent, or significant other, how can you be sure it is really them and not a deepfake? Individuals should be aware of the potential risks and take steps to protect themselves, such as being skeptical of audio that is received from unknown sources, or that sounds too good to be true, or unusual requests from people they know.
Generative AI has a variety of legitimate uses and potential benefits. Overall, a tool like Microsoft’s VALL-E has the potential to fundamentally change the way we interact with AI and revolutionize the human-computer interaction. However, the security and privacy implications of the technology cannot be overlooked.
The onus is on the stakeholders to ensure the responsible use of the technology, and to make sure that it is used for the benefit of society as a whole.
