Open AI's Whisper is a Precursor to GPT-4
Stable Diffusion and Whisper are just the beginning.
Hey Everyone,
We know OpenAI’s GPT-4 is coming soon. Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
Check out the demo, on Hugging Face.
A well trained General ASR
With Whisper, OpenAI has created an ASR model that is much more accurate while being able to transcribe a wide variety of languages.
If you try the demonstrations, you’ll see that talking fast or with a lovely accent doesn’t seem to affect the results. The post mentions it was trained on 680,000 hours of supervised data. If you were to talk that much to an AI, it would take you 77 years without sleep!
It is an open-sourced neural net called Whisper that approaches human level robustness and accuracy on English speech recognition. I personally think the release of GPT-4 is soon, likely in the next three to six months as of September, 2022.
We are approaching a time when universal real-time translation and speech to text translations will become free for all and ubiquitous in all aspects of society. This will change how we experience the Metaverse, even as we flee a synthetic “old internet” full of deepfakes, spam and pointless content. In reality we’ve all been leaving the likes of Facebook, LinkedIn, Twitter for years.
It’s time for the next iteration of the internet. Already creators not doing video are at a disadvantage as the mobile layer of the internet dominates, but that won’t be the last version of the internet as we look to more immersive realities.
While we can debate if Whisper is a major achievement, there’s a Stable Diffusion or a Whisper like product nearly every month now. You can read the paper about how the generalized training does underperform some specifically-trained models on standard benchmarks, but they believe that Whisper does better at random speech beyond particular benchmarks.
Some actually believe the Whisper open source model may become a building block in future speech-to-text apps. But the same could be said for how DALL-E 2 was improved upon in an with more open-source access and utility. We create advertisements and paywalls around certain aspects of internet usage, but in truth the internet is a human right as well as access to utility like apps.
Just as A.I. evolves, what we once considered expensive exclusive access will a few years later become basic and for all.
As interest around large AI models — particularly large language models (LLMs) like OpenAI’s GPT-3 — grows, companies like Nvidia will power them even better and it may indeed be the end of Moore’s law. With chip supply chains a bit broken, it may be that Chips "going... down in price is indeed a story of the past."
However apps and tools that humans use to augment themselves with A.I. are just beginning. They will change the sounds, art, visual identity and communication of the future iterations of the internet and our own global culture as well. And you could argue in 2022, they are doing so at a pretty impressive rate.
Just as Stable Diffusion, Midjourney and DALL-E 2 have exploded in popularity and mass adoption, OpenAI trained Whisper on 680,000 hours of audio data and matching transcripts in 98 languages collected from the web feels like an inclusive harmony of LLMs at the service of humanity. Companies like Hugging Face have rallied feasts of collaboration and a more decentralized volunteer society around machine learning than perhaps ever before.
A.I. is directly now shaping the culture of the internet, even as younger people are using TikTok instead of Google for search. Curiously the short-video internet is far less accurate and prone to misinformation. We can debate if a Stable Diffusion image/painting is really so noble even as it copies the style of work of artists both alive and dead?
GPT stands for "Generative Pre-Training" and was introduced in this paper from OpenAI in 2018. On February 14th 2019, OpenAI announced GPT-2, which became famous within the machine learning community for producing surprisingly coherent written text samples. It used 1.5 billion parameters.
On the 28th of May, 2020, OpenAI released GPT-3, a 175 billion parameter model, widely regarded to have impressive language generation abilities. Suffice to say that we are really due for GPT-4. Hopefully it’s coming soon!
Work it
Make it
Do it
Makes usHarder
Better
Faster
Stronger ~ Daft Punk
Onwards Better?
It’s no secret today’s young people prefer searching for recommendations on video apps over text-based search engines. The synthetic creations we make aren’t necessarily all improving upon our humanity, though they do feel faster, more convenient and somehow more automated, augmented and immediate. The desire for more immediate instant gratification and speed.
OpenAI describes Whisper as an encoder-decoder transformer, a type of neural network that can use context gleaned from input data to learn associations that can then be translated into the model's output.
By open-sourcing Whisper, OpenAI hopes to introduce a new foundation model that others can build on in the future to improve speech processing and accessibility tools. OpenAI has a significant track record on this front. In January 2021, OpenAI released CLIP, an open source computer vision model that arguably ignited the recent era of rapidly progressing image synthesis technology such as DALL-E 2 and Stable Diffusion.
Whatever OpenAI does or Microsoft does with GitHub Copilot, this is just the beginning and I’m sure we can improve on it in a less commercial way. The movement towards open-access (not the same as open-source) is strong in our A.I. ethics veins now. I am a bit changed after my chat with Nathan Lambert, an interview I will post on A.I. Supremacy Newsletter soon.
Even Facebook tried to explain to us it was “building a more connected world” since 2004, fast forward 20 years and even as now its apps are like literally “husks of a bygone dying internet.” Some will call the A.I. apps progress, but they will also make us subservient to the new rituals, systems and lazy formulas of human convenience with technology. What in the end will be left of our humanity after all is said and done? Even as the Neanderthals went extinct, they still live on in us. So one wonders as we embed A.I. in our society, what will happen to our own old rituals and preferences?
Neanderthals became extinct around 40,000 years ago. Human extinction is also somewhat inevitable the way we are going. It’s possible our immortality may actually in building artificial intelligence built to endure. Built to pollinate our descendants through the stars, and awaken and guide them.
Building more addictive feeds, apps, and bots capable of replicating what we do better than even we can - I’m not sure that’s a productive use of A.I. as it learns to get better in the end. The profit motive is skewing how we use the tool and how we shape the tool of A.I. As TikTok has an immense power to modify human behavior at scale, and narrow algorithms can create ideological indoctrination tools, we have to finally build rules to guide their principles and their use.
The commercialization of deep fakes mean a whole new era of cyber crime, deception, misinformation and an internet monetized on the weaponization of A.I. We cannot be so proud and smug as the PR of OpenAI or DeepMind would suggest, that’s not even the reality the majority of us live in. It’s what humanity does with these new utilities, convenience and augmentations that matter.
Even the A.I. writers now write like bots replicating the PR of corporations and their minions. But the reality is these are just the many baby steps of infatuation with technology. Meanwhile humanity hasn’t even evolved basic rule of law, ethical frameworks and corporate guard rails for what A.I. itself will become?
Our so-called “foundational models” will also lead to new kind of rot that will also need to be disrupted. We don’t have time to be proud of our achievements, because the future demands that we are more mature as we build A.I. and machine learning builds its own momentum through our systems and the greed-to-power imperative of A.I. Supremacy embedded in this global movement of surveillance capitalism which we have become attuned towards.