Gato as a Precursor to Sub Human Artificial general Intelligence
Gato is a playful poke at how close and far we are from AGI
What if Transformers at scale really do prove to be a breakthrough for sub human level AI or SHLAI?
Gato, as the agent is known, is DeepMinds’s generalist AI that can perform many different tasks that humans can do, without carving a niche for itself as an expert on one task.
There must be some who think that DeepMind’s Gato is a game changing announcement. Maybe they are right?
Gato performs over 450 out of 604 tasks at over a 50% expert score threshold.
Maybe the hype is real? Scalable Transformers will be able to do some “new” things. Perhaps it’s not a huge step towards AGI, but it could certainly build a case for a sub-human-level AI that’s good at many things at the same time.
What hyperparameters, up to and including overall architecture, might we push on and make progress on in the coming decade? How far can our current advances in transformer architectures take us?
Gato can perform more than 600 different tasks, such as playing video games, captioning images and moving real-world robotic arms. Gato is a multi-modal, multi-task, multi-embodiment generalist policy.
Multi-modal
Multi-task
Multi-embodiment generalist policy.
This cat sounds really good on paper!
Gato is company, Gato can jump higher with less.
Gato is small, parameters-wise. At 1.2 billion parameters, it's 1/100th the size of the largest GPT-3 model and 1/500th the size of PaLM.
DeepMind is one of the most well-known AI companies dedicated to the advancement of artificial intelligence. With several programs, it aims to offer new ideas and improvements in machine learning, engineering, simulation, and computer infrastructure. You could make a good argument DeepMind is the best A.I. lab of talent ever assembled in 2022.
The Little Cat that Could
Gato is a miniaturized version of models to come in the near future.
Gato is a pet, it’s not a serious attempt at AGI.
Looking across the field of AI research today, there are two common categories of problems scientists are focused on: prediction and control.
Prediction models try to learn about a domain (such as weather patterns) and understand how it might evolve, while control models prompt agents to take actions in that environment.
Building a successful path to AGI requires understanding and developing algorithms in both spaces, accounting for all the variations that our natural and social environments throw at us, from how viruses mutate or how language may evolve in use and meaning over time to how to help produce energy from fusion power.
DeepMind I’d argue is better at prediction, than control.
However as robotics and neuromorphic computing mature, control will get easier. However this will likely take decades, not years.
Think about it, Gato is not just a transformer but also an agent – so can think of it as a transformer combined with an RL agent for multi task reinforcement learning and with the ability to perform multiple tasks – hence why some observers think it’s game changing. Game over, game changing, maybe, we will see. And Meow, I see you Gato!
Without expert collaboration, AI researchers cannot make significant progress in real-world domains. Once Tesla robotics come up to speed, they will have to work with people at DeepMind and Microsoft Research and try to do some things before China does. I think this will be how this partially plays out.
Robotics and data become as important in the end as transformer models and training. Size matters, but so does experience.
Identifying the right paths forward in these fields requires partnerships across disciplines, leveraging a common scientific approach to develop and use AI to navigate complex questions at the heart of society’s most urgent needs. DeepMind (Google) should not be given too much credit, many of these AI Labs don’t even share the source code or in an open source way! This makes it difficult for human knowledge to really scale at a decent pace.
DeepMind and OpenAI researchers in 2022 have developed a habit of calling a lot of things that are certainly not AGI, part of artificial general intelligence’s domain. This is pretty unfortunate. But just consider they may mean sub human level AI (or SHLAI). They are trying to be positive and commercialize their AI Lab’s advantage, whether real or self-prophesized for profit.
I like Cats and I like A.I. but I don’t necessarily feel Gato is a breakthrough.
The guiding design principle of Gato is to train on the widest variety of relevant data possible, including diverse modalities such as images, text, proprioception, joint torques, button presses, and other discrete and continuous observations and actions.
Could Gato’s Tokenization model be a game-changer?
After converting data into tokens, they use the following canonical sequence ordering.
Text tokens in the same order as the raw input text.
Image patch tokens in raster order.
Tensors in row-major order.
Nested structures in lexicographical order by key.
Agent timesteps as observation tokens followed by a separator, then action tokens.
Agent episodes as timesteps in time order (p. 3).
DeepMind says that Gato is trained on a large number of datasets comprising agent experience in both simulated and real-world environments, in addition to a variety of natural language and image datasets.
Clearly how they have developed Gato is of great interest. It’s salient, but to be game-changing it would need to be replicable and a less miniature example would be needed, that I’m sure is coming soon and likely around the same time OpenAI launches GPT-4, whatever and whenever that will be.
Gato is a large transformer with a single sense-modality: it receives diverse kinds of inputs pressed into a sequence of tokens, and outputs more tokens. This is just applying the current successful language-model architecture to varied-domain problem solving in the naïve way.
The original paper from deepmind
This Video is of some interest.
Gato could be a way to train the future robots of this world.
During training, for 25% of the sequences in each batch, a prompt sequence is prepended, coming from an episode generated by the same source agent on the same task. Half of the prompt sequences are from the end of the episode, acting as a form of goal conditioning for many domains; and the other half are uniformly sampled from the episode. During evaluation, the agent can be prompted using a successful demonstration of the desired task, which we do by default in all control results that we present here.
If you think about it, Gato's architecture isn't that different from many of the AI systems in use today. In the sense that it's a Transformer, it's similar to OpenAI's GPT-3.
The Transformer has been the architecture of choice for complicated reasoning tasks, displaying abilities in summarizing texts, producing music, categorizing objects in photos, and analyzing protein sequences. Gato is a good transformer-cat. Nobody is skeptical of how this feels, it’s a good performer.
DeepMind’s unabashed optimism is like a Google edict. They have to say this stuff. Many other categories of AGI problems are yet to be solved - from causality, to learning efficiently and transfer - and as algorithms become more general, more real-world problems will be solved, gradually contributing to a system that one day will help solve everything else, too.
Gato Foreshadows How SHLAI Can be Trained
A major part of the guts of the model is the use of internal prompt programming. Context length limits don't prevent training high-performance in Gato, but does prevent us from fully testing the out-of-the-box model's few-shot generalization abilities. Gato as a sub human level AI pet demonstrates a lot of things, it’s a good teacher for future projects.
DeepMind is certainly a trailblazer in how things should be done.
Gato is trained to do RL-style tasks by supervised learning on token sequences generated from state-of-the-art RL model performance. These tasks take place in both virtual and real-world robot arm environments. With the rise of synthetic AI how we train robots to A.I. agents in the Metaverse will certainly be quite different.
Gato is also a cool cat for other obvious reasons. Gato has a parameter count that is orders of magnitude lower than single-task systems, including GPT-3. Parameters are system components learnt from training data that fundamentally describe the system's ability to solve a problem, such as text generation. GPT-3 has more than 170 billion, while Gato has only 1.2 billion.
I think we can say that OpenAI and DeepMind are flirting with sub-human level AI in various ways that’s a good teaser for the future.
So they aren’t necessarily solving bigger AGI challenges, including issues with learning human-centric capabilities like sensory perception, motor skills, problem-solving, human-level creativity, and so on, as well as a lack of working protocol, reduced universality, business alignment, and AGI direction.
Gato is a cat, not a person. Gato, after a small amount of fine-tuning, catches up with SOTA RL expert models. It does this even with real-world colored-block stacking tasks; Gato is capable of interfacing with the physical world, albeit in a controlled environment.
Gato’s meow will be heard though, and this is because it’s symbolic. Gato was inspired by works such as GPT-3 (Brown et al., 2020) and Gopher (Rae et al., 2021), pushing the limits of generalist language models; and more recently the Flamingo (Alayrac et al., 2022) generalist visual language model. The real story is what’s next?
Do you want access to exclusive content? Join 21 other paying subscribers for more. I have yet to really up the cadence in this Newsletter but I will be doing so.