Cicero
Diplomacy is a strategy game that resembles a simple version of Sid Meyer's Civilization with human negotiation at the heart of the gameplay. Players constantly talk to each other to form and break alliances and make, deliver or abandon strategic promises.
Cicero is an agent created by Meta to outperform humans in this game. The AI analyses the current game situation, makes predictions about what other players will do, constructs a strategy, then leads meaningful conversations with human participants to fulfill its strategy and win.
According to the abstract of the research paper,
"Across 40 games of an anonymous online Diplomacy league, Cicero achieved more than double the average score of the human players and ranked in the top 10% of participants who played more than one game."
Inter-temporal modeling
Venturing deeper into components for more general AI, DeepMind builds agents that can interpret and execute human commands in an open-ended virtual environment.
The authors of the paper combined two techniques: behavioral cloning (BC) and reinforcement learning (RL) to significantly improve the scores.
While behavioral cloning is employed to teach agents to mimic human actions in response to requests, reinforcement learning helps agents to take note of the human feedback and iterate on it.
You can get a greater understanding of these subjects by reading the paper. Watch the video in the official article and try imagining a robot performing these actions at your home. RIP dishes.
HyperTree Proof Search
"HTPS" for short, this algorithm presented by Meta can solve Math Olympiad problems, providing scientific notation proof trees to theorems.
The model presents several "proving environments" - limited sets of objects and tools used to prove theorems. HTPS is trained on examples of where the proofs for theorems lie within these proving environments.
One of these proving environments is "Metamath" - within this environment a successful proof tree consists of replacing the theorem hypothesis with logically equal expressions until they are broken down into axioms.
This technology is essential to software verification, especially in cryptography and aerospace. We can only wonder when we will see the actual unproven mathematical theorems solved by AI.
ESMFold
One of the most significant advances in biology that was enabled by AI was protein structure prediction. First presented by DeepMind in Alphafold and then improved upon in Alphafold 2, this technique has already effectively shortened certain studies by years and opened fantastic research possibilities. This was one of the most significant breakthroughs in the entire history of biological research.
This month, Meta realizes its approach to the problem, presenting a new protein structure prediction model: ESMFold. The release is paired with an interactive "Metagenomic Atlas", allowing exploration of some of the 617 million predicted protein structures.
Several versions of the ESMFold model are open-source, up to a 15B-parameter one; This research shows that game-changing AI can be expanded upon and is subject to further improvement, with competition arising even in industry-specific fields of research.
SegCLR
How about some brain-mapping action? Google proposes an algorithm that can identify single cells, their shapes, and internal structures, and produce representations of the data that can be utilized in a variety of tasks.
SegCLR can be used to distinguish cell types even from small fragments and analyze patterns of brain connectivity. The work enables deeper research of newly delivered publically available high-resolution brain maps, including human samples and the "cubic millimeter mouse cortex dataset" which is such a fun and geeky thing to repeat to yourself while doing household chores.
MinD-Vis
Isn't this insane? We can conditionally extract visual concepts from the brain. While a person is looking at an image, their mind is read with fMRI, and MinD-Vis uses diffusion to reconstruct the image based on the fMRI data.
In recent months, there have been a couple of papers showing that it is possible to decode linguistic concepts from thoughts. This model right here presents a first doorway towards extracting high-quality visual imagery, unlocking new potential for human creativity. If these technologies mature and reach consumer markets, our world will be transformed by BCI-enabled telepathy and telekinesis - mind-controlled devices and interfacing ideas for immediate visualization outline the next paradigm shift in human experience.
Wav2Vec and the Brain
In this research, the authors separately presented audiobooks to a self-supervised model Wav2Vec and several people with different native languages. They collected fMRI data from people listening to audiobooks and compared speech processing patterns that emerged in the human brain and Wav2Vec, and found striking similarities!
The results are so difficult to paraphrase that it makes sense to simply quote a paragraph from the paper.
"Our results provide four main contributions. First, self-supervised learning leads Wav2Vec 2.0 to learn latent representations of the speech waveform similar to those of the human brain. Second, the functional hierarchy of its transformer layers aligns with the cortical hierarchy of speech in the brain, and reveals the whole-brain organisation of speech processing with an unprecedented clarity. Third, the auditory-, speech-, and language-specific representations learned by the model converge to those of the human brain. Fourth, behavioral comparisons to 386 supplementary participants’ results on a speech sound discrimination task confirm this common language specialization."
Diffusion Distillation
The v2 of this paper came out on November 30th and provided a fantastic improvement to diffusion algorithms that can speed up image generation by more than tenfold.
The improvements can be applied to Stable Diffusion models and the research overall is assisted by Stability AI. We can expect to be able to try out this advanced tech in the coming months.
EDiff-I
It's Nvidia's turn to present its generative AI in the field of static imagery. With recent community efforts in Stable Diffusion and especially the v4 Midjourney update, can EDiff-I offer something unique? Yes, it can!
EDiff-I is the first to support a unique generation workflow called "Paint with words", which gives an insight into the future of design. When generating an image with a prompt and a sketch, EDiff-I allows users to specifically label areas of the sketch, providing a new powerful way to control the composition.
Moreover, EDiff-I offers several other features, including style transfer and text embeddings, drastically improving word generation akin to Google's Imagen.
Magic3D
More technical powerplays from Nvidia. Here is a model that can generate a 3d mesh based on a text prompt.
A 3d mesh is a structure that consists of vertices (dots) connected into polygons to form an object. It is one of the standard formats for 3d models that has an immensely broad use - in gaming, virtual reality, manufacturing, and architecture to say the least.
The algorithm works in two stages: first, it generates a coarse model via prompt-based diffusion; then, it improves resolution and optimizes the model, with the entire process still guided by the initial prompt.
Being able to form an instantly usable mesh based on any prompt is a powerful tool that would both cut workload requirements on serious projects and lower the entry barriers to several industries.
InstructPix2Pix
More generative bliss! In this paper, the AI was taught to alter images based on instructions rather than descriptions of the final result.
While the model presents examples of failure cases, it is the first time a generative AI can interpret action words to manipulate visual data.
Such a concept of human-AI interaction feels more natural and fits into existing workflows with AI assistants. This approach could later lower the entry barrier to generative AI as action-prompting is vastly presented in movies and popular culture.
VectorFusion and SinFusion
Research into generative AI goes broader with new works focusing on specific and niche objectives.
VectorFusion produces SVG files based on text prompts. SVG (Scalable Vector Graphics) is an image format that describes the image with paths and fills. The main advantage of SVGs is that these images can be infinitely scaled; additionally, SVG files usually have smaller sizes. They are used in nearly every UI and on this website as well.
SinFusion research extends what a diffusion model can accomplish when trained on only one image or video. This project shows how much research is to be done on data efficiency in AI training and how newer approaches could enable tasks previously impossible to perform due to data constraints.
Other News
There is a bunch of other hot news this month! First, a dealbreaker update in image generation: MidJourney v4. Despite the common claims that MidJourney doesn't let users fully control the generation and additionally enhances the results, this version pushes the quality a step forward, beating the best Stable Diffusion community efforts.
The new version produces results with an insane level of correctness in minor details, further lifting the shade of "untruthfulness" that always accompanied AI-generated pictures. Moreover, MidJourney introduced the "remix mode" that instantly became a popular tool for fusing famous memes. You can find examples of v4 in action in the official community feed, subreddit, and discord server.
New web services appeared, offering a collection of various Stable Diffusion models to download. Enthusiasts even trained some models based on MidJourney v4 generations.
Colossal AI open-sourced a way to accelerate diffusion model pretraining and fine-tuning by almost 7x, reducing both the necessary time and computations. Take note - this is different from diffusion distillation. 7x faster training, 10 faster generations. Now, these are the margins we like to see.
A new competitor appears and joins Meta and Google in the video-generation club. The new model from ByteDance called Magic Video is capable of generating videos from prompts but seems to be heavily infected with nasty Shutterstock watermarks.
OpenAI opened access to ChatGPT chat-bot and published an improved GPT-3 version “text-davinci-003”
Lastly, an online prediction platform Metaculus runs an ongoing prediction named "Date Weakly General AI is Publicly Known". We have no way of knowing this prediction's accuracy, but the predicted date has been getting sooner and sooner. This month, the date dipped into the year 2027 for the first time.
The most significant change happened in spring when prediction dipped from 2042 to 2028 in less than two months. That could follow the release of Gato - one of the most impressive generalist agents to date.
This trend likely illustrates either the change in sentiment due to recent breakthroughs, or an influx of new users who believe that we're on the verge of making our greatest invention. In both scenarios, this graph supports the evidence of the quick expansion of AI research and industry.
Closing words
Thank you for reading this. Here is your minimalistic cubic millimeter mouse cortex dataset badge:
Have a nice month, and see you next time. Rephrase the news for 5-year-olds, then tell your fire inspector, your pet spider, and your vacuum cleaner. Increase AI awareness. Spread the word.