What Text GenAI Can Learn from Vision GenAI
An analysis of vision vs. text models: why the disparity, and how can we change that?
In today’s state of AI, there appears to be a noticeable gap between the depth captured by the output of vision-based models and the often straightforward or “bland” output of text-based models.
In other words, while an AI-generated image might evoke wonder, joy, or nostalgia, AI-generated text can sometimes feel mechanical or lacking in human touch.
Image Source: YandexART: Yet Another ART Rendering Technology
Looking at these details, one might get the feeling that vision-based GenAI is years ahead of text-based GenAI.
In this article, we explore the reasons behind this disparity and discuss the lessons text-based generative AI can learn from the successes of vision-based models without diving into the technical details.
Understanding this gap is crucial since language is how humans express thoughts and emotions. If AI can generate text that genuinely resonates emotionally, it could transform how we interact with technology, from more engaging virtual assistants to AI-generated literature that moves us.
Let’s begin!
Vision vs. text models
As you may have seen (and also evident from the images above), generative AI models in the world of visual content have achieved astonishing levels of sophistication that often feel remarkably authentic and impactful.
So why does text-based generative AI lag behind in emotional depth and human touch?
The answer lies in the fundamental architectural differences between how vision-based and text-based AI models generate content.
Visual generative AI models often utilize techniques called diffusion models.
Think of these models as artists who start with rough sketches and gradually add details to create a masterpiece. Here’s a simplified breakdown:
Image source: Medium
The process begins with a canvas filled with random noise—imagine a static-filled television screen.
The model repeatedly refines this noise, gradually reducing randomness and enhancing structure.
The model considers the entire image at each step, allowing it to adjust and improve any part of the picture as needed.
After numerous iterations, the noise transforms into a coherent, detailed image conveying complex scenes and emotions.
This iterative and holistic approach enables the model to capture intricate patterns and subtle nuances, resulting in images that resonate emotionally with viewers.
To put it more technically, diffusion models generate images by simulating a process of gradually predicting the noise that should be removed from an existing noisy image to recover the desired content.
This process is depicted below:
The process begins with an image that’s entirely random noise.
The model applies a series of transformations to the noisy image. It predicts a less noisy version at each step by considering the patterns that make up meaningful images.
During each iteration, the model can adjust any pixel, capturing complex relationships and structures within the image.
After many iterations, the noise is sufficiently reduced, and a coherent, detailed image emerges that can convey intricate scenes and evoke emotions.
This method allows the model to maintain a global perspective of the image throughout the generation process. By continuously refining the entire image, the model ensures that all elements work together harmoniously to produce a visually and emotionally compelling result.
In contrast, text-based generative AI models operate differently.
Most text models today use a transformer architecture, which excels at understanding and generating data sequences—like sentences or paragraphs. At its core, these are just “next-word/token prediction” models.
Source: Jay Alammar’s blog on GPT-2
Here’s a breakdown of how they work:
The model generates text one word (or token) at a time, moving from the beginning to the end of the text.
At each step, the model considers the words it has generated to predict the next word. This means it has a good understanding of the immediate context.
Once a word is generated, it remains unchanged. The model doesn’t go back to revising previous words, even if a different choice might improve the overall coherence or emotional impact.
The focus is on choosing the best next word rather than optimizing the entire piece of text as a whole.
This sequential, one-way approach is limiting since the model never gets to revisit and refine earlier parts of the text.
Thus, one mistake midway derails the entire text generation process.
Moreover, think about how we, as humans, write and communicate. When crafting an email, a story, or even a social media post, we often return to earlier sentences to make edits, add details, or adjust the tone. This iterative process helps us ensure our message is clear and coherent and conveys the intended emotion.
However, when using a text-based generative AI model, the prompt is usually a single piece of text, and the model continues generating its output linearly from the last word or sentence in the prompt.
This means that in complex scenarios where we have a larger piece of text with prompts embedded at various places within it, our “next-word predictor” model will struggle to incorporate those prompts effectively.
The discussion so far highlights some key limitations of current transformer models used predominantly in text generation.
However, text-based AI can overcome these challenges by learning from the successes of vision-based generative AI models.
Let’s understand those.
Conclusion and learnings for text-based models
While diffusion models have been dominant in vision tasks, researchers understand their potential and have explored them for text tasks as well.
Most attempts, however, have been unsuccessful because of this:
In images, adding noise means slightly altering the color values of pixels, which are continuous and can be adjusted incrementally. This gradual addition and removal of noise allow diffusion models to refine images step by step, enhancing details and textures with each iteration.
However, text is made up of discrete units—words or tokens—that don’t have a smooth, continuous representation. You can’t “slightly” change a word; replacing or altering a word often results in a completely different word, which can drastically change the meaning of a sentence. This makes the direct application of diffusion models to text generation challenging.
Despite these hurdles, researchers have been exploring ways to adapt diffusion models for text tasks.
We won’t get into the technical details, but this recent paper has successfully leveraged diffusion models for text generation tasks: Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution.
While this will require more research, the proposed methodologies perform better than traditional models on several tasks, which indicates that going ahead in 2024, we are expected to witness a surge in diffusion models being explored and refined for text generation tasks.
This could be a hot take, but we believe that the future in text generation will not be “next token predictors” but rather much more advanced and parallelized architectures inspired by the iterative refinement technology similar to that used in diffusion models.
Time will tell if this turns out to be true.
Thanks for reading!