There are not many times throughout our life when we are able to witness the growth and evolution of powerful technology. However, many will say that one of those times is right now, and that technology is AI. Although we have seen artificial intelligence power things such as company chatbots for basic customer service and voice assistants like Siri and Alexa, we have not quite seen anything as powerful as the current chatbots based on large language models. These include but are not limited to ChatGPT by OpenAI, CoPilot by Microsoft, and Gemini by Google. What makes these large language models so powerful is their potential and ability to process many different requests that we’ve never seen before in such a way. By this, I mean these models are able to provide on-demand responses to all different types of prompts, such as recommending restaurants based on tastes, writing code from basic instructions, or even proofreading online texts. These examples are only a few of the very wide range of requests these types of models can handle. What makes these models even more exciting is what’s in store for the future of them. Most of these models also have additional features such as image generation. Specifically, ChatGPT has such a feature named D.A.L.L.E, which requires a paid account because of the added computing resources such a model requires. D.A.L.L.E is able to generate images based on only text input from the user. CoPilot also has such a feature, also based on the D.A.L.L.E model, however. This is made possible by training the model with very large amounts of datasets, allowing it to recognize objects and be able to either place or remove them in the images that people request to be generated. These models work very well for the most part, but just as with any technology, they have their shortcomings at times. For example, the model may place an object that was not requested by the user, and it will then struggle to properly remove it when asked by the user. But this is not all that these models are capable of.
The next big step in content generation for these large language models is video. Recently, OpenAI released an announcement highlighting their newest model named Sora. In the same way D.A.L.L.E can generate images based on text, Sora seeks to create motion scenes based on text as well. And OpenAI has already released some very good examples of what Sora is capable of on their website. One of them is attached below with its prompt. One note is that these prompts can be as detailed as the user wants (within reason) so the final product meets their expectations. The examples OpenAI used ranged from a full paragraph to only one short sentence, and they all produced accurate results. But this does not mean all results from Sora are accurate or what was expected. OpenAI added a section to their Sora showcase for what it still struggled to do at the time of the announcement. Some of these weaknesses include the misapplication of physics to solid objects, implausible motion, and multiple objects spontaneously appearing. At this stage in development, weaknesses are expected, and although I’m sure it will never be perfect, the released model will be more polished than its current state.
All of this is so exciting because we get to watch how this technology evolves. It is all still very new to us, so we have not even been able to capture its full utility yet. This applies to all of AI and not only text or video generation. It will certainly be interesting to see how the images and video creation techniques continue to improve over time, and we could maybe see some kind of fully AI-generated short film at some point. In addition to the usages, the implications are also something to keep an eye on, as there have been no shortage of legal concerns with many aspects of AI. Many of these cannot be determined until they happen, so I imagine there are going to be many more before they ever slow down.
OpenAI Sora generated video with the prompt: Historical footage of California during the gold rush.