In the 2nd edition of his 1999 magnum opus, renowned film editor Walter Murch introduced a hypothetical scenario: “Imagine a technological pinnacle in the midst of the twenty-first century, where it becomes feasible for an individual to create an entire feature film, complete with virtual actors.”
With the introduction of Sora by OpenAI, humanity is drawing nearer to Murch’s conjecture.
Sora, a text-to-video model developed by OpenAI over the past year, has the capability to produce up to a minute of high-definition, 1080p video based on a text input.
It represents the latest advancement in the expanding realm of generative AI, joining other innovative technologies. OpenAI states, “Sora can generate complete videos in one go or extend existing videos to enhance their length.” Additionally, the model can incorporate static images to craft a video.
OpenAI unveiled 48 video examples generated by Sora, asserting that these videos required no alterations or enhancements.
Furthermore, OpenAI published a technical document elucidating the inner workings of Sora. Lior S. from the AlphaSignal newsletter has provided a comprehensive breakdown of the model’s technical intricacies.
Walter Murch once envisioned a technology that could translate an individual’s thoughts directly into a visual cinematic experience through a “black box.” While such mind-reading capabilities remain a futuristic concept, Sora’s text-to-video transformation serves as a precursor to Murch’s theoretical black box input.
Sora leverages patches, which are akin to the underlying “tokens” in its ChatGPT models. By compressing videos into a lower-dimensional latent space and segmenting the representation into spacetime patches, Sora dissects videos into spatial and temporal components.
Moreover, Sora employs the “re-captioning” technique utilized in DALL-E 3, where a descriptive captioner model is trained to enhance text fidelity and overall video quality. The model also integrates GPT technology to expand user prompts into more detailed inputs.
While the exact training data for Sora remains undisclosed by OpenAI, Jim Fan, a Senior Research Scientist at NVIDIA, speculates that Sora may have been trained on synthetic data using Unreal Engine 5.
Despite the excitement surrounding Sora’s capabilities, OpenAI acknowledges its current limitations in handling complex scenes, such as struggles with cause and effect relationships and spatial directions like right and left. The company has outlined five instances where Sora encounters challenges with intricate scenarios.
Nevertheless, OpenAI envisions Sora as a transformative model that could pave the way for Artificial General Intelligence (AGI). Tim Brooks, a scientist at OpenAI, emphasized the significance of developing models that can comprehend video content and the intricate interactions within our world as a crucial advancement for future AI systems.