loading page

Generate Impressive Videos with Text Instructions: A Review of OpenAI Sora, Stable Diffusion, Lumiere and Comparable Models
  • Enis Karaarslan,
  • Ömer Aydın
Enis Karaarslan

Corresponding Author:[email protected]

Author Profile
Ömer Aydın


The generation of videos from textual input poses a significant computational challenge within computer science. Nonetheless, recent advancements in text-to-video artificial intelligence (AI) technologies have showcased notable progress within this domain. Foreseen advancements in realistic video generation and data-driven physics simulations are poised to further propel the field forward. The emergence of text-to-video AI holds transformative potential across a plethora of creative domains, including filmmaking, advertising, graphic design, and game development, as well as within sectors such as social media, influencer marketing, and educational technology. This research study seeks to comprehensively review generative AI methodologies in text-to-video synthesis, with an emphasis on large language models and AI architectures. Multiple methods such as literature review,  technical evaluation and solution proposal were applied for this purpose. Prominent models such as OpenAI Sora, Stable Diffusion, and Lumiere are evaluated for their efficacy and architectural intricacies. However, the pursuit of Artificial General Intelligence (AGI) is accompanied by a myriad of challenges. These encompass the imperative to safeguard human rights, prevent potential misuse, and protect intellectual property rights. Ensuring the accuracy and integrity of the generated content is paramount. The computationally intensive nature of transformer models results in substantial electricity and water consumption, necessitating the formulation of strategies to mitigate environmental and computational costs to ensure long-term sustainability. This research endeavors to explore potential avenues for addressing these challenges and proposes solutions to advance environmental and computational efficiency within the context of text-to-video AI.
13 Mar 2024Submitted to TechRxiv
19 Mar 2024Published in TechRxiv