Abstract
This study explores various methodological approach to transforming static images with text overlays into dynamic videos. Utilizing Python libraries such as OpenCV and MoviePy, the proposed method integrates text overlay, image processing, and video creation to produce engaging visual content. The effectiveness of this method is evaluated based on processing time, video quality, and user feedback. Experimental results demonstrate that the approach not only achieves high-quality visual output but also ensures efficient processing, making it suitable for applications in digital marketing, educational content, and social media. The traditional approaches typically resulted in limited frame quality, with Structural Similarity Index (SSIM) scores around 0.75, Peak Signal-to-Noise Ratio (PSNR) values near 28 dB, and Mean Squared Error (MSE) for frame predictions at 0.04—leaving significant room for improvement. In our project, we present an enhanced framework that achieves higher accuracy and quality by combining advanced tools such as transformer-based embedding fusion (Xformer), Hugging Face for rich text embeddings, and OpenCV for refined image preprocessing. By optimizing GAN architecture and introducing a temporal LSTM network, we improved frame quality and coherence across generated videos. Our model achieves an SSIM score increase to 0.85 (+10%) and PSNR improvement to 30.2 dB, while reducing MSE to 0.02, thus providing smoother transitions and sharper visuals. Training accuracy for the LSTM model reached 90% over 50 epochs, with a corresponding testing accuracy of 88%, reflecting strong generalization compared to previous methods. Furthermore, GAN training loss decreased from 1.0 to 0.35 over 100 epochs, with testing loss stabilizing at 0.40, indicating reduced overfitting and consistent performance. Our improvements establish new benchmarks in dynamic video generation from static images and text, increasing model accuracy, visual quality, and temporal stability beyond the limitations of earlier approaches. The primary objectives of this study are to develop a seamless process for adding text to images, create
dynamic transitions between these images, and compile them into an engaging video format. The paper focus with a discussion of the findings, implications for various fields, and potential areas for future research, such as the incorporation of more advanced animations and real-time processing capabilities.