AI video generation is no longer a futuristic concept-it's here. In building VidGen, I explored how combining powerful LLMs like Gemini with specialized media models can create a unified content creation pipeline.
The Pipeline
Our system uses a multi-stage process:
- Scripting: Gemini generates the narrative.
- Audio: Deepgram and specialized TTS models handle voiceovers.
- Visuals: Hugging Face models generate or process video frames.
- Assembly: Inngest handles the background orchestration of these heavy tasks.
Lessons Learned
Handling long-running media tasks requires robust state persistence. Using Convex allowed us to maintain a real-time reactive UI while heavy processing happened in the background.
The future of content creation is collaborative, where the engineer builds the "director" that empowers users to create.