Deep Dive: Creating Accurate Math Videos Using LLMs

Aditya Advani
10 min readMay 15, 2024

--

Generating Math Explainer Videos from Text Prompts: How We Built MathMatrixMovies Using Google’s Gemini Pro 1.5 and Manim, 3Blue1Brown’s Amazing Math Animation Library, with some help from Meta’s Llama3.

Hello World,

We are super excited to share a really cool project our team built at the recent Llama3 hackathon at Shack15 in San Francisco on 11th and 12th May 2024. We created MathMatrixMovies, a tool that generates engaging, animated math explainer videos from simple text prompts, powered by Gemini Pro 1.5 and Manim.

The Problem: Making Engaging Math Education Videos is Hard

Making math education accessible, engaging, and personalized is a challenge faced by students and educators alike. We wanted to create a tool that could break down complex mathematical concepts into easy-to-understand videos, tailored to different age groups and learning levels.

The Solution: Math Problem Prompt 2 Video using LLMs + Manim

We leveraged the power of Google’s Gemini Pro 1.5, a state-of-the-art large language model (LLM), to generate object-oriented Manim code from text prompts. Users simply input a prompt, select the target age range (3–18, undergraduate, or graduate), and choose the language (English, Hindi, Tamil, Spanish, etc.).

Behind the scenes, Gemini Pro 1.5 writes the Manim code, debugs it, and renders the video. The generated video is then split into 1-second frames, which are processed by Gemini Pro 1.5 for a second pass to refine the content. The result is a polished, engaging math explainer video that can be published directly to our MathMatrixMovies YouTube channel with a single click.

Three Examples:

1. Generated YouTube video explaining 5+3=8 to a 3 year old

Explainer video in English for 5 year olds explaining 3+5=8

2. Elliptic Curve Cryptography for 17 year olds

3. Integral Calculus for 14 year olds in the Tamil language

Find more on our YouTube channel! https://www.youtube.com/channel/UC7v3S0YBMNHfBJw8_3GhKpg

Explainer Video: How the System Works

Explanation of how the MathMatrixMovie system works

Workflow: How a Math Question is Turned into an Educational Video, then published to YouTube

User workflow Prompt 2 Video 2 YouTube

Step 1: User submits a math problem and target audience to MathMatrixMovies and clicks Generate

Step 2: Gemini Pro is called with the following prompt

We pass the arguments

  • Math prompt: “5+3=8”
  • Audience: “3 year old”
  • Language: “English”
  • Voice: “en-US-AriaNeural”
Can you explain 5+3=8 to a 3 year old? Please be visual and interesting. 
Consider using a meme if the audience is younger.
Please create python code for a manim video for the same. Please do not use any external dependencies like mp3s or svgs or graphics. Do not create any sound effects. If you need to draw something, do so using exclusively manim. Always add a title and an outro. Narrate the title and outro. Please try to visually center or attractively lay out all content. Please also keep the margins in consideration. If a sentence is long please wrap it by splitting it into multiple lines. Please add actual numbers and formulae wherever appropriate as we want our audience of 3 year old to learn math. Do use voiceovers to narrate the video. The following is an example of how to do that: ``` from manim import * from manim_voiceover import VoiceoverScene from manim_voiceover.services.azure import AzureService

class AzureExample(VoiceoverScene):
def construct(self):
self.set_speech_service(
AzureService(
voice="en-US-AriaNeural",
style="newscast-casual",
global_speed=1.15
)
)

circle = Circle()
square = Square().shift(2 * RIGHT)

with self.voiceover(text="This circle is drawn as I speak.") as tracker:
self.play(Create(circle), run_time=tracker.duration)

with self.voiceover(text="Let's shift it to the left 2 units.") as tracker:
self.play(circle.animate.shift(2 * LEFT),
run_time=tracker.duration)

with self.voiceover(text="Now, let's transform it into a square.") as tracker:
self.play(Transform(circle, square), run_time=tracker.duration)

with self.voiceover(
text="You can also change the pitch of my voice like this.",
prosody={"pitch": "+40Hz"},
) as tracker:
pass

with self.voiceover(text="Thank you for watching."):
self.play(Uncreate(circle))
self.wait()
```

The voice for the "English" is "en-US-AriaNeural".
Please use this voice for the narration.

Please do not use any external dependencies like svgs since they are
not available. First write the script explicitly and refine the
contents and then write the code. Describe illustrations explicitly and put them near the concepts. Please draw and animate things, using the whole canvas. Use color in a restrained but elegant way, for educational purposes.

Please use only manim for the video. Please write ALL the
code needed since it will be extracted directly and run from your response.

We ask the model to write Manim code. Manim is installed on the system along with the Manim Voiceover plugin that is configured to use the Azure Text To Speech (TTS) service.

Step 3 (Optional): If the written Manim code fails to compile, the LLM is re-prompted along with the error text so that it can re-attempt. Upto 8 retries are permitted at this time.

Step 4: LLM has written Manim code that successfully compiles. An example in this case would be

## Script

**Title Card** (Text: "Let's Learn Addition!")

**(Narrator):** Let's learn addition!

**(Scene 1)**
* Five red apples appear on the left side of the screen. * Three green apples appear on the right side of the screen. **(Narrator):** We have five red apples...and three green apples. **(Scene 2)** * All the apples are gathered in the center of the screen. **(Narrator):** Let's put all the apples together. **(Scene 3)**
* Three green apples appear on the right side of the screen.

**(Narrator):** We have five red apples...and three green apples

**(Scene 2)**

* All the apples are gathered in the center of the screen.

**(Narrator):** Let's put all the apples together.

**(Scene 3)**

* The apples are counted one by one, with a number appearing above each apple as it is counted.

**(Narrator):** One, two, three, four, five, six, seven, eight!

**(Scene 4)**

* The equation "5 + 3 = 8" appears below the apples.

**(Narrator):** Five plus three equals eight! We learned how to add!

**(Outro Card)** (Text: "The End")

**(Narrator):** The End!

## Manim Code

```python
from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover.services.azure import AzureService

class AdditionExample(VoiceoverScene):
def construct(self):
self.set_speech_service(
AzureService(
voice="en-US-AriaNeural",
style="newscast-casual",
)
)

# Title Card
title = Text("Let's Learn Addition!").scale(1.5)
self.play(Write(title))
self.wait(0.5)
with self.voiceover(text="Let's learn addition!") as tracker:
self.wait(tracker.duration)
self.play(FadeOut(title))

# Scene 1
red_apples = VGroup(*[Circle(radius=0.3, color=RED, fill_opacity=1) for _ in range(5)]
).arrange(RIGHT, buff=0.5).to_edge(LEFT, buff=1)
green_apples = VGroup(*[Circle(radius=0.3, color=GREEN, fill_opacity=1) for _ in range(3)]
).arrange(RIGHT, buff=0.5).to_edge(RIGHT, buff=1)
self.play(Create(red_apples))
self.play(Create(green_apples))
with self.voiceover(text="We have five red apples... and three green apples.") as tracker:
self.wait(tracker.duration)

# Scene 2
all_apples = VGroup(red_apples, green_apples).arrange(RIGHT, buff=0.5).center()
self.play(Transform(red_apples, all_apples[0:5]))
self.play(Transform(green_apples, all_apples[5:8]))
with self.voiceover(text="Let's put all the apples together.") as tracker:
self.wait(tracker.duration)

# Scene 3
for i, apple in enumerate(all_apples):
self.play(Flash(apple, color=YELLOW, flash_radius=0.5))
number = Text(str(i + 1)).scale(0.8).next_to(apple, UP)
self.play(FadeIn(number))
with self.voiceover(text=str(i + 1)) as tracker:
self.wait(tracker.duration)

# Scene 4
equation = MathTex("5 + 3 = 8").scale(1.5).next_to(all_apples, DOWN, buff=1)
self.play(Write(equation))
with self.voiceover(text="Five plus three equals eight! We learned how to add!") as tracker:
self.wait(tracker.duration)

# Outro Card
self.play(FadeOut(all_apples), FadeOut(equation))
outro = Text("The End").scale(1.5)
self.play(Write(outro))
with self.voiceover(text="The End!") as tracker:
self.wait(tracker.duration)
self.play(FadeOut(outro))

self.wait()
```

This code uses Manim to create an animated video explaining 5 + 3 = 8 to a 3-year-old. The script is followed closely, with visuals matching the narration.

That first code had some minor bugs, after iterating with generation errors in context, two passes later the final code compiled to create the initial movie:

from manim import *
from manim_voiceover import VoiceoverScene
from manim_voiceover.services.azure import AzureService

class AdditionExample(VoiceoverScene):
def construct(self):
self.set_speech_service(
AzureService(
voice="en-US-AriaNeural",
style="newscast-casual",
)
)

# Title Card
title = Text("Let's Learn Addition!").scale(1.5)
self.play(Write(title))
self.wait(0.5)
with self.voiceover(text="Let's learn addition!") as tracker:
self.wait(tracker.duration)
self.play(FadeOut(title))

# Scene 1
red_apples = VGroup(*[Circle(radius=0.3, color=RED, fill_opacity=1) for _ in range(5)]).arrange(RIGHT, buff=0.5).to_edge(LEFT, buff=1)
green_apples = VGroup(*[Circle(radius=0.3, color=GREEN, fill_opacity=1) for _ in range(3)]).arrange(RIGHT, buff=0.5).to_edge(RIGHT, buff=1)

self.play(Create(red_apples))
self.play(Create(green_apples))

with self.voiceover(text="We have five red apples... and three green apples.") as tracker:
self.wait(tracker.duration)

# Scene 2
all_apples = VGroup(red_apples, green_apples).arrange(RIGHT, buff=0.5).center()
self.play(Transform(red_apples, all_apples[0:5]))
self.play(Transform(green_apples, all_apples[5:8]))

with self.voiceover(text="Let's put all the apples together.") as tracker:
self.wait(tracker.duration)

# Scene 3
for i, apple in enumerate(all_apples):
self.play(Flash(apple, color=YELLOW, flash_radius=0.5))
number = Text(str(i + 1)).scale(0.8).next_to(apple, UP)
self.play(FadeIn(number))
with self.voiceover(text=str(i + 1)) as tracker:
self.wait(tracker.duration)

# Scene 4
equation = MathTex("5 + 3 = 8").scale(1.5).next_to(all_apples, DOWN, buff=1)
self.play(Write(equation))
with self.voiceover(text="Five plus three equals eight! We learned how to add!") as tracker:
self.wait(tracker.duration)

# Outro Card
self.play(FadeOut(all_apples), FadeOut(equation))
outro = Text("The End").scale(1.5)
self.play(Write(outro))
with self.voiceover(text="The End!") as tracker:
self.wait(tracker.duration)
self.play(FadeOut(outro))

self.wait()

The command to compile Manim code 2 video is:

manim -pql MathMovie_9c5b68a7.py --disable_caching

This will create a Manim video in the ./media/videos/MathMovie_9c5b68a7/480p15/AdditionExample.mp4 along with a matching subtitle file AdditionExample.srt

The initial video is presented to the user in the frontend, while the workflow continues.

Initial Video presentation

Step 5: We split the video into 30 key frames and ask Gemini to iterate on the video to improve quality

Step 6 (Optional): If we like the video, we click Publish to Youtube, and that

  1. calls out to Llama3 on Groq to generate a compelling, title, description, tags, category and metadata, & TBD uploads the video to YouTube
  2. TBD: uploads the input params and the generated output code as well as the generated video file to the Manim finetuning dataset on Airtable

We didn’t finish the Publish to YouTube step during the hackathon due to the complexity of getting Google OAuth approval to do so. We did publish test videos.

Fine-Tuning a Llama3 to do the CodeGen:

We decided to evaluate the output of Llama3 fine-tuned vs Gemini Pro 1.5.

As a first step we decided to fine-tune meta-llama-3–8b-instruct using OpenPipe, and host the fine-tuned model using OctoAI. We were wary of fine-tuning a model due to our inexperience, but were blown away by the ease and speed of doing so using this particular pipeline. Shoutout to the squad at OctoAI for creating a notebook & process that was wonderfully planned out. The entire fine-tuning and model-hosting process took us less than an hour! And the Llama3 output with just 63 examples was already significantly better.

We’re curating a second article on just this topic with a slightly refined approach and a larger dataset. Keep your eyes open for that in a few weeks!

If you want to try your hand at fine-tuning a model to output decent manim math videos in the interim, we are gathering a publicly available dataset (jsonl format) here.

Technical Details of the Project:

  • We used Gemini Pro 1.5 to generate object-oriented Manim code and debug it until it compiles.
  • The rendered video is split into 1-second frames, which are sent back to Gemini Pro 1.5 for a second pass to refine the content.
  • We built the frontend platform using Streamlit
  • The platform is hosted on an Azure Virtual Machine
  • External LLMs & APIs are Gemini Pro 1.5 and Llama3 on Groq
  • Fine-tuning dataset is being collected here as jsonl, we will release an article on our success with fine-tuning in the forthcoming weeks

Challenges and Lessons Learned:

One of the main challenges we faced was ensuring the generated Manim code was syntactically correct and produced the desired video output. We overcame this by implementing a robust debugging system that iteratively refined the code until it compiled successfully.

Another challenge was working around the rate limits of Google Gemini Pro 1.5. At this time the model’s usage is not tracked well and the billing is somewhat opaque. In the end it turned out that usage is free until May 14, 2024 but is limited to 2 RPM (Update: now free until May 30, 2024 after Google IO). We estimate the cost of this flow to be upto 40K tokens, which as per their pay-as-you-go pricing would be $0.42 per video (Since slashed by 50% so $0.21 now). For a production quality video we might iterate 3x as much so around $1.30 (now $0.65) per final video.

We also know the importance of user testing and feedback in shaping the platform’s development. So we are actively seeking input from AI engineers, using which we will be able to refine the user experience and ensure the generated videos are truly effective in explaining mathematical concepts.

Suggested improvements:

  • See how well this works with GPT 4–o and Claude Opus and Reka Core
  • Provide the model with the ability to look up the exact Manim spec & examples
  • Provide the model with a library of svgs for cool stickers, and jpegs or pngs for memes / background
  • Prompt engineering — chain smaller chat prompts instead of one-shotting the Manim output. This approach might particularly suit Gemini more than other models

Next Steps:

We’re actively collecting user math video requests and system Manim code generations to build a fine-tuning dataset that will further enhance the system’s capabilities and make it more open by running on an open weight model like Llama3. We are making this data publicly available on Airtable to foster transparency and collaboration within the AI and education communities. Our codebase and methodology is fully public and FOSS as well.

Try It Out: You can try MathMatrixMovies for yourself at https://math.auto.movie. We’d love to hear your feedback and suggestions for improvement!

If you have any questions or would like to contribute to the project, feel free to reach out to us — our individual bios are below.

Happy math video generating!

About us

Aditya Advani

is an AI engineer and serial entrepreneur in SF. He is working on integrating MathMatrixMovies into his larger autonomous video editing platform, ELDO, which aims to leverage AI to create engaging, informative content across a wide range of subjects.

You can find him on X at @aditya_advani
Visit his blog on Medium at https://medium.com/@aditya_advani

Baladhurgesh Balagurusamy Paramasivan

is a Machine Learning Engineer at Lucid Motors, specializing in deep learning and computer vision. He is very passionate about developing innovative solutions for real-world challenges and actively experiments with new ideas.

You can find him on X at @baladhurgesh97

Lily

Machine Learning Engineer | Interested in Machine Learning Systems, Optimization and HPC

You can find her on X at @excelsiorpred
Visit her blog at https://medium.com/@lilysu

Justin B Strong

Talks about ML Papers and AI software engineering | AI Founder in SF.

You can find him on X at @gptjustin

--

--

Aditya Advani
Aditya Advani

Written by Aditya Advani

Teaching is more distinctively human than learning.

Responses (1)