Skip to content

Commit e1f0932

Browse files
authored
Merge pull request #2 from AIDASLab/html
project page
2 parents 6f445a9 + 821cfdb commit e1f0932

24 files changed

Lines changed: 3418 additions & 146 deletions

README.md

Lines changed: 5 additions & 146 deletions
Original file line numberDiff line numberDiff line change
@@ -1,148 +1,7 @@
1-
# MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
1+
# AIDAS Lab Project Page Template
22

3-
## Abstract
4-
In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise LaTeX format (i.e., $e^{ix} = \cos(x) + i\sin(x)$), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured LaTeX representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates LaTeX generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters.
5-
Specifically, in terms of CER, BLEU, and ROUGE scores for LaTeX translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
3+
This repository hosts the source code for the AIDAS Lab Project Page.
4+
The template was originally derived from [Nerfies](https://nerfies.github.io).
65

7-
### This study is accepted for the AAAI-25 Main Technical Track.
8-
9-
Here, you can find the benchmark dataset, experimental code, and fine-tuned model checkpoints for MathSpeech, which we have developed for our research.
10-
11-
If you want to view the detailed information about the dataset used in this study or additional experimental results such as latency measurements included in the appendix, please refer to the version uploaded on arXiv.
12-
13-
---
14-
15-
## Benchmart Dataset
16-
The MathSpeech benchmark dataset is available on [huggingface🤗](https://huggingface.co/datasets/AAAI2025/MathSpeech) or through the following [link](https://drive.google.com/drive/folders/1M8_IVcesO2EwNcl9zwxY6UgqAmSODzgq?usp=sharing).
17-
18-
- [MathSpeech in huggingface🤗 dataset](https://huggingface.co/datasets/AAAI2025/MathSpeech)
19-
- [Google Drive link for dataset](https://drive.google.com/drive/folders/1M8_IVcesO2EwNcl9zwxY6UgqAmSODzgq?usp=sharing)
20-
21-
22-
#### Dataset statistics
23-
<table border="1" style="border-collapse: collapse; width: 50%;">
24-
<thead>
25-
<tr>
26-
<th style="text-align: left;">The number of files</th>
27-
<td>1,101</td>
28-
</tr>
29-
</thead>
30-
<thead>
31-
<tr>
32-
<th style="text-align: left;">Total Duration</th>
33-
<td>5583.2 seconds</td>
34-
</tr>
35-
</thead>
36-
<tbody>
37-
<tr>
38-
<th style="text-align: left;">Average Duration per file</th>
39-
<td>5.07 seconds</td>
40-
</tr>
41-
<tr>
42-
<th style="text-align: left;">The number of speakers</th>
43-
<td>10</td>
44-
</tr>
45-
<tr>
46-
<th style="text-align: left;">The number of men</th>
47-
<td>8</td>
48-
</tr>
49-
<tr>
50-
<th style="text-align: left;">The number of women</th>
51-
<td>2</td>
52-
</tr>
53-
<tr>
54-
<th style="text-align: left;">source</th>
55-
<td><a href="https://www.youtube.com/@mitocw" target="_blank">[MIT OpenCourseWare]</td>
56-
</tr>
57-
</tbody>
58-
</table>
59-
60-
61-
62-
#### WERs of various ASR models on the Mathspeech benchmark
63-
<table style="width:100%; border-collapse: collapse;">
64-
<thead>
65-
<tr>
66-
<th></th>
67-
<th>Models</th>
68-
<th>Params</th>
69-
<th>WER(%) (Leaderboard)</th>
70-
<th>WER(%) (Formula)</th>
71-
</tr>
72-
</thead>
73-
<tbody>
74-
<tr>
75-
<td rowspan="4">OpenAI</td>
76-
<td>Whisper-base</td>
77-
<td>74M</td>
78-
<td>10.3</td>
79-
<td>34.7</td>
80-
</tr>
81-
<tr>
82-
<td>Whisper-small</td>
83-
<td>244M</td>
84-
<td>8.59</td>
85-
<td>29.5</td>
86-
</tr>
87-
<tr>
88-
<td>Whisper-largeV2</td>
89-
<td>1550M</td>
90-
<td>7.83</td>
91-
<td>31.0</td>
92-
</tr>
93-
<tr>
94-
<td>Whisper-largeV3</td>
95-
<td>1550M</td>
96-
<td>7.44</td>
97-
<td>33.3</td>
98-
</tr>
99-
<tr>
100-
<td>NVIDIA</td>
101-
<td>Canary-1B</td>
102-
<td>1B</td>
103-
<td>6.5</td>
104-
<td>35.2</td>
105-
</tr>
106-
</tbody>
107-
</table>
108-
109-
##### The WER for Leaderboard was from the [HuggingFace Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), while the WER for Formula was measured using our MathSpeech Benchmark. This value is based on results as of 2024-08-16.
110-
111-
112-
---
113-
## MathSpeech Checkpoint
114-
You can download the MathSpeech checkpoint from the following [link](https://drive.google.com/file/d/1m0cCpDDkOb7FltjLPVlg4ZCZSSSWZgS2/view?usp=sharing).
115-
116-
### Experiments codes
117-
118-
You can find the MathSpeech evaluation code, and the prompts used for the LLMs in the experiments at the following [link](https://github.com/hyeonsieun/MathSpeech/tree/main/Experiments).
119-
120-
### Ablation Study codes
121-
122-
You can find the code used in our Ablation Study at the following [link](https://github.com/hyeonsieun/MathSpeech/tree/main/Ablation_Study).
123-
124-
---
125-
## How to Use
126-
1. Clone this repository using the web URL.
127-
```bash
128-
git clone https://github.com/hyeonsieun/MathSpeech.git
129-
```
130-
2. To build the environment, run the following code
131-
```bash
132-
pip install -r requirements.txt
133-
```
134-
3. Place [the audio dataset and the transcription Excel file](https://drive.google.com/drive/folders/1M8_IVcesO2EwNcl9zwxY6UgqAmSODzgq?usp=sharing) inside the ASR folder.
135-
4. Run the following code.
136-
```bash
137-
python ASR.py
138-
```
139-
5. Go to the Experiments folder
140-
6. Move the 'MathSpeech_checkpoint.pth' from the following [link](https://drive.google.com/file/d/1m0cCpDDkOb7FltjLPVlg4ZCZSSSWZgS2/view?usp=sharing) into the Experiments folder.
141-
7. Run the following code.
142-
```bash
143-
python MathSpeech_eval.py
144-
```
145-
8. If you want to run LLMs like GPT-4o or Gemini, you'll need to configure the environment settings such as the API key and endpoint.
146-
9. You can also run the Ablation Study code from the Ablation_Study folder.
147-
148-
**Notes:** Here, example code for performing ASR using whisper-base and whisper-small is provided. If you want to use a different ASR model, you can modify that part of the code to use our MathSpeech.
6+
# Website License
7+
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

0 commit comments

Comments
 (0)