LecturePipeline
Transform unreadable English PPT slides into well-formatted, Chinese PDF documents with properly rendered math formulas.
By Tianji Cui,Tongji University
Published on June 14th, 2026 (cc) BY-NC-SA
English PPT slides are notoriously difficult to work with:
- Unstructured text: Bullet points scattered across slides with no clear hierarchy
- Math formulas as images: Equations like $R^{n \times 1}$ are rendered as bitmaps, making them impossible to extract or edit
- Mixed layouts: Text, images, tables, and formulas all crammed together
- No semantic structure: LLMs can't understand slide boundaries or topic transitions
When you try to feed raw PPT content to an LLM, you get garbled text, broken formulas, and lost context. The result is unusable for learning, reference, or further processing.
Supplies
Using SoMark web or CLI, and LLM.The full project is on Github now! You can see it at tjcty20051110/quickInformationSort4FinalExiamination
Parse With SoMark (OCR + Structure Extraction)
The first step is SoMark, a high-precision document parsing service that converts PDFs and images into structured Markdown/JSON. Unlike traditional OCR, SoMark preserves:
- Heading hierarchy — so LLMs understand document sections
- Tables — fully reconstructed instead of flattened text
- Formulas — converted to LaTeX (e.g., $\mathbf{v} = \begin{bmatrix}v_1 \\ v_2 \\ \vdots \\ v_n\end{bmatrix}$)
- Multi-column layouts — reading order preserved
- Cross-page elements — tables and text spanning multiple pages are merged
How to Use SoMark
Option A: Web Interface (for small batches)
- Go to https://somark.tech/
- Upload your PPT/PDF file
- Wait for parsing (100 pages ≈ 5 seconds)
- Download the exported ZIP package containing main.md + images/
Option B: API (for automation)
import somark
client = somark.Client(api_key="sk-your-api-key")
result = client.extract_document(
file_path="lecture.pptx",
output_formats=["markdown"],
element_formats={
"image": "url",
"formula": "latex", # Key: formulas as LaTeX!
"table": "html",
},
feature_config={
"enable_text_cross_page": True,
"enable_table_cross_page": True,
"enable_title_level_recognition": True,
}
)
You can learn the whole process at SoMark 文档智能.
Option C: MCP Server (for AI Agent workflows)
{
"mcpServers": {
"somark": {
"command": "npx",
"args": ["-y", "somark-mcp"],
"env": { "SOMARK_API_KEY": "sk-your-api-key" }
}
}
}
Translate to Chinese (Bilingual Format) and Fix LaTeX Formula Glitches
SoMark's OCR is excellent, but math formulas often need cleanup. Here are the 6 most common issues and fixes:
### Issue 1: Spaces around subscripts/superscripts
```
Before: x _ {k} After: x_{k}
Before: x ^ {2} After: x^{2}
Before: v ^ {T} After: v^{T}
```
**Regex fix:** `_(\s+\{` → `_{`, `\^(\s+\{` → `^{`
### Issue 2: Spaces between variable and _/^
```
Before: v _1 After: v_1
Before: R ^{n} After: R^{n}
Before: } _{k} After: }_{k}
```
**Regex fix:** `([a-zA-Z\d\)\}])\s+_` → `\1_`
### Issue 3: OCR artifacts
```
Before: \#1\#matrix A After: A
Before: \operatorname*{d e t} After: \det
Before: \mathrm{m i n} After: \min
Before: \mathrm{m a x} After: \max
Before: \mathsf{x} After: x
```
### Issue 4: Escaped ^ and _
```
Before: \^{T} After: ^{T}
Before: \_ After: _
```
### Issue 5: $$ must be on its own line
```
Before: $$\mathbf{A}x = b$$
After:
$$
\mathbf{A}x = b
$$
```
### Issue 6: Extra spaces
```
Before: R^{n \times 1} After: R^{n \times 1}
```
**Full Python fix script** (save as `fix_formulas.py` and run):
```python
import re
import glob
import os
def fix_formula_inner(formula):
formula = re.sub(r'_\s+\{', '_{', formula)
formula = re.sub(r'\^\s+\{', '^{', formula)
formula = re.sub(r'([a-zA-Z\d\)])\s+_', r'\1_', formula)
formula = re.sub(r'([a-zA-Z\d\)])\s+\^', r'\1^', formula)
formula = re.sub(r'\}\s+_', r'}_', formula)
formula = re.sub(r'\}\s+\^', r'}^', formula)
formula = re.sub(r'\\operatorname\*\{d e t\}', r'\\det', formula)
formula = re.sub(r'\\mathrm\{m i n\}', r'\\min', formula)
formula = re.sub(r'\\mathrm\{m a x\}', r'\\max', formula)
formula = re.sub(r'\\\^', '^', formula)
formula = re.sub(r'\\_', '_', formula)
formula = re.sub(r'\#1\#matrix\s*', '', formula)
return formula
def fix_content(content):
def fix_display_math(m):
formula = fix_formula_inner(m.group(1).strip())
return f'\n\n$$\n{formula}\n$$\n\n'
content = re.sub(r'\$\$\s*(.+?)\s*\$\$', fix_display_math, content, flags=re.DOTALL)
def fix_inline_math(m):
formula = fix_formula_inner(m.group(1))
return f'${formula}$'
content = re.sub(r'(?<![\$])\$(?!\$)([^\$\n]+?)\$(?!\$)', fix_inline_math, content)
return content
for f in glob.glob(r'**/*.md', recursive=True):
if os.path.basename(f) == 'main.md': continue
with open(f, 'r', encoding='utf-8') as fh:
content = fh.read()
with open(f, 'w', encoding='utf-8') as fh:
fh.write(fix_content(content))
print(f"Fixed: {os.path.basename(f)}")
```
---
Generate Self-Contained HTML
Convert the fixed Markdown into a **self-contained HTML file** with:
- **MathJax 3** for formula rendering (loaded from CDN)
- **Base64-embedded images** so the HTML works offline
- **Chinese typography** (Microsoft YaHei, A4-width layout)
**Why base64?** If you use relative paths like ``, the images will break when you move the file or convert to PDF. Base64 embeds the image data directly in the HTML.