LecturePipeline

Transform unreadable English PPT slides into well-formatted, Chinese PDF documents with properly rendered math formulas.

By Tianji Cui,Tongji University

Published on June 14th, 2026 (cc) BY-NC-SA

English PPT slides are notoriously difficult to work with:

Unstructured text: Bullet points scattered across slides with no clear hierarchy
Math formulas as images: Equations like $R^{n \times 1}$ are rendered as bitmaps, making them impossible to extract or edit
Mixed layouts: Text, images, tables, and formulas all crammed together
No semantic structure: LLMs can't understand slide boundaries or topic transitions

When you try to feed raw PPT content to an LLM, you get garbled text, broken formulas, and lost context. The result is unusable for learning, reference, or further processing.

Supplies

Using SoMark web or CLI, and LLM.The full project is on Github now! You can see it at tjcty20051110/quickInformationSort4FinalExiamination

Parse With SoMark (OCR + Structure Extraction)

The first step is SoMark, a high-precision document parsing service that converts PDFs and images into structured Markdown/JSON. Unlike traditional OCR, SoMark preserves:

Heading hierarchy — so LLMs understand document sections
Tables — fully reconstructed instead of flattened text
Formulas — converted to LaTeX (e.g., $\mathbf{v} = \begin{bmatrix}v_1 \\ v_2 \\ \vdots \\ v_n\end{bmatrix}$)
Multi-column layouts — reading order preserved
Cross-page elements — tables and text spanning multiple pages are merged

How to Use SoMark

Option A: Web Interface (for small batches)

Go to https://somark.tech/
Upload your PPT/PDF file
Wait for parsing (100 pages ≈ 5 seconds)
Download the exported ZIP package containing main.md + images/

Option B: API (for automation)

import somark

client = somark.Client(api_key="sk-your-api-key")

result = client.extract_document(

file_path="lecture.pptx",

output_formats=["markdown"],

element_formats={

"image": "url",

"formula": "latex", # Key: formulas as LaTeX!

"table": "html",

},

feature_config={

"enable_text_cross_page": True,

"enable_table_cross_page": True,

"enable_title_level_recognition": True,

}

)

You can learn the whole process at SoMark 文档智能.

Option C: MCP Server (for AI Agent workflows)

{

"mcpServers": {

"somark": {

"command": "npx",

"args": ["-y", "somark-mcp"],

"env": { "SOMARK_API_KEY": "sk-your-api-key" }

}

Translate to Chinese (Bilingual Format) and Fix LaTeX Formula Glitches

SoMark's OCR is excellent, but math formulas often need cleanup. Here are the 6 most common issues and fixes:

### Issue 1: Spaces around subscripts/superscripts

```

Before: x _ {k} After: x_{k}

Before: x ^ {2} After: x^{2}

Before: v ^ {T} After: v^{T}

```

**Regex fix:** `_(\s+\{` → `_{`, `\^(\s+\{` → `^{`

### Issue 2: Spaces between variable and _/^

```

Before: v _1 After: v_1

Before: R ^{n} After: R^{n}

Before: } _{k} After: }_{k}

```

**Regex fix:** `([a-zA-Z\d\)\}])\s+_` → `\1_`

### Issue 3: OCR artifacts

```

Before: \#1\#matrix A After: A

Before: \operatorname*{d e t} After: \det

Before: \mathrm{m i n} After: \min

Before: \mathrm{m a x} After: \max

Before: \mathsf{x} After: x

```

### Issue 4: Escaped ^ and _

```

Before: \^{T} After: ^{T}

Before: \_ After: _

```

### Issue 5: $$ must be on its own line

```

Before: $$\mathbf{A}x = b$$

After:

$$

\mathbf{A}x = b

$$

```

### Issue 6: Extra spaces

```

Before: R^{n \times 1} After: R^{n \times 1}

```

**Full Python fix script** (save as `fix_formulas.py` and run):

```python

import re

import glob

import os

def fix_formula_inner(formula):

formula = re.sub(r'_\s+\{', '_{', formula)

formula = re.sub(r'\^\s+\{', '^{', formula)

formula = re.sub(r'([a-zA-Z\d\)])\s+_', r'\1_', formula)

formula = re.sub(r'([a-zA-Z\d\)])\s+\^', r'\1^', formula)

formula = re.sub(r'\}\s+_', r'}_', formula)

formula = re.sub(r'\}\s+\^', r'}^', formula)

formula = re.sub(r'\\operatorname\*\{d e t\}', r'\\det', formula)

formula = re.sub(r'\\mathrm\{m i n\}', r'\\min', formula)

formula = re.sub(r'\\mathrm\{m a x\}', r'\\max', formula)

formula = re.sub(r'\\\^', '^', formula)

formula = re.sub(r'\\_', '_', formula)

formula = re.sub(r'\#1\#matrix\s*', '', formula)

return formula

def fix_content(content):

def fix_display_math(m):

formula = fix_formula_inner(m.group(1).strip())

return f'\n\n$$\n{formula}\n$$\n\n'

content = re.sub(r'\$\$\s*(.+?)\s*\$\$', fix_display_math, content, flags=re.DOTALL)

def fix_inline_math(m):

formula = fix_formula_inner(m.group(1))

return f'${formula}$'

content = re.sub(r'(?<![\$])\$(?!\$)([^\$\n]+?)\$(?!\$)', fix_inline_math, content)

return content

for f in glob.glob(r'**/*.md', recursive=True):

if os.path.basename(f) == 'main.md': continue

with open(f, 'r', encoding='utf-8') as fh:

content = fh.read()

with open(f, 'w', encoding='utf-8') as fh:

fh.write(fix_content(content))

print(f"Fixed: {os.path.basename(f)}")

```

---

Generate Self-Contained HTML

Convert the fixed Markdown into a **self-contained HTML file** with:

- **MathJax 3** for formula rendering (loaded from CDN)

- **Base64-embedded images** so the HTML works offline

- **Chinese typography** (Microsoft YaHei, A4-width layout)

**Why base64?** If you use relative paths like `![](./images/xxx.jpg)`, the images will break when you move the file or convert to PDF. Base64 embeds the image data directly in the HTML.