LecturePipeline

by ClumsyCalendar in Teachers > University+

21 Views, 0 Favorites, 0 Comments

LecturePipeline

former_picture.png

Transform unreadable English PPT slides into well-formatted, Chinese PDF documents with properly rendered math formulas.

By Tianji Cui,Tongji University

Published on June 14th, 2026 (cc) BY-NC-SA

English PPT slides are notoriously difficult to work with:

  1. Unstructured text: Bullet points scattered across slides with no clear hierarchy
  2. Math formulas as images: Equations like $R^{n \times 1}$ are rendered as bitmaps, making them impossible to extract or edit
  3. Mixed layouts: Text, images, tables, and formulas all crammed together
  4. No semantic structure: LLMs can't understand slide boundaries or topic transitions

When you try to feed raw PPT content to an LLM, you get garbled text, broken formulas, and lost context. The result is unusable for learning, reference, or further processing.

Supplies

github_repo.png
somark_demo.png

Using SoMark web or CLI, and LLM.The full project is on Github now! You can see it at tjcty20051110/quickInformationSort4FinalExiamination

Parse With SoMark (OCR + Structure Extraction)

FW7FUNCMQA4XQVT.png

The first step is SoMark, a high-precision document parsing service that converts PDFs and images into structured Markdown/JSON. Unlike traditional OCR, SoMark preserves:

  1. Heading hierarchy — so LLMs understand document sections
  2. Tables — fully reconstructed instead of flattened text
  3. Formulas — converted to LaTeX (e.g., $\mathbf{v} = \begin{bmatrix}v_1 \\ v_2 \\ \vdots \\ v_n\end{bmatrix}$)
  4. Multi-column layouts — reading order preserved
  5. Cross-page elements — tables and text spanning multiple pages are merged

How to Use SoMark

Option A: Web Interface (for small batches)

  1. Go to https://somark.tech/
  2. Upload your PPT/PDF file
  3. Wait for parsing (100 pages ≈ 5 seconds)
  4. Download the exported ZIP package containing main.md + images/

Option B: API (for automation)

import somark


client = somark.Client(api_key="sk-your-api-key")


result = client.extract_document(

file_path="lecture.pptx",

output_formats=["markdown"],

element_formats={

"image": "url",

"formula": "latex", # Key: formulas as LaTeX!

"table": "html",

},

feature_config={

"enable_text_cross_page": True,

"enable_table_cross_page": True,

"enable_title_level_recognition": True,

}

)

You can learn the whole process at SoMark 文档智能.

Option C: MCP Server (for AI Agent workflows)

{

"mcpServers": {

"somark": {

"command": "npx",

"args": ["-y", "somark-mcp"],

"env": { "SOMARK_API_KEY": "sk-your-api-key" }

}

}

}

Translate to Chinese (Bilingual Format) and Fix LaTeX Formula Glitches

bad_demo.png


SoMark's OCR is excellent, but math formulas often need cleanup. Here are the 6 most common issues and fixes:


### Issue 1: Spaces around subscripts/superscripts


```

Before: x _ {k} After: x_{k}

Before: x ^ {2} After: x^{2}

Before: v ^ {T} After: v^{T}

```


**Regex fix:** `_(\s+\{` → `_{`, `\^(\s+\{` → `^{`


### Issue 2: Spaces between variable and _/^


```

Before: v _1 After: v_1

Before: R ^{n} After: R^{n}

Before: } _{k} After: }_{k}

```


**Regex fix:** `([a-zA-Z\d\)\}])\s+_` → `\1_`


### Issue 3: OCR artifacts


```

Before: \#1\#matrix A After: A

Before: \operatorname*{d e t} After: \det

Before: \mathrm{m i n} After: \min

Before: \mathrm{m a x} After: \max

Before: \mathsf{x} After: x

```


### Issue 4: Escaped ^ and _


```

Before: \^{T} After: ^{T}

Before: \_ After: _

```


### Issue 5: $$ must be on its own line


```

Before: $$\mathbf{A}x = b$$

After:

$$

\mathbf{A}x = b

$$

```


### Issue 6: Extra spaces


```

Before: R^{n \times 1} After: R^{n \times 1}

```


**Full Python fix script** (save as `fix_formulas.py` and run):


```python

import re

import glob

import os


def fix_formula_inner(formula):

formula = re.sub(r'_\s+\{', '_{', formula)

formula = re.sub(r'\^\s+\{', '^{', formula)

formula = re.sub(r'([a-zA-Z\d\)])\s+_', r'\1_', formula)

formula = re.sub(r'([a-zA-Z\d\)])\s+\^', r'\1^', formula)

formula = re.sub(r'\}\s+_', r'}_', formula)

formula = re.sub(r'\}\s+\^', r'}^', formula)

formula = re.sub(r'\\operatorname\*\{d e t\}', r'\\det', formula)

formula = re.sub(r'\\mathrm\{m i n\}', r'\\min', formula)

formula = re.sub(r'\\mathrm\{m a x\}', r'\\max', formula)

formula = re.sub(r'\\\^', '^', formula)

formula = re.sub(r'\\_', '_', formula)

formula = re.sub(r'\#1\#matrix\s*', '', formula)

return formula


def fix_content(content):

def fix_display_math(m):

formula = fix_formula_inner(m.group(1).strip())

return f'\n\n$$\n{formula}\n$$\n\n'

content = re.sub(r'\$\$\s*(.+?)\s*\$\$', fix_display_math, content, flags=re.DOTALL)


def fix_inline_math(m):

formula = fix_formula_inner(m.group(1))

return f'${formula}$'

content = re.sub(r'(?<![\$])\$(?!\$)([^\$\n]+?)\$(?!\$)', fix_inline_math, content)

return content


for f in glob.glob(r'**/*.md', recursive=True):

if os.path.basename(f) == 'main.md': continue

with open(f, 'r', encoding='utf-8') as fh:

content = fh.read()

with open(f, 'w', encoding='utf-8') as fh:

fh.write(fix_content(content))

print(f"Fixed: {os.path.basename(f)}")

```


---

Generate Self-Contained HTML

good_result.png

Convert the fixed Markdown into a **self-contained HTML file** with:


- **MathJax 3** for formula rendering (loaded from CDN)

- **Base64-embedded images** so the HTML works offline

- **Chinese typography** (Microsoft YaHei, A4-width layout)


**Why base64?** If you use relative paths like `![](./images/xxx.jpg)`, the images will break when you move the file or convert to PDF. Base64 embeds the image data directly in the HTML.