VLN_xz

# ESP32S3 Desktop Robot: From Hardware Assembly to Vision SLAM & VLN Simulation

> A step-by-step Instructables-style guide to building a palm-sized desktop robot with voice AI, vision SLAM, and LLM-based navigation — from copper-wire frame to autonomous simulation.All the source code is available on GitHub, https://github.com/tjcty20051110.

---

Author: Tianyu Cui, Jiayi Li, Nixiao Wang, Yiran Zhang

Assign:

Tianyu Cui: VLN simulation, circuit assembly, hardware selection, project management,40%

Jiayi Li: Vision SLAM simulation, software development, hardware assembly,20%

Nixiao Wang: Hardware assembly, project management,20%

Yiran Zhang: Documentation, project management, community engagement,20%

Supplies

### Bill of Materials (BOM)

|-----------|-------|----------|-------|

| 8 Ω 2 W Cavity Speaker | 28 mm | 1 | Audio output |

Copper Frame Assembly

The chassis is entirely hand-welded from **1.5 mm copper wire**.

- Follow the dimensions in the drawing below to bend and solder the frame.

- The frame must support the ESP32-S3 board, OLED screen, motor driver, and batteries.

- Copper wire doubles as both **structural support** and **electrical conductor**, eliminating the need for extra DuPont wires.

Wiring

Connect all modules according to the wiring diagram. The ESP32-S3 GPIO allocation is as follows:

| Function | GPIO | Connected Module |

|----------|------|------------------|

| I2C SDA | GPIO 8 | OLED, INMP441, MAX98357 |

| I2C SCL | GPIO 9 | OLED, INMP441, MAX98357 |

| Motor PWM (Left) | GPIO 6 | L298N ENA |

| Motor PWM (Right) | GPIO 7 | L298N ENB |

| Motor Direction A | GPIO 4 | L298N IN1 / IN3 |

| Motor Direction B | GPIO 5 | L298N IN2 / IN4 |

> **Tip:** Because the copper frame itself carries current, keep signal wires (I2S, I2C) short and away from motor PWM lines to reduce EMI.

Wheel Option

You have two choices for the wheels:

- **Option A — 3D Print:** Use the model files inside `wheel.zip` and print TPU or PLA tires. According to https://makerworld.com.cn/zh/models/2618748-n20-ju-an-zhi-yao-kong-lun-tai-mo-ju?from=search#profileId-3020184.

- **Option B — Buy Off-the-Shelf:** Pre-made 64 T steel-gear wheels are surprisingly cheap and more durable. We recommend this route for cost and reliability.

XiaoZhi AI Firmware — Giving the Robot a Voice

**XiaoZhi** is an open-source AI voice-assistant project targeting the ESP32 family. It brings natural-language conversation to microcontrollers with minimal hardware.You can see the source code at tjcty20051110/ESP32S3-Xiaozhi-Robot.

### What XiaoZhi Can Do

- **Voice Wake-Up:** Say the wake word to activate the assistant.

- **Speech Recognition:** Converts your voice to text via cloud or local ASR.

- **LLM Dialogue:** Sends the text to a large language model for intelligent replies.

- **Text-to-Speech (TTS):** Streams the reply back as natural-sounding audio.

### Deployment Steps

1. **Extract** `xiaozhi-esp32-main.zip` to your project folder.

2. **Compile** with ESP-IDF or Arduino IDE (ESP32-S3 target).

3. **Configure** Wi-Fi credentials and API keys in `config.h`.

4. **Flash** the firmware to the ESP32-S3 via USB.

### On-Robot Integration

- **INMP441** (I2S digital microphone) captures your voice.

- **MAX98357** (I2S Class-D amplifier) drives the 8 Ω speaker to play XiaoZhi's replies.

- The OLED can display a simple "listening / thinking / speaking" animation as the robot's "eyes."

> **Pipeline:** Voice Wake-Up → ASR → LLM → TTS → Speaker Output

ESP32S3-Cam Vision Module — Eyes for the Robot

So far the robot can *hear* and *speak*. To let it *see*, we add the **ESP32S3-Cam Vision Module**.

The source code is at tjcty20051110/ESP32S3-Cam-demo-Arduino.

### Module Overview

- **MCU:** ESP32-S3 dual-core 240 MHz + 8 MB PSRAM

- **Camera:** OV2640 / OV3660 image sensor

- **Interfaces:** I2C, UART, SPI, Wi-Fi, USB

- **Use cases:** Video streaming, face detection, color tracking, gesture recognition, and even on-device SLAM

### Existing Examples (from `ESP32S3CAM3/`)

| Example | What It Does | Key Technique |

|---------|--------------|---------------|

| **ImageTransmit** | HTTP MJPEG video streaming | Wi-Fi soft-AP / STA |

| **FaceDetection** | Real-time face detection | ESP-WHO framework + I2C output |

| **ColorDetection** | HSV threshold detection | 5-color tracking + I2C output |

| **GestureRecognition** | Hand-gesture classification | Skin-tone segmentation + contour analysis |

| **HandwrittenDigitRecognition** | MNIST digit recognition | MLP with int8 quantization (97.5 % accuracy) |

| **HandwrittenDigitRecognition_Template** | Template-matching digits | No training required |

---

### Deep Dive: MonoVisualSLAM

The crown jewel of the vision module is **MonoVisualSLAM** — a lightweight single-camera SLAM running entirely on the ESP32-S3.

#### Algorithm

1. **Shi-Tomasi Corner Detection** extracts salient feature points on a 240×240 grayscale frame.

2. **Lucas-Kanade Optical Flow** tracks those corners across consecutive frames.

3. **Motion Estimation** computes the camera ego-motion (dx, dy) from the tracked flow vectors.

4. **Sparse Map Maintenance** keeps up to 50 map points in a lightweight point-cloud structure.

VLN-XZ Simulation — Future Research Directions

To validate the full **Vision-Language Navigation** pipeline before deploying it on the physical robot, we built **VLN-XZ** — a PyBullet-based simulation environment.

### VLN-XZ: Vision-Language Navigation Simulation

VLN-XZ integrates:

- **Monocular Visual SLAM** (ORB features + optical flow)

- **Topological Semantic Mapping** (auto-built from scene objects)

- **LLM Natural-Language Planning** (Kimi K2 API)

- **Differential-Drive Robot Control**

It closes the loop from *"understanding human speech"* to *"walking to the target."*

### System Architecture

```

User Natural Language Instruction

↓

LLM Planner (Kimi K2 API)

↓

VLN Navigator

↙ ↘

Semantic Map Robot Diff-Drive

A* Path Camera

↓

Visual SLAM (ORB + Optical Flow)

```

### Key Components

| Component | Description | Performance |

|-----------|-------------|-------------|

| **Visual SLAM** | ORB feature points + optical flow tracking | Position error < 0.1 m |

| **Semantic Map** | Auto-built from scene objects; supports A* path planning | Real-time update |

| **LLM Planner** | Parses natural language (e.g., *"go to the green box then the orange box"*) | Cloud API (Kimi K2) |

| **Diff-Drive Control** | Unified differential-drive control strategy | Smooth trajectory |

### Demo Scripts

| Script | Purpose |

|--------|---------|

| `demo_slam_vln.py` | SLAM mapping → semantic map → VLN navigation |

| `demo_llm_vln.py` | Adds LLM natural-language instruction parsing |

### Research Significance

This simulation system validates the complete algorithmic pipeline:

> **SLAM → Semantic Mapping → LLM Planning → Differential Control**

Once proven in simulation, the same pipeline will be ported to the **physical ESP32S3 desktop robot** equipped with the **ESP32S3-Cam vision module**, turning the desktop toy into a true voice-commanded, vision-guided autonomous agent.

Downloads

demo_forward.mp4

demo_s_curve.mp4

demo_slam_vln.mp4

demo_square.mp4

demo_turn.mp4