VLN_xz

by ClumsyCalendar in Workshop > Electric Vehicles

61 Views, 0 Favorites, 0 Comments

VLN_xz

成品.jpg

# ESP32S3 Desktop Robot: From Hardware Assembly to Vision SLAM & VLN Simulation


> A step-by-step Instructables-style guide to building a palm-sized desktop robot with voice AI, vision SLAM, and LLM-based navigation — from copper-wire frame to autonomous simulation.All the source code is available on GitHub, https://github.com/tjcty20051110.


---

Author: Tianyu Cui, Jiayi Li, Nixiao Wang, Yiran Zhang


Assign:


Tianyu Cui: VLN simulation, circuit assembly, hardware selection, project management,40%


Jiayi Li: Vision SLAM simulation, software development, hardware assembly,20%


Nixiao Wang: Hardware assembly, project management,20%


Yiran Zhang: Documentation, project management, community engagement,20%

Supplies

### Bill of Materials (BOM)


| Component | Specs | Quantity | Notes |

|-----------|-------|----------|-------|

| 0.96" OLED Display | 4-pin I2C, 128x64 | 1 | For robot eyes / status display |

| ESP32-S3 N16R8 Dev Board | 16 MB Flash, 8 MB PSRAM | 1 | Main controller |

| L298N Motor Driver | Dual H-Bridge | 1 | Drive 2x N20 motors |

| INMP441 Microphone | I2S digital mic | 1 | Voice input for XiaoZhi AI |

| MAX98357 Audio Amplifier | I2S Class-D amp | 1 | Drive 8 Ω 2 W speaker |

| 5 V Charge / Discharge Module | TP4056 + boost | 1 | Battery management |

| Toggle Switch | SK12D07VG4 4 mm | 1 | Power switch |

| 8 Ω 2 W Cavity Speaker | 28 mm | 1 | Audio output |

| N20 Gear Motor | 6 V, 100–300 RPM | 2 | Differential drive |

| Wheels | 64 T steel gear / 25 mm diameter | 2 | See wheel options below |

| 16340 Li-ion Battery | 3.7 V 700 mAh | 2 | Power source |

| 1.5 mm Copper Wire | ~2 m length | — | Frame structure & wire routing |

| Copper Wire | Various gauges | — | Circuit connections |

Copper Frame Assembly

铜架尺寸.png
完整框架示例.jpg
FJ7LZXLMQA4XRG2.jpg

The chassis is entirely hand-welded from **1.5 mm copper wire**.


- Follow the dimensions in the drawing below to bend and solder the frame.

- The frame must support the ESP32-S3 board, OLED screen, motor driver, and batteries.

- Copper wire doubles as both **structural support** and **electrical conductor**, eliminating the need for extra DuPont wires.

Wiring

电路连线图.jpg

Connect all modules according to the wiring diagram. The ESP32-S3 GPIO allocation is as follows:


| Function | GPIO | Connected Module |

|----------|------|------------------|

| I2C SDA | GPIO 8 | OLED, INMP441, MAX98357 |

| I2C SCL | GPIO 9 | OLED, INMP441, MAX98357 |

| Motor PWM (Left) | GPIO 6 | L298N ENA |

| Motor PWM (Right) | GPIO 7 | L298N ENB |

| Motor Direction A | GPIO 4 | L298N IN1 / IN3 |

| Motor Direction B | GPIO 5 | L298N IN2 / IN4 |


> **Tip:** Because the copper frame itself carries current, keep signal wires (I2S, I2C) short and away from motor PWM lines to reduce EMI.

Wheel Option

3D_wheel.png
轮胎购买链接.jpg

You have two choices for the wheels:


- **Option A — 3D Print:** Use the model files inside `wheel.zip` and print TPU or PLA tires. According to https://makerworld.com.cn/zh/models/2618748-n20-ju-an-zhi-yao-kong-lun-tai-mo-ju?from=search#profileId-3020184.

- **Option B — Buy Off-the-Shelf:** Pre-made 64 T steel-gear wheels are surprisingly cheap and more durable. We recommend this route for cost and reliability.

XiaoZhi AI Firmware — Giving the Robot a Voice

**XiaoZhi** is an open-source AI voice-assistant project targeting the ESP32 family. It brings natural-language conversation to microcontrollers with minimal hardware.You can see the source code at tjcty20051110/ESP32S3-Xiaozhi-Robot.


### What XiaoZhi Can Do


- **Voice Wake-Up:** Say the wake word to activate the assistant.

- **Speech Recognition:** Converts your voice to text via cloud or local ASR.

- **LLM Dialogue:** Sends the text to a large language model for intelligent replies.

- **Text-to-Speech (TTS):** Streams the reply back as natural-sounding audio.


### Deployment Steps


1. **Extract** `xiaozhi-esp32-main.zip` to your project folder.

2. **Compile** with ESP-IDF or Arduino IDE (ESP32-S3 target).

3. **Configure** Wi-Fi credentials and API keys in `config.h`.

4. **Flash** the firmware to the ESP32-S3 via USB.


### On-Robot Integration


- **INMP441** (I2S digital microphone) captures your voice.

- **MAX98357** (I2S Class-D amplifier) drives the 8 Ω speaker to play XiaoZhi's replies.

- The OLED can display a simple "listening / thinking / speaking" animation as the robot's "eyes."


> **Pipeline:** Voice Wake-Up → ASR → LLM → TTS → Speaker Output

ESP32S3-Cam Vision Module — Eyes for the Robot

So far the robot can *hear* and *speak*. To let it *see*, we add the **ESP32S3-Cam Vision Module**.

The source code is at tjcty20051110/ESP32S3-Cam-demo-Arduino.

### Module Overview


- **MCU:** ESP32-S3 dual-core 240 MHz + 8 MB PSRAM

- **Camera:** OV2640 / OV3660 image sensor

- **Interfaces:** I2C, UART, SPI, Wi-Fi, USB

- **Use cases:** Video streaming, face detection, color tracking, gesture recognition, and even on-device SLAM


### Existing Examples (from `ESP32S3CAM3/`)


| Example | What It Does | Key Technique |

|---------|--------------|---------------|

| **ImageTransmit** | HTTP MJPEG video streaming | Wi-Fi soft-AP / STA |

| **FaceDetection** | Real-time face detection | ESP-WHO framework + I2C output |

| **ColorDetection** | HSV threshold detection | 5-color tracking + I2C output |

| **GestureRecognition** | Hand-gesture classification | Skin-tone segmentation + contour analysis |

| **HandwrittenDigitRecognition** | MNIST digit recognition | MLP with int8 quantization (97.5 % accuracy) |

| **HandwrittenDigitRecognition_Template** | Template-matching digits | No training required |


---


### Deep Dive: MonoVisualSLAM


The crown jewel of the vision module is **MonoVisualSLAM** — a lightweight single-camera SLAM running entirely on the ESP32-S3.


#### Algorithm


1. **Shi-Tomasi Corner Detection** extracts salient feature points on a 240×240 grayscale frame.

2. **Lucas-Kanade Optical Flow** tracks those corners across consecutive frames.

3. **Motion Estimation** computes the camera ego-motion (dx, dy) from the tracked flow vectors.

4. **Sparse Map Maintenance** keeps up to 50 map points in a lightweight point-cloud structure.

VLN-XZ Simulation — Future Research Directions

demo_forward_traj.png
demo_s_curve_traj.png
demo_slam_vln_traj.png
demo_square_traj.png
demo_turn_traj.png

To validate the full **Vision-Language Navigation** pipeline before deploying it on the physical robot, we built **VLN-XZ** — a PyBullet-based simulation environment.


### VLN-XZ: Vision-Language Navigation Simulation


VLN-XZ integrates:

- **Monocular Visual SLAM** (ORB features + optical flow)

- **Topological Semantic Mapping** (auto-built from scene objects)

- **LLM Natural-Language Planning** (Kimi K2 API)

- **Differential-Drive Robot Control**


It closes the loop from *"understanding human speech"* to *"walking to the target."*


### System Architecture


```

User Natural Language Instruction

LLM Planner (Kimi K2 API)

VLN Navigator

↙ ↘

Semantic Map Robot Diff-Drive

A* Path Camera

Visual SLAM (ORB + Optical Flow)

```


### Key Components


| Component | Description | Performance |

|-----------|-------------|-------------|

| **Visual SLAM** | ORB feature points + optical flow tracking | Position error < 0.1 m |

| **Semantic Map** | Auto-built from scene objects; supports A* path planning | Real-time update |

| **LLM Planner** | Parses natural language (e.g., *"go to the green box then the orange box"*) | Cloud API (Kimi K2) |

| **Diff-Drive Control** | Unified differential-drive control strategy | Smooth trajectory |


### Demo Scripts


| Script | Purpose |

|--------|---------|

| `demo_slam_vln.py` | SLAM mapping → semantic map → VLN navigation |

| `demo_llm_vln.py` | Adds LLM natural-language instruction parsing |


### Research Significance


This simulation system validates the complete algorithmic pipeline:


> **SLAM → Semantic Mapping → LLM Planning → Differential Control**


Once proven in simulation, the same pipeline will be ported to the **physical ESP32S3 desktop robot** equipped with the **ESP32S3-Cam vision module**, turning the desktop toy into a true voice-commanded, vision-guided autonomous agent.