Minimal Path to Vision-Only Autonomous Driving

Minimal Path to Vision-Only Autonomous Driving | Generated by AI

Home 2026.06

Question: What’s the minimal viable path to explore end-to-end autonomous driving like Tesla/openpilot — vision-only, no LiDAR, runs on accessible hardware?

Answer:

You’re thinking exactly right. Let me break this down.

openpilot’s Architecture (is it really end-to-end?)

openpilot is partially end-to-end. It has two main neural nets:

Camera frames → [Supercombo model] → driving trajectory + lane lines + lead car
trajectory → [lateral/longitudinal controllers] → steering/throttle/brake

The Supercombo model is a CNN+transformer hybrid (Vision Transformer backbone since ~2022). But the control layer is still classical — PID/MPC controllers convert trajectory to actuator commands.

So: perception is learned, control is classical. Not fully E2E like Tesla’s recent FSD v12 which outputs actuator commands directly.

Tesla FSD v12 (2024) is the real E2E shift — video in, steering/throttle out, massive transformer. Wayve, NVIDIA DriveE2E are similar.

Xpeng/Li Auto — yes, they both shifted to transformer-based E2E perception around 2023-2024, similar to UniAD/VAD architectures from academia.

Your Minimal Viable Exploration Path

You don’t need a car. Here’s the ladder:

Level 1: Pure software — replicate openpilot’s perception

# Use comma.ai's dataset (comma2k19, ~33 hours of driving)
# https://github.com/commaai/comma2k19

# Run supercombo ONNX model on a driving video
import onnxruntime as ort
import numpy as np

session = ort.InferenceSession("supercombo.onnx")
# Input: [batch, 12, 128, 256] — 2 frames, YUV420
# Output: trajectory, lane lines, lead car

comma.ai releases the model weights. You can run inference on any dashcam footage.

Level 2: Phone as sensor (your idea — excellent)

Phone camera → frame capture → MobileNet/YOLO → lane detection + depth
           → simple path planning → display overlay

This is exactly what openpilot on phone experiments do. There’s openpilot-on-laptop repos. Phone gives you:

30fps camera
IMU (gyro/accel) — critical for motion estimation
GPS

Level 3: The Zhihui Jun route — self-driving bicycle/cart

Zhihui Jun used a bicycle with:

Raspberry Pi / Jetson Nano
Camera
Servo for steering

Minimum hardware:

Jetson Orin Nano ($250)
+ USB camera or Pi camera
+ RC car chassis (~$50)
+ PWM servo controller

This is actually the best learning path — you get real actuation feedback.

Level 4: True E2E (what you should ultimately build toward)

# Input: sequence of camera frames (like Tesla)
# Output: [steering_angle, throttle, brake]

# Architecture: Video Transformer (similar to Gato/RT-2)
class E2EDriver(nn.Module):
    def __init__(self):
        self.backbone = ViT(...)       # frame encoder
        self.temporal = Transformer(...)  # sequence modeling
        self.head = MLP(out=3)         # steer, throttle, brake

    def forward(self, frames):  # [B, T, C, H, W]
        tokens = self.backbone(frames)
        context = self.temporal(tokens)
        return self.head(context[:, -1])  # last token

Train on imitation learning (BC) first, then add DAgger or RLHF.

Data

The key bottleneck is data, not model architecture:

comma2k19 — free, 33h, highway US
nuScenes — multi-sensor, good for research
Waymo Open Dataset — high quality
Your own dashcam — most relevant for your use case

Recommended Stack for Your Exploration

Phase 1 (this week):
  - Download comma2k19
  - Run supercombo.onnx on sample frames
  - Visualize trajectory outputs

Phase 2 (next month):
  - Build phone → frame streamer → laptop inference pipeline
  - Implement lane detection with ONNX or torch

Phase 3 (hardware):
  - RC car + Jetson Orin Nano
  - Train small imitation learning model on your own driving data
  - Close the loop: camera → model → servo

Key Papers to Read

UniAD (2023) — “Planning-oriented Autonomous Driving”, NeurIPS best paper
VAD — Vectorized scene representation for E2E
Wayve GAIA-1 — generative world model for driving
Tesla FSD v12 blog — E2E actuator-level control

Your instinct is right — vision-only E2E on a cheap platform is the Tesla philosophy, and it’s the most learnable path. The RC car / self-driving bicycle is genuinely the best MVE (minimum viable experiment) for someone at your level. You’ll learn more from closing the perception→action loop on a toy than reading 10 papers.

References:

Back Donate