yolo

Real-Time Object Detection on SBCs: Raspberry Pi vs Edge AI

Why does YOLO run at 3 FPS on Raspberry Pi but 30+ FPS on edge AI boards? We dissect the CPU bottleneck, thermal throttling, and how dedicated NPUs enable production-grade computer vision.

Real-time object detection on single-board computers sits at a frustrating intersection: the algorithms are mature enough to work anywhere, but the compute requirements create a hard wall between "functional demo" and "production deployment." You can run YOLO on a Raspberry Pi. You can also watch it process frames at 3 FPS while the CPU thermal-throttles into oblivion. This gap—between academic possibility and robotic practicality—is what we're going to dissect.

TL;DR

flowchart TD
 A[SBC Hardware Setup Board & Camera Assembly] --> B[Environment Configuration OS & Dependencies Installation]
 B --> C[Model Selection Lightweight Architecture MobileNet/YOLO-Nano]
 C --> D[Model Optimization INT8 Quantization & Pruning]
 D --> E[Inference Engine Setup TensorFlow Lite/OpenCV Integration]
 E --> F[Pipeline Development Video Capture & Preprocessing]
 F --> G[Real-time Deployment Edge Inference & Monitoring]

graph TD
 A[Camera Module] -->|Raw Frames| B[Input Buffer]
 B --> C[Image Preprocessing]
 C --> D[SBC Processor CPU/GPU/NPU]
 D --> E[Inference Engine TFLite/ONNX]
 E --> F[Quantized Model YOLO/SSD]
 F --> G[Post-Processing NMS]
 G --> H[Detection Output Display/Network]

The Problem: Running modern CNNs (YOLOv8, etc.) on Raspberry Pi forces everything through the CPU, yielding 3-5 FPS with high latency and thermal instability.
The Bottleneck: Convolutions are memory-bandwidth intensive; ARM Cortex cores lack the systolic arrays or tensor units needed for efficient inference.
The Solution: Dedicated NPUs (Neural Processing Units) on edge AI boards like Axon provide 6+ TOPS of INT8 compute, enabling 30-45 FPS object detection at lower power consumption.
The Trade-off: Pi offers ecosystem and familiarity; Axon requires model conversion (ONNX → NPU) but delivers the latency consistency required for robotics.
Code provided: Complete Python implementations for both platforms using YOLOv8n.

Prerequisites

Hardware:
- Raspberry Pi 4 (4GB+) or Pi 5 with active cooling
- Axon board (RK3588-based with 6 TOPS NPU)
- USB webcam (Pi) or MIPI CSI-2 camera (Axon)
- Adequate power supply (5V/3A for Pi, 12V/2A or 5V/5A for Axon under load)

Software:
- Raspberry Pi OS 64-bit or Ubuntu 22.04 (both boards)
- Python 3.9+, OpenCV 4.8+
- OpenCV DNN module (Pi) or RKNN Toolkit (Axon)
- YOLOv8n ONNX model (download: wget https://github.com/ultralytics/assets/releases/download/v8.3.0/yolov8n.onnx)

Knowledge:
- Basic OpenCV workflows (capturing frames, drawing rectangles)
- Understanding of neural network inference (input normalization, NMS)
- Comfort with Linux and Python virtual environments

The Raspberry Pi Way

Let's build the baseline. We'll use OpenCV's DNN module with the CPU backend. This is the "standard" approach because it requires no vendor-specific SDKs—just pip-installable packages.

The Architecture

Real-time detection is a pipeline: Capture → Preprocess → Inference → Postprocess → Render. On the Pi, all five stages compete for four ARM cores.

import cv2
import numpy as np
import time

# Initialize camera with error handling
cap = cv2.VideoCapture(0)
if not cap.isOpened():
    raise RuntimeError("Failed to open camera. Check index and permissions.")
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP_FRAME_HEIGHT, 480)
cap.set(cv2.CAP_PROP_FPS, 30)

# Load YOLOv8n model using OpenCV DNN
model_path = "yolov8n.onnx"
net = cv2.dnn.readNetFromONNX(model_path)
net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)

# COCO class names (80 classes)
CLASSES = [
    "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat",
    "traffic light", "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat",
    "dog", "horse", "sheep", "cow", "elephant", "bear", "zebra", "giraffe", "backpack",
    "umbrella", "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard", "sports ball",
    "kite", "baseball bat", "baseball glove", "skateboard", "surfboard", "tennis racket",
    "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
    "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
    "couch", "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse",
    "remote", "keyboard", "cell phone", "microwave", "oven", "toaster", "sink", "refrigerator",
    "book", "clock", "vase", "scissors", "teddy bear", "hair drier", "toothbrush"
]

def preprocess(frame):
    """Convert BGR -> RGB, resize to 640x640, normalize 0-1"""
    blob = cv2.dnn.blobFromImage(
        frame, scalefactor=1/255.0, size=(640, 640), 
        swapRB=True, crop=False
    )
    return blob

def postprocess(outputs, frame_shape, conf_thresh=0.25, iou_thresh=0.45):
    """Parse YOLOv8 output with NMS: [batch, 84, 8400] -> boxes, scores, classes"""
    # Handle different output shapes based on ONNX export version
    outputs = np.squeeze(outputs[0])
    if outputs.shape[0] == 84:
        outputs = outputs.T  # Transpose to [8400, 84]

    # Split into boxes, scores, classes
    boxes = outputs[:, :4]  # cx, cy, w, h
    scores = np.max(outputs[:, 4:], axis=1)
    class_ids = np.argmax(outputs[:, 4:], axis=1)

    # Filter by confidence
    mask = scores > conf_thresh
    boxes = boxes[mask]
    scores = scores[mask]
    class_ids = class_ids[mask]

    if len(boxes) == 0:
        return []

    # Convert cx,cy,w,h to x1,y1,x2,y2
    x_gain = frame_shape[1] / 640
    y_gain = frame_shape[0] / 640

    boxes_xyxy = np.zeros_like(boxes)
    boxes_xyxy[:, 0] = (boxes[:, 0] - boxes[:, 2]/2) * x_gain  # x1
    boxes_xyxy[:, 1] = (boxes[:, 1] - boxes[:, 3]/2) * y_gain  # y1
    boxes_xyxy[:, 2] = (boxes[:, 0] + boxes[:, 2]/2) * x_gain  # x2
    boxes_xyxy[:, 3] = (boxes[:, 1] + boxes[:, 3]/2) * y_gain  # y2

    # Apply Non-Maximum Suppression
    indices = cv2.dnn.NMSBoxes(boxes_xyxy.tolist(), scores.tolist(), conf_thresh, iou_thresh)

    detections = []
    if len(indices) > 0:
        for i in indices.flatten():
            x1, y1, x2, y2 = boxes_xyxy[i].astype(int)
            detections.append((x1, y1, x2, y2, float(scores[i]), int(class_ids[i])))

    return detections

# Main loop
frame_times = []
while True:
    t0 = time.perf_counter()

    ret, frame = cap.read()
    if not ret:
        break

    # Preprocess
    blob = preprocess(frame)

    # Inference
    net.setInput(blob)
    outputs = net.forward()

    # Postprocess with NMS
    dets = postprocess(outputs, frame.shape)

    # Render
    for x1, y1, x2, y2, score, cid in dets:
        label = f"{CLASSES[cid]}: {score:.2f}"
        cv2.rectangle(frame, (x1, y1), (x2, y2), (0, 255, 0), 2)
        cv2.putText(frame, label, (x1, max(y1-10, 20)), 
                   cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 255, 0), 2)

    # Calculate FPS
    t1 = time.perf_counter()
    frame_times.append(t1 - t0)
    if len(frame_times) > 30:
        frame_times.pop(0)
    avg_fps = len(frame_times) / sum(frame_times) if sum(frame_times) > 0 else 0

    cv2.putText(frame, f"FPS: {avg_fps:.1f}", (10, 30),
                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
    cv2.imshow("Pi Detection", frame)

    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

cap.release()
cv2.destroyAllWindows()

Why This Is Slow

Run this on a Pi 4. You'll observe two things immediately: the frame rate hovers around 3.2 FPS, and htop shows all four CPU cores pegged at 100%.

The issue isn't the code quality—it's architectural. YOLOv8n requires ~8.2 billion multiply-accumulate operations (MACs) per frame. A Cortex-A72 core can theoretically deliver ~4-5 GFLOPS (single-precision), but matrix multiplication is memory-bound, not compute-bound. The CPU spends most cycles waiting for weights to stream from RAM through the cache hierarchy. There's no INT8 quantization acceleration, no Winograd convolution optimization, and no dedicated SRAM for feature maps.

Hitting the Wall

The Pi implementation hits three physical limits simultaneously:

1. Thermal Throttling
Without active cooling, the BCM2711 SoC hits 85°C within 60 seconds of inference. The clock drops from 1.8GHz to 1.2GHz, and your 3 FPS becomes 1.8 FPS. You're not just slow; you're unpredictably slow.

2. USB Bandwidth Contention
USB webcams share the 480Mbps USB 2.0 bus with Ethernet (on Pi 4) or storage. A 640x480 YUV stream consumes ~150Mbps. Add inference latency jitter, and you get frame tearing or dropped buffers.

3. Latency Variance
The standard deviation of frame processing time on Pi is ~40ms (at 3 FPS). For robotics, this is catastrophic. A drone moving at 5m/s travels 20cm between detection and actuator response. Uncertainty compounds.

Mathematically, we need dedicated hardware. A CPU is a generalist; object detection is a specialist task requiring massive parallelism on regular data structures (tensors).

Enter Axon

Axon represents the shift from "computer that can do AI" to "computer designed for AI." The key difference is the NPU— a 6 TOPS (Tera Operations Per Second) accelerator capable of INT8 inference.

Why this matters:
- Systolic Arrays: The NPU contains dedicated matrix multiplication units arranged in a grid, allowing data to flow through compute elements without repeated memory accesses.
- Quantization: We convert the model from FP32 (32-bit floats) to INT8 (8-bit integers). This reduces memory bandwidth by 4x and maps perfectly to the NPU's fixed-point units.
- Zero-Copy Camera: MIPI CSI-2 cameras write directly to memory accessible by the NPU, bypassing CPU entirely for the capture phase.

Specs comparison:

Metric	Raspberry Pi 5	Axon (RK3588 NPU)
AI Compute	~15 GFLOPS (CPU)	6 TOPS (NPU INT8)
CPU	4x Cortex-A76 @ 2.4GHz	4x A76 @ 2.4GHz + 4x A55 @ 1.8GHz
Memory BW	~4 GB/s shared	8 GB/s with NPU priority
PCIe	2.0 x1 (500 MB/s)	3.0 x4 (4000 MB/s)
Camera Interface	USB 2.0/3.0	MIPI CSI-2 (1.5Gbps/lane)
Thermal @ Load	85°C (throttling)	55°C (passive cooling)
Power Draw	8W	5-7W (NPU more efficient)

The Axon Way

The Axon implementation requires model conversion. We can't feed raw ONNX to the NPU; we must use the RKNN Toolkit to compile a quantized graph. The workflow is: ONNX → Quantize → RKNN Graph → Inference.

Step 1: Model Conversion

# Using RKNN-Toolkit2 on PC or Axon
from rknn.api import RKNN

# Initialize
rknn = RKNN(verbose=True)

# Load FP32 model
rknn.load_onnx(model="yolov8n.onnx", inputs=["images"], input_size_list=[[1, 3, 640, 640]])

# Build with quantization (INT8)
rknn.build(
    do_quantization=True,
    dataset="calibration.txt"  # List of 100+ representative images
)

# Export RKNN model
rknn.export_rknn("yolov8n.rknn")
rknn.release()

This produces a 3.4MB file (vs 12MB FP32) optimized for the NPU's instruction set.

Step 2: Runtime Code

Notice the pipeline structure remains identical—capture, preprocess, inference, postprocess—but the heavy lifting moves to inference() which executes on the NPU in 18ms instead of 280ms.

```python
import cv2
import numpy as np
import time
from rknnlite.api import RKNNLite

Initialize MIPI camera

cap = cv2.VideoCapture(0, cv2.CAP_V4L2)
if not cap.isOpened():
raise RuntimeError("Failed to open MIPI camera")
cap.set(cv2.CAP_PROP_FRAME_WIDTH, 640)
cap.set(cv2.CAP_PROP