Real-Time Computer Vision on Edge Devices: A Technical Deep Dive
Optimizing YOLO-based pipelines for sub-10ms inference on embedded hardware.
By Dr. Elena Voss · May 15, 2026
Real-time computer vision on edge devices is one of the most demanding AI workloads. You need to process 30+ frames per second on a device that might have less than 1W of power budget.
At AiSpaceRiver, we've deployed vision pipelines on devices ranging from ARM Cortex-M4 microcontrollers to NVIDIA Jetson Orin modules. Here's what we've learned about achieving reliable sub-10ms inference.
Model Selection: Beyond YOLO
While YOLOv11 is the default choice for many teams, it's rarely the optimal choice for edge deployment. Consider these alternatives:
- *YOLO-NAS*: Better accuracy-efficiency trade-off than standard YOLO
- *EfficientDet-Lite*: Excellent for mobile and embedded targets
- *NanoDet Plus*: Under 2MB model size with competitive accuracy
- *Custom architectures*: For specialized use cases, a purpose-built model can be 5x smaller than a general-purpose one
The key insight: Don't start with a large model and try to compress it. Start with the smallest viable architecture and scale up only as needed.
Pipeline Optimization
The inference model is only part of the pipeline. Here's the full chain:
Camera → Preprocessing → Inference → Post-processing → OutputEach stage is a potential bottleneck:
*Preprocessing: Use hardware acceleration* where available. Many edge SoCs have dedicated image signal processors (ISPs) that can resize, normalize, and color-convert at zero CPU cost.
*Inference: Use TensorRT or ONNX Runtime with the device-specific execution provider. For ARM CPUs, XNNPACK* backend is significantly faster than the default.
*Post-processing: NMS (non-maximum suppression) is often the hidden bottleneck. We use a custom C++ implementation* that runs 3x faster than the standard PyTorch version.
// Optimized NMS for edge deployment
std::vector<Box> fast_nms(
const std::vector<Box>& boxes,
float iou_threshold
) {
// Sort by confidence score
auto sorted = boxes;
std::sort(sorted.begin(), sorted.end(),
[](const Box& a, const Box& b) {
return a.confidence > b.confidence;
});
std::vector<Box> result;
for (size_t i = 0; i < sorted.size(); i++) {
if (sorted[i].confidence == 0) continue;
result.push_back(sorted[i]);
for (size_t j = i + 1; j < sorted.size(); j++) {
if (iou(sorted[i], sorted[j]) > iou_threshold) {
sorted[j].confidence = 0;
}
}
}
return result;
}Quantization-Aware Training
Post-training quantization is easy, but quantization-aware training (QAT) gives significantly better results. The difference is especially pronounced for:
- *Small models* (<5M parameters)
- *Models with skip connections* (common in vision architectures)
- *Models operating near the accuracy floor*
Implement QAT using PyTorch's built-in torch.ao.quantization module. The training overhead is minimal (about 10-20% slower) but the accuracy improvement can be 2-5%.
Benchmarking Methodology
When benchmarking edge vision pipelines, measure these metrics:
1. *End-to-end latency* (camera to output)
2. *Power consumption* under load
3. *Thermal throttling* behavior after 30 minutes
4. *Frame drop rate* under varying lighting conditions
5. *Memory fragmentation* over extended operation
We've seen devices that pass all unit tests but fail after 4 hours of continuous operation due to memory fragmentation. Always run long-duration tests.
Conclusion
Real-time edge computer vision is achievable with the right approach: choose the smallest viable model, optimize every stage of the pipeline, use quantization-aware training, and benchmark thoroughly under real-world conditions. The difference between a prototype and a production system is in the details.