STM32N6 NPU Explained: Hardware AI Inference at 600 GOPS on a Microcontroller

<p>The STM32N6 is a different kind of microcontroller. It is not just another Cortex-M upgrade — it is the first STM32 with a dedicated hardware Neural Processing Unit, and that changes what is possible on a microcontroller without a GPU, without a GPU driver, and without Linux.</p>

<p>This article explains what the NPU actually is, how it compares to running inference on a standard Cortex-M7, and what that means for your AI edge projects.</p>

<h2>What the STM32N6 Actually Contains</h2>

<h3>The Core</h3>
<p>The main processor is an Arm Cortex-M55 running at up to 800MHz. The M55 is Arm’s most capable Cortex-M core — it includes the Helium (MVE) vector extension for DSP and ML workloads, running SIMD operations on integer and floating-point data. Compared to a Cortex-M7, the M55 with Helium can perform 4–8x more multiply-accumulate operations per clock cycle on the right workload.</p>

<h3>The NPU</h3>
<p>On top of the M55, the STM32N6 includes a hardware NPU — a fixed-function neural network accelerator. It is designed to run quantised neural network inference (INT4 and INT8) with minimal CPU involvement. The NPU delivers up to 600 GOPS at INT8, which means it can process billions of multiply-accumulate operations per second without touching the Cortex-M55 at all.</p>

<p>This is the critical distinction: the NPU runs inference while the M55 handles everything else — sensor fusion, communication, display, control logic. In practice, this means you can run a real-time object detection model and still have full CPU headroom for your application logic.</p>

<h2>How the NPU Compares to Cortex-M7</h2>

<p>To understand the practical difference, consider running MobileNetV2 (image classification at 224×224 input) on three different platforms:</p>

<table>
<thead>
<tr><th>Platform</th><th>MobileNetV2 Inference</th><th>CPU Load During Inference</th></tr>
</thead>
<tbody>
<tr><td>STM32H7 (M7 at 480MHz)</td><td>~180ms</td><td>100% — nothing else runs</td></tr>
<tr><td>STM32N6 (M55 at 800MHz, no NPU)</td><td>~45ms</td><td>100% — nothing else runs</td></tr>
<tr><td>STM32N6 (M55 + NPU)</td><td>~8ms</td><td>~5% — M55 is nearly free</td></tr>
</tbody>
</table>

<p>That 8ms inference with 5% CPU load means you can run MobileNetV2 at over 100 frames per second while simultaneously handling UART, a display refresh, sensor polling, and Ethernet — all on a single microcontroller with no operating system required.</p>

<h2>What AI Models Run on the STM32N6 NPU</h2>

<p>The NPU runs models quantised to INT8 or INT4 precision using STM32Cube.AI. Any model that can be expressed as a standard neural network graph and quantised can be deployed. Tested and confirmed working models include:</p>

<ul>
<li><strong>Image classification:</strong> MobileNetV1, MobileNetV2, EfficientNet-Lite, SqueezeNet</li>
<li><strong>Object detection:</strong> YOLOv5n, YOLOv8n, SSD-MobileNet</li>
<li><strong>Face detection:</strong> UltraFace, RetinaFace (small variant)</li>
<li><strong>Keyword spotting:</strong> DS-CNN, MobileNet audio</li>
<li><strong>Anomaly detection:</strong> Autoencoder architectures</li>
<li><strong>Gesture recognition:</strong> CNN-based gesture classifiers</li>
</ul>

<h2>Deploying a Model: The STM32Cube.AI Flow</h2>

<p>The deployment pipeline for the STM32N6 NPU goes like this:</p>

<ol>
<li><strong>Train your model</strong> in PyTorch or TensorFlow — standard floating point (FP32)</li>
<li><strong>Export</strong> to ONNX or TFLite format</li>
<li><strong>Quantise</strong> to INT8 using post-training quantisation (PTQ) with a calibration dataset — this typically costs less than 1% accuracy on well-trained models</li>
<li><strong>Import into STM32Cube.AI</strong> (available as a plugin in STM32CubeIDE or as a standalone CLI tool)</li>
<li><strong>Analyse</strong> — Cube.AI shows you the model size, RAM usage, and predicted inference time on the NPU</li>
<li><strong>Generate C code</strong> — Cube.AI produces a C library that targets the NPU automatically</li>
<li><strong>Integrate</strong> into your STM32 project — call <code>ai_run()</code> with your input tensor and read the output</li>
</ol>

<h2>Memory Architecture for AI</h2>

<p>The STM32N6 has a flexible memory architecture that matters for AI:</p>

<ul>
<li><strong>4MB internal SRAM</strong> — enough for activations of most small to medium models</li>
<li><strong>4MB internal Flash</strong> — stores model weights (sufficient for models up to ~3MB after quantisation)</li>
<li><strong>LPDDR4 via XSPI</strong> — on the NVX-N6 Vision Pro variant, external LPDDR4 extends the memory available to the NPU for larger models</li>
<li><strong>OctoSPI Flash</strong> — up to 512MB external flash for model storage, loaded at runtime if needed</li>
</ul>

<p>A YOLOv8n model quantised to INT8 occupies approximately 3.2MB of flash and requires roughly 1.5MB of SRAM for activations — well within the NVX-N6 Vision’s capabilities with the internal memory alone.</p>

<h2>Camera and Display Integration</h2>

<p>The STM32N6 is designed for vision applications. The NVX-N6 Vision board exposes:</p>

<ul>
<li><strong>DCMI:</strong> parallel camera interface for OV7670, OV2640, and similar sensors</li>
<li><strong>MIPI-CSI2:</strong> high-bandwidth camera interface for IMX219, IMX477, and similar sensors</li>
<li><strong>MIPI-DSI:</strong> display output for MIPI-DSI panels — full HD video refresh is possible with the ChromART hardware accelerator</li>
</ul>

<p>A typical computer vision pipeline on the NVX-N6: camera frame captured via DCMI → pre-processed by M55 (resize, normalise) → inference on NPU → results overlaid on display via ChromART → display refreshed via MIPI-DSI — all in under 40ms total latency.</p>

<h2>Who Should Use the STM32N6</h2>

<p>The STM32N6 is the right choice when:</p>
<ul>
<li>You need to run a neural network and also do something useful with the result in real time</li>
<li>You cannot use a Raspberry Pi because you need deterministic real-time behaviour</li>
<li>You cannot use a GPU because power, cost, or form factor rules it out</li>
<li>Your application needs to run on battery or in a harsh environment</li>
</ul>

<p>It is not the right choice when:</p>
<ul>
<li>Your model is very large (hundreds of MB) — use a more powerful platform</li>
<li>You need Linux — use a Raspberry Pi or similar SBC</li>
<li>Your application has no AI component — the STM32H745 is more cost-effective</li>
</ul>

<h2>Get Started</h2>

<p>The <a href=”/products”>NVX-N6 Vision board</a> is our STM32N6 development platform. It includes MIPI-DSI display output, DCMI and CSI2 camera headers, and optional LPDDR4 memory. It ships with a pinout diagram, full schematic, and an STM32Cube.AI deployment guide covering YOLOv8n deployment step by step.</p>

<p>For questions about AI deployment on the STM32N6, email <strong>info@nvixeon.com</strong> — you will get a response from the engineer who built the board.</p>