Source, LinkedIn: EDGE AI FOUNDATION
Can you really run generative AI on the edge without blowing up memory, power, or latency?
At EDGE AI FOUNDATION, we sat down with Tinoosh Mohsenin, The Johns Hopkins University to break down what it actually takes to make transformers practical on Jetson-class devices and beyond.
This isn’t theory. It’s engineering.
In this session, Tinoosh walks through two powerful strategies for lean, deployable generative edge AI:
• Pruning where it matters most: the feedforward layers
• Structured sparsity that works with hardware, not against it
• Quantization that preserves accuracy while slashing memory traffic
• A 2-bit Vision Transformer designed for medical AI at the edge
The results speak for themselves:
• Multi-fold reductions in model size and energy
• Up to 43x model compression
• 22x latency improvements
• Accuracy that stays remarkably close to baseline
The big takeaway? FLOPs aren’t the real bottleneck. Memory, movement, and energy are. And if you focus your compression strategy there, edge AI becomes not just possible, but production-ready.
If you’re building robotics, clinical imaging tools, industrial systems, or any application where intelligence must live near the sensor, this one is worth your time.
Watch the full talk here: