Adib Mohsin

UNC - HuggingFace transformer compiler for optimised inferences

by
Compiles HuggingFace transformer models into optimised native Metal inference binaries. No runtime framework, no Python — just a compiled binary that runs your model at near-hardware-limit speed on Apple Silicon, using 25% less GPU power and 1.7x better energy efficiency than mlx-lm UNC is 1.35x faster while using 25% less GPU power, resulting in 1.7x better energy efficiency. 8.4x fewer CPU instructions means less heat, less power, and more headroom for the GPU than MLX for Apple.

Add a comment

Replies

Best
Adib Mohsin
Maker
📌
I built an LLM compiler for Hugging Face models, producing fastest and most energy efficient on device inference binaries for Apple Silicon. Because I want to extract the most token processing per power consumption. 152 tok/s. 11.3W. 5.3B CPU instructions. mlx-lm: 113 tok/s. 14.1W. 31.4B CPU instructions on my macbook M1 Pro. Using 25% less GPU power and 1.7x better energy efficiency than MLX from Apple This is similar to what ggml or Ollama does but it's designed around JIT/AOT compilation to produce smallest CPU footprint which results in lower power consumption.