UNC

HuggingFace transformer compiler for optimised inferences

1 follower

HuggingFace transformer compiler for optimised inferences

1 follower

Compiles HuggingFace transformer models into optimised native Metal inference binaries. No runtime framework, no Python — just a compiled binary that runs your model at near-hardware-limit speed on Apple Silicon, using 25% less GPU power and 1.7x better energy efficiency than mlx-lm UNC is 1.35x faster while using 25% less GPU power, resulting in 1.7x better energy efficiency. 8.4x fewer CPU instructions means less heat, less power, and more headroom for the GPU than MLX for Apple.

Free

Launch tags:Software Engineering•Artificial Intelligence•GitHub

Launch Team / Built With

Wispr Flow: Dictation That Works Everywhere — Stop typing. Start speaking. 4x faster.

Stop typing. Start speaking. 4x faster.

Promoted

Maker

📌

I built an LLM compiler for Hugging Face models, producing fastest and most energy efficient on device inference binaries for Apple Silicon. Because I want to extract the most token processing per power consumption. 152 tok/s. 11.3W. 5.3B CPU instructions. mlx-lm: 113 tok/s. 14.1W. 31.4B CPU instructions on my macbook M1 Pro. Using 25% less GPU power and 1.7x better energy efficiency than MLX from Apple This is similar to what ggml or Ollama does but it's designed around JIT/AOT compilation to produce smallest CPU footprint which results in lower power consumption.

Report

2mo ago

Reviews

Most Informative