EMA-Gated Temporal Sequence Compression in Vision Transformers - No fine-tuning required

Vision Transformers waste 90% of their compute recalculating stationary asphalt. NeuroFlow tracks semantic surprise in embedding space, physically eliminating background tokens before the encoder.

Result: 55.8x wall-clock speedup for ViTs on high-res video (1792p) with 97% fidelity. No fine-tuning required.

NeuroFlow is a dynamic routing framework for Vision Transformer video inference. It exploits temporal redundancy by tracking per-patch embedding-distance threshold via an Exponential Moving Average (EMA) of patch-level embeddings, effectively answering the architectural mismatch between O(N2) self-attention and highly redundant natural video streams.

Key Contributions:

Architecture C (Dual-Memory Reconstruction): A completely training-free inference engine that combines a Layer 0 Gate with a Layer 12 Cache. It achieves 71.55% zero-shot top-1 accuracy at 84.0% token sparsity on SigLIP, retaining 92.4% of dense accuracy without modifying any weights.
Architecture B (Extreme Wall-Clock Speedup): Physically eliminates stationary tokens before the encoder. With sparse manifold distillation, it reduces 1792p SigLIP 2 inference from 678 ms to 11.9 ms—a 55.80× wall-clock speedup at 97.37% embedding fidelity.
LLM Ablation: Characterises the architectural boundaries of applying similarity-gated bypass to autoregressive language models (Phi-3-mini), demonstrating 0% token drift in syntactically constrained generation.

Code and paper: https://github.com/ynnk-research/-NeuroFlow

30 views

Replies

Best

Any feedback is highly appreciated.

2mo ago

Honestly the "no fine-tuning required" part is what caught my eye. That alone makes it practical for people who can't afford to retrain large models. The technical details are dense but the core idea is solid. What kind of compression ratios are you seeing in practice?

@antwon_randolph2 Thank you. I have been experimenting with different sparsity settings, the ratio vary from 50% on dynamic fpv drone footage up to 98% on a static traffic stream.

The actual numbers also depends on what you are trying to achieve, as a higher or lower EMA decay has practical implications. I don't have any real use cases though, so this part is more speculative. I am mostly into the underlying science and a mediocre engineer at best.