Like the Olympic spirit, generating high-quality images and videos faster has always been a core pursuit of the generative modeling community. Early efforts mainly followed a numerical-analysis perspective: researchers tried to build more efficient ODE solvers, such as DDIM1, DPM-Solver2, and iPNDM3, to approximate the target distribution with fewer sampling steps. At the same time, distillation is also an immediately effective acceleration method, although it usually requires additional training and higher resource cost.
In the era of large models such as FLUX and Wan, perhaps also inspired by inference acceleration ideas from LLMs, cache mechanisms have quickly become popular in academia due to their training-free and plug-and-play nature. Of course, quantization and distributed inference are also developing in parallel. But in the end, the goal is still the same: finding a better balance between efficiency and quality.
This post focuses on cache mechanisms. Based on my own research experience, I want to discuss two questions: how cache methods should be evaluated, and how this technical line has roughly evolved.
When opening a cache-related paper, the first metrics we usually see fall into two categories: visual quality and efficiency. For quality, common metrics include CLIP Score, ImageReward, and VBench. For cache methods, reconstruction metrics such as PSNR and LPIPS are also often reported. In contrast, the meaning of “efficiency” is sometimes implicitly ambiguous in papers.
The most common efficiency metrics are latency and speedup. Latency is intuitive and is itself a reasonable metric; it is just highly hardware-dependent. The same algorithm can show very different latency on different GPUs, under different workloads, or with different memory states. Speedup also looks intuitive, but it can be more misleading because it is fundamentally a relative metric and is heavily affected by the baseline setting. Here, the baseline is usually the original number of inference steps, or what we often treat as the “Ground Truth” configuration.
Consider a simple arithmetic example. Suppose our goal is to generate an image of acceptable quality:
- Case A: the original number of inference steps is 50. With Cache, the model only fully computes 10 steps and skips 40 steps. The speedup is 50 / 10 = 5.0×.
- Case B: the original number of inference steps is 30. For models like FLUX, 30 steps are usually already enough. With Cache, the model also fully computes only 10 steps and skips 20 steps. The speedup is 30 / 10 = 3.0×.
At first glance, Case A with 5.0× speedup seems much stronger than Case B with 3.0×, almost like a bigger breakthrough. But once we remove the effect of the baseline step count, we can see that both cases fully run the network only 10 times, so their actual inference latency is likely very close.
In ODE Solver research, NFE, or Number of Function Evaluations, is widely used to measure how many times the denoising network is fully evaluated. It is a relatively “harder” metric: it does not directly depend on hardware and is less affected by the original step setting. For Step-Level Cache methods, NFE is especially suitable because what these methods truly reduce is the number of full network forward passes.
After clarifying the evaluation metric, we can briefly review the evolution of cache techniques. The generation process of diffusion models relies on iterative denoising, and redundancy exists at multiple granularities. Early explorations mainly focused on redundancy inside the model architecture. DeepCache4 and Delta-DiT5 for DiT architectures both observed a similar phenomenon: deep semantic features change slowly between adjacent timesteps, while shallow features change more quickly. Based on this observation, we can selectively skip the computation of some deep blocks and directly reuse the feature maps from the previous timestep, making the network “thinner” within a single inference step.
As DiT architectures became more common, researchers further moved to a finer Token-Level granularity, with representative methods such as ToCa6 and DuCa7. These methods identify redundant tokens and reduce computation in modules such as Attention through pruning or merging. It is worth noting that both Layer-Level and Token-Level methods essentially reduce FLOPs within a single network forward pass. The total number of network steps does not change. Therefore, NFE is not the best metric for evaluating these methods.
From late 2024 to 2025, Step-Level Cache methods began to appear more intensively, including TeaCache, MagCache, LeMiCa, and MeanCache. This marks a shift from internal model granularity to timestep granularity.
The logic of Step-Level Cache is more direct: if the overall output changes very little between adjacent timesteps, why not skip the entire forward pass? By exploiting this “step-level redundancy,” the inference process becomes genuinely shorter. More importantly, the key question at this stage shifts from “what to cache” to “when to cache.”
Early cache attempts often used a simple Uniform Strategy, mechanically computing once every $K$ steps, with the main focus still on cache objects such as tokens, layers, or blocks. The key of Step-Level Cache is to introduce a more reasonable Heuristic Strategy, dynamically deciding whether to skip the current step based on a threshold. This transition from Static Uniform to Dynamic Heuristic makes the allocation of NFE better match the non-uniform nature of the generation process, and is usually better in terms of engineering implementation and stability.
In this context, NFE is a more reasonable standard for measuring the efficiency of Step-Level Cache.

Figure 1. Differences between cache strategies in two common diffusion model architectures. UNet architectures usually cache the Feature itself, while DiT architectures more often cache the change/residual (Chen et al.). Why this difference appears in DiT architectures will be discussed later.
Of course, heuristic strategies are not perfect. They are essentially greedy and short-sighted. On the one hand, the error introduced by caching will continue to propagate through the generation process and may gradually accumulate or even amplify, which means that a better global strategy may exist. On the other hand, threshold-based decisions introduce clear uncertainty. The mapping between the threshold and the actual NFE is often an opaque black box: if we set the threshold to 0.24 or 0.36, how many NFEs will that actually correspond to? For different prompts, will the same threshold lead to unstable computation cost? These questions are all important in engineering deployment.
From a more idealized perspective, we may need to ask further questions: where is the upper bound of Cache? Under a fixed NFE budget, what is the best quality a cache strategy can achieve? Can we further improve this upper bound through other approaches? For example, under the current DiT architecture, is caching based on $\Delta$ already the most reasonable choice? These questions are still worth studying.
To be continued…
Song, Jiaming, Chenlin Meng, and Stefano Ermon. “Denoising diffusion implicit models.” arXiv preprint arXiv:2010.02502. 2020. ↩︎
Lu, Cheng, et al. “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.” Advances in Neural Information Processing Systems 35. 2022: 5775-5787. ↩︎
Zhou, Zhenyu, et al. “Fast ode-based sampling for diffusion models in around 5 steps.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. ↩︎
Ma, Xinyin, Gongfan Fang, and Xinchao Wang. “Deepcache: Accelerating diffusion models for free.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. ↩︎
Chen, Pengtao, et al. “$\Delta$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers.” arXiv preprint arXiv:2406.01125. 2024. ↩︎
Zou, Chang, et al. “Accelerating diffusion transformers with token-wise feature caching.” arXiv preprint arXiv:2410.05317. 2024. ↩︎
Zou, Chang, et al. “Accelerating diffusion transformers with dual feature caching.” arXiv preprint arXiv:2412.18911. 2024. ↩︎