Treffer: Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms.
Weitere Informationen
Large language models (LLMs) have traditionally relegated inference to remote servers, leaving mobile devices as thin clients. Recently, advances in mobile GPUs and NPUs have made on-device inference increasingly feasible, particularly for privacy-sensitive and personalized applications. However, executing LLMs directly on resource-constrained devices exposes severe I/O bottlenecks, as repeated accesses to large weight files can overwhelm limited memory and storage bandwidth. Prior studies have focused on internal mechanisms such as KV caching, while the role of the host OS buffer cache remains underexplored. This paper closes that gap with file-level trace analysis of real-world mobile LLM applications, and identifies three characteristic access patterns: (1) one-time sequential scans during initialization, (2) persistent hot sets (e.g., tokenizers, metadata, indices), and (3) recurring loop accesses to model weight files. Guided by these observations, we propose LLM-aware buffer cache strategies and derive cache-sizing guidelines that relate loop size, host-set coverage, and storage bandwidth. We further compare smartwatch-class and smartphone-class platforms to clarify feasible model sizes and practical hardware prerequisites for local inference. Our results provide system-level guidance for I/O subsystem design that enables practical on-device LLM inference in future mobile and IoT devices. [ABSTRACT FROM AUTHOR]