Die Ergebnisse können Gästen nur in Auswahl angezeigt werden. Bitte loggen Sie sich für Vollzugriff ein: Login

Treffer: Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms.

Title:

Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms.

Authors:

Kim, Heejin¹ (AUTHOR), Lee, Jeongha¹ (AUTHOR), Bahn, Hyokyung (AUTHOR) bahn@ewha.ac.kr

Source:

Mathematics (2227-7390). Nov2025, Vol. 13 Issue 22, p3689. 17p.

Subject Terms:

*CACHE memory, *BUFFER storage (Computer science), *SYSTEMS design, *RESOURCE allocation, *MOBILE operating systems, *LANGUAGE models

Database:

Academic Search Index

Weitere Informationen

Large language models (LLMs) have traditionally relegated inference to remote servers, leaving mobile devices as thin clients. Recently, advances in mobile GPUs and NPUs have made on-device inference increasingly feasible, particularly for privacy-sensitive and personalized applications. However, executing LLMs directly on resource-constrained devices exposes severe I/O bottlenecks, as repeated accesses to large weight files can overwhelm limited memory and storage bandwidth. Prior studies have focused on internal mechanisms such as KV caching, while the role of the host OS buffer cache remains underexplored. This paper closes that gap with file-level trace analysis of real-world mobile LLM applications, and identifies three characteristic access patterns: (1) one-time sequential scans during initialization, (2) persistent hot sets (e.g., tokenizers, metadata, indices), and (3) recurring loop accesses to model weight files. Guided by these observations, we propose LLM-aware buffer cache strategies and derive cache-sizing guidelines that relate loop size, host-set coverage, and storage bandwidth. We further compare smartwatch-class and smartphone-class platforms to clarify feasible model sizes and practical hardware prerequisites for local inference. Our results provide system-level guidance for I/O subsystem design that enables practical on-device LLM inference in future mobile and IoT devices. [ABSTRACT FROM AUTHOR]

Treffer: Rethinking I/O Caching for Large Language Model Inference on Resource-Constrained Mobile Platforms.

Weitere Informationen

Links

Zusatz-Funktionen