The university library will be closed from December 20, 2025 to January 11, 2026. From January 12, 2026, regular opening hours will apply again. Exception: The main medical library and the central library will be open again from January 5, 2026. Further information

Result: Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models

Title:
Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models
Publication Year:
2024
Document Type:
Report Working Paper
Accession Number:
edsarx.2501.14755
Database:
arXiv

Further information

Foundation models demand advanced data processing for their vast, multimodal datasets. However, traditional frameworks struggle with the unique complexities of multimodal data. In response, we present Data-Juicer 2.0, a data processing system backed by 100+ data processing operators spanning text, image, video, and audio modalities, supporting more critical tasks including data analysis, synthesis, annotation, and foundation model post-training. With seamless compatibility and dedicated optimization for popular dataset hubs like Hugging Face and computing engines like Ray, it improves upon its predecessor in terms of usability, efficiency, and programmability. It features an easily accessible user interface layer that supports decoupled Python interactions, RESTful APIs, and conversational commands. Its new runtime layer offers adaptive execution across diverse scales and environments, abstracting away system complexities. Extensive empirical evaluations demonstrate Data-Juicer 2.0's remarkable performance and scalability, highlighting its capability to efficiently process TB-level data with 10k+ CPU cores. The system is publicly available and has been widely adopted in diverse research fields and real-world products such as Alibaba Cloud PAI. We actively maintain the system and share practical insights to foster research and applications of next-generation foundation models.
Accepted by NeurIPS 2025 (Spotlight). 43 pages, 16 figures, 4 tables