Oasis: Data Curation and Assessment System for Pretraining of Large Language Models

Oasis: Data Curation and Assessment System for Pretraining of Large Language Models

Tong Zhou, Yubo Chen, Pengfei Cao, Kang Liu, Shengping Liu, Jun Zhao

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence
Demo Track. Pages 8855-8859. https://doi.org/10.24963/ijcai.2024/1048

Data is one of the most critical elements in building a large language model. However, existing systems either fail to customize a corpus curation pipeline or neglect to leverage comprehensive corpus assessment for iterative optimization of the curation. To this end, we present a pretraining corpus curation and assessment platform called Oasis — a one-stop system for data quality improvement and quantification with user-friendly interactive interfaces. Specifically, the interactive modular rule filter module can devise customized rules according to explicit feedback. The debiased neural filter module builds the quality classification dataset in a negative-centric manner to remove the undesired bias. The adaptive document deduplication module could execute large-scale deduplication with limited memory resources. These three parts constitute the customized data curation module. And in the holistic data assessment module, a corpus can be assessed in local and global views, with three evaluation means including human, GPT-4, and heuristic metrics. We exhibit a complete process to use Oasis for the curation and assessment of pretraining data. In addition, an 800GB bilingual corpus curated by Oasis is publicly released.
Keywords:
Natural Language Processing: NLP: Tools
Natural Language Processing: NLP: Applications
Natural Language Processing: NLP: Language models