WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource
Languages
WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource
Languages
This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus …