Ask a Question

Prefer a chat interface with context about you and your work?

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

WanJuanSiLu: A High-Quality Open-Source Webtext Dataset for Low-Resource Languages

This paper introduces the open-source dataset WanJuanSiLu, designed to provide high-quality training corpora for low-resource languages, thereby advancing the research and development of multilingual models. To achieve this, we have developed a systematic data processing framework tailored for low-resource languages. This framework encompasses key stages such as data extraction, corpus …