Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results

التفاصيل البيبلوغرافية
العنوان: Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results
المؤلفون: Ennen, Philipp, Hsu, Po-Chun, Hsu, Chan-Jan, Liu, Chang-Le, Wu, Yen-Chen, Liao, Yin-Hsiang, Lin, Chin-Tung, Shiu, Da-Shan, Ma, Wei-Yun
سنة النشر: 2023
المجموعة: Computer Science
مصطلحات موضوعية: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
الوصف: In this paper we present the multilingual language model BLOOM-zh that features enhanced support for Traditional Chinese. BLOOM-zh has its origins in the open-source BLOOM models presented by BigScience in 2022. Starting from released models, we extended the pre-training of BLOOM by additional 7.4 billion tokens in Traditional Chinese and English covering a variety of domains such as news articles, books, encyclopedias, educational materials as well as spoken language. In order to show the properties of BLOOM-zh, both existing and newly created benchmark scenarios are used for evaluating the performance. BLOOM-zh outperforms its predecessor on most Traditional Chinese benchmarks while maintaining its English capability. We release all our models to the research community.
نوع الوثيقة: Working Paper
الوصول الحر: http://arxiv.org/abs/2303.04715Test
رقم الانضمام: edsarx.2303.04715
قاعدة البيانات: arXiv