Pioneering Local AI: TAIDE’s Quest for a Trustworthy Taiwanese Language Mod

Richard Tzong-Han Tsai 蔡宗翰

Professor,

Department of Computer Science and Information Engineering,

National Central University

Abstract

At the TAIDE team in the National Science and Technology Council's Innovation Center, our mission has been unequivocal from the outset: to develop a large language model that caters specifically to Taiwan’s linguistic nuances while upholding the highest standards of data integrity and security. In a world dominated by major tech players, our objective has been to secure a digital voice for Taiwan, ensuring our sovereignty in the digital space and safeguarding sensitive information.

With only a fraction of the budget available to global giants like ChatGPT, our resolve has remained strong. We embarked on this ambitious project to create a model that authentically speaks and writes in the Taiwanese vernacular—reliable and trustworthy. This required assembling a comprehensive Taiwanese corpus, developing dependable evaluation metrics, and navigating the complexities of integrating diverse institutional efforts.

Our path was fraught with challenges, including convincing content producers to share their data while ensuring all used materials complied with strict legal standards to prevent the incorporation of unreliable or biased information. Despite these hurdles, our team's dedication yielded significant accomplishments: we not only launched the widely praised commercial 7B model but also utilized Llama3 to train a new preview model shortly after its release, demonstrating our deep technical capabilities.

This development process not only showcased our technical expertise but also reinforced our commitment to responsibly advancing Taiwan's AI capabilities. As we move forward, our focus remains on nurturing an ecosystem where technology not only supports local industry growth but also embodies and promotes societal values, positioning Taiwan as a formidable player in the global AI landscape.