keyboard_arrow_up
Saving Endangered Languages with a Novel Three-Way Cycle Cross-Lingual Zero-Shot Sentence Alignment

Authors

Eugene Hwang, Blue Core Labs, USA

Abstract

Sentence classification, including sentiment analysis, hate speech detection, tagging, and urgency detection is one of the most prospective and important subjects in the Natural Language processing field. With the advent of artificial neural networks, researchers usually take advantage of models favorable for processing natural languages including RNN, LSTM and BERT. However, these models require huge amount of language corpus data to attain satisfactory accuracy. Typically this is not a big deal for researchers who are using major languages including English and Chinese because there are a myriad of other researchers and data in the Internet. However, other languages like Korean have a problem of scarcity of corpus data, and there are even more unnoticed languages in the world. One could try transfer learning for those languages but using a model trained on English corpus without any modification can be sub-optimal for other languages. This paper presents the way to align cross-lingual sentence embedding in general embedding space using additional projection layer and bilignual parallel data, which means this layer can be reused for other sentence classification tasks without further fine-tuning. To validate power of the method, further experiment was done on one of endangered languages, Jeju language. To the best of my knowledge, it is the first attempt to apply zero-shot inference on not just minor, but endangered language so far.

Keywords

Natural Language Processing, Large Language Models, Transfer Learning, Cross-lingual Zero-shot, Embedding Alignment, BERT, Endangered Languages, Low-resourced Languages.

Full Text  Volume 13, Number 19