![]() |
TranSentCut ๏ญ transformer based Thai sentence segmentation |
---|---|
รหัสดีโอไอ | |
Creator | 1. Sumeth Yuenyong 2. Virach Sornlertlamvanich |
Title | TranSentCut ๏ญ transformer based Thai sentence segmentation |
Publisher | Research and Development Office, Prince of Songkla University |
Publication Year | 2565 |
Journal Title | Songklanakarin Journal of Science an Technology (SJST) |
Journal Vol. | 44 |
Journal No. | 3 |
Page no. | 852-860 |
Keyword | sentence segmentation, natural language processing, neural network, transformer model |
URL Website | https://rdo.psu.ac.th/sjst/index.php |
ISSN | 0125-3395 |
Abstract | We propose TranSentCut, a sentence segmentation model for Thai based on the transformer architecture. Sentencesegmentation for Thai is a problem because there is no end of sentence marker like in other languages. Existing methods makeuse of POS tags, which is not easy to label and must be done for every word in the data. This limits the the applicability andperformance of sentence segmentation on open-domain text, because the only high-quality Thai corpus that has sentenceboundary and POS labels was constructed mostly from academic articles. Our approach only uses raw text for training and theonly labelling required is to separate each sentence into its own line in a text file. This makes new datasets much easier toconstruct. Comparison with existing methods show that our proposed model is competitive with the most recent state-of-the-artwhen evaluated on in-domain texts, and improved significantly over existing publicly available libraries when applied to out-ofdomain input texts. |