|
Thai Text Compression Algorithm Employing Word-Formation Creation |
|---|---|
| รหัสดีโอไอ | |
| Creator | Chouvalit Khancome |
| Title | Thai Text Compression Algorithm Employing Word-Formation Creation |
| Contributor | Prayat Le-wan |
| Publisher | The Association of Council of IT Deans (CITT) |
| Publication Year | 2567 |
| Journal Title | Journal of Information Science and Technology |
| Journal Vol. | 14 |
| Journal No. | 1 |
| Page no. | 9-21 |
| Keyword | Text Compression, Bit-Level Compression, Dictionary Base Compression, Thai Vowel Patterns, Thai Word Formation |
| URL Website | https://tci-thaijo.org/index.php/JIST |
| Website title | Journal of Information Science and Technology |
| ISSN | 2651-1053 |
| Abstract | The compression of text without data loss is a fundamental aspect of computer science, crucial for minimizing the storage space required for large datasets. This principle has been continuously developed and has consistently attracted the interest of researchers. This research article presents a highly efficient design for a new text compression method specifically tailored for compressing Thai language text. The procedural mechanism involves the creation of a new dictionary-like structure termed the "Pre-Processing Section" based on the patterns of word formation in the Thai language. This structure is utilized for referencing terms during compression and decompression processes. The data compression is executed by storing information in a binary file using the newly developed Word-Formation Thai Text Compression Algorithm (WFTTCA). The compression process following this newly developed method can achieve compression rates in theoretical terms, represented by ASCII- TIS620 encoding, ranging from 37.50% to 79.17%, with a maximum average of 63.75%. For Unicode encoding, compression rates range from 68.75% to 89.58%, with a maximum average of 81.88%. In the case of UTF-8 encoding, compression rates range from 79.17% to 93.06%, with a maximum average of 87.92%. These compression rates correspond to a range of 3.51 to 10.50 times the original data size. The experimental results from the development of the program based on the new method, using actual Thai language data randomly sampled from 1Kb-100Kb and imported from news websites, reveal that the program is capable of compressing data encoded with ASCII-TIS620 by percentages ranging from 78.09% to 84.55%. For Unicode encoding, the compression rates range from 81.50% to 86.62%. Similarly, for UTF-8 encoding, the compression rates range from 88.09% to 91.11%. When comparing the compression efficiency achieved with popular current compression software, it is found that the program developed from the new method can achieve significantly higher compression rates, both in terms of percentage compression and compression ratios. |