Malay Part of Speech Tagging Using Ruled-Based Approach

Nur Ashikin Halid, Nazlia Omar

Abstract


The research on part of speech (POS) tagging has been widely applied and used through a variety of approaches, particularly for European languages. But it is more challenging for Asian languages, especially Malay as it has some element of modification from other languages such as English and Arabic. Among the issues that often occur in POS tagging are the existence of ambiguous words and unknown words. Meanwhile, the lack of rules in the existing work has become a major problem in Malay POS tagging. Therefore, this research aims to develop new rules for Malay POS tagging and to compare the performance of this new development with the existing gold standard. This process begins with the collection and selection of the corpus using secondary data, obtained from online daily news which covers several domains. Next, the corpus has gone through the process of pre-processing in raw text of article form which include sentence splitter and tokenization process to generate an unlabeled corpus. POS tag dictionary also has been constructed to form a lexicon that only consists of root words. The rule development process involves detailing every type of POS tag to its suitable rules and get the best rules ordering for each type of this POS. A total of 30 rules including affixation rules and 16 word type relations have been developed in this process. The evaluation process is used to test the precision of the developed POS tagger and to get the best rules ordering. The POS tagging result is compared with existing gold standard. Overall, the test showed good result with an accuracy of 93.06% compared to the gold standard performance of 77.17%. Hence, this research showed better accuracy compared with the gold standard and at the same time, it proves that the addition of a new rules and rules ordering among the factors that contributed to the higher precision in tagging Malay corpus. As an improvement in future studies, the use of compound words should be taken into account because most of these words are used in most news sources. In addition, corpus from social media sources can be used because the content of information disseminated through social media is fast and up-to-date even though the language used for this resource is mostly informal and confronts with noise data issues.


Keywords


Malay Part of Speech Tagging, Ruled-Based Approach, Part of Speech Rule, Word Type Relations Rule.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


e-ISSN : 2289-2192

For any inquiry regarding our journal please contact our editorial board by email apjitm@ukm.edu.my