Domain-specific Stop Words in Malaysian Parliamentary Debates 1959 – 2018

Anis Nadiah Che Abdul Rahman, Imran Ho Abdullah, Intan Safinaz Zainudin, Sabrina Tiun, Azhar Jaludin

Abstract


Removal of stop words is essential in Natural Language Processing and text-related analysis. Existing works on Malay stop words are based on standard Malay and Quranic/Arabic translations into Malay. Thus, there is a lack of domain-specific stop word list, making it discordant for processing of Malay parliamentary discourse. In this paper, we propose a semantic approach towards identifying and removing Malay, conventional Malay spelling and English functional words in analysing a time-series corpus, namely the Malaysian Hansard Corpus (MHC), to extract a Malay specific-domain stop word list. The study utilised a combination of Z-method of most frequently occurring words, words that appear once, and the classic method. The dataset of the corpus evaluated comprised Parliament 1 (year 1959) to Parliament 13 (year 2018). The study then categorised the stop word list    according to domain-specific related words. The resulting list comprised 587 stop words. New stop words that emerged from the MHC include parliamentary-related words like ‘Berhormat’ (salutation to the members of the Parliament), ‘Pertua’ (salutation to the Speaker of the House), ‘ketawa’ (laugh) and ‘tepuk’ (clap). Other than typical English stop words like ‘and’ and ‘the’, there are also words like ‘hon’ble’ (short for ‘Honourable’) and ‘honourable’. The list also includes stop words in conventional Malay spelling like ‘untok’ (for), ‘lebeh’ (more), and ‘kapada’ (to). The proposed set of stop words can be further utilised to assist natural language processing and text analysis.

 


Keywords


stop word removal; text filtration; Malaysian Hansard Corpus; Malay stop word; parliamentary corpus processing

Full Text:

PDF

References


Alshanik, F., Apon, A., Herzog, A., Safro, I. & Sybrandt, J. (2020). Accelerating text mining using domain-specific stop word lists. 2020 IEEE International Conference on Big Data (Big Data), 2639-2648.

Ayral, H. & Yavuz, S. (2011). An automated domain specific stop word generation method for natural language text classification. 2011 International Symposium on

Innovations in

Intelligent Systems and Applications, Istanbul.

Baldwin, T. & Su’ad Awab. (2006). Open source corpus analysis tools for Malay. Proceedings of the Fifth International Conference on Language Resources and

Evaluation, Italy.

Chekima, K. & Alfred, R. (2016). An automatic construction of Malay stop words based on aggregation method. In M. Berry, Hj. Mohamed A., & B. Yap, (Eds.). Soft

computing in data science. Communications in Computer and Information Science, Vol. 652. Singapore: Springer.

Chong, T.Y., Banchs, R.R. & Chng, E.S. (2012). An empirical evaluation of stop word removal in statistical machine translation. Proceedings of the 13th Conference of

the European Chapter of the Association for Computational Linguistics. France: Association for Computational Linguistics.

Choy, M. (2012). Effective listings of function stop words for Twitter. International Journal of Advanced Computer Science and Application. 3(6), 8–11.

Chua, S. & Nohuddin, P.N.E. (2017). Relationship analysis of keyword and chapter in Malay-translated tafseer of al-Quran. Journal of Telecommunication, Electronic and

Computer Engineering. 9(2-10), 185-189.

Haddi, E., Liu, X. & Shi, Y. (2013). The role of text pre-processing in sentiment analysis. Procedia Comput. Sci. 17, 26–32.

Fatimah Dato Ahmad (1995). A Malay language document retrieval system: An experimental approach and analysis. Unpublished PhD thesis, Universiti Kebangsaan

Malaysia, Bangi, Malaysia.

Fatimah Sidi, Marzanah Abdul Jabar, Mohd Hasan Selamat, Abdul Azim Abd Ghani, Md. Nasir Sulaiman & Salmi Baharom (2011). Malay interrogative knowledge corpus.

American Journal of Economics and Business Administration. 3(1), 171–176.

Green, D. & Cross, J, P. (2017). Exploring the political agenda of the European Parliament using a dynamic topic modeling approach. Cambridge: Cambridge University

Press.

Hassan Saif, Fernández, M., He, Y. & Harith, A. (2014). On stopwords, filtering and data sparsity for sentiment analysis of Twitter. Proceeding of Ninth International

Conference on Language Resources and Evaluation, Iceland. 810–817.

Hamood Ali Alshalabi, Sabrina Tiun & Nazlia Omar (2017). A comparative study of the ensemble and base classifiers performance in Malay text categorization. Asia-

Pacific Journal of Information Technology and Multimedia. 6(2), 53–64.

Hofmann, K., Marakasova, A., Baumann, A., Neidhardt, J., & Wissik, T. (2020). Comparing lexical usage in political discourse across diachronic corpora. Proceedings of

ParlaCLARIN II Workshop, 58–65.

Imran Ho-Abdullah, Zaharani Ahmad, Rusdi Abdul Ghani, Nor Hashimah & Idris Aman (2004). A practical grammar of Malay – A corpus-based approach to the

description of Malay. First COLLA Regional Workshop. Malaysia: Putrajaya, June.

Imran Ho Abdullah, Anis Nadiah Che Abdul Rahman & Azhar Jaludin (2017). The Malaysian Hansard Corpus.

Kaur, J. & Buttar, P.K. (2018). A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication

Engineering. 4(4), 207–210.

Keshavarz, H. & Abadeh, M.S. (2017). ALGA: Adaptive lexicon learning using genetic algorithm for sentiment analysis of microblogs. Knowledge-Based Systems. 122,

–16.

Khan, N., Bakht, M.B., Khan, M.J., Samad, A. & Sahar, G. (2019). Spotting Urdu stop words by Zipf's statistical approach. 13th International Conference on

Mathematics, Actuarial Science, Computer Science and Statistics (MACS). 1–5, doi: 10.1109/MACS48846.2019.9024817.

Koteyko, N. (2014). Compilation of specialised corpora. In Language and politics in Post-Soviet Russia: A corpus-assisted approach (pp. 48–64). London: Palgrave

Macmillan.

Kwee, A.T., Tsai, F.S. & Tang W. (2009) Sentence-level novelty detection in English and Malay. In T. Theeramunkong, B. Kijsirikul, N. Cercone, & T.B. Ho, (Eds.).

Advances in knowledge discovery and data mining. PAKDD 2009. Lecture Notes in Computer Science, Vol. 5476. Berlin: Springer. https://doi.org/10.1007/978-3-642-

-2_7

Liu, J., Ren, X., Shang, J., Cassidy, T., Voss, C.R. & Han, J. (2016). Representing documents via latent keyphrase inference. Proc Int World Wide Web Conf. 1057–

doi: 10.1145/2872427.2883088.

Lo, R. T.-W., He, B. & Ounis, I. (2005). Automatically building a stopwordlist for an information retrieval system. J. Digit. Inf. Manag. Spec. Issue. 5th Dutch-Belgian

Inf. Retr. Work. 5(2005), 17–24.

Luhn, H.P. (1960). Key word‐in‐context index for technical literature (KWIC Index). American Documentation. 11, 288–295.

Makrenchi, M. & Kamel, M.S. (2017). Extracting domain-specific stopwords for text classifiers. Intelligent Data Analysis. 21(1), 39–62.

Manning, C.D., Raghavan, P. & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press.

Mohd Amin Mohd Yunus, Aida Mustapha & Noor Azah Samsudin (2017). Query translation and Quran result in TreeMap. MATEC Web of Conferences 135. 1–7.

Muhammed Salehudin Aman (2021). Sinopsis sistem ejaan Bahasa Melayu. KLIKWeb DBP. Retrieved May 7th, 2021 from http://klikweb.dbp.my/?p=6003

Muhamad Taufik Abdullah (2006). Monolingual and crosslanguage information retrieval approaches for Malay and English language documents. Unpublished Ph.D

thesis. Universiti Putra Malaysia, Serdang, Malaysia.

Muhamad Taufik Abdullah, Fatimah Ahmad, Ramlan Mahmod, & Tengku Mohd Tengku Sembok (2005). Improvement of Malay information retrieval using local stop

words. International Advanced Technology Congress: Conference on Computer Integrated Systems. Putrajaya, Malaysia.

Munková, D., Munk, M. & Vozár, M. (2014). Influence of stop-words removal on sequence patterns identification within comparable corpora. In V. Trajkovik & A.

Mishev, (Eds.). Advances in intelligent systems and computing (pp. 67–76). Switzerland: Springer International Publishing Switzerland.

Norsimah Mat Awal, Azhar Jaludin, Anis Nadiah Che Abdul Rahman & Imran Ho Abdullah (2019). “Is Selangor in deep water?”: A corpus-driven account of air/water in

the Malaysian Hansard Corpus (MHC). GEMA Online® Journal of Language Studies. 19(2), 99–120.

Nor Fariza Mohd Nor, Anis Nadiah Che Abdul Rahman, Azhar Jaludin, Imran Ho Abdullah & Sabrina Tiun (2019). A corpus driven analysis of representations around the

word ‘ekonomi’ in Malaysian Hansard Corpus. GEMA Online® Journal of Language Studies. 19(4), 66–95.

Puri, R, Bedi, R. P. S. & Goyal, V. (2013). Automated stopwords identification in Punjabi documents. An Int. J. Eng. Sci. 8(2013), 119–125.

Rani, R. & Lobiyal, D.K. (2018). Automatic construction of generic stop words list for Hindi text. Procedia Computer Science. 132, 362-370.

Raulji, J.K & Saini, J.R. (2016). Stop-word removal algorithm and its implementation for Sanskrit language. International Journal of Computer Applications. 150(2), 15–

Raulji, J.K. & Saini, J.R. (2017). Generating stopwordlist for Sanskrit language. 2017 IEEE 7th International Advance Computing Conference (IACC).

Rose, S., Engel, D., Cramer, N. & Cowley, W. (2010). Automatic keyword extraction from individual documents. In Berry, M.W., & Kogan, J., (Eds.). Text mining:

Applications and theory. New Jersey: John Wiley and Sons, Ltd.

Sabrina Tiun, Nor Fariza Mohd Nor, Azhar Jalaludin & Anis Nadiah Che Abdul Rahman. (2020). Word embedding for small and domain-specific Malay corpus. In Alfred

R., Lim Y., Haviluddin H., & On, C., (Eds). Computational science and technology. Lecture notes in electrical engineering. Singapore: Springer.

Sabrina Tiun, Saidah Saad, Nor Fariza Mohd Nor, Azhar Jalaludin & Anis Nadiah Che Abdul Rahman (2020). Quantifying semantic shift visually on a Malay domain-

specific corpus using temporal word embedding approach. Asia-Pacific Journal of Information Technology and Multimedia. 9(2), 1–10.

Sadeghi, M. & Vegas, J. (2014). Automatic identification of light stop words for Persian information retrieval systems. Journal of Information Science. 40(4).

Scott, M. (2008). WordSmith Tools version 5. Liverpool: Lexical Analysis Software.

Weisser, M. (2103). Tools, ideas & resources for linguistics. Retrieved November 18, 2020 from http://martinweisser.org/

Wild, F., Kalz, M., Demnati, H., Paliwoda-Pekosz, G. & Naili, M. (2020). Stopwords: Stop wordlists in German, English, Dutch, French, Polish, and... in lsa: Latent

Semantic Analysis. R Package Documentation. Retrieved November 4, 2020 from https://rdrr.io/cran/lsa/man/stopwords.html

Yuan, T., Lo, D., & Lawall, J. (2014). Automated construction of a software-specific word similarity database. 2014 Software Evolution Week - IEEE Conference on

Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE) 2014. 44–5. doi: 10.1109/CSMR-WCRE.2014.6747213.

Zheng, A. (2018). Feature engineering for machine learning. Sebastool, USA: O'Reilly Media, Inc.

Zhi, L.G. (2003). Using mutual information to identify new features for text documents of various domains. PACLIC 2003. 372–379.

Zipf, G.K. (1949). Human behavior and the principle of least Effort. Cambridge, Massachusetts: Addison-Wesley.




DOI: http://dx.doi.org/10.17576/gema-2021-2102-01

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021