Penjanaan Ringkasan Isi Utama Berita Bahasa Melayu berdasarkan Ciri Kata (Generation of News Headline for Malay Language based on Term Features)

Shahrul Azman Mohd Noah; Nazlena Mohamad Ali; Mohd Sabri Hasan

doi:10.17576/gema-2018-1804-04

Penjanaan Ringkasan Isi Utama Berita Bahasa Melayu berdasarkan Ciri Kata (Generation of News Headline for Malay Language based on Term Features)

Shahrul Azman Mohd Noah, Nazlena Mohamad Ali, Mohd Sabri Hasan

Abstract

Teknik ringkasan isi utama merupakan satu proses penyulingan maklumat penting daripada wacana untuk menghasilkan satu ayat tunggal yang mewakili isi utama penulisan. Dalam konteks wacana Bahasa Melayu, kajian bidang ini terlalu sedikit dan tertumpu kepada kaedah penterjemahan mesin. Kajian ini dibahagikan kepada tiga fasa iaitu analisis korpus wacana berita, pembangunan teknik ringkasan isi utama dan penilaian kualiti hasil ringkasan. Kajian bertujuan untuk membangunkan teknik ringkasan isi utama dengan menggabungkan kaedah statistik dan linguistik. Kaedah statistik digunakan untuk menentukan kata signifikan dan ayat terpenting berdasarkan konsep pemberat. Kaedah linguistik pula digunakan untuk meningkatkan ketepatannya. Korpus wacana berita Bahasa Melayu terdiri daripada 140 wacana berita berserta ringkasan rujukan tunggal. Hasil analisis korpus wacana berita mendapati isi utama penulisan berita dapat ditentukan berdasarkan empat ciri iaitu lokasi kedudukan kata dalam ayat, kedudukan dua ayat pertama wacana berita, kata berjenis akronim dan kata mewakili nama individu. Kata signifikan dengan isi utama penulisan teks ditentukan berdasarkan nilai pemberat kata. Nilai ditentukan dengan menggabungkan nilai frekuensi kata dalam dokumen dan kedudukan kata dalam ayat. Dua ayat pertama dalam dokumen berita Bahasa Melayu dikenalpasti sebagai calon ayat terbaik bagi pengecaman ayat terpenting. Hasil penilaian menunjukkan peratus min ketepatan pengecaman ayat terpenting adalah 82.9% dan min kualiti ringkasan isi utama yang dijanakan masing-masing ialah kejituan (0.3194), dapatan semula (0.5656), skor-F (0.4012), ROUGE-N (0.5656), ROUGE-L (0.3392), ROUGE-W (0.1186) dan ROUGE-S (0.1232). Kesimpulannya pertimbangan faktor bahasa dalam pembangunan teknik ringkasan isi utama mampu menghasilkan ringkasan yang berkualiti daripada aspek bahasa dan darjah ketepatan yang lebih baik.

Kata Kunci: peringkasan teks; kaedah tanpa seliaan; ciri kata; berita Bahasa Melayu

ABSTRACT

Headline generation is an information extraction process to generate a single sentence that represents the content of a text. In Malay language context, research in this area is limited to machine translation approaches. This study is divided into three phases: analysis of news discourse, development of headline generation technique and evaluation of the quality of generated headlines. The study aims to develop headline using statistical and linguistic methods. The statistic method used to identify significant words and sentences based in term weighting approach. The linguistic method is used to increase its preciseness. 140 news and their corresponding headlines model were constructed. Analysis of the news collection shows that the main idea of written text can be identified based on four characteristics: word location in sentences, sentence location in texts, acronym word types and words that represent the person name. Significant words with main idea of written text are determined based on the words weighted values. The values are determined by combining the frequency of words and word location in sentences. The content of the first two sentences are suitable candidates for recognising important sentences in text. Results showed that mean percentage for important sentence recognition 82.9%, mean quality of generated headlines are 0.3194 (precision), 0.5656 (recall), 0.4012 (F-measure), 0.5656 (ROUGE–N), 0.3392 (ROUGE–L), 0.1186 (ROUGE–W) and 0.1232 (ROUGE–S). In conclusion, the consideration of language factors in headline generation technique is capable of producing quality headlines with higher degree of fidelity as compared to the compared benchmarks.

Keywords: text summarisation; unsupervised approach; term features; Malay news article

Full Text:

PDF

References

Alireza B. & Moses S. (2013). Headlines in Newspaper Editorials: A Contrastive Study. SAGE Open. Vol. April-June(2013), 1-10.

Alotaiby, F. A. (2011). Automatic headline generation using character cross-correlation. Proceedings of the Association for Computational Linguistics–Human Language Technology (ACL–HLT 2011) Student Session: 117–121.

Alguliev, R. M., Aliguliyev, R. M. & Mehdiyev, C. A. (2011). Sentence Selection for Generic Document Summarization Using an Adaptive Differential Evolution Algorithm. Swarm and Evolutionary Computation. Vol. 1(4), 213–222.

Atkinson, J. & Munoz, Ricardo. (2013). Rhetorics-based Multi-document Summarization. Expert System with Applications. Vol. 40(11), 4346–4352.

Banko, M., Mittal, V. O. & Witbrock, M. J. (2000). Headline generation based on statistical translation. Proceedings of the 38th Annual Meeting on Association for Computational Linguistic (ACL–00): 318–325.

Daniel, J.A. (2008). Headline generation for Dutch newspaper articles through transformation-based learning. M.Sc Thesis: University of Groningen.

Dauzidia, F. S. & Lapalme, G. (2004). Lakhas, an arabic summarization system. Proceedings of the Document Understanding Conference 2004 (DUC 2004).

Dorr, B & Zajic, D. (2003). Cross-language Headline Generation for Hindi. ACM Transaction on Asian Language Information Processing. Vol. 2(3), 270-289.

Dorr, B., Zajic, D. & Schwartz, R. (2003). Hedge Trimmer: A Parse-and-Trim approach to headline generation. Proceedings of the Human Language Technology – North American Chapter of the Association for Computational Linguistics (HLT-NAACL) Workshop on Text Summarization 2003: 1–8.

El-Fishawy, N, Hamouda, A, Attiya, G. M. & Afel, M. (2014). Arabic Summarization in Twitter Social Network. Ain Shams Engineering Journal. Vol. 5(2), 411-420.

Edmudson, H. P. (1969). New Method in Automatic Extracting. Journal of the Association for Computing Machinery. Vol. 16(2), 264-285.

Foong, O. M., Oxley, A. & Sulaiman, S. (2010). Challenges and Trends of Automatic Text Summarization. International Journal of Information and Telecommunication Technology (IJITT). Vol. 1(1), 34-39.

Gunawan, D, Pasaribu, A, Rahmat, RF & Budiarto, R. (2017). Automatic Text Summarization for Indonesian Language Using TextTeaser. IOP Conference Series: Materials Science and Engineering. Vol. 190(1).

Gupta, V. & Lehal, G. S. (2010). A Survey of the Summarization Extractive Techniques. Journal of Emerging Technologies in Web Intelligence. Vol. 2(3), 258-268.

Hamood Ali Alshalabi, Sabrina Tiun & Nazlia Omar. (2017). A Comparative Study of the Ensemble and Base Classifiers Performance in Malay Text Categorization. Asia-Pasific Journal of Information Technology and Multimedia. Vol. 6(2), 53-64.

Hasan, M. S. (2015). Penjanaan ringkasan isi utama berdasarkan ciri kata bagi dokumen berita Bahasa Melayu. Tesis Doktor Falsafah: Universiti Kebangsaan Malaysia.

Hishamudin Isam & Norsimah Mat Awal. (2011). Analisis Berasaskan Korpus dalam Menstruktur Semula Kedudukan Makna Teras Leksikal Setia. GEMA Online® Journal of Language Studies. Vol. 11(1), 143-158.

Hovy, E. & Lin, C-Y. (1997). Automated text summarization in SUMMARIST. Proceedings of the Workshop on Intelligent Scalable Text Summarization: 18-24.

Kaikhah, K. (2004). Automatic text summarization with neural networks. Second International IEEE Conference on Intelligent System. 40-45.

Karim, N.S, Onn, F. M., Mohammad, H. H & Mahmud, A. H . (2010). Tatabahasa Dewan Edisi Ketiga. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Lin, J. (2009). Summarization. In. Liu, L. & Ozsu, M. (Eds.), Encyclopedia of Database Systems. New York: Springer.

Lin, C-Y. (2004). ROUGE: A package for automatic evaluation of summaries. Proceedings of the Association for Computational Linguistics (ACL-04) Workshop Text Summarization Branches Out : 74–81.

Lin, F-R. & Liang, C-H. (2008). Storyline-based Summarization for News Topic Retrospection. Decision Support Systems. Vol. 45(3), 473-490.

Lee, K. J. & Kim, J-H. (2005). Sentence compression learned by news headline for displaying in small device. Proceedings of the 2004 International Conference on Asian Information Retrieval Technology: 61–70.

Luhn, H. P. (1958). The Automatic Creation of Literature Abstract. IBM Journal Research and Development. Vol. 2(2), 159-165.

Mani, I. & Maybury, M. T. (1999). Advances in Automatic Text Summarization. Massachusetts Avenue : Massachusetts Institute of Technology.

Muurisep, K. & Mutso, P. (2005). ESTSUM – Estonian newspaper texts summarizer. Proceedings of the Second Baltic Conference on Human

Language Technologies: 311-316.

Nor Hashimah Jalaluddin & Ahmad Harith Syah. (2009). Penelitian Makna Imbuhan – Pen dalam Bahasa Melayu: Satu Kajian Rangka Rujuk Silang. Satu Kajian Rangka Rujuk Silang. GEMA Online® Journal of Language Studies. Vol. 9(2), 57-72.

Norshuhani Zamin & Arina Ghani. (2011). Summarizing Malay Text Documents. World Applied Sciences Journal. Vol. 12, 39-46.

Noorhuzaimi Karimah Mohd Noor, Shahrul Azman Noah, Mohd Juzaiddin

Ab Aziz, Mohd Pouzi Hamzah. (2012). Malay Anaphor and Antecedent Candidate Identification: A Proposed Solution. Proceedings of the Asia Conference on Intelligent Information and Database (ACIIDS): 141-151

Nenkova, A. & McKeown, K. (2011). Automatic summarization. Foundations and Trends in Information Retrieval. Vol. 5(2-3), 103-233.

Rahman, S. N. A. (2009). Kewartawan Malaysia: Praktis & Cabaran dalam Era Revolusi Digital. Kuala Lumpur: Prentice Hall.

Shahrul Azman Mohd Noah, Nazlena Mohamad Ali & Mohd Sabri Hasan. (2018). Penentuan Fitur bagi Pengekstrakan Tajuk Berita Akhbar Bahasa Melayu (Determining Features of News Headline in Malay News Document) GEMA Online® Journal of Language Studies. Vol. 18(2), 154-

Sembok, T. M. T. (2007). Bahasa, Kecerdasan dan Makna Sekitar Capaian Maklumat. Bangi: Penerbitan Universiti Kebangsaan Malaysia.

Shen, D. (2009). Text summarization. In. Liu, L. & Ozsu, M. T. (Eds.). Encyclopedia of Database Systems. New York: Springer.

Soricut, R. & Marcu, D. (2007). Abstractive Headline Generation Using WIDL-expressions. Information Processing and Management. Vol. 43(6), 1536-1548.

Suraya Alias, Siti Khaotijah Mohammad & Hoon, G. K. (2018). A Text Representation Model Using Sequential Pattern-growth Method. Pattern Anal Applic. Vol. 21(1), 233-247.

Steinberger, J. & Jezek, K. (2009). Evaluation measures for text summarization. Computing and Informatics. Vol. 28, 1001-1026.

Vijayapal, P., Vishnu, B., Govardhan, A. & Babu, M. Y. (2011). Statistical translation based headline generation for Telugu. International Journal of Computer Science and Network Security. Vol. 11(6), 295-299.

Xu, S., Yang, S. & Lau, F. C. M. (2010). Keyword extraction and headline generation using novel word features. Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10): 1461-1466.

Zajic, D., Dorr, B. & Schwartz, R. (2002). Automatic headline generation for newpapers stories. Proceedings of the Document Understanding Conference (DUC 2002).

Zajic, D., Dorr, B. & Schwartz, R. (2005). Headline Generation for Written and Broadcast News. Technical Report UMIACS-TR-2005-07, University of Maryland, USA.

Zamin, N & Ghani, A. (2011). Summarizing Malay Text Documents. World Applied Science Journal 12 (Special Issue on Computer Applications & Knowledge Management): 39-46.

Zhou, L. & Hovy, E. (2003). Headline summarization at ISI. Proceedings of the Document Understanding Conference (DUC 2003).

Zhou, L. & Hovy. E. (2004). Template–filtered headline summarization. Proceedings of the Association for Computational Linguistics (ACL-04) Workshop on Text Summarization Branches Out: 56 – 60.

DOI: http://dx.doi.org/10.17576/gema-2018-1804-04