Penentuan Fitur bagi Pengekstrakan Tajuk Berita Akhbar Bahasa Melayu (Determining Features of News Headline in Malay News Document)

Shahrul Azman Mohd Noah, Nazlena Mohamad Ali, Mohd Sabri Hasan

Abstract


Ringkasan tajuk berita (headline) adalah salah satu teknik ringkasan teks automatik yang boleh mengurangkan masalah kebanjiran maklumat dalam sistem capaian. Teknik ini berupaya mengurangkan beban kognitif pengguna semasa meneliti dan memilih dokumen relevan dalam kuantiti yang besar. Keupayaan teknik ini dipengaruhi oleh ciri-ciri sistem bahasa tabii yang mewakili maklumat dalam dokumen. Kajian ini membincangkan proses dalam penentuan ciri-ciri sistem bahasa Melayu pada dokumen genre berita. Metodologi kajian dimulai dengan analisis ke atas korpus dokumen berita bahasa Melayu. Korpus ini mengandungi 140 dokumen berita teras yang dipilih daripada dua pangkalan data berita arus perdana di Malaysia iaitu Berita Harian dan Utusan Malaysia. Kriteria pemilihan adalah kategori berita teras, bersaiz 50 hingga 250 perkataan, dengan tahun penerbitan dari 2007 hingga 2012 dan genre berita adalah ekonomi, jenayah, pendidikan dan sukan. Tiga pakar linguistik bahasa Melayu menghasilkan satu ringkasan tajuk berita bagi setiap dokumen berita secara manual. Ketiga-tiga pakar linguistik ini perlu mematuhi tiga syarat iaitu ringkasan dilakukan secara pengekstrakan, teknik pemilihan perkataan secara select-word-inorder dan perubahan morfologi perkataan. Hasil eksperimen menunjukkan tiga fitur telah dikenal pasti iaitu, pertama: dua ayat pertama adalah calon sesuai ayat terpenting, kedua: ayat mengandungi takrifan akronim berpotensi sebagai ayat terpenting dan ketiga: saiz ringkasan tajuk berita ideal adalah enam perkataan. Pertimbangan fitur ini membolehkan ringkasan tajuk berita dijana secara automatik yang lebih mirip seperti dilakukan oleh manusia.

 

Kata kunci: isi utama; pemprosesan bahasa tabii; berita Bahasa Melayu; ringkasan teks; korpus bahasa melayu

 

ABSTRACT

 

Headline summarization is one of the automated text summarization techniques that can reduce the problem of information overload in the retrieval system and reduce the user's cognitive burden while searching and selecting relevant documents in large quantities. This study discusses the process on the determination of Malay language system features in the news genre document. Methodology starts with analysis the corpus of Malay news documents. The corpus contains 140 core news items which were selected from the two mainstream news databases in Malaysia which are Berita Harian and Utusan Malaysia. The selection news criteria are from core news categories, sized 50 to 250 words, the years of publication from 2007 to 2012 and news genres from economic, crime, education and sports. Three linguistic experts in Malay produced a headline summary for each news document manually. The experts need to comply with three conditions which are summary extraction, select-word-inorder word selection techniques and word morphological changes. The experimental results show that three characteristics have been identified, first: the first two sentenses are the important sentences, second: the verse that contains the potential acronym definitions is chosen as the most important sentence and third: the size of the summary of the ideal headline is six words. The consideration of this feature allows a summary of the headline that can be generated automatically, just like the process done by human.

 

Keywords: headline; Natural Language Processing; malay news; text summarization; Malay corpus

Full Text:

PDF

References


Alicja P., Vania D. & Katja M. (2017). Automatic Extraction of News Values from Headline Text. Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, Association for Computational Linguistics Valencia, Spain, 64–74.

Alireza B. & Moses S. (2013). Headlines in Newspaper Editorials: A Contrastive Study. SAGE Open, Vol. April-June(2013), 1-10.

Alotaiby, F., Foda, S. & Alkharashi, I. (2011). A New Approach to HMM-Based Automatic Headline Generation. In Proceedings of the Computational Linguistics-Applications Conference Jachranka, 2011.

Azmi, A. H. & Al-Thanyyan, S. (2012). A Text Summarizer for Arabic. Computer Speech and Language. Vol. 26(2012), 260-273.

Brooks, B. S. (1996). News Reporting and Writing (5th ed.) New York: St Martin’s Press.

Dorr, B., Zajic, D. & Schwartz, R. (2003). Cross-Language Headline Generation for Hindi. ACM Transactions on Asian Language Information Processing. Vol. 2(3), 270-289.

Dauzidia, F. S. & Lapalme, G. (2004). Lakhas, An Arabic Summarization System. Proceedings of the Document Understanding Conference (DUC 2004).

Gunawan, D, Pasaribu, A, Rahmat, RF & Budiarto, R. (2017). Automatic Text Summarization for Indonesian Language Using TextTeaser. IOP Conference Series: Materials Science and Engineering. Vol. 190(1).

Hishamudin Isam & Norsimah Mat Awal. (2011). Analisis Berasaskan Korpus Dalam Menstruktur Semula Kedudukan Makna Teras Leksikal

Setia. GEMA Online® Journal of Language Studies. Vol. 11(1), 143-158.

Julian, H. & Stanley, J. (1965). The Complete Reporter. New York: MacMillan Publishing Co. Inc.

Luhn, H. P. (1958). The Automatic Creation of Literature Abstracts. IBM Journal of Research Development. Vol. 2(2), 159-165.

Md Salleh Kassim (1985). Kewartawan: Teori & Praktis. Dewan Bahasa dan Pustaka.

Nik Safiah Karim, Farid M. Onn, Hashim Haji Musa & Abdul Hamid Mahmood. (2010). Tatabahasa Dewan Edisi Ketiga. Kuala Lumpur: Dewan Bahasa dan Pustaka.

Saiful Nujaimi Abdul Rahman (2009). Kewartawan Malaysia: Praktis & Cabaran Dalam Era Revolusi Digital. Kuala Lumpur: Prentice Hall.

Tengku Mohd Tengku Sembok (2007). Bahasa, Kecerdasan dan Makna Sekitar Capaian Maklumat. Bangi: Penerbit Universiti Kebangsaan Malaysia.

Shahrul Azman Mohd Noah, Nazlia Omar & Amru Yusrin Amruddin (2015). Evaluation of Lexical-Based Approaches to the Semantic Similarity of Malay Sentences. Journal of Quantitative Linguistics. Vol. 22(2), 135-156.

Kaikhah, K. (2004). Automatic Text Summarization with Neural Networks. Second IEEE International Conference on Intelligent System, Hamburg-Harburg, Germany: 40-44.

Kyoomarsi, F., Khosrawi, H., Eslami, E. & Dehkordy, P. K. (2008). Optimizing Text Summarization Based on Fuzzy Logic. Proceedings of Seventh IEEE/ACIS International Conference on Computer Science and Information Science, University of Shahid Bahonar Kerman, United Kingdom.

Norshuhani Zamin & Arina Ghani. (2011). Summarizing Malay Text Documents. World Applied Sciences Journal. Vol. 12, 39-46.

Mazdak, N. (2004). FarsiSum – A Persian Text Summarize. Unpublished M.S. Thesis, Department of Linguistics, Stockholm University.

Noorhuzaimi Karimah Mohd Noor, Shahrul Azman Mohd Noah, Mohd Juzaiddin Ab Aziz & Mohd Pouzi Hamzah (2012). Malay anaphor and antecedent candidate identification: a proposed solution. In Pan JS., Chen SM., Nguyen N.T. (Eds.), Intelligent Information and Database Systems. ACIIDS 2012 Lecture Notes in Computer Science, vol 7198 (pp. 141-151). Springer, Berlin, Heidelberg.

Nor Hashimah Jalaluddin & Ahmad Harith Syah. (2009). Penelitian Makna Imbuhan – Pen Dalam Bahasa Melayu: Satu Kajian Rangka Rujuk Silang. Satu Kajian Rangka Rujuk Silang. GEMA Online® Journal of Language Studies. Vol.9(2), 57-72.

Over, P, Dang, H. & Harman, D. (2007). DUC in Context. Information Processing and Management. Vol. 43(2007), 1506-1520.

Shen, D. (2009). Text Summarization. In. Liu, L. & Ozsu, M. T. Encyclopedia of Database Systems. New York: Springer Science-Business Media.

Shamsfard, M., Akhavan, T. & Erfani Jourabchi, M. (2009). PARSUMIST: A Persian Text Summarization. Proceedings of the 5th IEEE International Conference On Natural Language Processing and Knowledge Engineering (IEEE NLP-KE’09).

Steinberger, J. & Jezek, K. (2009). Text Summariazation: An Old Challenge and New Approaches. In. Ajit Abraham, Aboul-Ella Hassanien, Andre Ponce de Leon, F. De Carvalho & Vaclav Snasel. Foundations of Computational Intelligence Volume 6, Data Mining. Berlin: Springer-Verlag.

Suraya Alias, Siti Khaotijah Mohammad & Hoon, G. K. (2018). A Text Representation Model Using Sequential Pattern-growth Method. Pattern Anal Applic. Vol. 21(1), 233-247.

Zhou, L. & Hovy, E. (2003). Headline summarization at ISI. Proceedings of the Document Understanding Conference 2003 (DUC 2003).

Zhou, L. & Hovy. E. (2004). Template–filtered headline summarization. Proceedings of the Association for Computational Linguistics. Workshop on Text Summarization Branches Out, 56-60.

Zajic, D., Dorr, B. & Schwartz, R. (2002). Automatic headline generation for newpapers stories. Proceedings of the Document Understanding Conference 2002 (DUC 2002).

Zajic, D., Dorr, B. & Schwartz, R. (2005). Headline Generation for Written and Broadcast News. Technical Report UMIACS-TR-2005-07, University of

Maryland, USA.




DOI: http://dx.doi.org/10.17576/gema-2018-1802-11

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021