Model Peringkasan Teks Ekstraktif Dwibahasa menggunakan Fitur Kekangan Corak Tekstual (Bilingual Extractive Text Summarization Model using Textual Pattern Constraints)

Suraya Alias, Mohd Shamrie Sainin, Siti Khaotijah Mohammad

Abstract


Di dalam era pencarian maklumat digital, sebuah ringkasan yang dijana secara automatik dapat membantu pembaca mendapatkan maklumat penting dan relevan dengan lebih mudah. Sebahagian besar kajian dan set data penanda aras dalam bidang peringkasan teks secara automatik adalah dalam bahasa Inggeris. Justeru itu, terdapat keperluan kajian dalam bahasa Melayu agar potensi dalam bidang ini lebih kompetitif. Kajian ini juga menyoroti masalah dalam mengenal pasti dan menjana maklumat penting dalam penyediaan ringkasan ekstraktif. Ini kerana model perwakilan teks yang sedia ada seperti BOW mempunyai kelemahan dalam perwakilan semantik yang kurang tepat dan model N-gram pula mempunyai isu penghasilan dimensi vektor kata yang sangat tinggi. Dalam kajian ini, sebuah model peringkasan teks dwibahasa dinamakan MYTextSumBASIC telah dibangunkan untuk menghasilan ringkasan ekstraktif secara automatik dalam versi bahasa Melayu dan bahasa Inggeris. Model MYTextSumBASIC ini menggunakan model perwakilan teks dikenali sebagai FASP yang telah diimprovisasi dengan menggunakan tiga Fitur Kekangan Corak Tekstual iaitu kekangan item kata, kekangan kata urutan bersebelahan dan kekangan saiz urutan. Terdapat tiga fasa utama dalam rangka kerja model MYTextSumBASIC iaitu pembangunan korpus ringkasan bahasa Melayu, pembangunan model MYTextSumBASIC menggunakan perwakilan FASP dan penilaian ringkasan. Dalam fasa penilaian, dengan menggunakan 100 wacana berita bahasa Melayu, prestasi ringkasan yang dihasilkan secara automatik oleh MYTextSumBASIC telah mengatasi ringkasan dari model Baseline (Lead) dan OTS dengan nilai purata tertinggi bagi dapatan semula (R) ialah 0.5849, kejituan (P) ialah 0.5736 dan skor-F (Fm) ialah 0.5772. Bagi penilaian secara manual oleh pakar bahasa, kaedah MYTextSumBASIC telah menghasilkan skor kebolehbacaan sebanyak 4.1 dan 3.87 untuk skor isi kandungan ringkasan yang dihasilkan menggunakan set data rawak. Eksperimen selanjutnya menggunakan set data tanda aras bahasa Inggeris DUC 2002 sebanyak 102 wacana berita juga telah menunjukkan model MYTextSumBASIC telah mengatasi sistem terbaik dan tercorot dalam perbandingan tersebut dengan nilai purata dapatan semula ROUGE-1 (0.43896) dan ROUGE-2 (0.19918). Kesimpulan dari penilaian ringkasan dapat merumuskan bahawa kaedah perwakilan teks FASP yang digunakan sebagai fitur oleh MYTextSumBASIC boleh diaplikasi untuk teks dwibahasa dengan prestasi kompetitif melalui perbandingan dengan model peringkasan teks bahasa Inggeris yang sedia ada.

 

Kata Kunci: Fitur Kekangan Corak Tekstual; Peringkasan Teks; Pertumbuhan Corak-Tersusun; Bahasa Melayu

 

ABSTRACT

 

In the era of digital information, an auto-generated summary can help readers to easily find important and relevant information. Most of the studies and benchmark data sets in the field of text summarization are in English. Hence, there is a need to study the potential of Malay language in this field. This study also highlights the problems in identifying and generating important information in extractive summaries. This is because existing text representation models such as BOW has weaknesses in inaccurate semantic representation, while the N-gram model has the issue of producing very high word vector dimensions. In this study, a bilingual text summarization model named MYTextSumBASIC has been developed to generate an extractive summary automatically in Malay and English. The MYTextSumBASIC summarizer model applies a text representation model known as FASP using three Textual Pattern Constraints, namely word item constraints, adjacent word constraints and sequence size constraints. There are three main phases in the framework of MYTextSumBASIC model, which are the development of the Malay language corpus, the development of MYTextSumBASIC model using FASP and the summary evaluation phase. In the summary evaluation phase, using the Malay language data sets of 100 news articles, the summaries produced by MYTextSumBASIC outperformed the summary generated by Baseline (Lead) and OTS summarizer with the highest average for retrieval (R) is 0.5849, precision (P) is 0.5736 and the F-score (Fm) is 0.5772. For manual evaluation by linguists, the MYTextSumBASIC method yielded a reading score of 4.1 and 3.87 for summary content generated using a random data set. Further experiments using the 2002 DUC English benchmark data set of 102 news articles have also shown that the MYTextSumBASIC model outperformed the best and lowest systems in the comparison with the mean retrieval values of ROUGE-1 (0.43896) and ROUGE-2 (0.19918). These findings conclude that the FASP text representation feature along with the textual pattern constraints used by our model can be used for bilingual text with competitive performance compared to other text summarization models.

 

Keywords: Textual Pattern Constraint; Text Summarization; Sequential Pattern-Growth; Malay language


Full Text:

PDF

References


Alias, S., Mohammad, S. K., Hoon, G. K., & Ping, T. T. (2016). A Malay Text Corpus Analysis for Sentence Compression Using Pattern-Growth Method. Jurnal Teknologi. 78(8), 197-206.

Alias, S., Mohammad, S. K., Hoon, G. K., & Ping, T. T. (2018). A text representation model using Sequential Pattern-Growth method. Pattern Analysis and Applications. 1-15.

doi:10.1007/s10044-017-0624-9

Baralis, E., Cagliero, L., Jabeen, S. & Fiori, A. (2012). Multi-document summarization exploiting frequent itemsets. Paper presented at the 27th Annual ACM Symposium on

Applied Computing, Trento, Italy.

Binwahlan, M. S., Salim, N. & Suanmali, L. (2010). Fuzzy swarm diversity hybrid model for text summarization. Information Processing & Management. 46(5), 571-588.

Boudin, F. & Morin, E. (2013, 2013). Keyphrase Extraction for N-best reranking in multi-sentence compression. Paper presented at the North American Chapter of the

Association for Computational Linguistics (NAACL).

Clarke, J., & Lapata, M. (2008). Global inference for sentence compression: An integer linear programming approach. Journal of Artificial Intelligence Research, 31, 399-429.

Conroy, J. M., Schlesinger, J. D., O’leary, D. P. & Goldstein, J. (2006, November). Back to basics: CLASSY 2006. Paper presented at the Proceedings of DUC.

Edmundson, H. P. (1969). New Methods in Automatic Extracting. Journal of the ACM (JACM). 16(2), 264-285. doi:10.1145/321510.321519

Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R. D., de França Silva, G., Simske, S. J., & Favaro, L. (2014). A multi-document summarization system based on statistics and

linguistic treatment. Expert Systems with Applications. 41(13), 5780-5787.

Ferreira, R., de Souza Cabral, L., Lins, R. D., e Silva, G. P., Freitas, F., Cavalcanti, G. D. & Favaro, L. (2013). Assessing sentence scoring techniques for extractive text

summarization. Expert Systems with Applications. 40(14), 5755-5764.

Gambhir, M. & Gupta, V. (2017). Recent automatic text summarization techniques: a survey. Artificial Intelligence Review. 47(1), 1-66. doi:10.1007/s10462-016-9475-9

Ganesan, K., Zhai, C. & Han, J. (2010). Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. Paper presented at the Proceedings of the

rd international conference on computational linguistics

García-Hernández, R. A. & Ledeneva, Y. (2009). Word Sequence Models for Single Text Summarization. Paper presented at the 2009 Second International Conferences on

Advances in Computer-Human Interactions.

Harabagiu, S. M. & Lacatusu, F. (2002, July). Generating single and multi-document summaries with gistexter. Paper presented at the Document Understanding Conferences.

Jones, K. S. (2007). Automatic summarising: The state of the art. Information Processing & Management. 43(6), 1449-1481. doi:10.1016/j.ipm.2007.03.009

Jusoh, S., Masoud, A. M. & Alfawareh, H. M. (2011). Automated text summarization: sentence refinement approach. In P. J. Snasel V., El-Qawasmeh E. (Ed.), Digital Information

Processing and Communications. Communications in Computer and Information Science (Vol. 189, pp. 207-218): Springer, Berlin, Heidelberg.

Khan, A., Salim, N., Reafee, W., Sukprasert, A. & Kumar, Y. J. (2015). A Clustered Semantic Graph Approach For Multi-Document Abstractive Summarization. Jurnal Teknologi,

(18).

Kim, H. D., Park, D. H., Lu, Y. & Zhai, C. (2012). Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proceedings of the American Society

for Information Science and Technology. 49(1), 1-10. doi:10.1002/meet.14504901209

Le, Q. V. & Mikolov, T. (2014). Distributed representations of sentences and documents. Paper presented at the Proceedings of the 31st International Conference on Machine

Learning (ICML-14).

Ledeneva, Y., Gelbukh, A. & García-Hernández, R. (2008). Terms Derived from Frequent Sequences for Extractive Text Summarization. Paper presented at the International

Conference on Intelligent Text Processing and Computational Linguistics.

Litvak, M. & Last, M. (2013). Cross-lingual training of summarization systems using annotated corpora in a foreign language. Information Retrieval. 16(5), 629-656.

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of Research and Development. 2(2), 159-165. doi:10.1147/rd.22.0159

M. Denil, A. D., N. Kalchbrenner, P. Blunsom, N. de Freitas. (2014). Modelling, Visualising and Summarising Documents with a Single Convolutional Neural Network. Paper

presented at the 26th Int. Conf. Computational Linguistics.

Mahajani, A., Pandya, V., Maria, I. & Sharma, D. (2019). A Comprehensive Survey on Extractive and Abstractive Techniques for Text Summarization. In T. S. Hu YC., Mishra K.,

Trivedi M. (Ed.). Y.-C. Hu, S. Tiwari, K. K. Mishra, & M. C. Trivedi (Series Eds.), Ambient Communications and Computer Systems, Advances in Intelligent Systems and

Computing Ambient Communications and Computer Systems (Vol. 904, pp. 339-351): Springer Singapore.

Narayan, S., Cohen, S. B. & Lapata, M. (2018). Ranking sentences for extractive summarization with reinforcement learning. Paper presented at the North American Chapter of

the Association for Computational Linguistics: Human Language Technologies.

Nenkova, A. & McKeown, K. (2011). Automatic Summarization. Foundations and Trends® in Information Retrieval. 5(2–3), 103-233. doi:10.1561/1500000015

Nenkova, A. & McKeown, K. (2012). A survey of text summarization techniques. In Charu C. Aggarwal & C. Zhai (Eds.), Mining Text Data (pp. 43-76): Springer.

Nenkova, A. & Vanderwende, L. (2005). The impact of frequency on summarization. Microsoft Research, Redmond, Washington, Tech. Rep. MSR-TR-2005-101.

Ning, Z., Yuefeng, L. & Sheng-Tang, W. (2012). Effective Pattern Discovery for Text Mining. Knowledge and Data Engineering, IEEE Transactions. 24(1), 30-44.

doi:10.1109/TKDE.2010.211

Noah, S. A. M., Ali, N. M. & Hasan, M. S. (2018). Penjanaan Ringkasan Isi Utama Berita Bahasa Melayu berdasarkan Ciri Kata (Generation of News Headline for Malay Language

based on Term Features). GEMA Online® Journal of Language Studies. 18(4).

Pei, J., Han, J., Mortazavi-Asl, B., Wang, J., Pinto, H., Chen, Q. & Hsu, M.-C. (2004). Mining Sequential Patterns by Pattern-Growth: The PrefixSpan approach. IEEE Transactions

on Knowledge and Data Engineering. 16(11), 1424-1440.

Qiang, J.-P., Chen, P., Ding, W., Xie, F. & Wu, X. (2016). Multi-document summarization using closed patterns. Knowledge-Based Systems. 99, 28-38.

Rotem, N. (2019). Open Text Summarizer (OTS). Retrieved from http://libots.sourceforge.net/

Van Lierde, H. & Chow, T. W. (2019). Query-oriented text summarization based on hypergraph transversals. Information Processing & Management. 56(4), 1317-1338.

Verma, V. K., Yadav, A. & Jain, T. (2019). Key Feature Extraction and Machine Learning-Based Automatic Text Summarization. Paper presented at the Emerging Technologies in

Data Mining and Information Security. Advances in Intelligent Systems and Computing.

Xie, F., Wu, X. & Zhu, X. (2017). Efficient sequential pattern mining with wildcards for keyphrase extraction. Knowledge-Based Systems. 115, 27-39.

Zajic, D., Dorr, B. & Schwartz, R. (2002). Automatic headline generation for newspaper stories. Paper presented at the Workshop on Automatic Summarization.

Zamin, N. & Ghani, A. (2010, 2010). A Hybrid Approach for Malay Text Summarizer. Paper presented at the Proceedings of the International Multi-Conference on Engineering

and Technological Innovation.




DOI: http://dx.doi.org/10.17576/gema-2020-2003-05

Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2131

ISSN : 1675-8021