Can ChatGPT Translate Like a Pro? A Pilot Benchmarking Study of English–Malay Translation Quality
Abstract
Artificial intelligence (AI) tools such as ChatGPT have significantly advanced machine translation, yet their performance in low-resource language pairs, particularly English–Malay, lags behind. While existing studies have compared AI and human translation quality, most have relied on academic assessment frameworks, leaving a gap in evaluating AI translation through professional certification standards. From a professional standpoint, translation competence is most reliably assessed through formal certification frameworks that combine analytic rubrics, performance descriptors, and expert judgment. To determine whether AI systems can perform at a professional standard, they must be evaluated using the same criteria applied to human translators. This pilot study addresses that gap by benchmarking ChatGPT’s English–Malay translation performance against a novice and a professional translator using the National Accreditation Authority for Translators and Interpreters (NAATI) Certified Translator examination framework. Thirteen professional raters from the Malaysian Translators Association assessed the translations based on Meaning Transfer, Textual Norms and Conventions, and Language Proficiency. Findings revealed a clear performance hierarchy—Professional Translator > ChatGPT > Novice Translator—indicating that while ChatGPT achieved near-professional competence in fluency and meaning accuracy, it remained limited in idiomatic precision and cultural adaptation. The study highlights ChatGPT’s potential as an assistive tool for translation and training, while reaffirming the need for human oversight. It also validates the NAATI framework as a robust benchmark for evaluating AI translation quality. As AI models continue to evolve, future research involving larger translator samples and a wider range of language pairs is essential to evaluate ongoing progress and ensure the responsible integration of AI translation into professional practice.
Keywords: AI translation; English–Malay translation; ChatGPT; NAATI; professional translator assessment
Full Text:
PDFReferences
Alafnan, M. A. (2025). Large language models as computational linguistics tools: A comparative analysis of ChatGPT and Google Machine Translations. Journal of Artificial Intelligence and Technology, 5, 20-32. https://doi.org/10.37965/jait.2024.0549
Algaraady, J., & Mahyoob, M. (2025). Exploring ChatGPT’s potential for augmenting post-editing in machine translation across multiple domains: Challenges and opportunities. Frontiers in Artificial Intelligence. 8,1526293. https://doi.org/10.3389/frai.2025.1526293
Alkhatnai, M. (2025). The role of artificial intelligence tools in mediating Sino-Arab cultural exchanges through intercultural translation. Babel, 71(6), 740-769. https://doi.org/10.1075/babel.25101.alk
Alkhawaja, L. (2024). Unveiling the new frontier: ChatGPT-3 powered translation for Arabic-English language pairs. Theory and Practice in Language Studies, 14(2), 347-357. https://doi.org/10.17507/tpls.1402.05
Al-Khresheh, M. (2025). A back translation analysis of AI-generated Arabic-English texts using ChatGPT: Exploring accuracy and meaning retention. Dragoman, 17, 97-117. https://doi.org/10.63132/ati.2025.abackt.95444806
Alomari, E. A. (2024). Unlocking the potential: A comprehensive systematic review of ChatGPT in natural language processing tasks. Computer Modeling in Engineering & Sciences, 141(1), 43-85. https://doi.org/10.32604/cmes.2024.052256
Alshalan, A. (2025). Bridging the divide: Saussurean Structure and Derridean complexity in ChatGPT’s meaning-making. Arab World English Journal (AWEJ) Special Issue on Artificial Intelligence, 81-95. https://dx.doi.org/10.24093/awej/AI.5
Amaro, V., & Zhang, X. (2025). Intercultural interfaces: Artificial intelligence and its challenges of cultural sensitivity. E-Revista de Estudos Interculturais.13, 1-27.
Amini, M. (2018). How to evaluate the TEFL students’ translations: through analytic, holistic or combined method? Language Testing in Asia, 8(10), 1-8. https://doi.org/10.1186/s40468-018-0063-6
Angelelli, C. V. (2009). Using a rubric to assess translation ability: Defining the construct. In C. V. Angelelli, & H. E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice (pp. 13-47). John Benjamins.
Awashreh, R., & Aboeisheh, A. (2025). The collaborative future of translation between human-AI partnerships. In M. H. Al Aqad (Ed.), Role of AI in translation and interpretation (pp. 205-236). IGI Global.
Bansal, G., Chamola, V., Hussain, A., Guizani, M., & Niyato, D. (2024). Transforming conversations with AI – A comprehensive study of ChatGPT. Cognitive Computation, 16(5), 2487-2510. https://doi.org/10.1007/s12559-023-10236-2
Belgacem, A., Bradai, A., & Beghdad-Bey, K. (2023, October 23-26). ChatGPT backend: A comprehensive analysis [Paper presentation]. International Symposium on Networks, Computers and Communications (ISNCC 2023), Doha, Qatar
Bhattacharya, P., Prasad, V. K., Verma, A., & Dhiman, G. (2024). Demystifying ChatGPT: An in-depth survey of OpenAI’s robust large language models. Archives of Computational Methods in Engineering, 8, 4557-4660. https://doi.org/10.1007/s11831-024-10115-5
Bowker, L. (2000). A corpus-based approach to evaluating student translations. The Translator, 6(2), 183-210. https://doi.org/10.1080/13556509.2000.10799065
Bowker, L. (2001). Towards a methodology for a corpus-based approach to translation evaluation. Meta, 46(2), 345-364. https://doi.org/10.7202/002135ar
Chowdhury, M. N.-U.-R., & Haque, A. (2023, June 23-25). ChatGPT: Its applications and limitations [Paper presentation]. 3rd International Conference on Intelligent Technologies (CONIT 2023), Hubli, India.
Chung, H.-Y. (2020). Automatic evaluation of human translation: BLEU vs. METEOR. Lebende Sprachen, 65(1), 181-205. https://doi.org/10.1515/les-2020-0009
Clifford, A. (2007). Grading scientific translation: What’s a new teacher to do? Meta, 52(2), 376-389. https://doi.org/10.7202/016083ar
Colina, S. (2008). Translation quality evaluation: Some empirical evidence for a functionalist approach. The Translator, 14(1), 97-134. https://doi.org/10.1080/13556509.2008.10799251
Colina, S. (2009). Further evidence for a functionalist approach to translation quality evaluation. Target, 21(2), 215-244. https://doi.org/10.1075/target.21.2.02col
Darawsheh, K., Hamamra, B., & Abuarrah, S. (2025). Addressing untranslatability: ChatGPT’s compensatory strategies for translating verbified proper nouns in English–Arabic contexts. Traduction et Langues, 24(1), 298-326.
de los Reyes Lozano, J., & Mejías-Climent, L. (2023). Beyond the black mirror effect: the impact of machine translation in the audiovisual translation environment. Linguistica Antverpiensia, New Series – Themes in Translation Studies, 20, 1-19. https://doi.org/10.52034/lans-tts.v22i.790
De Sutter, G., Cappelle, B., De Clercq, O., Loock, R., & Plevoets, K. (2017). Towards a corpus-based, statistical approach to translation quality: Measuring and visualising linguistic deviance in student translations. Linguistica Antverpiensia, New Series – Themes in Translation Studies, 16, 25-39.
https://doi.org/10.52034/lanstts.v16i0.440
Ding, L. (2024). A comparative study on the quality of English-Chinese translation of legal texts between ChatGPT and neural machine translation systems. Theory and Practice in Language Studies, 14(9), 2823-2833, https://doi.org/10.17507/tpls.1409.18
Eyckmans, J., & Anckaert, P. (2017). Item-based assessment of translation competence: Chimera of objectivity versus prospect of reliable measurement. Linguistica Antverpiensia, New Series – Themes in Translation Studies, 16, 40-56. https://doi.org/10.52034/lanstts.v16i0.436
Eyckmans, J., Anckaert, P., & Segers, W. (2009). The perks of norm-referenced translation evaluation. In C. V. Angelelli, & H. E. Jacobson (Eds.), Testing and assessment in translation and interpreting studies: A call for dialogue between research and practice (pp. 73-93). John Benjamins.
Eyckmans, J., Segers, W., & Anckaert, P. (2012). Translation assessment methodology and the prospects of European collaboration. In D. Tsagari, & I. Csépes (Eds.), Collaboration in language testing and assessment (pp. 171-184). Peter Lang.
Farghal, M., & Haider, A. (2025). A Cogno-Prosodic Approach to Translating Arabic Poetry into English: Human vs. Machine. 3L: Language, Linguistics, Literature®, 31(1), 255-271.
Gao, Y., Wang, R., & Hou, F. (2024, December 3-6). How to design translation prompts for ChatGPT: An empirical study [Paper presentation]. 6th ACM International Conference on Multimedia in Asia Workshops, Auckland, New Zealand.
Gladkoff, S., & Han, L. (2022, June 20-25). HOPE: A task-oriented and human-centric evaluation framework using professional post-editing towards more effective MT evaluation [Paper presentation]. 13th Language Resources and Evaluation Conference (LREC 2022), Marseille, France.
Graham, Y., Baldwin, T., Moffat, A., & Zobel, J. (2013, December 4-6). Crowd-sourcing of human judgments of machine translation fluency [Paper presentation]. Australasian Language Technology Association Workshop 2013 (ALTA 2013), Brisbane, Australia.
Hassani, G., Malekshahi, M., & Davari, H. (2025). AI-powered transcreation in global marketing: Insights from Iran. ELOPE: English Language Overseas Perspectives and Enquiries, 22(1), 203-221. https://doi.org/10.4312/elope.22.1.203-221
House, J. (2014). Translation quality assessment: Past and present. In J. House (Ed.), Translation: A multidisciplinary approach (pp. 241-264). Palgrave Macmillan.
Jiang, L., Jiang, Y., & Han, L. (2024). The potential of ChatGPT in translation evaluation: A case study of the Chinese-Portuguese machine translation. Cadernos de Tradução, 44(1), 1-22. https://doi.org/10.5007/2175-7968.2024.e98613
Jiménez-Crespo, M. A. (2009). The evaluation of pragmatic and functionalist aspects in localization: Towards a holistic approach to quality assurance. The Journal of Internationalization and Localization, 1, 60–93. https://doi.org/10.1075/jial.1.03jim
Jiménez-Crespo, M. A. (2011). A corpus-based error typology: Towards a more objective approach to measuring quality in localization. Perspectives, 19(4), 315–338. https://doi.org/10.1080/0907676X.2011.615409
Johnson, J. W., & Kathirvel, K. (2025). Evaluating the effectiveness of ChatGPT in language translation and cross-lingual communication. In Bui Thanh Hung, M. Sekar, Ayhan ESI, R. Senthil Kumar (Eds.), Applications of Mathematics in Science and Technology (pp. 434-439). CRC Press.
Kalaš, F. (2025). Evaluation of German–Slovak AI translation of stock market news. Linguistische Treffen in Wrocław, 27(1), 117-129. https://doi.org/10.23817/lingtreff.27-7
Keshamoni, K. (2023). ChatGPT: An Advanceds Natural Language Processing System for Conversational AI Applications—A Comprehensive Review and Comparative Analysis with Other Chatbots and NLP Models. In: Tuba, M., Akashe, S., Joshi, A. (Eds.) ICT Systems and Sustainability. ICT4SD 2023.
Lecture Notes in Networks and Systems, vol 765 (pp. 447 – 455). Springer, Singapore. https://doi.org/10.1007/978-981-99-5652-4_40
Kockaert, H. J., & Segers, W. (2017). Evaluation of legal translations: PIE method (Preselected Items Evaluation). The Journal of Specialised Translation, 27, 148-163. https://doi.org/10.26034/cm.jostrans.2017.263
Kumar, H., Damle, M., Natraj, N. A., & Lapina, M. (2024, December 3-5). AI-driven natural language processing: ChatGPT's potential and future advancements in generative AI [Paper presentation]. 6th International Symposium on Advanced Electrical and Communication Technologies (ISAECT),
Alkhobar, Saudi Arabia
Lai, T. (2011). Reliability and validity of a scale-based assessment for translation tests. Meta, 56(3), 713-722. https://doi.org/10.7202/1008341ar
Lee, T. K. (2024). Artificial intelligence and posthumanist translation: ChatGPT versus the translator. Applied Linguistics Review,15(6), 2351-2372. https://doi.org/10.1515/applirev-2023-0122.
Łukasik, M. (2024). The future of the translation profession in the era of artificial intelligence: Survey results from Polish translators, translation trainers, and students of translation. Lublin Studies in Modern Languages and Literature, 48(3), 25-39. http://dx.doi.org/10.17951/lsmll.2024.48.3.25-39
Martínez Mateo, R. (2014). A deeper look into metrics for translation quality assessment (TQA): A case study. Miscelánea: A Journal of English and American Studies, 49, 73-93. https://doi.org/10.26754/ojs_misc/mj.20148792
Martínez Mateo, R., Montero Martínez, S., & Moya Guijarro, A. J. M. (2017). The Modular Assessment Pack: A new approach to translation quality assessment at the Directorate General for Translation. Perspectives, 25(1), 18-48. https://doi.org/10.1080/0907676X.2016.1167923
Martínez Melis, N., & Hurtado Albir, A. (2001). Assessment in translation studies: Research needs. Meta, 46(2), 272-287. https://doi.org/10.7202/003624ar
Moneus, A. M., & Sahari, Y. (2024). Artificial intelligence and human translation: A contrastive study based on legal texts. Heliyon, 10(6), e28106. https://doi,org/10.1016/j.heliyon.2024.e28106
NAATI. (2024, February). Certified translator assessment rubrics (Final version 2.0). https://www.naati.com.au/wp-content/uploads/2023/07/Certified-Translator-Assessment-Rubrics.pdf
NAATI (National Accreditation Authority for Translators and Interpreters). (2012). Improvements to NAATI testing: Development of a conceptual overview for a new model for NAATI standards, testing and assessment [Report]. NAATI.
Nuriev, V. A., & Egorova, A. Y. (2021). Methods of quality estimation for machine translation: State-of-the-art. Informatika i Ee Primeneniya [Informatics and its Applications], 15 (2), 104-111. https://doi.org/10.14357/19922264210215
Ozyumenko, V. I., & Larina, T. V. (2025). Artificial intelligence in translation: Advantages and limitations. Science Journal of Volgograd State University, 24(1), 117-130. https://doi.org/10.15688/jvolsu2.2025.1.10
Peng, K., Ding, L., Zhong, Q., Shen, L., Liu, X., Zhang, M., Ouyang, Y., & Tao, D. (2023, December 6-10). Towards making the most of ChatGPT for machine translation [Paper presentation]. Findings of the Association for Computational Linguistics: EMNLP 2023 Conference.
Qamar, M. T., Yasmeen, J., Pathak, S. K., & Rangarajan, M. (2024). Big claims, low outcomes: Fact checking ChatGPT’s efficacy in handling linguistic creativity and ambiguity. Cogent Arts & Humanities, 11(1), 1-22. https://doi.org/10.1080/23311983.2024.2353984
Rustici, C. (2025, August 20). What are the top AI chatbots? Data-driven insights from the AI “Big Bang” study. DirectIndustry e-Magazine. https://emag.directindustry.com/2025/08/20/best-ai-chatbots-data-insights-study/
Saehu, A., & Hkikmat, M. M. (2025). The quality and accuracy of AI-generated translation in translating communication-based topics: Bringing translation quality assessments into practices. In M. H. Al Aqad (Ed.), Role of AI in translation and interpretation (pp. 237-266). IGI Global.
https://doi.org/10.4018/979-8-3373-0060-3.ch009
Siu, S. C. (2024). Revolutionising translation with AI: Unravelling neural machine translation and generative pre-trained large language models. In Y. Peng, H. Huang, & D. Li (Eds.), New advances in translation technology: Applications and pedagogy (pp. 29-54). Springer. https://doi.org/10.1007/978-
-97-2958-6_3
Sulaiman, M. Z., Zainudin, I. S., & Haroon, H. (2024). Pemprofesionalan amalan terjemahan dan kejurubahasaan di Malaysia: Satu tinjauan awal (The professionalisation of translation and interpreting practice in Malaysia: A preliminary study). GEMA Online Journal of Language Studies, 24(4), 387-409.
http://doi.org/10.17576/gema-2024-2404-21
Sutrisno, A. (2025). Inter-sentential translation and language perspective in Neural Machine Translation: Insights from ChatGPT as a transformer-based model. Asia Pacific Translation and Intercultural Studies, 12(1), 81-94. https://doi.org/10.1080/23306343.2025.2485609
Tan, L., Dehdari, J., & van Genabith, J. (2015, October 16). An awkward disparity between BLEU/RIBES scores and human judgements in machine translation [Paper presentation]. 2nd Workshop on Asian Translation (WAT 2015), Kyoto, Japan.
Tanni, S. A. (2025). The role of artificial intelligence in translation sites. In B. S. Awwad (Ed.), Sustainability in light of governance and artificial intelligence applications (pp. 157-178). Emerald. https://doi.org/10.1108/9781837081981
Turner, B., Lai, M., & Huang, N. (2010). Error deduction and descriptors: A comparison of two methods of translation test assessment. Translation & Interpreting, 2(1), 11–23.
Waddington, C. (2001b). Different methods of evaluating student translations: The question of validity. Meta, 46(2), 311-325. https://doi.org/10.7202/004583ar
Waddington, C. (2001a). Should translations be assessed holistically or through error analysis? HERMES - Journal of Language and Communication in Business, 14(26), 15-37. https://doi.org/10.7146/hjlcb.v14i26.25637
Waddington, C. (2003). A positive approach to the assessment of translation errors. In M. M. Ricardo (Ed.), Actas del I Congreso Internacional de la Asociación Ibérica de Estudios de Traducción e Interpretación (pp. 409-426). AIETI.
Wang, Y., Zhang, J., Shi, T., Deng, D., Tian, Y., & Matsumoto, T. (2024). Recent advances in interactive machine translation with large language models. IEEE Access, 12, 179353-179382. https://doi.org/10.1109/ACCESS.2024.3487352
Williams, M. (2001). The application of argumentation theory to translation quality assessment. Meta, 46(2), 327-344. https://doi.org/10.7202/004605ar
Williams, M. (2004). Translation quality assessment: An argumentation-centred approach. University of Ottawa Press.
Yating, L., Afzaal, M., Shanshan, X., & El-Dakhs, D. A. S. (2025). TQFLL: a novel unified analytics framework for translation quality framework for large language model and human translation of allusions in multilingual corpora. Automatika, 66(1), 91-102.
https://doi.org/10.1080/00051144.2024.2447652
Refbacks
- There are currently no refbacks.
eISSN : 2550-2247
ISSN : 0128-5157