Assessing Legal Translations Generated by GPT-4 Turbo Using MQM: A Comparative Study

Lama Abdullah Aldosari, Nasrin Altuwairesh

Abstract


Advances in artificial intelligence (AI), particularly through GPT models, have significantly enhanced machine translation (MT) capabilities, offering more accurate and nuanced translations. This study investigates the influence of different translation prompts on the quality of legal translations generated by Generative Pretrained Transformer-4 (GPT-4) Turbo. By applying the Multidimensional Quality Metrics (MQM) framework, the research evaluates both the types and frequency of errors found in translations produced by existing and newly developed prompts. The study focuses on comparing translation commission prompts—designed to enhance context-specific texts—with existing prompts in terms of translation quality. A dataset of 14 Saudi Laws, drawn from official sources, serves as the basis for analysis, with reference translations used as benchmarks. The findings reveal that the newly developed prompts, specifically tailored for legal translation, resulted in significantly lower error rates (10-12%) compared to existing prompts, which demonstrated error rates ranging from 34% to 45%. These results underscore the transformative potential of tailored prompt engineering in achieving high-quality legal translations by reducing errors in terminology, accuracy, and style. By categorising and ranking translation errors by severity, the research highlights the impact of prompt engineering on improving legal translation performance. These findings contribute to the development of more effective MT systems, offering practical insights for refining machine translation in the legal field and beyond.

 

Keywords: GPT-4 Turbo; Legal Translation; Machine Translation; Multidimensional Quality Metrics; Translation Commission

 

DOI: http://doi.org/10.17576/3L-2026-3201-11


Full Text:

PDF

References


Agrawal, S., Zhou, C., Lewis, M., Zettlemoyer, L., & Ghazvininejad, M. (2022). In-context examples selection for machine translation. ArXiv. https://doi.org/10.48550/arXiv.2212.02437

Alkatheery, E. R. (2023). Google translate errors in legal texts: Machine translation quality assessment. Arab World English Journal for Translation & Literary Studies, 7(1), 208-219. http://dx.doi.org/10.24093/awejtls/vol7no1.16

Al-Khalifa, H., Al-Khalefah, K., & Haroon, H. (2024). Error analysis of pretrained language models (PLMs) in English-to-Arabic machine translation. Human-Centric Intelligent Systems, 4(2), 206-219.

Anesa, P., & Kulbicki, L. (2022). The impact of digitalization on legal communication: Introduction. The International Journal of Law, Language & Discourse, 10(2), 5-8. https://doi.org/10.56498/1022022408

Banat, M., & Adla, Y. A. (2023). Exploring the effectiveness of GPT-3 in translating specialized religious text from Arabic to English: A comparative study with human translation. Journal of Translation and Language Studies, 4(2), 1-23. https://doi.org/10.48185/jtls.v4i2.762

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., … Amodei, D. (2020). Language models are few-shot learners. ArXiv. https://doi.org/10.48550/arXiv.2005.14165

Burchardt, A., Lommel, A., & Macketanz, V. (2021). A new deal for translation quality. Universal access in the information society, 20(4), 701-715. https://doi.org/10.1007/s10209-020-00736-5

Chatzikoumi, E. (2020). How to evaluate machine translation: A review of automated and human metrics. Natural Language Engineering, 26(2), 137-161. https://doi.org/10.1017/S1351324919000469

DeMattee, A. J., Gertler, N., Shibaike, T., & Bloodgood, E. A. (2022). Supplemental information for overcoming the laws-in-translation problem: Comparing techniques to translate legal texts. Qualitative and Multi-Method Research, 20(2), 13-21, https://doi.org/10.31219/osf.io/jc5p9

Drugan, J. (2013). Quality in professional translation. Bloomsbury.

Dunđer, I. (2020). Machine translation system for the industry domain and Croatian language. Journal of Information and Organizational Sciences, 44(1), 33-50. https://doi.org/10.31341/jios.44.1.2

Ekin, S. (2023). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. TechRxiv. https://doi.org/10.36227/techrxiv.22683919.v2

Freitag, M., Foster, G., Grangier, D., Ratnakar, V., Tan, Q., & Macherey, W. (2021). Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9, 1460-1474.

Gao, Y., Wang, R., & Hou, F. (2023). How to design translation prompts: An empirical study. ArXiv. https://doi.org/10.48550/arXiv.2304.02182

Graham, Y., Baldwin, T., Moffat, A., & Zobel, J. (2013). Continuous measurement scales in human evaluation of machine translation. Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse (pp. 33-31). Association for Computational Linguistics.

Hendy, A., Abdelrehim, M., Sharaf, A., Raunak, V., Gabr, M., Matsushita, H., Kim, Y. J., Afify, M., & Awadalla, H. H. (2023). How good are GPT models at machine translation? A comprehensive evaluation. ArXiv. https://doi.org/10.48550/arXiv.2302.09210

House, J. (2001). Translation quality assessment: Linguistic description versus social evaluation. Meta, 46(2), 243-257. https://doi.org/10.7202/003141ar

Islam, M. A., Anik, M. S. H., & Islam, A. B. M. A. A. (2021). Towards achieving a delicate blending between rule-based translator and neural machine translator. Neural Computing Applications, 33(18), 12141–12167. https://doi.org/10.1007/s00521-021-05895-x

Jiao, W., Wang, W., Huang, J. T., Wang, X., & Tu, Z. (2023). Is ChatGPT a good translator? A preliminary study. ArXiv. https://doi.org/10.48550/arXiv.2301.08745

Khoshafah, F. (2023). ChatGPT for Arabic-English translation: Evaluating the accuracy. Research Square. https://doi.org/10.21203/rs.3.rs-2814154/v1

Knap-Dlouhá, P. (2022). Machine translation: A possible solution to the law? Brünner Beiträge zur Germanistik und Nordistik, 1(36), 35-46. https://doi.org/10.5817/BBGN2022-1-4

Kocmi, T., & Federmann, C. (2023). GEMBA-MQM: Detecting translation quality error spans with GPT-4. ArXiv. https://doi.org/10.48550/arXiv.2310.13988

Liu, J., Shen, D., Zhang, Y., Dolan, B., Carin, L., & Chen, W. (2021). What makes good in-context examples for GPT-3? ArXiv. https://doi.org/10.48550/arXiv.2101.06804

Lommel, A. (2018). Metrics for translation quality assessment: A case for standardising error typologies. In J. Moorkens, S. Castilho, F. Gaspari, & S. Doherty (Eds.), Translation quality assessment: From principle to practice (pp. 109-127). Springer. https://doi.org/10.1007/978-3-319-91241-7_6

Lopez, A. (2008). Statistical machine translation. ACM Computing Surveys (CSUR), 40(3), 1–49.

Lu, Y., Bartolo, M., Moore, A., Riedel, S., & Stenetorp, P. (2021). Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. ArXiv. https://doi.org/10.48550/arXiv.2104.08786

Mariana, V. R., Cox, T., & Melby, A. (2015). The Multidimensional Quality Metric (MQM) framework: A new framework for translation quality assessment. The Journal of Specialised Translation, (23), 137-161.

Mohamed, Y. A., Khanan, A., Bashir, M., Mohamed, A. H. H., Adiel, M. A., & Elsadig, M. A. (2024). The impact of artificial intelligence on language translation: a review. IEEE Access, 12, 25553-25579.

Mossop, B. (2001). Revising and editing for translators. St Jerome.

Munday, J., Pinto, S. R., & Blakesley, J. (2022). Introducing translation studies: Theories and applications. Routledge. https://doi.org/10.4324/9780429352461

Nida, E. A. (1977). The nature of dynamic equivalence in translating. Babel: International Journal of Translation, (13), 99-103.

OpenAI. (2023). GPT-4 (Jun 13 version) [Large language model]. https://openai.com/research/gpt-4

Royal Court. (n.d.-a). Laws and Regulations. National Centre for Archives and Records. https://ncar.gov.sa/rules-regulations

Royal Court. (n.d.-b). Saudi Laws. The Bureau of Experts at the Council of Ministers. https://laws.boe.gov.sa/BoeLaws/Laws/Folders/2

Shin, T., Razeghi, Y., Logan, R. L., IV., Wallace, E., & Singh, S. (2020). Autoprompt: Eliciting knowledge from language models with automatically generated prompts. ArXiv. https://doi.org/10.48550/arXiv.2010.15980

Soriano Barabino, G. (2020). Cultural, textual and linguistic aspects of legal translation: A model of text analysis for training legal translators. International Journal of Legal Discourse, 5(2), 285-300. https://doi.org/10.1515/ijld-2020-2037

Sosoni, V., O’Shea, J., & Stasimioti, M. (2022). Translating law: A comparison of human and post-edited translations from Greek to English. Journal of Language and Law, 78, 92-120. https://doi.org/10.2436/rld.i78.2022.3704

Sulaiman, M. Z., Zainudin, I. S., & Haroon, H. (2025). Can ChatGPT translate like a pro? A pilot benchmarking study of English–Malay translation quality. 3L: Language, Linguistics, Literature®, 31(4), 259–278. https://doi.org/10.17576/3L-2025-3104-17

Vermeer, H. (2021). Skopos and commission in translational action (A. Chesterman, Trans.). In L. Venuti (Ed.), The translation studies reader (pp. 219-230). Routledge. (Original work published 1989). https://doi.org/10.4324/9780429280641

Vilar, D., Freitag, M., Cherry, C., Luo, J., Ratnakar, V., & Foster, G. (2023). Prompting palm for translation: Assessing strategies and performance. ArXiv. https://doi.org/10.48550/arXiv.2211.09102

Wang, J., Liu, Z., Zhao, L., Wu, Z., Ma, C., Yu, S., Dai, H., Yang, Q., Liu, Y., Zhang, S., Shi, E., Pan, Y., Zhang, T., Zhu, D., Li, X., Jiang, X., Ge, B., Yuan, Y., Shen, D., … Zhang, S. (2023). Review of large vision models and visual prompt engineering. ArXiv. https://doi.org/10.48550/arXiv.2307.00855

Webson, A., & Pavlick, E. (2021). Do prompt-based models really understand the meaning of their prompts? ArXiv. https://doi.org/10.48550/arXiv.2109.01247

White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. ArXiv. https://doi.org/10.48550/arXiv.2302.11382

Yamada, M. (2023). Optimising machine translation through prompt engineering: An investigation into ChatGPT's customizability. ArXiv. https://doi.org/10.48550/arXiv.2308.01391

Zhang, B., Haddow, B., & Birch, A. (2023). Prompting large language model for machine translation: A case study. ArXiv. https://doi.org/10.48550/arXiv.2301.07069


Refbacks

  • There are currently no refbacks.


 

 

 

eISSN : 2550-2247

ISSN : 0128-5157