Tagging L2 Writing: Learner Errors and the Performance of an Automated Part-of-Speech Tagger

Roslina Abdul Aziz, Zuraidah Mohd Don


This paper is concerned with the application  of technologies developed in other disciplines, in particular with the use of text processing techniques to investigate the problems of second language learner writing in English.  The question addressed is whether learner texts produced by L1-Malay learners at the University of Malaya can usefully be processed using the Constituent Likelihood Automatic Word-tagging System (CLAWS); a part-of-speech (POS) tagger developed for and trained on texts written by native speakers of the language. The study adopts the procedure employed by van Rooy and Schäfer (2002).CLAWS was used to automatically POS tag a subset of the Malaysian Corpus of Learner English (MACLE), and the texts were then analyzed for tagging accuracy.CLAWS was found to perform less well on learner text than on native speaker texts, but still with an accuracy rate of over 90%. The sources of error are traced, and spelling errors are found to be the most common source. Closer inspection indicates that successful tagging is likely to lead to problems downstream in later processing, which suggests that to optimize performance, some modifications will be required in tagger design.



Learner Corpora; Learner Errors; Part-of-Speech Tagging; Tagging Accuracy

Full Text:



Aarts, J, van Halteren, H. &Oostdijk, N. (1998). The linguistic annotation of corpora: The TOSCA analysis system. International Journal of Corpus Linguistics. 3(2), 189-210. doi: 10.1075/ijcl.3.2.02aar

Aijmer, K. (2002). Modality in advanced swedish learners’ written interlanguage. In S. Granger, J. Hung & S. Petch-Tyson (Eds.), Computer learner corpora, second language acquisition and foreign language teaching (pp. 55-76). Amsterdam: John Benjamins.

Brill, E. (1999). Tagging unknown words. In H. van Halteren (Ed.), Syntactic wordclass tagging (pp. 207-216). Dordrecht: Kluwer.

deHaan, P. (2000). Tagging non-native english with the TOSCA-ICLE tagger”.In C. Mair& M. Hundt (Eds.), Corpus linguistics and linguistic theory (pp. 69-79). Amsterdam: Rodopi.

Díaz-Negrillo, A. & Fernández-Domínguez, J. (2006).Error tagging systems for learner corpora.RESLA. 19, 83-102

Díaz-Negrillo, A. & García-Cumbreras, M. A. (2007).A tagging tool for error analysis on learner corpora.ICAME Journal Computers in English Linguistics. 31, 197-203.

Díaz-Negrillo, A., Meurers, De., Valera, S. & Wunsch, H. (2010). Towards interlanguage POS annotation for effective learner corpora in SLA and FLT. Language Forum. 36(1-2), 139-154. http://www.sfs.uni-


Diaz-Negrillo, A. & Thompson, P. (2013). Learner corpora: Looking towards the future”. In A. Diaz-Negrillo, N. Ballier & P. Thompson (Eds.), Automatic Treatment and Analysis of Learner Corpus Data (pp. 9-28). Armsterdam: John Benjamins.

Garside, R. & Smith, N. (1997). A hybrid-grammatical tagger: CLAWS4. In R. Garside, G. Leech & A. McEnery (Eds.), Corpus Annotation: Linguistic Information From Computer Text Corpora (pp. 102-121). London: Longman.

Granger, S. (1993).The International Corpus of Learner English.In J. Aarts, P. de Haan & N. Oostdijk (Eds.) English Language Corpora: Design, Analysis and Exploitation (pp. 57-69). Amsterdam: Rodopi.

Granger, S. (2002). A bird’s eye view of learner corpus research”. In S. Granger, J. Hung & S. Petch-Tyson (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching (pp. 3-33). Amsterdam: John Benjamins.

Granger, S. (2003). Error-tagged learner corpora and CALL: A promising synergy. CALICA, 20(3), 465-480. http://www.jstor.org/stable/24157525

Granger, S. (2005). Computer learner corpus research: Current status and future prospects. In U. Connor & T. Upton (Eds.), Computer Learner Corpora, Second Language Acquisition and Foreign Language Teaching (pp. 123-145). Amsterdam: Rodopi.

Granger S. (2008). Learner corpora in foreign language education. In N.van Deusen-Scholl & N.H. Hornberger (Eds.) Encyclopedia of Language and Education. Volume 4. Second and foreign language education (pp. 337-351).Springer.doi: 10.1007/978-0-387-30424-3_109

Ionin, T. & Wexler, K. (2001). L1 Russian children learning English: Tense and overgeneration of "be"”. In X. Bonch-Bruevich, W. J. Crawford,

J. Hellermann, C. Higgins & H. Nguyen (Eds.), The Past, Present, and Future of Second Language Research: Second Language Research Forum (pp. 76-94). Somerville, MA: Cascadilla Press.

Ionin, T. & Wexler, K. (2002). Why is 'is' easier than '-s'?: Acquisition of tense/agreement morphology by child second language learners of English. Second Language Research. 18(2), 95-136. doi: 10.1191/0267658302sr195oa

Izumi, E., Uchimoto, K. & Isahara, H. (2005).Error annotation for corpus of Japanese learner English.Proceeding from IWLIC 2005: The 6th International Workshop on Linguistically Interpreted Corpora.Jeju Island: Korea. http://www.aclweb.org/website/old_anthology/I/I05/I05-6009.pdf

Knowles, G. & ZuraidahMohd Don. (2004). Introducing MACLE: The Malaysian Corpus of Learner English. Proceedings from NSCLFLE: The 1st National Symposium of Corpus Linguistics and Foreign Language Education. Guangzhou: China.

Knowles, G., Zuraidah Mohd Don, Jariah Mohd Jan, Rajeswary Sargunam, Janet Yong, Sathia Devi, Asha Doshi, Su'ad Awab. (2006). The Malaysian Corpus of Learner English: A bridge from linguistics to ELT”. In Azirah H. & Norizah H. (Eds.), Varieties of English in Southeast Asia and beyond. Kuala Lumpur: University of Malaya Press.

Leech, G. (1997). Introducing corpus annotation”. In R. Garside, G. Leech & T. McEnery (Eds.), Corpus Annotation: Linguistic Information From Computer Text Corpora (pp.1-18). Harlow, England: Addison Wesley Longman Limited.

Meunier, F. & de Mönnink, I. (2001). Assessing the success rate of EFL learner corpus tagging. ICAME Conference. Louvain-la-Neuve: Spain.

Nesselhauf, N. (2009). Co-selection phenomenon across new Englishes: Parallels (and differences) of foreign learner varieties. English Word-Wide. 30(1), 1-26.

Roslina Abdul Aziz & Zuraidah Mohd Don. (2014). The overgeneration of be+verb in the writing of L1-Malay ESL learners in Malaysia. Research in Corpus Linguistics. 2, 35-44.


vanRooy, B. & Schäfer, L. (2002). The effect of learner errors on POS tag errors during automatic POS tagging. Southern African Linguistics and Applied Language Studies. 20, 325-335. doi:10.2989/16073610209486319

DOI: http://dx.doi.org/10.17576/gema-2019-1903-09


  • There are currently no refbacks.




eISSN : 2550-2131

ISSN : 1675-8021