Home   >   CSC-OpenAccess Library   >    Manuscript Information
Developing AI Tools For A Writing Assistant: Automatic Detection of dt-mistakes In Dutch
Wouter Mercelis
Pages - 9 - 23     |    Revised - 30-04-2021     |    Published - 01-06-2021
Volume - 12   Issue - 2    |    Publication Date - June 2021  Table of Contents
MORE INFORMATION
KEYWORDS
NLP, Dutch, AI, Spelling Correction, Transfer Learning.
ABSTRACT
This paper describes a lightweight, scalable model that predicts whether a Dutch verb ends in -d, -t or -dt. The confusion of these three endings is a common Dutch spelling mistake. If the predicted ending is different from the ending as written by the author, the system will signal the dt-mistake. This paper explores various data sources to use in this classification task, such as the Europarl Corpus, the Dutch Parallel Corpus and a Dutch Wikipedia corpus. Different architectures are tested for the model training, focused on a transfer learning approach with ULMFiT. The trained model can predict the right ending with 99.4% accuracy, and this result is comparable to the current state-of-the-art performance. Adjustments to the training data and the use of other part-of-speech taggers may further improve this performance. As discussed in this paper, the main advantages of the approach are the short training time and the potential to use the same technique with other disambiguation tasks in Dutch or in other languages.
1 Google Scholar 
2 Semantic Scholar 
3 refSeek 
4 BibSonomy 
5 Doc Player 
6 J-Gate 
7 Scribd 
8 SlideShare 
“About - fast.ai,” Internet: https://www.fast.ai/about/, 2020 [Mar. 21, 2021].
“Aquaducten.” Internet: https://www.scholieren.com/verslag/werkstuk-latijn-aquaducten, 2021 [Mar. 21, 2021].
“Circus Maximus.” Internet: https://www.scholieren.com/verslag/werkstuk-geschiedenis-circus-maximus, 2007 [Mar. 21, 2021].
“Cold Skin.” Internet: https://www.scholieren.com/verslag/boekverslag-engels-cold-skin-door-steven-herrick, 2010 [Mar. 21, 2021].
“d / dt / t.” Internet: https://www.vlaanderen.be/taaladvies/d-dt-t, 2021 [Apr. 28, 2021].
“De gevolgen van de ontdekkingsreizen.” Internet: https://www.scholieren.com/verslag/werkstuk-geschiedenis-de-gevolgen-van-de-ontdekkingsreizen, 2003 [Mar. 21, 2021].
“Index of /nlwiki/.” Internet: https://dumps.wikimedia.org/nlwiki/, 2021 [Apr. 28, 2021].
“Internationale politiek België.” Internet: https://www.scholieren.com/verslag/opdracht-geschiedenis-internationale-politiek-belgie, 2004 [Mar. 21, 2021].
“LIIR – Home.” Internet: http://liir.cs.kuleuven.be/software_pages/dt_correction_dataset_preprocessing.php, 2018 [Mar. 21, 2021].
“torch.nn - PyTorch 1.5.0 documentation.” Internet: https://pytorch.org/docs/stable/nn.html [Mar. 21, 2021].
B. van der Burgh. "110k Dutch Book Reviews Dataset for Sentiment Analysis." Internet: https://github.com/benjaminvdb/DBRD, 2019 [Mar. 21, 2021].
C. Leacock, M. Chodorow, M. Gamon, and J. Tetreault. (2014). "Automated Grammatical Error Detection for Language Learners". (2nd ed). [On-line]. Available: https://www.morganclaypool.com/doi/abs/10.2200/S00562ED1V01Y201401HLT025 [Mar. 21, 2021].
G. Alafang Malema, N. Motlogelwa, B. Okgetheng and O. Mogotlhwane. (2016, Aug.). “Setswana Verb Analyzer and Generator.” International Journal of Computational Linguistics. [On-line]. 7(1), pp. 1-11. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-73 [May 5, 2021].
G. Heyman, I. Vulic, Y. Laevaert, and M.-F. Moens. (2018, Dec.). “Automatic detection and correction of context-dependent dt-mistakes using neural networks.” Comput. Linguist. Neth. J. [On-line]. 8, pp. 49–65. Available: https://clinjournal.org/clinj/article/view/79 [Mar. 21, 2021].
H. Paulussen, L. Macken, W. Vandeweghe, and P. Desmet. (2013). “Dutch Parallel Corpus: A Balanced Parallel Corpus for Dutch-English and Dutch-French.” [On-line]. pp. 185–199. Available: https://link.springer.com/chapter/10.1007/978-3-642-30910-6_11 [Mar. 21, 2021].
H. Schmid. (1997). “Probabilistic Part-of-Speech Tagging Using Decision Trees,” New Methods in Language Processing.[On-line]. pp. 154–164. Available: https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger1.pdf [Mar. 21, 2021].
J. Howard and S. Gugger. (2020, Feb.). “Fastai: A Layered API for Deep Learning.” Information. 11(2). p. 108. Available: https://www.mdpi.com/2078-2489/11/2/108 [May 4, 2021].
J. Howard and S. Ruder. (2018). “Universal Language Model Fine-tuning for Text Classification.” [On-line]. Available: http://arxiv.org/abs/1801.06146 [Mar. 21, 2021].
J. Zhang, Y. Zeng, and B. Starly. (2021, Mar.). “Recurrent neural networks with long term temporal dependencies in machine tool wear diagnosis and prognosis.” SN Appl. Sci. [On-line]. 3(4), p. 442. Available: https://link.springer.com/article/10.1007/s42452-021-04427-5 [Apr. 28, 2021]
J.S. Sumamo, and S. Teferra. (2018, Oct.). “Designing A Rule Based Stemming Algorithm for Kambaata Language Text.” International Journal of Computational Linguistics. [On-line]. 9(2), pp. 41-54. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-73 [May 5, 2021].
L. Allein, A. Leeuwenberg, and M.-F. Moens. (2020). "Binary and Multitask Classification Model for Dutch Anaphora Resolution: Die/Dat Prediction." ArXiv. [On-line]. Available: https://arxiv.org/abs/2001.02943 [Mar. 21, 2021].
L. Salifou, and H. Â Naroua. (2014, Jun.). “Design of A Spell Corrector For Hausa Language.” International Journal of Computational Linguistics. [On-line]. 5(2), pp. 14-26. Available: https://www.cscjournals.org/library/manuscriptinfo.php?mc=IJCL-56 [May 5, 2021].
M. Honnibal and I. Montani. (2017). “spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.” [On-line]. Available: https://sentometrics-research.com/publication/72/ [Mar. 21, 2021].
N. Verhaert and D. Sandra. (2016). “Homofoondominantie veroorzaakt dt-fouten tijdens het spellen en maakt er ons blind voor tijdens het lezen.” Levende Talen Tijdschr. [On-line]. Available: https://lt-tijdschriften.nl/ojs/index.php/ltt/article/view/1632 [Mar. 21, 2021].
P. Koehn. (2005). “Europarl: A Parallel Corpus for Statistical Machine Translation.” Conference Proceedings: the tenth Machine Translation Summit. [On-line]. pp. 79–86. Available: http://mt-archive.info/MTS-2005-Koehn.pdf [Mar. 21, 2021].
S. Faltl, M. Schimpke, and C. Hackober. "ULMFiT: State-of-the-Art in Text Analysis", Internet: https://humboldt-wi.github.io/blog/research/information_systems_1819/group4_ulmfit/, 2019 [Mar. 21, 2021].
T. Brants and A. Franz. (2006). "Web 1T 5-gram Version 1 - Linguistic Data Consortium." 2006. [On-line]. Available: https://catalog.ldc.upenn.edu/LDC2006T13 [Mar. 21, 2021].
Y. Li, A. Anastasopoulos, and A. W. Black. (2020, Jan.). “Towards Minimal Supervision BERT-based Grammar Error Correction.” ArXiv200103521. [On-line]. Available: http://arxiv.org/abs/2001.03521 [Mar. 21, 2021].
Z. Liu and Y. Liu. (2016). “Exploiting Unlabeled Data for Neural Grammatical Error Detection.” arXiv.org. [On-line]. Available: http://search.proquest.com/docview/2080422559/ [Mar. 21, 2021].
Mr. Wouter Mercelis
Faculteit Letteren/Onderzoekseenheid Taalkunde/Onderzoeksgroep, Kwantitatieve Lexicologie en Variatielinguïstiek (QLVL), KU Leuven, Leuven, 3000 - Belgium
mercelisw@gmail.com


CREATE AUTHOR ACCOUNT
 
LAUNCH YOUR SPECIAL ISSUE
View all special issues >>
 
PUBLICATION VIDEOS