ZAEBUC design and annotation: Guidelines, processes, and insights
Source of Publication
Bilingual Writers and Corpus Analysis
In this chapter, we present the ZAEBUC corpus annotations used by the remaining chapters in this book. In addition to rich metadata for all the texts in ZAEBUC, we discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on both Arabic and English texts using consistent guidelines as much as possible. We also tracked the alignments within the different annotations, and with the original raw texts. For all annotations, we use existing automatic annotation tools followed by manual correction, except for CEFR ratings, which are only manual. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The ZAEBUC corpus annotations are intended to be the stepping stones for additional annotations. Some of the book chapters use the annotations directly, and some extend them through additional manual and automatic annotations.
Education | Linguistics
Habash, Nizar and Palfreyman, David M., "ZAEBUC design and annotation: Guidelines, processes, and insights" (2022). All Works. 5591.
Indexed in Scopus