ZAEBUC design and annotation: Guidelines, processes, and insights
Document Type
Book Chapter
Source of Publication
Bilingual Writers and Corpus Analysis
Publication Date
12-23-2022
Abstract
In this chapter, we present the ZAEBUC corpus annotations used by the remaining chapters in this book. In addition to rich metadata for all the texts in ZAEBUC, we discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on both Arabic and English texts using consistent guidelines as much as possible. We also tracked the alignments within the different annotations, and with the original raw texts. For all annotations, we use existing automatic annotation tools followed by manual correction, except for CEFR ratings, which are only manual. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The ZAEBUC corpus annotations are intended to be the stepping stones for additional annotations. Some of the book chapters use the annotations directly, and some extend them through additional manual and automatic annotations.
DOI Link
ISBN
9781000782660,9781003183921
Publisher
Routledge
First Page
28
Last Page
51
Disciplines
Education | Linguistics
Scopus ID
Recommended Citation
Habash, Nizar and Palfreyman, David M., "ZAEBUC design and annotation: Guidelines, processes, and insights" (2022). All Works. 5591.
https://zuscholars.zu.ac.ae/works/5591
Indexed in Scopus
yes
Open Access
no