ZAEBUC design and annotation: Guidelines, processes, and insights

Author First name, Last name, Institution

Nizar Habash, NYU Abu Dhabi
David M. Palfreyman, Zayed University

Document Type

Book Chapter

Source of Publication

Bilingual Writers and Corpus Analysis

Publication Date

12-23-2022

Abstract

In this chapter, we present the ZAEBUC corpus annotations used by the remaining chapters in this book. In addition to rich metadata for all the texts in ZAEBUC, we discuss the various guidelines and pipeline processes we followed to create the annotations and quality check them. The annotations include spelling and grammar correction, morphological tokenization, Part-of-Speech tagging, lemmatization, and Common European Framework of Reference (CEFR) ratings. All of the annotations are done on both Arabic and English texts using consistent guidelines as much as possible. We also tracked the alignments within the different annotations, and with the original raw texts. For all annotations, we use existing automatic annotation tools followed by manual correction, except for CEFR ratings, which are only manual. We also present various measurements and correlations with preliminary insights drawn from the data and annotations. The ZAEBUC corpus annotations are intended to be the stepping stones for additional annotations. Some of the book chapters use the annotations directly, and some extend them through additional manual and automatic annotations.

ISBN

9781000782660,9781003183921

Publisher

Routledge

First Page

28

Last Page

51

Disciplines

Education | Linguistics

Scopus ID

85143670981

Indexed in Scopus

yes

Open Access

no

Share

COinS