All Works

A comprehensive dataset for Arabic word sense disambiguation

Document Type

Article

Source of Publication

Data in Brief

Publication Date

8-1-2024

Abstract

This data paper introduces a comprehensive dataset tailored for word sense disambiguation tasks, explicitly focusing on a hundred polysemous words frequently employed in Modern Standard Arabic. The dataset encompasses a diverse set of senses for each word, ranging from 3 to 8, resulting in 367 unique senses. Each word sense is accompanied by contextual sentences comprising ten sentence examples that feature the polysemous word in various contexts. The data collection resulted in a dataset of 3670 samples. Significantly, the dataset is in Arabic, which is known for its rich morphology, complex syntax, and extensive polysemy. The data was meticulously collected from various web sources, spanning news, medicine, finance, and more domains. This inclusivity ensures the dataset's applicability across diverse fields, positioning it as a pivotal resource for Arabic Natural Language Processing (NLP) applications. The data collection timeframe spans from the first of April 2023 to the first of May 2023. The dataset provides comprehensive model learning by including all senses for a frequently used Arabic polysemous term, even rare senses that are infrequently used in real-world contexts, thereby mitigating biases. The dataset comprises synthetic sentences generated by GPT3.5-turbo, addressing instances where rare senses lack sufficient real-world data. The dataset collection process involved initial web scraping, followed by manual sorting to distinguish word senses, supplemented by thorough searches by a human expert to fill in missing contextual sentences. Finally, in instances where online data for rare word senses was lacking or insufficient, synthetic samples were generated. Beyond its primary utility in word sense disambiguation, this dataset holds considerable value for scientists and researchers across various domains, extending its relevance to sentiment analysis applications.

DOI Link

10.1016/j.dib.2024.110591

ISSN

2352-3409

Publisher

Elsevier BV

Volume

Disciplines

Computer Sciences

Keywords

Arabic language, Deep learning, GPT3.5, Labelled data, Machine learning, Natural language processing, Word sense disambiguation

Scopus ID

85195605044

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License

Recommended Citation

Kaddoura, Sanaa and Nassar, Reem, "A comprehensive dataset for Arabic word sense disambiguation" (2024). All Works. 6631.
https://zuscholars.zu.ac.ae/works/6631

Indexed in Scopus

yes

Open Access

yes

Open Access Type

Gold: This publication is openly available in an open access journal/series

Download

Included in

Computer Sciences Commons

COinS

All Works

A comprehensive dataset for Arabic word sense disambiguation

Document Type

Source of Publication

Publication Date

Abstract

DOI Link

ISSN

Publisher

Volume

Disciplines

Keywords

Scopus ID

Creative Commons License

Recommended Citation

Indexed in Scopus

Open Access

Open Access Type

Included in

Search

Browse

Contribute

Content Type

All Works

A comprehensive dataset for Arabic word sense disambiguation

Author First name, Last name, Institution

Document Type

Source of Publication

Publication Date

Abstract

DOI Link

ISSN

Publisher

Volume

Disciplines

Keywords

Scopus ID

Creative Commons License

Recommended Citation

Indexed in Scopus

Open Access

Open Access Type

Included in

Share

Search

Browse

Contribute

Content Type