All Works

Large scale image dataset construction using distributed crawling with hadoop YARN

Rahman Ali, University of Peshawar
Asmat Ali, University of Peshawar
Asad Masood Khatak, Zayed University
Muhammad Saqlain Aslam, National Central University Taiwan

Document Type

Conference Proceeding

Source of Publication

Proceedings - 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems and 19th International Symposium on Advanced Intelligent Systems, SCIS-ISIS 2018

Publication Date

7-2-2018

Abstract

© 2018 IEEE. with the rapid advancement in the internet, we are now living in the era of big data. The image data over the web has the potential to assist in the development of sophisticated and robust models and algorithms to interact with images and multimedia data. Images Data sets are widely used in image processing tasks and analyses. They are in use in various fields including Artificial Intelligence, Data extraction and collection, Computer Vision, Research studies and education. In this research work, we are going to propose a system that crawls the web in a systematic manner using Hadoop MapReduce technique to collect images from millions of web pages on the web. With Celebrity images just a use case, we would be able to search for and retrieve any image tagged with some specific terms. The system uses some simple techniques to reduce noisy images like thumbnails and icons. The proposed system is based on Apache Hadoop and Apache Nutch, an open source web crawler. A customized crawl is run through Apache Nutch in a Hadoop Cluster that searches images for one or more categories on the web and retrieves their links. Next, HIPI, Hadoop Image Processing Interface is used to download the images and create datasets for an individual category or a dataset of multiple categories.

DOI Link

10.1109/scis-isis.2018.00075

ISBN

9781538626337

Publisher

Institute of Electrical and Electronics Engineers Inc.

First Page

394

Last Page

399

Disciplines

Computer Sciences

Keywords

Apache Nutch, Distributed Computing, Hadoop YARN, Image Search, Web Crawling, Web Scrapping

Scopus ID

85067125855

Recommended Citation

Ali, Rahman; Ali, Asmat; Khatak, Asad Masood; and Aslam, Muhammad Saqlain, "Large scale image dataset construction using distributed crawling with hadoop YARN" (2018). All Works. 2216.
https://zuscholars.zu.ac.ae/works/2216

Indexed in Scopus

yes

Open Access

Link to Full Text

COinS

All Works

Large scale image dataset construction using distributed crawling with hadoop YARN

Document Type

Source of Publication

Publication Date

Abstract

DOI Link

ISBN

Publisher

First Page

Last Page

Disciplines

Keywords

Scopus ID

Recommended Citation

Indexed in Scopus

Open Access

Search

Browse

Contribute

Content Type

All Works

Large scale image dataset construction using distributed crawling with hadoop YARN

Author First name, Last name, Institution

Document Type

Source of Publication

Publication Date

Abstract

DOI Link

ISBN

Publisher

First Page

Last Page

Disciplines

Keywords

Scopus ID

Recommended Citation

Indexed in Scopus

Open Access

Share

Search

Browse

Contribute

Content Type