Large scale image dataset construction using distributed crawling with hadoop YARN

Document Type

Conference Proceeding

Source of Publication

Proceedings - 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems and 19th International Symposium on Advanced Intelligent Systems, SCIS-ISIS 2018

Publication Date



© 2018 IEEE. with the rapid advancement in the internet, we are now living in the era of big data. The image data over the web has the potential to assist in the development of sophisticated and robust models and algorithms to interact with images and multimedia data. Images Data sets are widely used in image processing tasks and analyses. They are in use in various fields including Artificial Intelligence, Data extraction and collection, Computer Vision, Research studies and education. In this research work, we are going to propose a system that crawls the web in a systematic manner using Hadoop MapReduce technique to collect images from millions of web pages on the web. With Celebrity images just a use case, we would be able to search for and retrieve any image tagged with some specific terms. The system uses some simple techniques to reduce noisy images like thumbnails and icons. The proposed system is based on Apache Hadoop and Apache Nutch, an open source web crawler. A customized crawl is run through Apache Nutch in a Hadoop Cluster that searches images for one or more categories on the web and retrieves their links. Next, HIPI, Hadoop Image Processing Interface is used to download the images and create datasets for an individual category or a dataset of multiple categories.




Institute of Electrical and Electronics Engineers Inc.

First Page


Last Page



Computer Sciences


Apache Nutch, Distributed Computing, Hadoop YARN, Image Search, Web Crawling, Web Scrapping

Scopus ID


Indexed in Scopus


Open Access