Large scale image dataset construction using distributed crawling with hadoop YARN
Source of Publication
Proceedings - 2018 Joint 10th International Conference on Soft Computing and Intelligent Systems and 19th International Symposium on Advanced Intelligent Systems, SCIS-ISIS 2018
© 2018 IEEE. with the rapid advancement in the internet, we are now living in the era of big data. The image data over the web has the potential to assist in the development of sophisticated and robust models and algorithms to interact with images and multimedia data. Images Data sets are widely used in image processing tasks and analyses. They are in use in various fields including Artificial Intelligence, Data extraction and collection, Computer Vision, Research studies and education. In this research work, we are going to propose a system that crawls the web in a systematic manner using Hadoop MapReduce technique to collect images from millions of web pages on the web. With Celebrity images just a use case, we would be able to search for and retrieve any image tagged with some specific terms. The system uses some simple techniques to reduce noisy images like thumbnails and icons. The proposed system is based on Apache Hadoop and Apache Nutch, an open source web crawler. A customized crawl is run through Apache Nutch in a Hadoop Cluster that searches images for one or more categories on the web and retrieves their links. Next, HIPI, Hadoop Image Processing Interface is used to download the images and create datasets for an individual category or a dataset of multiple categories.
Institute of Electrical and Electronics Engineers Inc.
Apache Nutch, Distributed Computing, Hadoop YARN, Image Search, Web Crawling, Web Scrapping
Ali, Rahman; Ali, Asmat; Khatak, Asad Masood; and Aslam, Muhammad Saqlain, "Large scale image dataset construction using distributed crawling with hadoop YARN" (2018). All Works. 2216.
Indexed in Scopus