DOI QR코드

DOI QR Code

A Brief Survey into the Field of Automatic Image Dataset Generation through Web Scraping and Query Expansion

  • Bart Dikmans (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology) ;
  • Dongwann Kang (Dept. of Computer Science and Engineering, Seoul National University of Science and Technology)
  • Received : 2022.04.22
  • Accepted : 2023.01.01
  • Published : 2023.10.31

Abstract

High-quality image datasets are in high demand for various applications. With many online sources providing manually collected datasets, a persisting challenge is to fully automate the dataset collection process. In this study, we surveyed an automatic image dataset generation field through analyzing a collection of existing studies. Moreover, we examined fields that are closely related to automated dataset generation, such as query expansion, web scraping, and dataset quality. We assess how both noise and regional search engine differences can be addressed using an automated search query expansion focused on hypernyms, allowing for user-specific manual query expansion. Combining these aspects provides an outline of how a modern web scraping application can produce large-scale image datasets.

Keywords

Acknowledgement

This study was supported by the Research Program funded by the Seoul National University of Science and Technology (SeoulTech).

References

  1. A. Rosebrock, "How to create a deep learning dataset using google images," 2017 [Online]. Available: https://pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/.
  2. D. M. Thomas and S. Mathur, "Data analysis by web scraping using Python," in Proceedings of 2019 3rd International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 2019, pp. 450-454. https://doi.org/10.1109/ICECA.2019.8822022
  3. D. Glez-Pena, A. Lourenco, H. Lopez-Fernandez, M. Reboiro-Jato, and F. Fdez-Riverola, "Web scraping technologies in an API world," Briefings in Bioinformatics, vol. 15, no. 5, pp. 788-797, 2014. https://doi.org/10.1093/bib/bbt026
  4. F. Schroff, A. Criminisi, and A. Zisserman, "Harvesting image databases from the web," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. 754-766, 2011. https://doi.org/10.1109/TPAMI.2010.133
  5. Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, "Automatic image dataset construction with multiple textual metadata," in Proceedings of 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, 2016, pp. 1-6. https://doi.org/10.1109/ICME.2016.7552988
  6. Y. Lin, J. B. Michel, E. A. Lieberman, J. Orwant, W. Brockman, and S. Petrov, "Syntactic annotations for the Google Books Ngram Corpus," in Proceedings of the ACL 2012 System Demonstrations, Jeju, South Korea, 2012, pp. 169-174.
  7. J. M. Zink, "Automated dataset generation for image recognition using the example of taxonomy," 2018 [Online]. Available: https://arxiv.org/abs/1802.02207.
  8. G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
  9. Google, "Word2Vec documentation," 2013 [Online]. Available: https://code.google.com/archive/p/word2vec/.
  10. A. Handler, "An empirical study of semantic similarity in WordNet and Word2Vec," Master's thesis, University of New Orleans, New Orleans, LA, USA, 2014 [Online]. Available: https://scholarworks.uno.edu/td/1922/.
  11. F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, "LSUN: construction of a large-scale image dataset using deep learning with humans in the loop," 2015 [Online]. Available: https://arxiv.org/abs/1506.03365.
  12. D. Schwab and M. Lafourcade, "Hardening of acception links through vectorized lexical functions," 2002 [Online]. Available: https://www.researchgate.net/publication/2543982_Hardening_of_Acception_Links_Through_Vectorized_Lexical_Functions.
  13. B. Zhao, "Web scraping," in Encyclopedia of Big Data. Cham, Switzerland: Springer, 2017, pp. 1-3. https://doi.org/10.1007/978-3-319-32001-4_483-1
  14. D. S. Sirisuriya, "A comparative study on web scraping," 2015 [Online]. Available: http://ir.kdu.ac.lk/handle/345/1051.
  15. S. Upadhyay, V. Pant, S. Bhasin, and M. K. Pattanshetti, "Articulating the construction of a web scraper for massive data extraction," in Proceedings of 2017 2nd International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 2017, pp. 1-4. https://doi.org/10.1109/ICECCT.2017.8117827
  16. L. L. Pipino, Y. W. Lee, and R. Y. Wang, "Data quality assessment," Communications of the ACM, vol. 45, no. 4, pp. 211-218, 2002. https://doi.org/10.1145/505248.506010
  17. S. Shen, "7 steps to ensure and sustain quality data," 2019 [Online]. Available: https://towardsdatascience.com/ 7-steps- to-ensure-and-sustain-data-quality-3c0040591366.
  18. HeavyAI, "Data Quality FAQ," 2022 [Online]. Available: https://www.heavy.ai/technical-glossary/dataquality#:~:text=Data%20that%20is%20deemed%20fit,data%2C%20and%20poor%20data%20security.
  19. R. L. Sarfin, "5 Characteristics of data quality," 2022 [Online]. Available: https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality.
  20. C. Stedman and J. Vaughan, "Data Quality," 2022 [Online]. Available: https://www.techtarget.com/searchdatamanagement/definition/data-quality.
  21. S. Dodge and L. Karam, "Understanding how image quality affects deep neural networks," in Proceedings of 2016 8th International Conference on Quality of Multimedia Experience (QoMEX), 2016, pp. 1-6. https://doi.org/10.1109/QoMEX.2016.7498955
  22. Z. Chen, W. Lin, S. Wang, L. Xu, and L. Li, "Image quality assessment guided deep neural networks training," 2017 [Online]. Available: https://arxiv.org/abs/1708.03880.
  23. Z. Chen, W. Lin, S. Wang, L. Xu, and L. Li, "Image quality assessment based label smoothing in deep neural network learning," in Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 6742-6746. https://doi.org/10.1109/ICASSP.2018.8461630
  24. Picsella, "How to ensure image dataset quality for image classification," 2023 [Online]. Available: https://www.picsellia.com/post/image-data-quality-for-image-classification.
  25. A. Mikhailiuk, "Deep image quality assessment," 2021 [Online]. Available: https://towardsdatascience.com/deep- image -quality-assessment-30ad71641fac.
  26. W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, "Blind image quality assessment using a deep bilinear convolutional neural network," IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36-47, 2020. https://doi.org/10.1109/TCSVT.2018.2886771
  27. A. Rosebrock, "Detect and remove duplicate images from a dataset for deep learning," 2020 [Online]. Available: https://pyimagesearch.com/2020/04/20/detect-and-remove-duplicate-images-from-a-dataset-fordeep-learning/.
  28. E. Hofesmann, "Find and remove duplicate images in your dataset," 2021 [Online]. Available: https://towardsdatascience.com/find-and-remove-duplicate-images-in-your-dataset-3e3ec818b978#:~:text=Images%20with%20a%20low%20uniqueness,train%2Ftest%20your%20model%20on.
  29. L. Zhang and Y. Rui, "Image search: from thousands to billions in 20 years," ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 9, no. 1s, article no. 36, 2013. https://doi. org/10.1145/2490823
  30. Google Search Help, "Q&A about image search (reply: Bluequoll)," 2020 [Online]. Available: https://support.google.com/websearch/thread/32492691/image-search-returns-almost-nothing-compared-to-whatit-used-to-on-about-any-subject-what-gives?hl=en.
  31. X. Zhu and X. Wu, "Class noise vs. attribute noise: a quantitative study," Artificial Intelligence Review, vol. 22, pp. 177-210, 2004. https://doi.org/10.1007/s10462-004-0751-8
  32. S. Gupta and A. Gupta, "Dealing with noise problem in machine learning data-sets: a systematic review," Procedia Computer Science, vol. 161, pp. 466-474, 2019. https://doi.org/10.1016/j.procs.2019.11.146
  33. Tableau, "Guide to data cleaning: definition, benefits, components, and how to clean your data," 2023 [Online]. Available: https://www.tableau.com/learn/articles/what-is-data-cleaning.
  34. L. Vaughan and M. Thelwall, "Search engine coverage bias: evidence and possible causes," Information Processing & Management, vol. 40, no. 4, pp. 693-707, 2004. https://doi.org/10.1016/S0306-4573(03)00063-3
  35. A. Mowshowitz and A. Kawaguchi, "Measuring search engine bias," Information Processing & Management, vol. 41, no. 5, pp. 1193-1205, 2005. https://doi.org/10.1016/j.ipm.2004.05.005
  36. University of Florida Libraries, "Google Guide: basic search tips," 2023 [Online]. Available: https://libguides.uwf.edu/c.php?g=215353&p=1420921.
  37. YourDictionary, "Labrador synonyms," 2023 [Online]. Available: https://thesaurus.yourdictionary.com/labrador.
  38. D. A. Rappaport, P. I. Altman, K. Handshumacher, "To scrape or not to scrape: the potential legal implications of using web scraping for market research," Hedge Fund Law Report, 2021 [Online]. Available: https://www.akingump.com/a/web/soxXRQ6Nw48FehNvwpdjJ1/2jiuhx/hflr-reprint-to-scrape-or-not-to-scrape-rappaport-altman-handschumacher-4819-0662-7801-v1.pdf.
  39. T. Paul, "is web scraping legal? A guide to understanding legality on web scraping," 2020 [Online]. Available: https://www.blog.datahut.co/post/is-web-scraping-legal.
  40. Google, "Removing content on Google," 2023 [Online]. Available: https://support.google.com/legal/troubleshooter/1114905?hl=en.