Acknowledgement
This study was supported by the Research Program funded by the Seoul National University of Science and Technology (SeoulTech).
References
- A. Rosebrock, "How to create a deep learning dataset using google images," 2017 [Online]. Available: https://pyimagesearch.com/2017/12/04/how-to-create-a-deep-learning-dataset-using-google-images/.
- D. M. Thomas and S. Mathur, "Data analysis by web scraping using Python," in Proceedings of 2019 3rd International Conference on Electronics, Communication and Aerospace Technology (ICECA), Coimbatore, India, 2019, pp. 450-454. https://doi.org/10.1109/ICECA.2019.8822022
- D. Glez-Pena, A. Lourenco, H. Lopez-Fernandez, M. Reboiro-Jato, and F. Fdez-Riverola, "Web scraping technologies in an API world," Briefings in Bioinformatics, vol. 15, no. 5, pp. 788-797, 2014. https://doi.org/10.1093/bib/bbt026
- F. Schroff, A. Criminisi, and A. Zisserman, "Harvesting image databases from the web," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 4, pp. 754-766, 2011. https://doi.org/10.1109/TPAMI.2010.133
- Y. Yao, J. Zhang, F. Shen, X. Hua, J. Xu, and Z. Tang, "Automatic image dataset construction with multiple textual metadata," in Proceedings of 2016 IEEE International Conference on Multimedia and Expo (ICME), Seattle, WA, 2016, pp. 1-6. https://doi.org/10.1109/ICME.2016.7552988
- Y. Lin, J. B. Michel, E. A. Lieberman, J. Orwant, W. Brockman, and S. Petrov, "Syntactic annotations for the Google Books Ngram Corpus," in Proceedings of the ACL 2012 System Demonstrations, Jeju, South Korea, 2012, pp. 169-174.
- J. M. Zink, "Automated dataset generation for image recognition using the example of taxonomy," 2018 [Online]. Available: https://arxiv.org/abs/1802.02207.
- G. A. Miller, "WordNet: a lexical database for English," Communications of the ACM, vol. 38, no. 11, pp. 39-41, 1995. https://doi.org/10.1145/219717.219748
- Google, "Word2Vec documentation," 2013 [Online]. Available: https://code.google.com/archive/p/word2vec/.
- A. Handler, "An empirical study of semantic similarity in WordNet and Word2Vec," Master's thesis, University of New Orleans, New Orleans, LA, USA, 2014 [Online]. Available: https://scholarworks.uno.edu/td/1922/.
- F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, "LSUN: construction of a large-scale image dataset using deep learning with humans in the loop," 2015 [Online]. Available: https://arxiv.org/abs/1506.03365.
- D. Schwab and M. Lafourcade, "Hardening of acception links through vectorized lexical functions," 2002 [Online]. Available: https://www.researchgate.net/publication/2543982_Hardening_of_Acception_Links_Through_Vectorized_Lexical_Functions.
- B. Zhao, "Web scraping," in Encyclopedia of Big Data. Cham, Switzerland: Springer, 2017, pp. 1-3. https://doi.org/10.1007/978-3-319-32001-4_483-1
- D. S. Sirisuriya, "A comparative study on web scraping," 2015 [Online]. Available: http://ir.kdu.ac.lk/handle/345/1051.
- S. Upadhyay, V. Pant, S. Bhasin, and M. K. Pattanshetti, "Articulating the construction of a web scraper for massive data extraction," in Proceedings of 2017 2nd International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 2017, pp. 1-4. https://doi.org/10.1109/ICECCT.2017.8117827
- L. L. Pipino, Y. W. Lee, and R. Y. Wang, "Data quality assessment," Communications of the ACM, vol. 45, no. 4, pp. 211-218, 2002. https://doi.org/10.1145/505248.506010
- S. Shen, "7 steps to ensure and sustain quality data," 2019 [Online]. Available: https://towardsdatascience.com/ 7-steps- to-ensure-and-sustain-data-quality-3c0040591366.
- HeavyAI, "Data Quality FAQ," 2022 [Online]. Available: https://www.heavy.ai/technical-glossary/dataquality#:~:text=Data%20that%20is%20deemed%20fit,data%2C%20and%20poor%20data%20security.
- R. L. Sarfin, "5 Characteristics of data quality," 2022 [Online]. Available: https://www.precisely.com/blog/data-quality/5-characteristics-of-data-quality.
- C. Stedman and J. Vaughan, "Data Quality," 2022 [Online]. Available: https://www.techtarget.com/searchdatamanagement/definition/data-quality.
- S. Dodge and L. Karam, "Understanding how image quality affects deep neural networks," in Proceedings of 2016 8th International Conference on Quality of Multimedia Experience (QoMEX), 2016, pp. 1-6. https://doi.org/10.1109/QoMEX.2016.7498955
- Z. Chen, W. Lin, S. Wang, L. Xu, and L. Li, "Image quality assessment guided deep neural networks training," 2017 [Online]. Available: https://arxiv.org/abs/1708.03880.
- Z. Chen, W. Lin, S. Wang, L. Xu, and L. Li, "Image quality assessment based label smoothing in deep neural network learning," in Proceedings of 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, Canada, 2018, pp. 6742-6746. https://doi.org/10.1109/ICASSP.2018.8461630
- Picsella, "How to ensure image dataset quality for image classification," 2023 [Online]. Available: https://www.picsellia.com/post/image-data-quality-for-image-classification.
- A. Mikhailiuk, "Deep image quality assessment," 2021 [Online]. Available: https://towardsdatascience.com/deep- image -quality-assessment-30ad71641fac.
- W. Zhang, K. Ma, J. Yan, D. Deng, and Z. Wang, "Blind image quality assessment using a deep bilinear convolutional neural network," IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36-47, 2020. https://doi.org/10.1109/TCSVT.2018.2886771
- A. Rosebrock, "Detect and remove duplicate images from a dataset for deep learning," 2020 [Online]. Available: https://pyimagesearch.com/2020/04/20/detect-and-remove-duplicate-images-from-a-dataset-fordeep-learning/.
- E. Hofesmann, "Find and remove duplicate images in your dataset," 2021 [Online]. Available: https://towardsdatascience.com/find-and-remove-duplicate-images-in-your-dataset-3e3ec818b978#:~:text=Images%20with%20a%20low%20uniqueness,train%2Ftest%20your%20model%20on.
- L. Zhang and Y. Rui, "Image search: from thousands to billions in 20 years," ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 9, no. 1s, article no. 36, 2013. https://doi. org/10.1145/2490823
- Google Search Help, "Q&A about image search (reply: Bluequoll)," 2020 [Online]. Available: https://support.google.com/websearch/thread/32492691/image-search-returns-almost-nothing-compared-to-whatit-used-to-on-about-any-subject-what-gives?hl=en.
- X. Zhu and X. Wu, "Class noise vs. attribute noise: a quantitative study," Artificial Intelligence Review, vol. 22, pp. 177-210, 2004. https://doi.org/10.1007/s10462-004-0751-8
- S. Gupta and A. Gupta, "Dealing with noise problem in machine learning data-sets: a systematic review," Procedia Computer Science, vol. 161, pp. 466-474, 2019. https://doi.org/10.1016/j.procs.2019.11.146
- Tableau, "Guide to data cleaning: definition, benefits, components, and how to clean your data," 2023 [Online]. Available: https://www.tableau.com/learn/articles/what-is-data-cleaning.
- L. Vaughan and M. Thelwall, "Search engine coverage bias: evidence and possible causes," Information Processing & Management, vol. 40, no. 4, pp. 693-707, 2004. https://doi.org/10.1016/S0306-4573(03)00063-3
- A. Mowshowitz and A. Kawaguchi, "Measuring search engine bias," Information Processing & Management, vol. 41, no. 5, pp. 1193-1205, 2005. https://doi.org/10.1016/j.ipm.2004.05.005
- University of Florida Libraries, "Google Guide: basic search tips," 2023 [Online]. Available: https://libguides.uwf.edu/c.php?g=215353&p=1420921.
- YourDictionary, "Labrador synonyms," 2023 [Online]. Available: https://thesaurus.yourdictionary.com/labrador.
- D. A. Rappaport, P. I. Altman, K. Handshumacher, "To scrape or not to scrape: the potential legal implications of using web scraping for market research," Hedge Fund Law Report, 2021 [Online]. Available: https://www.akingump.com/a/web/soxXRQ6Nw48FehNvwpdjJ1/2jiuhx/hflr-reprint-to-scrape-or-not-to-scrape-rappaport-altman-handschumacher-4819-0662-7801-v1.pdf.
- T. Paul, "is web scraping legal? A guide to understanding legality on web scraping," 2020 [Online]. Available: https://www.blog.datahut.co/post/is-web-scraping-legal.
- Google, "Removing content on Google," 2023 [Online]. Available: https://support.google.com/legal/troubleshooter/1114905?hl=en.