Browse > Article
http://dx.doi.org/10.9717/kmms.2019.22.3.374

Design and Implementation of Web Crawler utilizing Unstructured data  

Tanvir, Ahmed Md. (Dept. of Computer Engineering, Pukyong National University)
Chung, Mokdong (Dept. of Computer Engineering, Pukyong National University)
Publication Information
Abstract
A Web Crawler is a program, which is commonly used by search engines to find the new brainchild on the internet. The use of crawlers has made the web easier for users. In this paper, we have used unstructured data by structuralization to collect data from the web pages. Our system is able to choose the word near our keyword in more than one document using unstructured way. Neighbor data were collected on the keyword through word2vec. The system goal is filtered at the data acquisition level and for a large taxonomy. The main problem in text taxonomy is how to improve the classification accuracy. In order to improve the accuracy, we propose a new weighting method of TF-IDF. In this paper, we modified TF-algorithm to calculate the accuracy of unstructured data. Finally, our system proposes a competent web pages search crawling algorithm, which is derived from TF-IDF and RL Web search algorithm to enhance the searching efficiency of the relevant information. In this paper, an attempt has been made to research and examine the work nature of crawlers and crawling algorithms in search engines for efficient information retrieval.
Keywords
Web Crawling; TF-IDF; Unstructured data; Hyperlink;
Citations & Related Records
Times Cited By KSCI : 1  (Citation Analysis)
연도 인용수 순위
1 Y. Kim, H. Hong, and M. Chung, "Application of Cohesion Devices for Improvement of Distributional Representation," Proceeding of The 14th International Conference on Multimedia Information Technology and Applications (M ITA), pp. 84-87, 2018.
2 S. Saranya, B.S.E. Zoraida, and P.V. Paul, "A Study on Competent Crawling Algorithm (CCA) for Web Search to Enhance Efficiency of Information Retrieval," Proceeding of Artificial Intelligence and Evolutionary Algorithms in Engineering Systems, Springer, New Delhi, pp. 9-16, 2015.
3 K.S. Kim, K.Y. Kim, K.H. Lee, T.K. Kim, and W.S. Cho, "Design and implementation of web crawler based on dynamic web collection cycle," Proceeding of The International Conference on Information Network, IEEE, pp. 562-566, 2012.
4 M.Y. Ivory and M.A. Hearst, "Improving web site design," Proceeding of IEEE Internet Computing 2, Vol. 6, No. 2, pp. 56-63, 2002.   DOI
5 D. Debraj and P. Das, "Study of deep web and a new form based crawling technique," International Journal of Computer Engineering and Technology (IJCET), Vol. 7, No. 1, pp. 36-44, 2016.
6 Z. Guojun, J. Wenchao, S. Jihui, S. Fan, Z. Hao, L. Jiang, et al., "Design and application of intelligent dynamic crawler for web data mining," Proceeding of 2017 32nd Youth Academic Annual Conference of Chinese Association of Automation (YAC) IEEE, pp. 1098- 1105, 2017.
7 K.A. Pakojwar, R.S. Mangrulkar, and V.G. Bhujade, "Web data extraction and alignment using tag and value similarity," Proceeding of 2015 International Conference on Innovations in Information, Embedded and Communication Systems (ICIIECS), pp. 1-4, 2015.
8 S. Kolhatkar, M.M. Pati, M.S. Kolhatkar, and M.S. Paranjape, "Emergence of Unstructured Data and Scope of Big Data in Indian Education," International Journal of Advanced Computer Science and Applications (IJACSA), Vol. 8, No. 1, pp. 150-157, 2017.
9 M. Afsharizadeh, H. Ebrahimpour-Komleh, and A. Bagheri, "Query-oriented text summarization using sentence extraction technique," Proceeding of 4th International Conference on Web Research (ICWR), pp. 128-132, 2018.
10 S. Ringe, N. Francis, and A.H.S.A. Palanawala, "Ontology Based Web Crawler," International Journal of Computer Applications in Engineering Sciences, Vol. 2, No. 3, pp. 194-197, 2012.
11 R. Jason and A. McCallum, "Using reinforcement learning to spider the web efficiently," Proceeding of International Conference on Machine Learning (ICML), Vol. 99, 1999.
12 L. Jiang, Z. Wu, Q. Feng, J. Liu, and Q. Zheng, "Efficient deep web crawling using reinforcement learning," Proceeding of Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp. 428-439, 2010.
13 Y. Kim, B. Kim, and M. Chung, "Unstructured data analysis and multi-pattern storage technique for traffic information inference," The Journal of Multimedia Information System, Vol. 21, No. 2, pp. 211-223, 2018.