• Title/Summary/Keyword: web crawling

Search Result 176, Processing Time 0.023 seconds

Database metadata standardization processing model using web dictionary crawling (웹 사전 크롤링을 이용한 데이터베이스 메타데이터 표준화 처리 모델)

  • Jeong, Hana;Park, Koo-Rack;Chung, Young-suk
    • Journal of Digital Convergence
    • /
    • v.19 no.9
    • /
    • pp.209-215
    • /
    • 2021
  • Data quality management is an important issue these days. Improve data quality by providing consistent metadata. This study presents algorithms that facilitate standard word dictionary management for consistent metadata management. Algorithms are presented to automate synonyms management of database metadata through web dictionary crawling. It also improves the accuracy of the data by resolving homonym distinction issues that may arise during the web dictionary crawling process. The algorithm proposed in this study increases the reliability of metadata data quality compared to the existing passive management. It can also reduce the time spent on registering and managing synonym data. Further research on the new data standardization partial automation model will need to be continued, with a detailed understanding of some of the automatable tasks in future data standardization activities.

Crawling algorithm design and experiment for automatic deep web document collection (심층 웹 문서 자동 수집을 위한 크롤링 알고리즘 설계 및 실험)

  • Yun-Jeong, Kang;Min-Hye, Lee;Dong-Hyun, Won
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.27 no.1
    • /
    • pp.1-7
    • /
    • 2023
  • Deep web collection means entering a query in a search form and collecting response results. It is estimated that the information possessed by the deep web has about 450 to 550 times more information than the statically constructed surface web. The static method does not show the changed information until the web page is refreshed, but the dynamic web page method updates the necessary information in real time and provides real-time information without reloading the web page, but crawler has difficulty accessing the updated information. Therefore, there is a need for a way to automatically collect information on these deep webs using a crawler. Therefore, this paper proposes a method of utilizing scripts as general links, and for this purpose, an algorithm that can utilize client scripts like regular URLs is proposed and experimented. The proposed algorithm focused on collecting web information by menu navigation and script execution instead of the usual method of entering data into search forms.

Improving the quality of Search engine by using the Intelligent agent technolo

  • Nauyen, Ha-Nam;Choi, Gyoo-Seok;Park, Jong-Jin;Chi, Sung-Do
    • Journal of the Korea Computer Industry Society
    • /
    • v.4 no.12
    • /
    • pp.1093-1102
    • /
    • 2003
  • The dynamic nature of the World Wide Web challenges Search engines to find relevant and recent pages. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. In this paper we study way spiders that should visit the URLs in order to obtain more “important” pages first. We define and apply several metrics, ranking formula for improving crawling results. The comparison between our result and Breadth-first Search (BFS) method shows the efficiency of our experiment system.

  • PDF

News Abusing Inference Model Using Web Crawling (웹크롤링을 활용한 뉴스 어뷰징 추론 모델)

  • Chung, Kyoung-Rock;Park, Koo-Rack;Chung, Young-Suk;Nam, Ki-Bok
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2018.07a
    • /
    • pp.175-176
    • /
    • 2018
  • 기존 신문이나 티브이가 아닌 온라인과 모바일로 뉴스를 보는 사람이 더 많아지면서, 포털 사이트 뉴스난에 다른 언론사의 기사보다 더 많이 노출되기 위한 경쟁의 심화로 뉴스 어뷰징은 심각한 사회 문제로까지 대두되었다. 본 논문은 온라인상에서 생성, 유통되는 많은 뉴스 중에서 이용자의 시간을 낭비하고 양질의 정보를 찾기 힘들게 하는 뉴스 어뷰징을 판단하는 모델을 제안한다. 제안된 모델은 크롤링 기술을 사용하여 뉴스의 제목과 내용을 가져온 후 인공지능 기술을 이용한 유사도 검사로 기사의 어뷰징 여부를 판단하여 양질의 뉴스 정보를 사용자에게 제공될 수 있다.

  • PDF

Information-providing Application Based on Web Crawling (웹 크롤링을 통한 개인 맞춤형 정보제공 애플리케이션)

  • Ju-Hyeon Kim;Jeong-Eun Choi;U-Gyeong Shin;Min-Jun Piao;Tae-Kook Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.295-296
    • /
    • 2023
  • 본 논문에서는 웹 크롤링을 통한 개인 맞춤형 정보제공 애플리케이션에 관해 연구하였다. 본 서비스는 Java의 Jsoup 라이브러리를 이용해서 웹 크롤링(Web Crawling)한 데이터를 MySQL에 저장한다. 이를 통해 사용자가 지정한 키워드를 필터링하여 사용자에게 정보를 제공한다. 예를 들어 사용자가 지정한 키워드 관련 공지 사항이 업데이트되면 구현한 앱 내에서 확인 가능하며, KakaoTalk 알림톡을 통해서도 업데이트된 정보를 실시간으로 전송받는 서비스를 구현하였다.

Identifying Reader's Internal Needs and Characteristics Using Keywords from Korean Web Novels (웹소설 키워드를 통한 이용 독자 내적 욕구 및 특성 파악)

  • Jo, Suyeon;Oh, Hayoung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.24 no.2
    • /
    • pp.158-165
    • /
    • 2020
  • Web novels that are consumed on mobile devices are characterized by capturing one aspect of our society. The purpose of this study was to collect the keywords from web novels, to identify trends of web novels, and further to analyze the covert needs and characteristics of readers in connection with the existing researches. As a result of the analysis, novels with modern backgrounds and adult novels were popular in relation to easily readable and accessible mobile environment. Male characters tend to be ideally depicted in web novels. In contrast, characters with inner scars were popular among female characters. Although this study did not conduct an in-depth analysis of adult novels due to the limitation of web crawling, it is meaningful that this study analyzed modern people's inner needs and characteristics using the para-text like keywords in existing web novel studies that previously lacked quantitative analysis.

Web-Anti-MalWare Malware Detection System (악성코드 탐지 시스템 Web-Anti-Malware)

  • Jung, Seung-il;Kim, Hyun-Woo
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2014.07a
    • /
    • pp.365-367
    • /
    • 2014
  • 최근 웹 서비스의 증가와 악성코드는 그 수를 판단 할 수 없을 정도로 빠르게 늘어나고 있다. 매년 늘어나는 악성코드는 금전적 이윤 추구가 악성코드의 주된 동기가 되고 있으며 이는 공공기관 및 보안 업체에서도 악성코드를 탐지하기 위한 연구가 활발히 진행되고 있다. 본 논문에서는 실시간으로 패킷을 분석할수 있는 필터링과 웹 크롤링을 통해 도메인 및 하위 URL까지 자동적으로 탐지할 수 있는 악성코드 탐지 시스템을 제안한다.

  • PDF

Optimization Model on the World Wide Web Organization with respect to Content Centric Measures (월드와이드웹의 내용기반 구조최적화)

  • Lee Wookey;Kim Seung;Kim Hando;Kang Sukho
    • Journal of the Korean Operations Research and Management Science Society
    • /
    • v.30 no.1
    • /
    • pp.187-198
    • /
    • 2005
  • The structure of a Web site can prevent the search robots or crawling agents from confusion in the midst of huge forest of the Web pages. We formalize the view on the World Wide Web and generalize it as a hierarchy of Web objects such as the Web as a set of Web sites, and a Web site as a directed graph with Web nodes and Web edges. Our approach results in the optimal hierarchical structure that can maximize the weight, tf-idf (term frequency and inverse document frequency), that is one of the most widely accepted content centric measures in the information retrieval community, so that the measure can be used to embody the semantics of search query. The experimental results represent that the optimization model is an effective alternative in the dynamically changing Web environment by replacing conventional heuristic approaches.

Understanding the Food Hygiene of Cruise through the Big Data Analytics using the Web Crawling and Text Mining

  • Shuting, Tao;Kang, Byongnam;Kim, Hak-Seon
    • Culinary science and hospitality research
    • /
    • v.24 no.2
    • /
    • pp.34-43
    • /
    • 2018
  • The objective of this study was to acquire a general and text-based awareness and recognition of cruise food hygiene through big data analytics. For the purpose, this study collected data with conducting the keyword "food hygiene, cruise" on the web pages and news on Google, during October 1st, 2015 to October 1st, 2017 (two years). The data collection was processed by SCTM which is a data collecting and processing program and eventually, 899 kb, approximately 20,000 words were collected. For the data analysis, UCINET 6.0 packaged with visualization tool-Netdraw was utilized. As a result of the data analysis, the words such as jobs, news, showed the high frequency while the results of centrality (Freeman's degree centrality and Eigenvector centrality) and proximity indicated the distinct rank with the frequency. Meanwhile, as for the result of CONCOR analysis, 4 segmentations were created as "food hygiene group", "person group", "location related group" and "brand group". The diagnosis of this study for the food hygiene in cruise industry through big data is expected to provide instrumental implications both for academia research and empirical application.

Automated Training Database Development through Image Web Crawling for Construction Site Monitoring (건설현장 영상 분석을 위한 웹 크롤링 기반 학습 데이터베이스 구축 자동화)

  • Hwang, Jeongbin;Kim, Jinwoo;Chi, Seokho;Seo, JoonOh
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.39 no.6
    • /
    • pp.887-892
    • /
    • 2019
  • Many researchers have developed a series of vision-based technologies to monitor construction sites automatically. To achieve high performance of vision-based technologies, it is essential to build a large amount and high quality of training image database (DB). To do that, researchers usually visit construction sites, install cameras at the jobsites, and collect images for training DB. However, such human and site-dependent approach requires a huge amount of time and costs, and it would be difficult to represent a range of characteristics of different construction sites and resources. To address these problems, this paper proposes a framework that automatically constructs a training image DB using web crawling techniques. For the validation, the authors conducted two different experiments with the automatically generated DB: construction work type classification and equipment classification. The results showed that the method could successfully build the training image DB for the two classification problems, and the findings of this study can be used to reduce the time and efforts for developing a vision-based technology on construction sites.