• Title/Summary/Keyword: Crawler

Search Result 202, Processing Time 0.021 seconds

Development of MFL Testing System for the Inspection of Storage Tank Floor (저장탱크 바닥면 검사를 위한 누설자속 탐상 시스템 개발)

  • Won, Soon-Ho;Cho, Kyung-Shik;Lee, Jong-O;Chang, Hong-Keun;Joo, Gwang-Tae
    • Journal of the Korean Society for Nondestructive Testing
    • /
    • v.22 no.1
    • /
    • pp.38-44
    • /
    • 2002
  • MFL method is a qualitative inspection tool and is a reliable, fast and economical NDT method. The application of MFL method to the inspection of storage tank floor plates has been shown to be a viable means. Examination of tank floors previously depended primarily upon ultrasonic test methods that required slow and painstaking application. Therefor most ultrasonic inspection of storage tank has been limited to spot testing only. Our NDE group have developed magnetic flux leakage system to overcome limitation of ultrasonic test. The developed system consists of magnetic yoke, array sensor, crawler and software. It is proved that the system is able to detect artificial flaw like 3.2mm diameter, 1.2mm depth in 6mm thick steel plate.

HTML Text Extraction Using Frequency Analysis (빈도 분석을 이용한 HTML 텍스트 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1135-1143
    • /
    • 2021
  • Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

Crawling algorithm design and experiment for automatic deep web document collection (심층 웹 문서 자동 수집을 위한 크롤링 알고리즘 설계 및 실험)

  • Yun-Jeong, Kang;Min-Hye, Lee;Dong-Hyun, Won
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.27 no.1
    • /
    • pp.1-7
    • /
    • 2023
  • Deep web collection means entering a query in a search form and collecting response results. It is estimated that the information possessed by the deep web has about 450 to 550 times more information than the statically constructed surface web. The static method does not show the changed information until the web page is refreshed, but the dynamic web page method updates the necessary information in real time and provides real-time information without reloading the web page, but crawler has difficulty accessing the updated information. Therefore, there is a need for a way to automatically collect information on these deep webs using a crawler. Therefore, this paper proposes a method of utilizing scripts as general links, and for this purpose, an algorithm that can utilize client scripts like regular URLs is proposed and experimented. The proposed algorithm focused on collecting web information by menu navigation and script execution instead of the usual method of entering data into search forms.

Analysis of Behavior Patterns from Human and Web Crawler Events Log on ScienceON (ScienceON 웹 로그에 대한 인간 및 웹 크롤러 행위 패턴 분석)

  • Poositaporn, Athiruj;Jung, Hanmin;Park, Jung Hoon
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.05a
    • /
    • pp.6-8
    • /
    • 2022
  • Web log analysis is one of the essential procedures for service improvement. ScienceON is a representative information service that provides various S&T literature and information, and we analyze its logs for continuous improvement. This study aims to analyze ScienceON web logs recorded in May 2020 and May 2021, dividing them into humans and web crawlers and performing an in-depth analysis. First, only web logs corresponding to S (search), V (detail view), and D (download) types are extracted and normalized to 658,407 and 8,727,042 records for each period. Second, using the Python 'user_agents' library, the logs are classified into humans and web crawlers, and third, the session size was set to 60 seconds, and each session is analyzed. We found that web crawlers, unlike humans, show relatively long for the average behavior pattern per session, and the behavior patterns are mainly for V patterns. As the future, the service will be improved to quickly detect and respond to web crawlers and respond to the behavioral patterns of human users.

  • PDF

A Heuristic Method of In-situ Drought Using Mass Media Information

  • Lee, Jiwan;Kim, Seong-Joon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2020.06a
    • /
    • pp.168-168
    • /
    • 2020
  • This study is to evaluate the drought-related bigdata characteristics published from South Korean by developing crawler. The 5 years (2013 ~ 2017) drought-related posted articles were collected from Korean internet search engine 'NAVER' which contains 13 main and 81 local daily newspapers. During the 5 years period, total 40,219 news articles including 'drought' word were found using crawler. To filter the homonyms liken drought to soccer goal drought in sports, money drought economics, and policy drought in politics often used in South Korea, the quality control was processed and 47.8 % articles were filtered. After, the 20,999 (52.2 %) drought news articles of this study were classified into four categories of water deficit (WD), water security and support (WSS), economic damage and impact (EDI), and environmental and sanitation impact (ESI) with 27, 15, 13, and 18 drought-related keywords in each category. The WD, WSS, EDI, and ESI occupied 41.4 %, 34.5 %, 14.8 %, and 9.3 % respectively. The drought articles were mostly posted in June 2015 and June 2017 with 22.7 % (15,097) and 15.9 % (10,619) respectively. The drought news articles were spatiotemporally compared with SPI (Standardized Precipitation Index) and RDI (Reservoir Drought Index) were calculated. They were classified into administration boundaries of 8 main cities and 9 provinces in South Korea because the drought response works based on local government unit. The space-time clustering between news articles (WD, WSS, EDI, and ESI) and indices (SPI and RDI) were tried how much they have correlation each other. The spatiotemporal clusters detection was applied using SaTScan software (Kulldorff, 2015). The retrospective and prospective cluster analyses were conducted for past and present time to understand how much they are intensive in clusters. The news articles of WD, WSS and EDI had strong clusters in provinces, and ESI in cities.

  • PDF

Development and Performance Analysis of Self-Propelled Crawler and Gathering Type Potato Harvester (크롤러 타입 자주식 수집형 감자 수확기 개발 및 성능분석)

  • Won-Kyung Kim;Sang Hee Lee;Deok Gyu Choi;Seok Ho Park;Youn Koo Kang;Seok Pyo Moon;Chang Uk Cheon;Young Joo Kim;Sung Hyuk Jang
    • Journal of Drive and Control
    • /
    • v.21 no.2
    • /
    • pp.23-29
    • /
    • 2024
  • Potatoes are one of the world's four major crops, and domestic consumption is currently increasing in Korea. However, the mechanization rate of potatoes is very low, and especially, harvesting is the most labor-intensive task in potato production. In Korea, potato-collecting work depends on manpower, so it is necessary to develop a gathering-type harvester that can be used for processes from digging to harvesting. Therefore, in this study, a self-propelled-type potato harvester was developed, and its performance was analyzed to mechanize harvesting. The potato harvester was developed to have a crawler-type driving part with a 60 hp diesel engine and consisted of a digging part that digs potatoes from the ground, a vertical transporting part that transfers the dug potatoes to the height of the collection bag, a separating part that separates debris, such as stones and soil, and a collecting part that loads the collection box. A field test of the potato harvester was conducted, and performance was evaluated by the damage, loss, and debris mixing proportions, which were 2.5%, 2.8%, and 2.6%, respectively. The working capacity was 1.2 h/10 a. The economic analysis results showed that the cost of harvesting work could be reduced by 12.7% compared to manual harvesting.

A Mobile Robot for Remote Inspection of Radioactive Waste (방사선폐기물 원격감시용 이동로봇)

  • 서용칠;김창회;조재완;최영수;김승호
    • Proceedings of the Korean Radioactive Waste Society Conference
    • /
    • 2004.06a
    • /
    • pp.430-432
    • /
    • 2004
  • Tele-operation and remote monitoring techniques are essential and important technologies for the inspection and maintenance of the radioactive waste. A mobile robot has been developed for the application of remote monitoring and inspection of nuclear facilities, where human access is limited because of the high-level radioactive environments, The mobile robot was designed with reconfigurable crawler type of wheels attached on the front and rear side in order to pass through the ditch, The extendable mast, mounted on the mobile robot, car be extended up to 8 m vertically. The robust controller for radiation is designed in focus on electric components to prevent abnormal operation in a highly radioactivated area during reactor operation, This robot system will enhance the reliability of nuclear power facilities, and cope with the unexpected radiation accident.

  • PDF

Performance Analysis of Web-Crawler in Multi-thread Environment (다중 쓰레드 환경에서 웹 크롤러의 성능 분석)

  • Park, Jung-Woo;Kim, Jun-Ho;Lee, Won-Joo;Jeon, Chang-Ho
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2009.01a
    • /
    • pp.473-476
    • /
    • 2009
  • 본 논문에서는 다중 쓰레드 환경에서 동작하는 웹 크롤러를 구현하고 성능을 분석한다. 이 웹 크롤러의 특징은 검색시간을 단축하기 위하여 크롤링, 파싱 및 페이지랭킹, DB 저장 모듈을 서로 독립적으로 다른 작업을 수행하도록 구현한 것이다. 크롤링 모듈은 웹상의 데이터를 수집하는 기능을 제공한다. 그리고 파싱 및 페이지랭크 모듈은 수집한 데이터를 파싱하고, 웹 페이지의 상대적인 중요도를 수치로 계산하여 페이지랭크를 지정한다. DB 연동 모듈은 페이지랭크 모듈에서 구한 페이지랭크를 데이터베이스에 저장한다. 성능평가에서는 다중 쓰레드 환경에서 쓰레드 수와 웹 페이지의 수에 따른 검색 시간을 측정하여 그 결과를 비교 평가한다.

  • PDF

User-Centered Information Retrieving Method in Blogs (사용자 중심의 블로그 정보 검색 기법)

  • Kim, Seung-Jong
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.11 no.9
    • /
    • pp.3458-3464
    • /
    • 2010
  • Due to the recent tremendous growth of internet information, RSS, syndication technology provides internet users with a user-friendly information search. RSS enables you to automatically receive newly updated contents, so users do not need to constantly access web sites to obtain new information. This paper proposes the way of managing the web crawler, which collects the sites of RSS documents and helps the users efficiently use the RSS documents. And it also suggests the proper way of ranking the RSS documents based on the users' popularity. Users can efficiently search out the documents they need by using the proposed information searching methods.