• Title/Summary/Keyword: Web Crawling

Search Result 177, Processing Time 0.023 seconds

Analyzing Coverage and Coverage Overlap of Korean Web Directories (국내 웹 디렉토리들의 커버리지 및 커버리지 중복성 분석)

  • 배희진;이진숙;이준호;박소연
    • Journal of the Korean Society for information Management
    • /
    • v.21 no.1
    • /
    • pp.173-186
    • /
    • 2004
  • This study examines coverage and coverage overlap of the three major Korean web directories, Naver, Yahoo Korea, and Empas. This study also suggests a methodology for collecting and processing web sites provided by these web directories. A method for napping main categories was developed. Each directory provided registered web pages in a slightly different way. Reference links had a significant influence on the coverage of each web directory. The overlap of pages among three directories was quite low, It is expected that this study could contribute to the field of web research by providing insights to how directories provide web pages and suggesting a methodology for the analysis of directory coverage.

A Methodology for Performance Evaluation of Web Robots (웹 로봇의 성능 평가를 위한 방법론)

  • Kim, Kwang-Hyun;Lee, Joon-Ho
    • The KIPS Transactions:PartD
    • /
    • v.11D no.3
    • /
    • pp.563-570
    • /
    • 2004
  • As the use of the Internet becomes more popular, a huge amount of information is published on the Web, and users can access the information effectively with Web search services. Since Web search services retrieve relevant documents from those collected by Web robots we need to improve the crawling quality of Web robots. In this paper, we suggest evaluation criteria for Web robots such as efficiency, continuity, freshness, coverage, silence, uniqueness and safety, and present various functions to improve the performance of Web robots. We also investigate the functions implemented in the conventional Web robots of NAVER, Google, AltaVista etc. It is expected that this study could contribute the development of more effective Web robots.

Development of chatting program using social issue keyword information (사회적 핵심 이슈 키워드 정보를 활용한 채팅 프로그램 개발)

  • Yoon, Kyung-Suob;Jeong, Won-Hyeok
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.307-310
    • /
    • 2020
  • 본 논문에서 이슈 키워드 추출을 위해 텍스트 마이닝(Text Mining) 기술을 요구한다. 사회적 이슈 키워드를 추출하기 위해 키워드 수집 모델이 되는 사이트에서 크롤링(crawling)을 수행한 뒤, 형태소 단위 의미있는 단어를 수집하기 위해 형태소 분석(morphological analysis)을 수행한다. 한국어 형태소 분석을 위해 파이썬의 코엔엘파이(KoNLPy) 패키지를 활용한다. 형태소 분석을 통해 나뉘어진 단어에서 통계를 내어 이슈 키워드 추출한다. 이슈 키워드를 뒷받침할 연관 단어를 분석하기 위해 단어 임베딩(Word Embedding)을 수행한다. 단어 임베딩 수행을 위해 Word2Vec 모델 중 Skip-Gram 방법론을 적용하여 연관 단어를 분석하도록 개발하였다. 웹 소켓(Web Socket) 통신을 통한 채팅 프로그램의 상단에 분석한 이슈 키워드와 연관 단어를 출력하도록 개발하였다.

  • PDF

Multi-threaded Web Crawling Design using Queues (큐를 이용한 다중스레드 방식의 웹 크롤링 설계)

  • Kim, Hyo-Jong;Lee, Jun-Yun;Shin, Seung-Soo
    • Journal of Convergence for Information Technology
    • /
    • v.7 no.2
    • /
    • pp.43-51
    • /
    • 2017
  • Background/Objectives : The purpose of this study is to propose a multi-threaded web crawl using queues that can solve the problem of time delay of single processing method, cost increase of parallel processing method, and waste of manpower by utilizing multiple bots connected by wide area network Design and implement. Methods/Statistical analysis : This study designs and analyzes applications that run on independent systems based on multi-threaded system configuration using queues. Findings : We propose a multi-threaded web crawler design using queues. In addition, the throughput of web documents can be analyzed by dividing by client and thread according to the formula, and the efficiency and the number of optimal clients can be confirmed by checking efficiency of each thread. The proposed system is based on distributed processing. Clients in each independent environment provide fast and reliable web documents using queues and threads. Application/Improvements : There is a need for a system that quickly and efficiently navigates and collects various web sites by applying queues and multiple threads to a general purpose web crawler, rather than a web crawler design that targets a particular site.

Spring Boot-based Web Application Development for providing information on Security Vulnerabilities and Patches for Open Source Software (Spring Boot 기반의 오픈소스 소프트웨어 보안 취약점 및 패치 정보 제공 웹 어플리케이션 개발)

  • Sim, Wan;Choi, WoongChul
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.17 no.4
    • /
    • pp.77-83
    • /
    • 2021
  • As Open Source Software(OSS) recently invigorates, many companies actively use the OSSes in their business software. With such OSS invigoration, our web application is developed in order to provide the safety in using the OSSes, and update the information on the new vulnerabilities and the patches at all times by crawling the web pages of the relevant OSS home pages and the managing organizations of the vulnerabilities. By providing the updated information, our application helps the OSS users and developers to be aware of such security issues, and gives them to work in the safer environment from security risks. In addition, our application can be used as a security platform to greatly contribute to preventing potential security incidents not only for companies but also for individual developers.

Distribute Parallel Crawler Design and Implementation (분산형 병렬 크롤러 설계 및 구현)

  • Jang, Hyun Ho;jeon, kyung-sik;Lee, HooKi
    • Convergence Security Journal
    • /
    • v.19 no.3
    • /
    • pp.21-28
    • /
    • 2019
  • As the number of websites managed by organizations or organizations increases, so does the number of web application servers and containers. In checking the status of the web service of the web application server and the container, it is very difficult for the person to check the status of the web service after accessing the physical server at the remote site through the terminal or using other accessible software It. Previous research on crawler-related research is hard to find any reference to the processing of data from crawling. Data loss occurs when the crawler accesses the database and stores the data. In this paper, we propose a method to store the inspection data according to crawl - based web application server management without losing data.

Development of a method for urban flooding detection using unstructured data and deep learing (비정형 데이터와 딥러닝을 활용한 내수침수 탐지기술 개발)

  • Lee, Haneul;Kim, Hung Soo;Kim, Soojun;Kim, Donghyun;Kim, Jongsung
    • Journal of Korea Water Resources Association
    • /
    • v.54 no.12
    • /
    • pp.1233-1242
    • /
    • 2021
  • In this study, a model was developed to determine whether flooding occurred using image data, which is unstructured data. CNN-based VGG16 and VGG19 were used to develop the flood classification model. In order to develop a model, images of flooded and non-flooded images were collected using web crawling method. Since the data collected using the web crawling method contains noise data, data irrelevant to this study was primarily deleted, and secondly, the image size was changed to 224×224 for model application. In addition, image augmentation was performed by changing the angle of the image for diversity of image. Finally, learning was performed using 2,500 images of flooding and 2,500 images of non-flooding. As a result of model evaluation, the average classification performance of the model was found to be 97%. In the future, if the model developed through the results of this study is mounted on the CCTV control center system, it is judged that the respons against flood damage can be done quickly.

Study of Travel Demand and Air Route Strategy : Web Crawling-based Analysis Technology (여행 수용 파악 및 항공 노선 전략 연구 : 웹 크롤링 기반 분석 기법)

  • Cho, Chang-Hyeon;Yu, Heonchang
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2020.05a
    • /
    • pp.378-381
    • /
    • 2020
  • 항공/여행 상품은 타 산업보다 불확실성에 취약하며 시간의 절대적인 종속성으로 인해 정확한 수요 파악 및 예측을 하지 못할 경우 가치가 0으로 수렴한다. 이에 본 논문은 웹 크롤링을 기반으로 잠재여행 욕구를 파악하고, 향후 성장할 것으로 예상되는 항공 노선 및 취항지를 예측 및 분석하는 기법을 제안하고자 한다.

Graph Structure and Evolution of the Korea web (한국 웹 그래프와 진화에 대한 연구)

  • Han, In-Kyu;Lee, Sang-Ho
    • The KIPS Transactions:PartD
    • /
    • v.14D no.3 s.113
    • /
    • pp.293-302
    • /
    • 2007
  • The study of the web graph yields valuable insight into web algorithms for crawling, searching and community discovery, and the sociological phenomena which characterize its evolution, also it is useful for understanding the evolution process of web graph and predicting the scale of the Web. In this paper, we report experimental results on properties of the Korea web graph with over 116 million pages and 2.7 billion links. We indicate to study the Korea web properties such as the power law phenomenon and then to analyze the similarity and difference between the global and Korea web graph. Our analysis reveals the Korea web graph have different properties compared with the global web graph from the structure to the evolution of the Web. Finally, a number of measurements of the evolution of the Korea web graph ill be represented.

Security Check Scheduling for Detecting Malicious Web Sites (악성사이트 검출을 위한 안전진단 스케줄링)

  • Choi, Jae Yeong;Kim, Sung Ki;Min, Byoung Joon
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.9
    • /
    • pp.405-412
    • /
    • 2013
  • Current web has evolved to a mashed-up format according to the change of the implementation and usage patterns. Web services and user experiences have improved, however, security threats are also increased as the web contents that are not yet verified combine together. To mitigate the threats incurred as an adverse effect of the web development, we need to check security on the combined web contents. In this paper, we propose a scheduling method to detect malicious web pages not only inside but also outside through extended links for secure operation of a web site. The scheduling method considers several aspects of each page including connection popularity, suspiciousness, and check elapse time to make a decision on the order for security check on numerous web pages connected with links. We verified the effectiveness of the security check complying with the scheduling method that uses the priority given to each page.