• Title/Summary/Keyword: web crawler

Search Result 102, Processing Time 0.024 seconds

RSS Channel Recommendation System using Focused Crawler (주제 중심 수집기를 이용한 RSS 채널 추천 시스템)

  • Lee, Young-Seok;Cho, Jung-Woo;Kim, Jun-Il;Choi, Byung-Uk
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.43 no.6 s.312
    • /
    • pp.52-59
    • /
    • 2006
  • Recently, the internet has seen tremendous growth with plenty of enriched information due to an increasing number of specialized personal interests and popularizations of private cyber space called, blog. Many of today's blog provide internet users, RSS, which is also hewn as the syndication technology. It enables blog users to receive update automatically by registering their RSS channel address with RSS aggregator. In other words, it keeps internet users wasting their time checking back the web site for update. This paper propose the ways to manage RSS Channel Searching Crawler and collected RSS Channels for internet users to search for a specific RSS channel of their want without any obstacles. At the same time. This paper proposes RSS channel ranking based on user popularity. So, we focus on an idea of adding index to information and web update for users to receive appropriate information according to user property.

A Study on Design and Development of Web Information Collection System Based Compare and Merge Method (웹 페이지 비교통합 기반의 정보 수집 시스템 설계 및 개발에 대한 연구)

  • Jang, Jin-Wook
    • Journal of Information Technology Services
    • /
    • v.13 no.1
    • /
    • pp.147-159
    • /
    • 2014
  • Recently, the quantity of information that is accessible from the Internet is being dramatically increased. Searching the Web for useful information has therefore become increasingly difficult. Thus, much research has been done on web robots which perform internet information filtering based on user interest. If a web site which users want to visit is found, its content is searched by following the searching list or Web sites links in order. This search process takes a long time according as the number of page or site increases so that its performance need to be improved. In order to minimize unnecessary search with web robots, this paper proposes an efficient information collection system based on compare and merge method. In the proposed system, a web robot initially collects information from web sites which users register. From the next visit to the web sites, the web robot compares what it collected with what the web sites have currently. If they are different, the web robot updates what it collected. Only updated web page information is classified according to subject and provided to users so that users can access the updated information quickly.

Advanced Manufacturing Technologies on the World Wide Web: Methodologies and Application Techniques (World Wide Web 상의 첨단 생산 기술: 방법론과 응용기술)

  • Kim, Seong-Jip;Kim, Nak-Hyun;Yang, Tae-Kon
    • IE interfaces
    • /
    • v.9 no.3
    • /
    • pp.306-316
    • /
    • 1996
  • The easily use of WWW and Web browser of INTERNET makes the world our stage. But when we search for the information and resource that we want, the information supplied by search engine (e.g., Yahoo, Lycos, WebCrawler, Alta Vista) is inadequate to acquire the necessary and related information of research issues. This paper surveys AMT(Advanced Manufacturing Technology) which is the research topics recently on the WWW(WorLd Wide Web) and provides searching methods and information for academic research, technical report, proceedings, software, etc. It also briefly surveys WWW-VL(Virtual Library) and reviews the major three technology, CALS (Commerce At Light Speed), AMS(Agile Manufacturing System), CE(Concurrent Engineering), that is recently the focus of the research issue of Industrial Engineer.

  • PDF

Implementation of a Web Robot and Statistics on the Korean Web (웹 로봇 구현 및 한국 웹 통계보고)

  • Kim, Sung-Jin;Lee, Sang-Ho
    • The KIPS Transactions:PartC
    • /
    • v.10C no.4
    • /
    • pp.509-518
    • /
    • 2003
  • A web robot is a program that downloads and stores web pages. Implementation issues for developing web robots have been studied widely and various web statistics are reported in the literature. First, this paper describes the overall architecture of our robot and implementation decisions on several important issues. Second, we show empirical statistics on approximately 74 million Korean web pages. Third, we monitored 1,424 Korean web sites to observe the changes of web pages. We identify what factors of web pages could affect the changes. The factors may be used for the selection of web pages to be updated incrementally.

A Design of Web History Archive System Using RCS in Large Scale Web (대용량 웹에서 RCS를 이용한 웹 히스토리 저장 시스템 설계)

  • 이무훈;이민희;조성훈;장창복;김동혁;최의인
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04b
    • /
    • pp.211-213
    • /
    • 2004
  • 웹의 급속한 성장에 따라 웹 정보는 시간적 . 공간적 제약을 받지 않고 널리 활용되어지고 있다. 하지만 기존에 유용하게 사용되던 정보가 어느 순간 삭제가 된다면 더 이상 켈 정보를 이용할 수 없게 된다는 문제점이 존재한다. 이러한 문제를 해결하기 위해 웹 아카이브 시스템에 대한 연구와 좀더 효율적으로 삭제된 웹 정보를 저장하기 위한 기법들이 제안되었다. 그러나 기존의 기법들은 단순히 웹 정보를 저장하는 것에만 초점을 두었기 때문에 저장 공간의 효율성 및 제약성을 전혀 고려하지 않는 단점을 가지고 있다. 따라서 본 논문에서는 WebBase를 기반으로 하여 레포지토리에서 갱신되는 웹 정보들을 효율적으로 저장하고 검색할 수 있는 웹 히스토리 저장 시스템을 설계하였다. 본 논문에서 제안한 기법은 웹 히스토리 저장 시스템 설계를 위해 별도의 Crawler를 두지 않고 WebBase를 활용함으로써 웹 정보 수집에 대한 오버헤드를 줄일 수 일고, 삭제되는 웹 정보를 RCS를 통하여 체계적이고 효율적으로 저장함으로써 중요한 웹 정보를 공유할 수 있도록 하였다.

  • PDF

The Development of Automatic Collection Method to Collect Information Resources for Wed Archiving: With Focus on Disaster Safety Information (웹 아카이빙을 위한 정보자원의 자동수집방법 개발 - 재난안전정보를 중심으로 -)

  • Lee, Su Jin;Han, Hui Lyeong;Sim, Min Jeong;Won, Dong Hyun;Kim, Yong
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.17 no.4
    • /
    • pp.1-26
    • /
    • 2017
  • This study aims to provide the efficient sharing and utilization method of disasters scattered by each institution and develop automated collection algorithm using web crawler for disaster information in deep web accounts. To achieve these goals, this study analyzes the logical structure of the deep web and develops algorithms to collect the information. With the proposed automatic algorithm, it is expected that disaster management will be helped by sharing and utilizing disaster safety information.

A Study on Political Attitude Estimation of Korean OSN Users (온라인 소셜네트워크를 통한 한국인의 정치성향 예측 기법의 연구)

  • Wijaya, Muhammad Eka;Ahn, Heejune
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.21 no.4
    • /
    • pp.1-11
    • /
    • 2016
  • Recently numerous studies are conducted to estimate the human personality from the online social activities. This paper develops a comprehensive model for political attitude estimation leveraging the Facebook Like information of the users. We designed a Facebook Crawler that efficiently collects data overcoming the difficulties in crawling Ajax enabled Facebook pages. We show that the category level selection can reduce the data analysis complexity utilizing the sparsity of the huge like-attitude matrix. In the Korean Facebook users' context, only 28 criteria (3% of the total) can estimate the political polarity of the user with high accuracy (AUC of 0.82).

Modern Concurrent Programming for Multicode Environment (동시성으로 작성하는 파이썬 크롤러)

  • Kim, Nam-gue;Kang, Young-Jin;Lee, HoonJae
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2017.05a
    • /
    • pp.430-433
    • /
    • 2017
  • Programming that ensures concurrency is essential for developers. If you do not use it, it is hard to expect the speed of the program to improve unless there is technical advancement of the hardware itself. Programming languages that support good concurrency code include go, elixir, and scala. Python, which supports a number of useful libraries, also supports concurrent programming like asyncio and coroutine. This paper defines the concepts of concurrency and parallelism, and explains what to note when writing concurrency programming in Python. The crawler that collects web data is written in concurrent code and compared with programs written in sequential, multithreaded code.

  • PDF

Abusive Detection Using Bidirectional Long Short-Term Memory Networks (양방향 장단기 메모리 신경망을 이용한 욕설 검출)

  • Na, In-Seop;Lee, Sin-Woo;Lee, Jae-Hak;Koh, Jin-Gwang
    • The Journal of Bigdata
    • /
    • v.4 no.2
    • /
    • pp.35-45
    • /
    • 2019
  • Recently, the damage with social cost of malicious comments is increasing. In addition to the news of talent committing suicide through the effects of malicious comments. The damage to malicious comments including abusive language and slang is increasing and spreading in various type and forms throughout society. In this paper, we propose a technique for detecting abusive language using a bi-directional long short-term memory neural network model. We collected comments on the web through the web crawler and processed the stopwords on unused words such as English Alphabet or special characters. For the stopwords processed comments, the bidirectional long short-term memory neural network model considering the front word and back word of sentences was used to determine and detect abusive language. In order to use the bi-directional long short-term memory neural network, the detected comments were subjected to morphological analysis and vectorization, and each word was labeled with abusive language. Experimental results showed a performance of 88.79% for a total of 9,288 comments screened and collected.

  • PDF

Performance Analysis of Web-Crawler in Multi-thread Environment (다중 쓰레드 환경에서 웹 크롤러의 성능 분석)

  • Park, Jung-Woo;Kim, Jun-Ho;Lee, Won-Joo;Jeon, Chang-Ho
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2009.01a
    • /
    • pp.473-476
    • /
    • 2009
  • 본 논문에서는 다중 쓰레드 환경에서 동작하는 웹 크롤러를 구현하고 성능을 분석한다. 이 웹 크롤러의 특징은 검색시간을 단축하기 위하여 크롤링, 파싱 및 페이지랭킹, DB 저장 모듈을 서로 독립적으로 다른 작업을 수행하도록 구현한 것이다. 크롤링 모듈은 웹상의 데이터를 수집하는 기능을 제공한다. 그리고 파싱 및 페이지랭크 모듈은 수집한 데이터를 파싱하고, 웹 페이지의 상대적인 중요도를 수치로 계산하여 페이지랭크를 지정한다. DB 연동 모듈은 페이지랭크 모듈에서 구한 페이지랭크를 데이터베이스에 저장한다. 성능평가에서는 다중 쓰레드 환경에서 쓰레드 수와 웹 페이지의 수에 따른 검색 시간을 측정하여 그 결과를 비교 평가한다.

  • PDF