• Title/Summary/Keyword: web crawler

Search Result 102, Processing Time 0.025 seconds

System Design for Collecting Real-Time Product Information Using RSS (RSS를 이용한 실시간 상품정보 수집시스템의 설계)

  • Chuluun, Munkhzaya;Ko, Sun-Woo
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.35 no.1
    • /
    • pp.1-9
    • /
    • 2012
  • It is well known that internet shoppers are very sensitive to sale prices. They visit the various shopping malls and collect the product information including purchase conditions for goods purchase decision-making. Recently the necessity of information support is increasing because of increase of information amount which is necessary and complexity of goods purchase decision-making process. The comparison shopping agent systems have provided price comparison information which is collected from various shopping malls to satisfy internet shoppers information craving. But the frequent price change caused by keen price competition is becoming the primary reason of information quality decline among price comparison sites. RSS which is a family of web feed formats used to publish frequently updated is applied even in on-line shopping malls. This paper develops a RSS product information collection system to get real-time product information. The proposed product information system consists of (1) web crawler module for searching RSS feed shopping malls automatically, (2) RSS reader module for parsing product information from RSS feed file, (3) product DB and (4) product searching module. Performance of the proposed system is higher than the comparison shopping agent systems when it is defined with the volume of collecting product information per unit time.

Semantic Network Analysis about Comments on Internet Articles about Nurse Workplace Bullying (간호사 괴롭힘 관련 인터넷 포털 기사에 대한 댓글의 의미연결망 분석)

  • Kim, Chang Hee;Moon, Seong Mi
    • Journal of Korean Clinical Nursing Research
    • /
    • v.25 no.3
    • /
    • pp.209-220
    • /
    • 2019
  • Purpose: A significant amount of public opinion about nurse bullying is expressed on the internet. The purpose of this study was to analyze the linkage structures among words extracted from comments on internet articles related to nurse workplace bullying using semantic network analysis. Methods: From February 2018 to April 2019, comments made on news articles posted to the Daum and Naver web portal containing keywords such as "nurse", "Taeum", and "bullying" were collected using a web crawler written in Python. A morphological analysis performed with Open Korean Text in KoNLPy generated 54 major nodes. The frequencies, eigenvector centralities, and betweenness centralities of the 54 nodes were calculated and semantic networks were visualized using the UCINET and NetDraw programs. Convergence of iterated correlations (CONCOR) analysis was performed to identify structural equivalence. Results: This paper presents results about March 2018 and January 2019 because these months had highest number of articles. Of the 54 major nodes, "nurse", "hospital", "patient", and "physician" were the most frequent and had the highest eigenvector and betweenness centralities. The CONCOR analysis identified work environment, nurse, gender, and military clusters. Conclusion: This study structurally explored public opinion about nurse bullying through semantic network analysis. It is suggested that various studies on nursing phenomena will be conducted using social network analysis.

A Study on the Hyperlink Network Analysis of Library Web Sites (도서관 웹사이트의 하이퍼링크 네트워크 분석)

  • Roh, Yoon-Ju;Kim, Seong-Hee
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.28 no.2
    • /
    • pp.99-117
    • /
    • 2017
  • The present study positively analyzed the hyperlinks of 32 web sites with the purpose of analyzing the hyperlink network structure of web sites for each domestic library type. After collecting the hyperlink data using the crawler, we analyzed the overall characteristics of the websites in the network based on the characteristics of the library. The results are as follows. 1) Among all analyzed libraries, Yonsei scored the highest in degree centrality, betweenness centrality, closeness centrality, and eigenvector centrality. 2) By library type, Sejong for national library, Seoul for public library, and Yonsei for college library appeared an influential a relatively. Based on these analysis results, the present study will be utilized as basic data for establishing an operation strategy that improves the efficiency and effectiveness of library web sites in the future.

Database metadata standardization processing model using web dictionary crawling (웹 사전 크롤링을 이용한 데이터베이스 메타데이터 표준화 처리 모델)

  • Jeong, Hana;Park, Koo-Rack;Chung, Young-suk
    • Journal of Digital Convergence
    • /
    • v.19 no.9
    • /
    • pp.209-215
    • /
    • 2021
  • Data quality management is an important issue these days. Improve data quality by providing consistent metadata. This study presents algorithms that facilitate standard word dictionary management for consistent metadata management. Algorithms are presented to automate synonyms management of database metadata through web dictionary crawling. It also improves the accuracy of the data by resolving homonym distinction issues that may arise during the web dictionary crawling process. The algorithm proposed in this study increases the reliability of metadata data quality compared to the existing passive management. It can also reduce the time spent on registering and managing synonym data. Further research on the new data standardization partial automation model will need to be continued, with a detailed understanding of some of the automatable tasks in future data standardization activities.

Design and Analysis of Technical Management System of Personal Information Security using Web Crawer (웹 크롤러를 이용한 개인정보보호의 기술적 관리 체계 설계와 해석)

  • Park, In-pyo;Jeon, Sang-june;Kim, Jeong-ho
    • Journal of Platform Technology
    • /
    • v.6 no.4
    • /
    • pp.69-77
    • /
    • 2018
  • In the case of personal information files containing personal information, there is insufficient awareness of personal information protection in end-point areas such as personal computers, smart terminals, and personal storage devices. In this study, we use Diffie-Hellman method to securely retrieve personal information files generated by web crawler. We designed SEED and ARIA using hybrid slicing to protect against attack on personal information file. The encryption performance of the personal information file collected by the Web crawling method is compared with the encryption decryption rate according to the key generation and the encryption decryption sharing according to the user key level. The simulation was performed on the personal information file delivered to the external agency transmission process. As a result, we compared the performance of existing methods and found that the detection rate is improved by 4.64 times and the information protection rate is improved by 18.3%.

Sentence Filtering Dataset Construction Method about Web Corpus (웹 말뭉치에 대한 문장 필터링 데이터 셋 구축 방법)

  • Nam, Chung-Hyeon;Jang, Kyung-Sik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.11
    • /
    • pp.1505-1511
    • /
    • 2021
  • Pretrained models with high performance in various tasks within natural language processing have the advantage of learning the linguistic patterns of sentences using large corpus during the training, allowing each token in the input sentence to be represented with appropriate feature vectors. One of the methods of constructing a corpus required for a pre-trained model training is a collection method using web crawler. However, sentences that exist on web may contain unnecessary words in some or all of the sentences because they have various patterns. In this paper, we propose a dataset construction method for filtering sentences containing unnecessary words using neural network models for corpus collected from the web. As a result, we construct a dataset containing a total of 2,330 sentences. We also evaluated the performance of neural network models on the constructed dataset, and the BERT model showed the highest performance with an accuracy of 93.75%.

Design and Implementation for Local Newsletter Using Mobile Web crawler and GPS (모바일 웹 크롤링과 GPS를 이용한 지역 뉴스레이터 설계 및 구현)

  • Jaung, Dongyou;Kim, Yongtae;Park, Geunyong;Shin, Jaesik;Park, Eunju;Lim, Hankyu
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2017.04a
    • /
    • pp.152-155
    • /
    • 2017
  • 본 논문은 지역에 관심이 많은 사용자들이 실시간으로 모바일 웹페이지 형태의 뉴스를 제공 받을 수 있는 시스템을 설계하고 이를 제작하였다. 사용자는 실시간으로 본인이 위치한 지역을 대상으로 종합되어지는 뉴스를 모바일 웹페이지 형태의 오브젝트로 제공받는다. 본 연구를 통해 지역 관심도 향상 및 지역 개발 촉진 및 관광시설 피드백 활성화 효과의 기대가 가능하다.

Cross-cultural Study on Knowledge Sharing in Open Collaboration: Collectivism vs. Individualism (문화에 따른 개방형 협업 지식공유 활동 비교 연구: 집단주의 문화와 개인주의 문화를 중심으로)

  • Baek, Hyunmi;Lee, Saerom
    • Knowledge Management Research
    • /
    • v.19 no.2
    • /
    • pp.133-150
    • /
    • 2018
  • To cope with the rapid changes in the corporate environment, the creation of innovative output through various forms of collaboration have been discussed. For open collaborations, contributors who distribute to various countries and cultures are able to share knowledge via the internet without physical rewards or responsibilities. In this study, we focused on the open source software project, which is a representative open collaboration. We investigated the factors that affect the knowledge contribution of developers of various countries within the open collaboration platform. Specifically, we investigated the open collaborative nature of multi-culture developers by dividing cultures according to collectivism and individualism. We collected data on 26,604 developers using a python based web crawler for GitHub which is an open source software development platform, and conducted cross-cultural study. This paper contributes to the field of knowledge management by suggesting various impacts of antecedents such as hireability, and information exposure on knowledge sharing according to culture.

Design and Implementation of a Globus-based Distributed Web Crawler Manager on Grid Environment (글로버스 기반 그리드 환경에서의 분산 웹 크롤러 매니저 설계 및 구현)

  • Kim, Hyuk-Ho;Lee, Seung-Ha;Park, Chan-Ho;Kim, Yang-Woo;Lee, Phil-Woo
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2005.05a
    • /
    • pp.945-948
    • /
    • 2005
  • 그리드 정보검색 시스템은 일반적인 정보검색 시스템의 문제점과 한계점을 인식하고, 그리드라는 분산처리 환경을 기반으로 정보검색 시스템을 구축함으로써 보다 효율적이고 유연한 확장성을 갖는 정보검색 서비스를 제공한다. 본 논문에서는 그리드 시스템 환경에 맞게 그리드 미들웨어 중에 하나인 글로버스 툴킷(Globus Toolkit)을 이용하여 정보검색을 위한 가상 조직(VO: Virtual Organization)을 구성했다. 그리고 그리드 정보검색을 위한 전단계로 웹상에서 각종 정보를 수집하는 P2P 기반 분산 크롤러들을 관리하는 크롤러 매니저를 그리드 서비스로 설계 및 구현하여 그리드 정보검색 시스템에 존재하는 다른 서비스들과 함께 활용할 수 있도록 하였다.

  • PDF

Design and Implementation of Distributed Web Crawler Using Globus Environment (글로버스를 이용한 분산 웹 크롤러의 설계 및 구현)

  • 이지선;김양우;이필우
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2004.04a
    • /
    • pp.712-714
    • /
    • 2004
  • 대부분의 웹 검색 엔진들과 많은 특화된 검색 도구들은 웹 페이지의 색인화와 분석을 위한 전처리 단계로 대규모 웹 페이지들을 수집하기 위해 웹 크롤러에 의존한다. 일반적인 웹 크롤러는 몇 주 또는 몇 달의 주기에 걸쳐 수백만 개의 호스트들과 상호작용을 통해 웹 페이지 정보를 수집한다. 본 논문에서는 이러한 크롤러의 성능향상과 효율적인 실행을 위해 그리드 미들웨어인 글로버스 툴킷을 이용하여 분산된 크롤러를 제안한다. 본 웹 크롤러의 실행은 그 기능의 분산처리를 위한 각 호스트 서버들을 글로버스로 연결하고, 인증하여, 작업을 할당하는 단계와, 크롤러 프로그램이 실행되어 자료를 수집하는 단계. 마지막으로 이렇게 수집된 웹 페이지 정보들을 처음 명령한 시스템으로 반환하는 단계로 나누어진다. 결과 수집 작업을 보다 분산화 할 수 있게 하였으며 여러 대의 저 비용의 시스템에서 고 비용, 고 사양의 서버의 성능을 얻을 수 있었으며, 확장이 용이하고, 견고한 크롤러 프로그램 및 시스템 환경을 구축할 수 있었다.

  • PDF