• Title/Summary/Keyword: Web-Crawling

Search Result 175, Processing Time 0.028 seconds

Intelligent Web Crawler for Supporting Big Data Analysis Services (빅데이터 분석 서비스 지원을 위한 지능형 웹 크롤러)

  • Seo, Dongmin;Jung, Hanmin
    • The Journal of the Korea Contents Association
    • /
    • v.13 no.12
    • /
    • pp.575-584
    • /
    • 2013
  • Data types used for big-data analysis are very widely, such as news, blog, SNS, papers, patents, sensed data, and etc. Particularly, the utilization of web documents offering reliable data in real time is increasing gradually. And web crawlers that collect web documents automatically have grown in importance because big-data is being used in many different fields and web data are growing exponentially every year. However, existing web crawlers can't collect whole web documents in a web site because existing web crawlers collect web documents with only URLs included in web documents collected in some web sites. Also, existing web crawlers can collect web documents collected by other web crawlers already because information about web documents collected in each web crawler isn't efficiently managed between web crawlers. Therefore, this paper proposed a distributed web crawler. To resolve the problems of existing web crawler, the proposed web crawler collects web documents by RSS of each web site and Google search API. And the web crawler provides fast crawling performance by a client-server model based on RMI and NIO that minimize network traffic. Furthermore, the web crawler extracts core content from a web document by a keyword similarity comparison on tags included in a web documents. Finally, to verify the superiority of our web crawler, we compare our web crawler with existing web crawlers in various experiments.

Smart Synthetic Path Search System for Prevention of Hazardous Chemical Accidents and Analysis of Reaction Risk (반응 위험성분석 및 사고방지를 위한 스마트 합성경로 탐색시스템)

  • Jeong, Joonsoo;Kim, Chang Won;Kwak, Dongho;Shin, Dongil
    • Korean Chemical Engineering Research
    • /
    • v.57 no.6
    • /
    • pp.781-789
    • /
    • 2019
  • There are frequent accidents by chemicals during laboratory experiments and pilot plant and reactor operations. It is necessary to find and comprehend relevant information to prevent accidents before starting synthesis experiments. In the process design stage, reaction information is also necessary to prevent runaway reactions. Although there are various sources available for synthesis information, including the Internet, it takes long time to search and is difficult to choose the right path because the substances used in each synthesis method are different. In order to solve these problems, we propose an intelligent synthetic path search system to help researchers shorten the search time for synthetic paths and identify hazardous intermediates that may exist on paths. The system proposed in this study automatically updates the database by collecting information existing on the Internet through Web scraping and crawling using Selenium, a Python package. Based on the depth-first search, the path search performs searches based on the target substance, distinguishes hazardous chemical grades and yields, etc., and suggests all synthetic paths within a defined limit of path steps. For the benefit of each research institution, researchers can register their private data and expand the database according to the format type. The system is being released as open source for free use. The system is expected to find a safer way and help prevent accidents by supporting researchers referring to the suggested paths.

Annotation Technique Development based on Apparel Attributes for Visual Apparel Search Technology (비주얼 의류 검색기술을 위한 의류 속성 기반 Annotation 기법 개발)

  • Lee, Eun-Kyung;Kim, Yang-Weon;Kim, Seon-Sook
    • Fashion & Textile Research Journal
    • /
    • v.17 no.5
    • /
    • pp.731-740
    • /
    • 2015
  • Mobile (smartphone) search engine marketing is increasingly important. Accordingly, the development of visual apparel search technology to obtain easier and faster access to visual information in the apparel field is urgently needed. This study helps establish a proper classifying system for an apparel search after an analysis of search techniques for apparel search applications and existing domestic and overseas apparel sites. An annotation technique is developed in accordance with visual attributes and apparel categories based on collected data obtained by web crawling and apparel images collecting. The categorical composition of apparel is divided into wearing, image and style. The web evaluation site traces the correlations of the apparel category and apparel factors as dependent upon visual attributes. An appraisal team of 10 individuals evaluated 2860 pieces of merchandise images. Data analysis consisted of correlations between apparel, sleeve length and apparel category (based on an average analysis), and correlation between fastener and apparel category (based on an average analysis). The study results can be considered as an epoch-making mobile apparel search system that can contribute to enhancing consumer convenience since it enables an effective search of type, price, distributor, and apparel image by a mobile photographing of the wearing state.

System Design for Collecting Real-Time Product Information Using RSS (RSS를 이용한 실시간 상품정보 수집시스템의 설계)

  • Chuluun, Munkhzaya;Ko, Sun-Woo
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.35 no.1
    • /
    • pp.1-9
    • /
    • 2012
  • It is well known that internet shoppers are very sensitive to sale prices. They visit the various shopping malls and collect the product information including purchase conditions for goods purchase decision-making. Recently the necessity of information support is increasing because of increase of information amount which is necessary and complexity of goods purchase decision-making process. The comparison shopping agent systems have provided price comparison information which is collected from various shopping malls to satisfy internet shoppers information craving. But the frequent price change caused by keen price competition is becoming the primary reason of information quality decline among price comparison sites. RSS which is a family of web feed formats used to publish frequently updated is applied even in on-line shopping malls. This paper develops a RSS product information collection system to get real-time product information. The proposed product information system consists of (1) web crawler module for searching RSS feed shopping malls automatically, (2) RSS reader module for parsing product information from RSS feed file, (3) product DB and (4) product searching module. Performance of the proposed system is higher than the comparison shopping agent systems when it is defined with the volume of collecting product information per unit time.

A Study on Artificial Intelligence Education Design for Business Major Students

  • PARK, So-Hyun;SUH, Eung-Kyo
    • The Journal of Industrial Distribution & Business
    • /
    • v.12 no.8
    • /
    • pp.21-32
    • /
    • 2021
  • Purpose: With the advent of the era of the 4th industrial revolution, called a new technological revolution, the necessity of fostering future talents equipped with AI utilization capabilities is emerging. However, there is a lack of research on AI education design and competency-based education curriculum as education for business major. The purpose of this study is to design AI education to cultivate competency-oriented AI literacy for business major in universities. Research design, data and methodology: For the design of AI basic education in business major, three expert Delphi surveys were conducted, and a demand analysis and specialization strategy were established, and the reliability of the derived design contents was verified by reflecting the results. Results: As a result, the main competencies for cultivating AI literacy were data literacy, AI understanding and utilization, and the main detailed areas derived from this were data structure understanding and processing, visualization, web scraping, web crawling, public data utilization, and concept of machine learning and application. Conclusions: The educational design content derived through this study is expected to help establish the direction of competency-centered AI education in the future and increase the necessity and value of AI education by utilizing it based on the major field.

An empirical study on factors influencing the admission competition rate for the department of dental hygiene (치위생학과의 입학경쟁률에 영향을 미치는 요인에 관한 실증적 연구)

  • Kyu-Seok Kim;Hye-Young Mun;Min-Ji Jo;Ha-Young Kim;Jung-Yun Kang
    • Journal of Korean society of Dental Hygiene
    • /
    • v.23 no.4
    • /
    • pp.303-309
    • /
    • 2023
  • Objectives: According to the Korea Education Development Institute, the college admission quota is expected to exceed the number of high school graduates, leading to an anticipated expansion in the gap between them. This paper aims to conduct an empirical analysis of the variables previously studied, with a specific focus on the admission competition rate for the department of dental hygiene. Methods: The research methodology is the multiple linear regression analysis. The research data contains the structured data from academy information, and the web-based unstructured data collected over the past 3 years. Results: After conducting the analysis, it was newly discovered that the university's online recognition and its location in the metropolitan area were statistically significant factors influencing the admission competition rate for the department of dental hygiene. Conclusions: The findings of this study are expected to be helpful in formulating admission strategies for universities to attract new students and identifying the factors that influence student attraction.

Design and Analysis of Technical Management System of Personal Information Security using Web Crawer (웹 크롤러를 이용한 개인정보보호의 기술적 관리 체계 설계와 해석)

  • Park, In-pyo;Jeon, Sang-june;Kim, Jeong-ho
    • Journal of Platform Technology
    • /
    • v.6 no.4
    • /
    • pp.69-77
    • /
    • 2018
  • In the case of personal information files containing personal information, there is insufficient awareness of personal information protection in end-point areas such as personal computers, smart terminals, and personal storage devices. In this study, we use Diffie-Hellman method to securely retrieve personal information files generated by web crawler. We designed SEED and ARIA using hybrid slicing to protect against attack on personal information file. The encryption performance of the personal information file collected by the Web crawling method is compared with the encryption decryption rate according to the key generation and the encryption decryption sharing according to the user key level. The simulation was performed on the personal information file delivered to the external agency transmission process. As a result, we compared the performance of existing methods and found that the detection rate is improved by 4.64 times and the information protection rate is improved by 18.3%.

A Study on the Document Topic Extraction System for LDA-based User Sentiment Analysis (LDA 기반 사용자 감정분석을 위한 문서 토픽 추출 시스템에 대한 연구)

  • An, Yoon-Bin;Kim, Hak-Young;Moon, Yong-Hyun;Hwang, Seung-Yeon;Kim, Jeong-Joon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.21 no.2
    • /
    • pp.195-203
    • /
    • 2021
  • Recently, big data, a major technology in the IT field, has been expanding into various industrial sectors and research on how to utilize it is actively underway. In most Internet industries, user reviews help users make decisions about purchasing products. However, the process of screening positive, negative and helpful reviews from vast product reviews requires a lot of time in determining product purchases. Therefore, this paper designs and implements a system that analyzes and aggregates keywords using LDA, a big data analysis technology, to provide meaningful information to users. For the extraction of document topics, in this study, the domestic book industry is crawling data into domains, and big data analysis is conducted. This helps buyers by providing comprehensive information on products based on user review topics and appraisal words, and furthermore, the product's outlook can be identified through the review status analysis.

A Study on Sentiment Analysis of Media and SNS response to National Policy: focusing on policy of Child allowance, Childbirth grant (국가 정책에 대한 언론과 SNS 반응의 감성 분석 연구 -아동 수당, 출산 장려금 정책을 중심으로-)

  • Yun, Hye Min;Choi, Eun Jung
    • Journal of Digital Convergence
    • /
    • v.17 no.2
    • /
    • pp.195-200
    • /
    • 2019
  • Nowadays as the use of mobile communication devices such as smart phones and tablets and the use of Computer is expanded, data is being collected exponentially on the Internet. In addition, due to the development of SNS, users can freely communicate with each other and share information in various fields, so various opinions are accumulated in the from of big data. Accordingly, big data analysis techniques are being used to find out the difference between the response of the general public and the response of the media. In this paper, we analyzed the public response in SNS about child allowance and childbirth grant and analyzed the response of the media. Therefore we gathered articles and comments of users which were posted on Twitter for a certain period of time and crawling the news articles and applied sentiment analysis. From these data, we compared the opinion of the public posted on SNS with the response of the media expressed in news articles. As a result, we found that there is a different response to some national policy between the public and the media.

Industrial Technology Leak Detection System on the Dark Web (다크웹 환경에서 산업기술 유출 탐지 시스템)

  • Young Jae, Kong;Hang Bae, Chang
    • Smart Media Journal
    • /
    • v.11 no.10
    • /
    • pp.46-53
    • /
    • 2022
  • Today, due to the 4th industrial revolution and extensive R&D funding, domestic companies have begun to possess world-class industrial technologies and have grown into important assets. The national government has designated it as a "national core technology" in order to protect companies' critical industrial technologies. Particularly, technology leaks in the shipbuilding, display, and semiconductor industries can result in a significant loss of competitiveness not only at the company level but also at the national level. Every year, there are more insider leaks, ransomware attacks, and attempts to steal industrial technology through industrial spy. The stolen industrial technology is then traded covertly on the dark web. In this paper, we propose a system for detecting industrial technology leaks in the dark web environment. The proposed model first builds a database through dark web crawling using information collected from the OSINT environment. Afterwards, keywords for industrial technology leakage are extracted using the KeyBERT model, and signs of industrial technology leakage in the dark web environment are proposed as quantitative figures. Finally, based on the identified industrial technology leakage sites in the dark web environment, the possibility of secondary leakage is detected through the PageRank algorithm. The proposed method accepted for the collection of 27,317 unique dark web domains and the extraction of 15,028 nuclear energy-related keywords from 100 nuclear power patents. 12 dark web sites identified as a result of detecting secondary leaks based on the highest nuclear leak dark web sites.