• Title/Summary/Keyword: web crawling

Search Result 176, Processing Time 0.025 seconds

Determinants of Shortening Job-hunting Period in Platform Labor Market: Analysis by using Web Crawling and Survival Model (플랫폼 노동시장의 구직기간 단축 결정요인: 웹크롤링과 생존모형을 이용한 분석)

  • Lee, Jongho
    • Journal of Digital Convergence
    • /
    • v.19 no.5
    • /
    • pp.1-13
    • /
    • 2021
  • The purpose of this research is to analyze how the wage level of new job seekers in the platform labor market affects the period on getting the first job. Recently, the platform gets attention as one of alternatives to solve the increase of unemployment rate. It is important to create quality jobs that we build up a trust between employers and employees in the platform. Previous studies showed that feedback from previous employers is important for solving the information asymmetry problem between those people. However, there is no feedback for new job seekers who have not get the first job. Therefore, we focus on the fact that wages are presented by job seekers rather than employers in the platform, and we will figure out that the low wages of new job seekers may affect the shortening of job-hunting period. For this reason, we use 3,704 job seekers of Freelancer.com. Survival analysis shows that low wages for new job seekers have a significant impact on shortening job-hunting period.

A Scientific Quantitative Analysis on Vegetables of Joseon Dynasty using the Joseonwangjoshilrok based Data (조선왕조실록 과학계량적 분석을 통한 채소류의 통시적 고찰)

  • Kim, Mi-Hye
    • Journal of the Korean Society of Food Culture
    • /
    • v.36 no.2
    • /
    • pp.143-157
    • /
    • 2021
  • This study aimed to analyze the periodic prevalence of the vegetables during the Joseon era with JoseonWangjoSilrok as a reference. The JoseonWangjoSilrok articles were collected from the Guksapyeonchanwewonhwe site, using web-crawling techniques to extract the relevant information. Out of 384,582 search results, 9,560 articles with vegetable-related keywords were found. According to the annual average vegetable recordings during the regimes of various kings, there were two peaking curves in the 15th and 18th centuryJoseon. The found was: 2,750 in the 18th century, 2,529 in the 15th century, 1,424 in the 16th century, and 1,018 in the 19th century. A Variable Interest Index was designed to ascertain the interestin vegetables of the 27 Joseon kings. The king most interested in vegetables was the 19th king Sookjong. The second most interested king was Youngjo. There were 5,105 vegetable-related findings within the JoseonWangjoSilrok related to specific species and categories of vegetables. Among the words found: 1,194 were stem-leaves vegetables (23.39%), 1,017 were root vegetables (19.92%), 1,148 were flower-fruit vegetables (22.49%), 1,144 were spice vegetables (22.41%), 95 were mushrooms (1.86%), and 507 were seaweeds (9.93%). Statistical analysis using ANOVA revealed the chronological factors that affected the vegetables' prevalence index.

Analysis of Current Status of Marine Products and Characteristics of Processed Products Seafood in Joseon - via the Veritable Records of the Joseon Dynasty based data - (『조선왕조실록(朝鮮王朝實錄)』 속 수산물 현황과 가공식품 특성 분석)

  • Kim, Mi-Hye
    • Journal of the Korean Society of Food Culture
    • /
    • v.37 no.1
    • /
    • pp.26-38
    • /
    • 2022
  • This study used the big data method to analyze the chronological frequency of seafood appearance and variety mentioned by the veritable records of the Joseon dynasty. The findings will be used as a basis for Joseon Period's food cultural research. The web-crawling method was used to digitally scrap from the veritable records of the Joseon dynasty of Joseon's first to the twenty-seventh king. A total of 9,536 cases indicated the appearance of seafood out of the 384,582 articles. Seafood were termed "seafood" as a collective noun 107 times (1.12%), 27 types of fish 8,372 times (87.79%), 3 types of mollusca (1.28%), 18 types of shellfish 213 times (2.23%), 6 types of crustacean 188 times (1.97%), 9 types of seaweed 534 times (5.60%). Fish appeared most frequently out of all the recorded seafood. Sea fish appeared more frequently than the freshwater fish. Kings that showed the most Strong Interest Inventory (SII) were: Sungjong from the 15thcentury, Sehjo from the 15th, Youngjo from the 18th, Sehjong from the 15th, and Jungjo from the 18th respectively. Kings of Chosen were most interested in seafood in the 15th and 18th centuries.

Proposal of Brand Evaluation Map through Big Data : Focus on The Hyundai Motor's Product Evaluation (빅데이터를 통한 브랜드 평가 맵 제안 : 현대자동차 제품 평가 중심으로)

  • Youn, Dae Myung;Lee, Yong Hyuck;Lee, Bong Gyou
    • Journal of Information Technology Services
    • /
    • v.19 no.4
    • /
    • pp.1-11
    • /
    • 2020
  • Through text mining, sentiment analysis, and semiotics analysis, this study aims to reinterpret the meaning of user emotional words and related words to derive strategic elements of brand and design. After selecting a local car manufacturer whose user opinion on the brand is a clear topic, web-crawl the car comments of the manufacturer directly created by the users online. Then, analyze the extracted morphology and its associated words and convert them to fit the marketing mix theory. Through this process, propose a methodology that allows consumers to supplement and improve brand elements with negative sensibilities, and to inherit elements with positive sensibilities and manage brands reasonably. In particular, the Map presented in this study are considered to be fully utilized as information for overall brand management.

A Method of Link Extraction on Non-standard Links in Web Crawling (웹크롤러의 비표준 링크에 관한 링크 추출 방안)

  • Jeong, Jun-Yeong;Jang, Mun-Su;Gang, Seon-Mi
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2008.04a
    • /
    • pp.79-82
    • /
    • 2008
  • 웹크롤러는 웹페이지 내의 URL링크를 추적하여 다른 문서를 수집한다. 국내의 상당수 웹사이트는 웹 표준에 맞지 않는 링크방식으로 웹문서를 연결하고 있다. 일반적인 웹크롤러는 링크의 비표준적인 사용을 가정하지 않기 때문에 이러한 문서는 수집할 수 없다. 비표준적인 링크가 가능한 것은 사용자의 실수에 강인한 마크업 언어인 HTML에 자바스크립트 기능이 추가되면서 자바스크립트의 변칙적인 사용이 허용되었기 때문이다. 본 논문에서는 230여개의 웹사이트를 조사하여 기존 웹크롤러에서 해결하지 못한 링크 추출 문제를 찾아내고, 이를 수집하기 위한 알고리즘을 제안한다. 또한 자바스크립트 문제 해결을 위한 무거운 자바스크립트 엔진을 대신하여 필요한 기능만으로 구성된 모듈을 사용함으로써 효율적인 문서 수집기 모델을 제안한다.

  • PDF

Data Analysis Web Application Based on Text Mining (텍스트 마이닝 기반의 데이터 분석 웹 애플리케이션)

  • Gil, Wan-Je;Kim, Jae-Woong;Park, Koo-Rack;Lee, Yun-Yeol
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2021.07a
    • /
    • pp.103-104
    • /
    • 2021
  • 본 논문에서는 텍스트 마이닝 기반의 토픽 모델링 웹 애플리케이션 모델을 제안한다. 웹크롤링 기법을 활용하여 키워드를 입력하면 요약된 논문 정보를 파일로 저장할 수 있고 또한 키워드 빈도 분석과 토픽 모델링 등을 통해 연구 동향을 손쉽게 확인해볼 수 있는 웹 애플리케이션을 설계하고 구현하는 것을 목표로 한다. 제안 모델인 웹 애플리케이션을 통해 프로그래밍 언어와 데이터 분석 기법에 대한 지식이 부족하더라도 논문 수집과 저장, 텍스트 분석을 경험해볼 수 있다. 또한, 이러한 웹 시스템 개발은 기존의 html, css, java script와 같은 언어에 의존하지 않고 파이썬 라이브러리를 활용하였기 때문에 파이썬을 기반으로 데이터 분석과 머신러닝 교육을 수행할 경우 프로젝트 기반 수업 교육 과정으로 채택이 가능할 것으로 기대된다.

  • PDF

Responsive web based Virus Information Sytem using Crawling (크롤링을 통한 반응형웹 기반의 바이러스 정보 시스템)

  • Hur, Tai-Sung;Baek, Jae-Won
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2020.07a
    • /
    • pp.269-270
    • /
    • 2020
  • 코로나 사태 이후에도 세상은 수많은 바이러스가 유행하게 될 것이다. 수많은 질병에서 필요한 것은 정보이고 이러한 정보를 얻기 위해서 사람들은 많은 사이트를 찾아다니며 정보를 검색하는 데 시간을 소비하고 원하는 정보를 빠르게 찾을 수 없다. 이러한 문제를 해결하고자 현재 유행하고 있는 질병 현황 정보, 시도별 현황 정보, 마스크 판매처 위치 및 재고, 바이러스 감염자 방문 기록을 확인할 수 있는 등 바이러스 정보를 짧은 시간에 사용자가 원하는 정보를 한 눈에 확인할 수 있도록 각종 사이트에서 데이터를 크롤링하여 가공하여 필요한 정보를 제공하는 반응형웹 시스템을 개발하였다.

  • PDF

Determinants of Wage for Web-based Platform Workers: In perspective of evaluation by previous employers (웹 기반형(Web-based) 플랫폼 노동자의 임금 결정요인: 이전 고용주에 의한 평가의 관점에서)

  • Lim, Jisun
    • Journal of Digital Convergence
    • /
    • v.20 no.4
    • /
    • pp.1-14
    • /
    • 2022
  • The purpose of this study was to find the wage determinants of web-based platform workers. For this purpose, a total of 3,575 web-based platform workers' information from Freelancer.com, a global platform labor market, in September 2018 were used and whether or not newly available indicators such as evaluations by previous employers had a significant effect on the wage increase of platform workers using OLS and QR methods. As an OLS estimation results, the number of reviews, as well as education and experience, affects the wages of platform workers. However, as a result of the QR estimation, experience rather than education, recommendation rather than a review has a more significant effect on the wage of web-based platform workers as the wage level rises.

HTML Text Extraction Using Frequency Analysis (빈도 분석을 이용한 HTML 텍스트 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1135-1143
    • /
    • 2021
  • Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

HTML Text Extraction Using Tag Path and Text Appearance Frequency (태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1709-1715
    • /
    • 2021
  • In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages.