• Title/Summary/Keyword: 웹크롤링기술

Search Result 33, Processing Time 0.034 seconds

A Comparative Analysis of Content-based Music Retrieval Systems (내용기반 음악검색 시스템의 비교 분석)

  • Ro, Jung-Soon
    • Journal of the Korean Society for information Management
    • /
    • v.30 no.3
    • /
    • pp.23-48
    • /
    • 2013
  • This study compared and analyzed 15 CBMR (Content-based Music Retrieval) systems accessible on the web in terms of DB size and type, query type, access point, input and output type, and search functions, with reviewing features of music information and techniques used for transforming or transcribing of music sources, extracting and segmenting melodies, extracting and indexing features of music, and matching algorithms for CBMR systems. Application of text information retrieval techniques such as inverted indexing, N-gram indexing, Boolean search, truncation, keyword and phrase search, normalization, filtering, browsing, exact matching, similarity measure using edit distance, sorting, etc. to enhancing the CBMR; effort for increasing DB size and usability; and problems in extracting melodies, deleting stop notes in queries, and using solfege as pitch information were found as the results of analysis.

A Study on Detecting Fake Reviews Using Machine Learning: Focusing on User Behavior Analysis (머신러닝을 활용한 가짜리뷰 탐지 연구: 사용자 행동 분석을 중심으로)

  • Lee, Min Cheol;Yoon, Hyun Shik
    • Knowledge Management Research
    • /
    • v.21 no.3
    • /
    • pp.177-195
    • /
    • 2020
  • The social consciousness on fake reviews has triggered researchers to suggest ways to cope with them by analyzing contents of fake reviews or finding ways to discover them by means of structural characteristics of them. This research tried to collect data from blog posts in Naver and detect habitual patterns users use unconsciously by variables extracted from blogs and blog posts by a machine learning model and wanted to use the technique in predicting fake reviews. Data analysis showed that there was a very high relationship between the number of all the posts registered in the blog of the writer of the related writing and the date when it was registered. And, it was found that, as model to detect advertising reviews, Random Forest is the most suitable. If a review is predicted to be an advertising one by the model suggested in this research, it is very likely that it is fake review, and that it violates the guidelines on investigation into markings and advertising regarding recommendation and guarantee in the Law of Marking and Advertising. The fact that, instead of using analysis of morphemes in contents of writings, this research adopts behavior analysis of the writer, and, based on such an approach, collects characteristic data of blogs and blog posts not by manual works, but by automated system, and discerns whether a certain writing is advertising or not is expected to have positive effects on improving efficiency and effectiveness in detecting fake reviews.

Design and Analysis of Technical Management System of Personal Information Security using Web Crawer (웹 크롤러를 이용한 개인정보보호의 기술적 관리 체계 설계와 해석)

  • Park, In-pyo;Jeon, Sang-june;Kim, Jeong-ho
    • Journal of Platform Technology
    • /
    • v.6 no.4
    • /
    • pp.69-77
    • /
    • 2018
  • In the case of personal information files containing personal information, there is insufficient awareness of personal information protection in end-point areas such as personal computers, smart terminals, and personal storage devices. In this study, we use Diffie-Hellman method to securely retrieve personal information files generated by web crawler. We designed SEED and ARIA using hybrid slicing to protect against attack on personal information file. The encryption performance of the personal information file collected by the Web crawling method is compared with the encryption decryption rate according to the key generation and the encryption decryption sharing according to the user key level. The simulation was performed on the personal information file delivered to the external agency transmission process. As a result, we compared the performance of existing methods and found that the detection rate is improved by 4.64 times and the information protection rate is improved by 18.3%.

Media-based Analysis of Gasoline Inventory with Korean Text Summarization (한국어 문서 요약 기법을 활용한 휘발유 재고량에 대한 미디어 분석)

  • Sungyeon Yoon;Minseo Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.5
    • /
    • pp.509-515
    • /
    • 2023
  • Despite the continued development of alternative energies, fuel consumption is increasing. In particular, the price of gasoline fluctuates greatly according to fluctuations in international oil prices. Gas stations adjust their gasoline inventory to respond to gasoline price fluctuations. In this study, news datasets is used to analyze the gasoline consumption patterns through fluctuations of the gasoline inventory. First, collecting news datasets with web crawling. Second, summarizing news datasets using KoBART, which summarizes the Korean text datasets. Finally, preprocessing and deriving the fluctuations factors through N-Gram Language Model and TF-IDF. Through this study, it is possible to analyze and predict gasoline consumption patterns.

Web crawling process of each social network service for recognizing water quality accidents in the water supply networks (물공급네트워크 수질사고인지를 위한 소셜네트워크 서비스 별 웹크롤링 방법론 개발)

  • Yoo, Do Guen;Hong, Seunghyeok;Moon, Gihoon
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2022.05a
    • /
    • pp.398-398
    • /
    • 2022
  • 최근 수돗물 공급과정에 있어 적수, 유충 발생 등 지역 단위의 수질문제로 국민의 직간접적인 피해가 발생된 바 있다. 수질문제 발생 시, 소셜네트워크서비스(SNS)에 게시되는 피해 관련 의견은 시공간적으로 빠르게 확산되며, 궁극적으로는 물공급과정 전체의 부정적 인식증가와 신뢰도 저하를 초래한다. 따라서, 물공급시스템에서의 수질사고 발생을 빠르게 인지하는 다양한 방법론의 적용을 통한 피해 최소화를 위한 노력이 반드시 필요하다. 일반적으로 수질사고는 다양한 항목의 실시간 계측기에서 획득되는 시계열자료의 변화양상을 통해 판단할 수 있으나, 이와 같은 방법론의 효율적 적용을 위해서는 선진계측인프라의 도입이 선행되어야 한다. 본 연구에서는 국내의 발달된 정보통신기술환경을 활용하여, 물공급네트워크 내 수질사고인지를 위한 SNS 별 웹크롤링 방법론을 제안하고, 적용결과를 분석하였다. 방법론의 구현에 앞서, 각종 SNS 별(트위터, 인스타그램, 블로그, 네이버 카페 등) 프로그래밍을 통한 웹크롤링 가능여부, 정보획득 기간 등을 확인하였으며, 과거 유사 수질사고 발생 시 영향력과 관련 게시글이 크게 나타난 네이버 카페와 트위터를 중심으로 웹 크롤링 절차를 제시하였다. 네이버 카페의 경우 대상급수구역 내의 시민들이 다수 참여하는 카페를 목록화하고, 지자체명과 핵심 키워드(수돗물, 유충, 적수) 조합을 활용한 웹크롤링을 수행하여, 관련 게시물 건수와 의미를 실시간으로 분석하는 절차를 마련하였다. 개발된 SNS 별 웹크롤링 방법론에 따라 과거 수질사고가 발생된 바 있는 2개 이상의 지자체에 대한 분석을 실시하였으며, SNS 별 결과에 있어 차이점을 확인하여 제시하였다. 향후 제안된 방법을 적용하여 시공간적 수질사고 정보의 전파 및 확산양상을 추가적으로 분석할수 있을 것으로 기대된다.

  • PDF

Korea National College of Agriculture and Fisheries in Naver News by Web Crolling : Based on Keyword Analysis and Semantic Network Analysis (웹 크롤링에 의한 네이버 뉴스에서의 한국농수산대학 - 키워드 분석과 의미연결망분석 -)

  • Joo, J.S.;Lee, S.Y.;Kim, S.H.;Park, N.B.
    • Journal of Practical Agriculture & Fisheries Research
    • /
    • v.23 no.2
    • /
    • pp.71-86
    • /
    • 2021
  • This study was conducted to find information on the university's image from words related to 'Korea National College of Agriculture and Fisheries (KNCAF)' in Naver News. For this purpose, word frequency analysis, TF-IDF evaluation and semantic network analysis were performed using web crawling technology. In word frequency analysis, 'agriculture', 'education', 'support', 'farmer', 'youth', 'university', 'business', 'rural', 'CEO' were important words. In the TF-IDF evaluation, the key words were 'farmer', 'dron', 'agricultural and livestock food department', 'Jeonbuk', 'young farmer', 'agriculture', 'Chonju', 'university', 'device', 'spreading'. In the semantic network analysis, the Bigrams showed high correlations in the order of 'youth' - 'farmer', 'digital' - 'agriculture', 'farming' - 'settlement', 'agriculture' - 'rural', 'digital' - 'turnover'. As a result of evaluating the importance of keywords as five central index, 'agriculture' ranked first. And the keywords in the second place of the centrality index were 'farmers' (Cc, Cb), 'education' (Cd, Cp) and 'future' (Ce). The sperman's rank correlation coefficient by centrality index showed the most similar rank between Degree centrality and Pagerank centrality. The KNCAF articles of Naver News were used as important words such as 'agriculture', 'education', 'support', 'farmer', 'youth' in terms of word frequency. However, in the evaluation including document frequency, the words such as 'farmer', 'dron', 'Ministry of Agriculture, Food and Rural Affairs', 'Jeonbuk', and 'young farmers' were found to be key words. The centrality analysis considering the network connectivity between words was suitable for evaluation by Cd and Cp. And the words with strong centrality were 'agriculture', 'education', 'future', 'farmer', 'digital', 'support', 'utilization'.

Detecting Weak Signals for Carbon Neutrality Technology using Text Mining of Web News (탄소중립 기술의 미래신호 탐색연구: 국내 뉴스 기사 텍스트데이터를 중심으로)

  • Jisong Jeong;Seungkook Roh
    • Journal of Industrial Convergence
    • /
    • v.21 no.5
    • /
    • pp.1-13
    • /
    • 2023
  • Carbon neutrality is the concept of reducing greenhouse gases emitted by human activities and making actual emissions zero through removal of remaining gases. It is also called "Net-Zero" and "carbon zero". Korea has declared a "2050 Carbon Neutrality policy" to cope with the climate change crisis. Various carbon reduction legislative processes are underway. Since carbon neutrality requires changes in industrial technology, it is important to prepare a system for carbon zero. This paper aims to understand the status and trends of global carbon neutrality technology. Therefore, ROK's web platform "www.naver.com." was selected as the data collection scope. Korean online articles related to carbon neutrality were collected. Carbon neutrality technology trends were analyzed by future signal methodology and Word2Vec algorithm which is a neural network deep learning technology. As a result, technology advancement in the steel and petrochemical sectors, which are carbon over-release industries, was required. Investment feasibility in the electric vehicle sector and technology advancement were on the rise. It seems that the government's support for carbon neutrality and the creation of global technology infrastructure should be supported. In addition, it is urgent to cultivate human resources, and possible to confirm the need to prepare support policies for carbon neutrality.

Sensitivity Identification Method for New Words of Social Media based on Naive Bayes Classification (나이브 베이즈 기반 소셜 미디어 상의 신조어 감성 판별 기법)

  • Kim, Jeong In;Park, Sang Jin;Kim, Hyoung Ju;Choi, Jun Ho;Kim, Han Il;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.9 no.1
    • /
    • pp.51-59
    • /
    • 2020
  • From PC communication to the development of the internet, a new term has been coined on the social media, and the social media culture has been formed due to the spread of smart phones, and the newly coined word is becoming a culture. With the advent of social networking sites and smart phones serving as a bridge, the number of data has increased in real time. The use of new words can have many advantages, including the use of short sentences to solve the problems of various letter-limited messengers and reduce data. However, new words do not have a dictionary meaning and there are limitations and degradation of algorithms such as data mining. Therefore, in this paper, the opinion of the document is confirmed by collecting data through web crawling and extracting new words contained within the text data and establishing an emotional classification. The progress of the experiment is divided into three categories. First, a word collected by collecting a new word on the social media is subjected to learned of affirmative and negative. Next, to derive and verify emotional values using standard documents, TF-IDF is used to score noun sensibilities to enter the emotional values of the data. As with the new words, the classified emotional values are applied to verify that the emotions are classified in standard language documents. Finally, a combination of the newly coined words and standard emotional values is used to perform a comparative analysis of the technology of the instrument.

Prototype Design and Development of Online Recruitment System Based on Social Media and Video Interview Analysis (소셜미디어 및 면접 영상 분석 기반 온라인 채용지원시스템 프로토타입 설계 및 구현)

  • Cho, Jinhyung;Kang, Hwansoo;Yoo, Woochang;Park, Kyutae
    • Journal of Digital Convergence
    • /
    • v.19 no.3
    • /
    • pp.203-209
    • /
    • 2021
  • In this study, a prototype design model was proposed for developing an online recruitment system through multi-dimensional data crawling and social media analysis, and validates text information and video interview in job application process. This study includes a comparative analysis process through text mining to verify the authenticity of job application paperwork and to effectively hire and allocate workers based on the potential job capability. Based on the prototype system, we conducted performance tests and analyzed the result for key performance indicators such as text mining accuracy and interview STT(speech to text) function recognition rate. If commercialized based on design specifications and prototype development results derived from this study, it may be expected to be utilized as the intelligent online recruitment system technology required in the public and private recruitment markets in the future.

A Study on the Factors of Well-aging through Big Data Analysis : Focusing on Newspaper Articles (빅데이터 분석을 활용한 웰에이징 요인에 관한 연구 : 신문기사를 중심으로)

  • Lee, Chong Hyung;Kang, Kyung Hee;Kim, Yong Ha;Lim, Hyo Nam;Ku, Jin Hee;Kim, Kwang Hwan
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.22 no.5
    • /
    • pp.354-360
    • /
    • 2021
  • People hope to live a healthy and happy life achieving satisfaction by striking a good work-life balance. Therefore, there is a growing interest in well-aging which means living happily to a healthy old age without worry. This study identified important factors related to well-aging by analyzing news articles published in Korea. Using Python-based web crawling, 1,199 articles were collected on the news service of portal site Daum till November 2020, and 374 articles were selected which matched the subject of the study. The frequency analysis results of text mining showed keywords such as 'elderly', 'health', 'skin', 'well-aging', 'product', 'person', 'aging', 'female', 'domestic' and 'retirement' as important keywords. Besides, a social network analysis with 45 important keywords revealed strong connections in the order of 'skin-wrinkle', 'skin-aging' and 'old-health'. The result of the CONCOR analysis showed that 45 main keywords were composed of eight clusters of 'life and happiness', 'disease and death', 'nutrition and exercise', 'healing', 'health', and 'elderly services'.