• Title/Summary/Keyword: 텍스트 빈도 분석

Search Result 342, Processing Time 0.027 seconds

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

Analyzing Architectural History Terminologies by Text Mining and Association Analysis (텍스트 마이닝과 연관 관계 분석을 이용한 건축역사 용어 분석)

  • Kim, Min-Jeong;Kim, Chul-Joo
    • Journal of Digital Convergence
    • /
    • v.15 no.1
    • /
    • pp.443-452
    • /
    • 2017
  • Architectural history traces the changes in architecture through various traditions, regions, overarching stylistic trends, and dates. This study identified terminologies related to the proximity and frequency in the architectural history areas by text mining and association analysis. This study explored terminologies by investigating articles published in the "Journal of Architectural History", a sole journal for the architectural history studies. First, key terminologies that appeared frequently were extracted from paper that had titles, keywords, and abstracts. Then, we analyzed some typical and specific key terminologies that appear frequently and partially depending on the research areas. Finally, association analysis was used to find the frequent patterns in the key terminologies. This research can be used as fundamental data for understanding issues and trends in areas on the architectural history.

An Exploratory Study of VR Technology using Patents and News Articles (특허와 뉴스 기사를 이용한 가상현실 기술에 관한 탐색적 연구)

  • Kim, Sungbum
    • Journal of Digital Convergence
    • /
    • v.16 no.11
    • /
    • pp.185-199
    • /
    • 2018
  • The purpose of this study is to derive the core technologies of VR using patent analysis and to explore the direction of social and public interest in VR using news analysis. In Study 1, we derived keywords using the frequency of words in patent texts, and we compared by company, year, and technical classification. Netminer, a network analysis program, was used to analyze the IPC codes of patents. In Study 2, we analyzed news articles using T-LAB program. TF-IDF was used as a keyword selection method and chi-square and association index algorithms were used to extract the words most relevant to VR. Through this study, we confirmed that VR is a fusion technology including optics, head mounted display (HMD), data analysis, electric and electronic technology, and found that optical technology is the central technology among the technologies currently being developed. In addition, through news articles, we found that the society and the public are interested in the formation and growth of VR suppliers and markets, and VR should be developed on the basis of user experience.

A Study on Unstructured text data Post-processing Methodology using Stopword Thesaurus (불용어 시소러스를 이용한 비정형 텍스트 데이터 후처리 방법론에 관한 연구)

  • Won-Jo Lee
    • The Journal of the Convergence on Culture Technology
    • /
    • v.9 no.6
    • /
    • pp.935-940
    • /
    • 2023
  • Most text data collected through web scraping for artificial intelligence and big data analysis is generally large and unstructured, so a purification process is required for big data analysis. The process becomes structured data that can be analyzed through a heuristic pre-processing refining step and a post-processing machine refining step. Therefore, in this study, in the post-processing machine refining process, the Korean dictionary and the stopword dictionary are used to extract vocabularies for frequency analysis for word cloud analysis. In this process, "user-defined stopwords" are used to efficiently remove stopwords that were not removed. We propose a methodology for applying the "thesaurus" and examine the pros and cons of the proposed refining method through a case analysis using the "user-defined stop word thesaurus" technique proposed to complement the problems of the existing "stop word dictionary" method with R's word cloud technique. We present comparative verification and suggest the effectiveness of practical application of the proposed methodology.

An Exploratory Study of Technology Planning and Hype Cycle Using Content Analysis (뉴스 내용분석을 활용한 하이프 사이클 적용의 탐색적 연구: 클라우드 컴퓨팅 기술을 중심으로)

  • Suh, Yoonkyo;Kim, Si jeoung
    • Proceedings of the Korea Technology Innovation Society Conference
    • /
    • 2015.11a
    • /
    • pp.927-945
    • /
    • 2015
  • 본 연구는 과학 커뮤니케이션 분야에서 널리 쓰이고 있는 뉴스 내용분석 방법론이 하이프 사이클 모델에 부합하는 지를 탐색적으로 살펴보고자 한다. 즉 과학기술 뉴스 내용분석이 하이프 사이클 모델에서 설명하는 사회적 가시성의 실체적 파악을 위한 기술기획의 유용한 보완적 방법론으로 쓰일 수 있음을 밝히는데 본 연구의 의의가 있다. 이를 위해 대표적인 유망기술로 클라우드 컴퓨팅을 대상으로 뉴스 내용분석을 수행하였다. 분석의 초점은 클라우드 컴퓨팅 기술 관련 뉴스의 빈도, 보도태도(긍정, 중립, 부정), 5가지 뉴스 프레임 관점에서 분석이 이루어졌고, 뉴스 보도경향이 하이프 사이클 흐름을 따라가는 지를 살펴보았다. 종합지 경제지와 IT전문지를 대상으로 한 뉴스 내용분석 결과는 뉴스 빈도, 보도 태도, 뉴스 프레임 모두 하이프 사이클의 흐름을 따르고 있었으며, 특히 2014년 이후의 흐름은 하이프 사이클 상에서 기대붕괴 지점을 지나 현실인식의 지점으로 진화되는 시점임을 추론할 수 있었다. 본 연구결과는 최근 확산되고 있는 텍스트 마이닝, 감성어 자동식별 분석 기술 등과 접목하여 사회적 맥락 파악을 위한 기술기획 분석의 보완적 방법론으로 기여할 수 있을 것으로 판단된다.

  • PDF

A Study on Data Cleansing Techniques for Word Cloud Analysis of Text Data (텍스트 데이터 워드클라우드 분석을 위한 데이터 정제기법에 관한 연구)

  • Lee, Won-Jo
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.4
    • /
    • pp.745-750
    • /
    • 2021
  • In Big data visualization analysis of unstructured text data, raw data is mostly large-capacity, and analysis techniques cannot be applied without cleansing it unstructured. Therefore, from the collected raw data, unnecessary data is removed through the first heuristic cleansing process and Stopwords are removed through the second machine cleansing process. Then, the frequency of the vocabulary is calculated, visualized using the word cloud technique, and key issues are extracted and informationalized, and the results are analyzed. In this study, we propose a new Stopword cleansing technique using an external Stopword set (DB) in Python word cloud, and derive the problems and effectiveness of this technique through practical case analysis. And, through this verification result, the utility of the practical application of word cloud analysis applying the proposed cleansing technique is presented.

A Corpus Analysis to the Engineering Academic English (공학학술영어에 대한 코퍼스 분석)

  • Ha, Myung-Jeong;Rhee, Eugene
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2017.05a
    • /
    • pp.139-140
    • /
    • 2017
  • 본 연구는 공과대학 학생들이 배우는 전공영어로서의 특수목적영어(ESP)에 대해 코퍼스 기반 접근법의 유용성을 논하고자 한다. 이에 본 연구에서는 공과대학에서 사용하는 전공텍스트를 코퍼스로 구축하여 컴퓨터에 기반한 분석에서 나온 결과들을 제시하면서 공학영어 코퍼스의 특성을 살펴보고 궁극적으로 영어매개수업을 듣는 공대학생들의 데이터 기반 학습에 일조하고자 한다. 본 연구에서 사용된 목표 코퍼스는 세부전공과 상관없이 공통적으로 적용되는 공학과목을 선정하여 구축되었고 비교대상인 참조 코퍼스는 British National Corpus를 사용하였다. 공학영어 코퍼스는 총 단어 180만개, 단어 유형 만 6천여개로 이루어졌고 코퍼스 분석도구인 AntConc 3.4.4를 이용하여 빈도 분석과 키워드 분석이 수행되었다. 고빈도수 어휘의 분석결과 목표 코퍼스와 참조 코퍼스에서 가장 빈번하게 나타나는 어휘군은 내용어(content words)보다는 기능어(function words) 형태가 많다는 점이 나타났고 내용어군만 분석결과 참조코퍼스에 비해 공학영어 코퍼스에 과학영역의 변이어가 많이 분포하고 있음이 드러났다. 또한 키워드 분석에서는 공학영어 코퍼스의 키워드 동사군이 전문적인 어휘(technical vocabulary)보다는 비전문적인 학술적 어휘(non-technical academic vocabulary)가 상대적으로 많이 분포되어 있음이 드러나 ESP교육을 실시함에 있어서 전공관련 전문영어와 함께 일반적인 학술 영어에 대한 인식을 고양해야 할 필요성이 대두된다.

  • PDF

Keyword trends analysis related to the aviation industry during the Covid-19 period using text mining (텍스트마이닝을 활용한 Covid-19 기간 동안의 항공산업 관련 키워드 트렌드 분석)

  • Choi, Donghyun;Song, Bomi;Park, Dahyeon;Lee, Sungwoo
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.27 no.2
    • /
    • pp.115-128
    • /
    • 2022
  • The purpose of this study is to conduct keyword trend analysis using articles data on the impact of Covid-19 in the aviation in dustry. In this study, related articles were extracted centering on the keyword "Airline" by dividing the period of 6months before and after Covid-19 occurrence. After that, Topic modeling(LDA) was performed. Through this, The main topic was extracted in the event of an epidemic such as Covid-19, It is expected to be used as primary data to predict the aviation industry's impact when occurrence like Covid-19.

Sentence Similarity Analysis using Ontology Based on Cosine Similarity (코사인 유사도를 기반의 온톨로지를 이용한 문장유사도 분석)

  • Hwang, Chi-gon;Yoon, Chang-Pyo;Yun, Dai Yeol
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2021.05a
    • /
    • pp.441-443
    • /
    • 2021
  • Sentence or text similarity is a measure of the degree of similarity between two sentences. Techniques for measuring text similarity include Jacquard similarity, cosine similarity, Euclidean similarity, and Manhattan similarity. Currently, the cosine similarity technique is most often used, but since this is an analysis according to the occurrence or frequency of a word in a sentence, the analysis on the semantic relationship is insufficient. Therefore, we try to improve the efficiency of analysis on the similarity of sentences by giving relations between words using ontology and including semantic similarity when extracting words that are commonly included in two sentences.

  • PDF

A Case Study on Characteristics of Gender and Major in Career Preparation of University Students from Low-income Families: Application of Text Frequency Analysis and Association Rules (저소득층 대학생들의 진로준비과정에서의 성별·전공별 특성에 대한 사례연구: 텍스트 빈도분석과 연관분석의 적용)

  • Lee, Jihye;Lee, Shinhye
    • Journal of Digital Convergence
    • /
    • v.16 no.12
    • /
    • pp.61-69
    • /
    • 2018
  • This study aims to understand and to infer the implications from the career preparation experiences of low-income university students in the context of high youth unemployment rate and the polarization of the social classes. For this purpose, we selected 13 university students who received scholarship from the S scholarship foundation and conducted analysis using text mining techniques based on the six-time interviews. According to the results, university students seem to be influenced by home environment and income level when recalling previous academic experience or designing career during the interview process. Also, these differences were found to have different characteristics according to gender and major. This study is meaningful in that the qualitative research data is analyzed by applying the text mining technique in a convergent way. As a result, the college life and career preparation of low-income university students were explored through the frequency and relation of words.