• 제목/요약/키워드: Text frequency analysis

검색결과 453건 처리시간 0.028초

텍스트 분석을 활용한 국내 자연환경복원 연구동향 분석 (Text Analysis on the Research Trends of Nature Restoration in Korea)

  • 이길상;정예림;송영근;이상혁;손승우
    • 한국환경복원기술학회지
    • /
    • 제27권2호
    • /
    • pp.29-42
    • /
    • 2024
  • As a global response to climate and biodiversity challenges, there is an emphasis on the conservation and restoration of ecosystems that can simultaneously reduce carbon emissions and enhance biodiversity. This study comprised a text analysis and keyword extraction of 1,100 research papers addressing nature restoration in Korea, aiming to provide a quantative and systematic evaluation of domestic research trends in this field. To discern the major research topics of these papers, topic modeling was applied and correlations were established through network analysis. Research on nature restoration exhibited a mainly upward trend in 2002-2022 but with a slight recent decline. The most common keywords were "species," "forest," and "water". Research topics were broadly classified into (1) predictions of habitat size and species distribution, (2) the conservation and utilization of natural resources in urban areas, (3) ecosystems and landscape managements in protected areas, (4) the planting and growth of vegetation, and (5) habitat formation methods. The number of studies on nature restoration are increasing across various domains in Korea, with each domain experiencing professional development.

조선왕조실록 텍스트 빈도 분석을 통한 조선시대 곡물에 관한 인식 특성 고찰 (Perceived Characteristics of Grains during the Choseon Dynasty - A Study Applying Text Frequency Analysis Using the Choseonwangjoshilrok Data -)

  • 김미혜
    • 한국식생활문화학회지
    • /
    • 제38권1호
    • /
    • pp.26-37
    • /
    • 2023
  • This study applied the text frequency method to analyze the crops prevalent during the Chosunwangjoshilrok dynasty, and categorized the results by each king. Contemporary perception of grains was observed by examining the staple crop types. Staple species were examined using the word cloud and semantic network analysis. Totally, 101,842 types of crop consumption were recorded during the Chosunwangjoshilrok period. Of these, 51,337 (50.4%) were grains, 50,407 (49.5%) were beans, and 98 (0.1%) were seeds. Rice was the most frequently consumed grain (37.1%), followed by pii (11.9%), millet (11.3%), barley (4.5%), proso (0.8%), wheat (0.6%), buckwheat (0.1%), and adlay (0.05%). Grain chronological frequency in the Choseon dynasty was determined to be 15,520 cases in the 15th century (30.2%), 11,201 cases in the 18th century (21.8%), 9,421 cases in the 17th century (18.4%), 9,113 cases in the 16th century (17.8%), and 6,082 cases in the 19th century (11.8%). Interest in grain amongst the 27 kings of Choseon was evaluated based on the frequency of records. The 15th century King Sejong recorded the maximum interest with 13,363 cases (13.1%), followed by King Jungjo (8,501 cases in the 18th century; 8.4%), King Sungjong (7,776 cases in the 15th century; 7.6%).

해안해양공학 연구 분야의 SCOPUS 서지정보 Text Mining 분석 (Text Mining Analysis on the Research Field of the Coastal and Ocean Engineering Based on the SCOPUS Bibliographic Information)

  • 이기섭;조홍연;한재림
    • 한국해안·해양공학회논문집
    • /
    • 제30권1호
    • /
    • pp.19-28
    • /
    • 2018
  • 서지정보학의 발달 및 전산화로 방대한 양의 연구논문들이 축적되고 있다. 이에 따라 전 세계에서 출판되는 관련 분야 논문들을 모두 검토하기는 실질적으로 어려워졌으며, 연구방향을 잡고 추진하는 것도 어려워졌다. 그러나 자연어 처리기법의 발달로 인해 출판된 연구논문들의 경향 분석이 수월해졌다. 여기서는 해안 해양공학 분야의 SCOPUS DB(Data Base) 서지정보 텍스트 마이닝(Text Mining) 분석을 R언어를 이용하여 수행했다. 분석 결과, 예상한 바와 같이 'wave' 용어가 압도적으로 우세하였으며, 'numerical model', 'numerical simulation' 및 experimental study' 용어로부터 여전히 수치해석 및 수리실험의 우세가 확인되었다. 또한 최근 해양에너지와 관련되는 'wave energy' 용어 사용이 부각되고 있는 것으로 파악되었다. 한편, 해안 해양공학 분야의 연구주제 용어의 빈도와 연결 관계는 'wave -> height, energy' 우세를 정량적으로 확인할 수 있었으며, 향후 세부분야 및 시기별 고해상도 분석 가능성을 제시하였다.

의학 사상의 유사성은 계량 분석 될 수 있는가 - 『동의보감』과 『의학입문』, 『경악전서』를 중심으로 - (Can Similarities in Medical thought be Quantified? - Focusing on Donguibogam, Uihagibmun and Gyeongagjeonseo -)

  • 오준호
    • 대한한의학원전학회지
    • /
    • 제31권2호
    • /
    • pp.71-82
    • /
    • 2018
  • Objectives : The purpose of this study is to compare the similarities among Donguibogam(DO), Uihagibmun(UI), and Gyeongagjeonseo(GY) in order to examine whether the medical thoughts embedded in the texts can be compared in a quantitative way. Methods : Under an empirical assumption that medical thoughts can be reduced to the frequency of major key words within the text, we selected the fourteen words of the four categories that are commonly used to describe physiology and pathology in Korean medicine as key words. And the frequency of these key words was measured and compared with each other in the three important medical texts in Korea. Results : As a result of quantitative analysis based on ${\chi}^2$ statistic, the key words in the books were distributed most heterogeneously in DO and distributed most homogeneously in UI. In comparison of the similarity analyzed by the same method, DO and UI were significantly more similar than those of DO and UI. The results of the word frequency pattern and the similarities of the book contents(CBDF) show that DO is influenced by UI, and the differences between standardized residuals and homogeneity tells us that internal context of both books are constructed differently. Conclusions : These results support the results of traditional research by experts. With the above, we were able to confirm that medical thoughts can be reduced to the frequency of major key words within the text, and compared through the frequency of such key words.

Web of Science 빅데이터를 활용한 텍스트 마이닝 기반의 정보윤리 이슈 탐색 (Exploring Information Ethics Issues based on Text Mining using Big Data from Web of Science)

  • 김한성
    • 컴퓨터교육학회논문지
    • /
    • 제22권3호
    • /
    • pp.67-78
    • /
    • 2019
  • 본 연구의 목적은 Web of Science(WoS)에서 제공하는 학술 빅데이터를 활용하여 정보윤리 이슈를 탐색하고 향후 정보과 정보윤리 교육을 위한 시사점을 제공하는 것에 있다. 이를 위해 WoS에서 제공하는 학술논문 중 정보윤리와 관련해 출판된 318편의 논문을 텍스트 마이닝 하였다. 구체적으로는 R을 활용해 주요키워드에 대한 빈도 분석(TF, DF, TF-IDF), 토픽 모델링 기반의 정보윤리 이슈 분석, 그리고 각 이슈에 대한 연도별 출연 빈도를 분석하여 정보윤리 연구의 경향성을 탐색하였다. 주요 결과를 살펴보면 다음과 같다. 첫째, TF-IDF를 통해 'digital', 'student', 'software', 'privacy' 등의 단어가 주요 키워드임을 확인하였다. 둘째, 토픽 모델링 분석 결과, 'Professional value', 'Cyber-bullying', 'AI and Social Impact' 등을 포함한 총 8개 이슈로 분석되었고, 그 중, 'Professional value'와 'Cyber-bullying' 이슈가 상대적으로 높은 비율을 차지하고 있었다. 본 연구는 이러한 분석 결과를 기초로 우리나라 정보윤리 교육을 시사점을 논의하였다.

공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘 (Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification)

  • 홍성삼;김동욱;한명묵
    • 인터넷정보학회논문지
    • /
    • 제20권1호
    • /
    • pp.1-10
    • /
    • 2019
  • 빅 데이터에서 텍스트 마이닝은 많은 수의 데이터로부터 많은 특징 추출하기 때문에, 클러스터링 및 분류 과정의 계산 복잡도가 높고 분석결과의 신뢰성이 낮아질 수 있다. 특히 텍스트마이닝 과정을 통해 얻는 Term document matrix는 term과 문서간의 특징들을 표현하고 있지만, 희소행렬 형태를 보이게 된다. 본 논문에서는 탐지모델을 위해 텍스트마이닝에서 개선된 GA(Genetic Algorithm)을 이용한 특징 추출 방법을 설계하였다. TF-IDF는 특징 추출에서 문서와 용어간의 관계를 반영하는데 사용된다. 반복과정을 통해 사전에 미리 결정된 만큼의 특징을 선택한다. 또한 탐지모델의 성능 향상을 위해 sparsity score(희소성 점수)를 사용하였다. 스팸메일 세트의 희소성이 높으면 탐지모델의 성능이 낮아져 최적화된 탐지 모델을 찾기가 어렵다. 우리는 fitness function에서 s(F)를 사용하여 희소성이 낮고 TF-IDF 점수가 높은 탐지모델을 찾았다. 또한 제안된 알고리즘을 텍스트 분류 실험에 적용하여 성능을 검증하였다. 결과적으로, 제안한 알고리즘은 공격 메일 분류에서 좋은 성능(속도와 정확도)을 보여주었다.

텍스트 분석 기술 및 활용 동향 (Investigations on Techniques and Applications of Text Analytics)

  • 김남규;이동훈;최호창
    • 한국통신학회논문지
    • /
    • 제42권2호
    • /
    • pp.471-492
    • /
    • 2017
  • 최근 데이터의 양 자체가 해결해야 할 문제의 일부분이 되는 빅데이터(Big Data) 분석에 대한 수요와 관심이 급증하고 있다. 빅데이터는 기존의 정형 데이터 뿐 아니라 이미지, 동영상, 로그 등 다양한 형태의 비정형 데이터 또한 포함하는 개념으로 사용되고 있으며, 다양한 유형의 데이터 중 특히 정보의 표현 및 전달을 위한 대표적 수단인 텍스트(Text) 분석에 대한 연구가 활발하게 이루어지고 있다. 텍스트 분석은 일반적으로 문서 수집, 파싱(Parsing) 및 필터링(Filtering), 구조화, 빈도 분석 및 유사도 분석의 순서로 수행되며, 분석의 결과는 워드 클라우드(Word Cloud), 워드 네트워크(Word Network), 토픽 모델링(Topic Modeling), 문서 분류, 감성 분석 등의 형태로 나타나게 된다. 특히 최근 다양한 소셜미디어(Social Media)를 통해 급증하고 있는 텍스트 데이터로부터 주요 토픽을 파악하기 위한 수요가 증가함에 따라, 방대한 양의 비정형 텍스트 문서로부터 주요 토픽을 추출하고 각 토픽별 해당 문서를 묶어서 제공하는 토픽 모델링에 대한 연구 및 적용 사례가 다양한 분야에서 생성되고 있다. 이에 본 논문에서는 텍스트 분석 관련 주요 기술 및 연구 동향을 살펴보고, 토픽 모델링을 활용하여 다양한 분야의 문제를 해결한 연구 사례를 소개한다.

텍스트마이닝을 이용한 한국응급구조학회지 중심단어 분석 (Analysis of key words published with the Korea Society of Emergency Medical Services journal using text mining)

  • 권찬양;양현모
    • 한국응급구조학회지
    • /
    • 제24권1호
    • /
    • pp.85-92
    • /
    • 2020
  • Purpose: The purpose of this study was to analyze the English abstract key words found within the Korea Society of Emergency Medical Services journal using text mining techniques to determine the adherence of these terms with Medical Subject Headings (MeSH) and identify key word trends. Methods: We analyzed 212 papers that were published from 2012 to 2019. R software, web scraping, and frequency analysis of key words were conducted using R's basic and text mining packages. Additionally, the Word Clouds package was used for visualization. Results: The average number of key words used per study was 3.9. Word cloud visualization revealed that CPR was most prominent in the first half and emergency medical technician was most frequently used during the second half. There were a total of 542 (64.9%) words that exactly matched the MeSH listed words. A total of 293 (35%) key words did not match MeSH listed words. Conclusion: Researchers should obey submission rules. Further, journals should update their respective submission rules. MeSH key words that are frequently cited should be suggested for use.

텍스트마이닝을 활용한 주요 대기업 신년사 분석 (Study on CEO New Year's Address: Using Text Mining Method)

  • 김유경;조대곤
    • 한국IT서비스학회지
    • /
    • 제22권2호
    • /
    • pp.93-127
    • /
    • 2023
  • This study analyzed the CEO New Year's addresses of major Korean companies, extracting key topics for employees via text mining techniques. An intended contribution of this study is to assist reporters, analysts, and researchers in gaining a better understanding of the New Year's addresses by elucidating the implicit and implicative features of messages within. To this end, this study collected and analyzed 545 New Year's addresses published between 2012 and 2021 by the top 66 Korean companies in terms of market capitalization. Research methodologies applied include text clustering, word embedding of keywords, frequency analysis, and topic modeling. Our main findings suggest that the messages in the New Year's addresses were categorized into nine topics-organizational culture, global advancement, substantial management, business reorganization, capacity building, market leadership, management innovation, sustainable management, and technology development. Next, this study further analyzed the managerial significance of each topic and discussed their characteristics from the perspectives of time, industry, and corporate groups. Companies were typically found to emphasize sound management, market leadership, and business reorganization during economic downturns while stressing capacity building and organizational culture during market transition periods. Also, companies belonging to corporate groups tended to emphasize founding philosophy and corporate culture.

텍스트마이닝 기법을 이용한 『상한론』 내의 증상-본초 조합의 탐색적 분석 (Analysis of Symptoms-Herbs Relationships in Shanghanlun Using Text Mining Approach)

  • 장동엽;하윤수;이충열;김창업
    • 동의생리병리학회지
    • /
    • 제34권4호
    • /
    • pp.159-169
    • /
    • 2020
  • Shanghanlun (Treatise on Cold Damage Diseases) is the oldest document in the literature on clinical records of Traditional Asian medicine (TAM), on which TAM theories about symptoms-herbs relationships are based. In this study, we aim to quantitatively explore the relationships between symptoms and herbs in Shanghanlun. The text in Shanghanlun was converted into structured data. Using the structured data, Term Frequency - Inverse Document Frequency (TF-IDF) scores of symptoms and herbs were calculated from each chapter to derive the major symptoms and herbs in each chapter. To understand the structure of the entire document, principal component analysis (PCA) was performed for the 6-dimensional chapter space. Bipartite network analysis was conducted focusing on Jaccard scores between symptoms and herbs and eigenvector centralities of nodes. TF-IDF scores showed the characteristics of each chapter through major symptoms and herbs. Principal components drawn by PCA suggested the entire structure of Shanghanlun. The network analysis revealed a 'multi herbs - multi symptoms' relationship. Common symptoms and herbs were drawn from high eigenvector centralities of their nodes, while specific symptoms and herbs were drawn from low centralities. Symptoms expected to be treated by herbs were derived, respectively. Using measurable metrics, we conducted a computational study on patterns of Shanghanlun. Quantitative researches on TAM theories will contribute to improving the clarity of TAM theories.