• 제목/요약/키워드: Term Frequency and Inverse Document Frequency

검색결과 88건 처리시간 0.022초

텍스트 마이닝을 활용한 경제정책기록서비스 연구: 경제정책방향을 중심으로 (A Study on the Archival Information Services of Economic Policy Using Text Mining Methods: Focusing on Economic Policy Directions)

  • 연지현;김성원
    • 한국기록관리학회지
    • /
    • 제22권2호
    • /
    • pp.117-133
    • /
    • 2022
  • 자의적으로 구성한 기록 콘텐츠만으로는 이용자가 필요한 기간과 맥락에 대한 이해 없이 이용하게 됨으로써 주요한 경제정책기록에 효율적으로 접근하기에 어려움을 겪는다. 이러한 현재의 기록 서비스를 개선하기 위한 방안을 모색하고자 한다. 본 연구에서 1991년부터 2021년까지 30년간의 경제정책방향을 대상으로 경제정책기록에 텍스트 마이닝 기법을 활용하여 정부별 주요하게 다뤄진 경제 키워드와 변화과정을 도출하였다. 대책 배경, 주요 내용, 본문 텍스트를 수집하여 전처리를 진행한 후 텍스트 빈도분석, TF-IDF, 네트워크분석, 시계열 분석을 진행하였다. 분석 결과 '일자리', '경쟁력', '구조조정' 순으로 가장 높은 빈도수를 기록하였다. 정부별로 주요 키워드를 한눈에 볼 수 있었으며 '일자리', '부동산', '기업'의 연도별 상대비율을 시계열 순으로 분석하였다. 본 연구 결과를 바탕으로 향후 경제정책기록서비스의 발전과 저변확대를 위한 시사점을 제언하였다.

인플루언서 속성 분석 기반 추천 시스템 (Influencer Attribute Analysis based Recommendation System)

  • 박정련;박지원;김민우;오하영
    • 한국정보통신학회논문지
    • /
    • 제23권11호
    • /
    • pp.1321-1329
    • /
    • 2019
  • 소셜 정보망의 발달로 마케팅의 방법도 다양하게 변화되고 있다. 기존의 유명인, 경제적 지원 기반의 성공적인 마케팅방법론과 달리, 최근 인플루언서 기반 유튜브 마케팅이 큰 대세를 이루고 있다. 본 논문 에서는 처음으로 유튜브 양적 정보 및 댓글분석 기반 다각도 질적 분석을 활용하여 54개 이상의 유튜브 채널에서 인플루언서 특징을 추출하고 대표적인 주제들을 모델링하여 개인 맞춤형 영상 만족도 극대화는 물론 기업체가 새로운 아이템을 마케팅 할 때 기존의 인플루언서 특징을 참고하여 새로운 아이템의 영상을 제작하고 배포함으로써 성공적인 홍보 효과를 누릴 수 있도록 보조 수단 제공을 목적으로 한다. 유튜브 채널 별 다양한 영상의 모든 댓글을 각 문서로 가정하고 TF-IDF 및 LDA알고리즘을 적용하여 성능 극대화 향상을 보였다.

단어의 연관성을 이용한 문서의 자동분류 (Automatic Classification of Documents Using Word Correlation)

  • 신진섭;이창훈
    • 한국정보처리학회논문지
    • /
    • 제6권9호
    • /
    • pp.2422-2430
    • /
    • 1999
  • 본 논문에서는 단어들 사이의 연관성을 이용하여 문서들을 사용자의 관심분야 만큼 자동으로 분류하는 다음과 같은 방법을 제안한다. 첫째, TF*IDF 알고리즘을 이용하여 각 문서를 대표할 수 있는 단어들을 찾아내고, 본 논문에서 제안한 연관성 계산을 위한 확률 모델을 이용하여 각 문서를 대표할 수 있는 단어들을 찾아내고, 본 논문에서 제안한 연관성 계산을 위한 확률 모델을 이용하여 각 문서를 대표하는 각각의 단어들이 문서 전체집합에서 서로 어느 정도 연관성을 갖고 있는가를 계산한다. 둘째, 연관성이 가장 높은 두 단어를 중심으로 그 단어들에 밀접하게 연결되어 있는 단어들을 하나의 집합으로 구성하고, 그 집합을 이용하여 하나의 클래스와 프로파일을 생성한다. 연관성이 다음으로 높은 두 단어를 중심으로 위와 같은 과정을 임계 값 보다 낮은 값이 나올 때까지 계속적으로 반복함으로써, 사용자가 관심 있는 분야만큼의 프로파일을 생성한다. 또한, 본 논문에서는 생성된 각각의 프로파일이 각 문서들에 어느 정도의 영향력을 갖고 있는지를 평가하여 문서들을 분류하고, 기존의 자동문서 분류 방법과의 비교를 통하여 본 논문에서 제시한 방법의 타당성을 입증한다.

  • PDF

단어-역문서 빈도 벡터화를 통한 한국 걸그룹의 음반 메타 정보 군집화 (Clustering Meta Information of K-Pop Girl Groups Using Term Frequency-inverse Document Frequency Vectorization)

  • 현준서;조재혁
    • Journal of Platform Technology
    • /
    • 제11권3호
    • /
    • pp.12-23
    • /
    • 2023
  • 2020 년대 K-Pop 시장은 보이그룹보다 걸그룹이, 3 세대보다 4 세대가 전반에서 주목받았다. 해당 논문은 걸그룹의 세대가 바뀌기 시작했는지 알아보고자 가사 군집화에 대한 방법과 결과를 제시한다. 2013 년부터 2022 년까지 발표된 47 개 그룹의 1469 곡에 대한 메타정보를 수집하여 가사 정보와 가사 외 메타정보로 분류하여 각각 수치화했다. 가사 정보는 선행연구를 기반으로 단어역문서 빈도 벡터화를 적용한 뒤 상위 벡터 값만 선정하는 전처리를 하였다. 가사 외 메타정보는 가사 정보만 사용했을 때의 편향성을 줄이고 더 좋은 군집화 결과를 보여주기 위해 One-Hot Encoding 으로 전처리하여 적용했다. 전처리된 데이터에 대한 군집화 성능은 Spherical K-Means 의 Silhouette Coefficient, Calinski-Harabasz Score 가 Hierarchical Clustering 에 비해 각각 129%, 45% 더 높았다. 본 연구는 한국 대중가요 발전사와 걸그룹 가사 분석 및 군집화 연구에 기여할 수 있을 것으로 기대된다.

  • PDF

토픽 모델링을 활용한 광범위 선천성 대사이상 신생아 선별검사 관련 온라인 육아 커뮤니티 게시 글 분석: 계량적 내용분석 연구 (Analysis of online parenting community posts on expanded newborn screening for metabolic disorders using topic modeling: a quantitative content analysis)

  • 이명선;정현숙;김진선
    • 여성건강간호학회지
    • /
    • 제29권1호
    • /
    • pp.20-31
    • /
    • 2023
  • Purpose: As more newborns have received expanded newborn screening (NBS) for metabolic disorders, the overall number of false-positive results has increased. The purpose of this study was to explore the psychological impacts experienced by mothers related to the NBS process. Methods: An online parenting community in Korea was selected, and questions regarding NBS were collected using web crawling for the period from October 2018 to August 2021. In total, 634 posts were analyzed. The collected unstructured text data were preprocessed, and keyword analysis, topic modeling, and visualization were performed. Results: Of 1,057 words extracted from posts, the top keyword based on 'term frequency-inverse document frequency' values was "hypothyroidism," followed by "discharge," "close examination," "thyroid-stimulating hormone levels," and "jaundice." The top keyword based on the simple frequency of appearance was "XXX hospital," followed by "close examination," "discharge," "breastfeeding," "hypothyroidism," and "professor." As a result of LDA topic modeling, posts related to inborn errors of metabolism (IEMs) were classified into four main themes: "confirmatory tests of IEMs," "mother and newborn with thyroid function problems," "retests of IEMs," and "feeding related to IEMs." Mothers experienced substantial frustration, stress, and anxiety when they received positive NBS results. Conclusion: The online parenting community played an important role in acquiring and sharing information, as well as psychological support related to NBS in newborn mothers. Nurses can use this study's findings to develop timely and evidence-based information for parents whose children receive positive NBS results to reduce the negative psychological impact.

아토바스타틴의 새로운 약물 적응증 탐색을 위한 비정형 데이터 분석 (Analysis of Unstructured Data on Detecting of New Drug Indication of Atorvastatin)

  • 정휘수;강길원;최웅;박종혁;신광수;서영성
    • Journal of health informatics and statistics
    • /
    • 제43권4호
    • /
    • pp.329-335
    • /
    • 2018
  • Objectives: In recent years, there has been an increased need for a way to extract desired information from multiple medical literatures at once. This study was conducted to confirm the usefulness of unstructured data analysis using previously published medical literatures to search for new indications. Methods: The new indications were searched through text mining, network analysis, and topic modeling analysis using 5,057 articles of atorvastatin, a treatment for hyperlipidemia, from 1990 to 2017. Results: The extracted keywords was 273. In the frequency of text mining and network analysis, the existing indications of atorvastatin were extracted in top level. The novel indications by Term Frequency-Inverse Document Frequency (TF-IDF) were atrial fibrillation, heart failure, breast cancer, rheumatoid arthritis, combined hyperlipidemia, arrhythmias, multiple sclerosis, non-alcoholic fatty liver disease, contrast-induced acute kidney injury and prostate cancer. Conclusions: Unstructured data analysis for discovering new indications from massive medical literature is expected to be used in drug repositioning industries.

텍스트마이닝을 활용한 패브릭 관련 DIY 의류 상품 현황 연구 (A study on the current status of DIY clothing products related to fabric using text mining)

  • 이은혜;이하은;최정욱
    • 한국의상디자인학회지
    • /
    • 제25권2호
    • /
    • pp.111-122
    • /
    • 2023
  • This study aims to collect Big Data related to DIY clothing, analyze the results on a year-by-year basis, understand consumers' perceptions, the status, and reality of DIY clothing. The reference period for the evaluation of DIY clothing trends was set from 2012 to 2022. The data in this study was collected and analyzed using Textom, a Big Data solution program certified as a Good Software by the Telecommunications Technology Association (TTA). For the analysis of fabric-related DIY products, the keyword was set to "DIY clothing", and for data cleansing following collection, the "Espresso K" module was employed. Also, via data collection on a year-by-year basis, a total of 11 lists were generated and the collected data was analyzed by period. The following are the findings of this study's data collection on DIY clothing. The total number of keywords collected over a period of ten years on search engines "Naver" and "Google" between January 1, 2012 and December 31, 2022 was 16,315, and data trends by period indicate a continuous upward trend. In addition, a keyword analysis was conducted to analyze TF-IDF (Term Frequency-Inverse Document Frequency), a statistical measure that reflects the importance of a word within data, and the relationship with N-gram, an analysis of the correlation concerning the relationship between words. Using these results, it was possible to evaluate the popularity and growing tendency of DIY clothing products in conjunction with the evolving social environment, as well as the desire to explore DIY trends among consumers. Therefore, this study is valuable in that it provides preliminary data for DIY clothing research by analyzing the status and reality of DIY products, and furthermore, contributes to the development and production of DIY clothing.

Incorporating Time Constraints into a Recommender System for Museum Visitors

  • Kovavisaruch, La-or;Sanpechuda, Taweesak;Chinda, Krisada;Wongsatho, Thitipong;Wisadsud, Sodsai;Chaiwongyen, Anuwat
    • Journal of information and communication convergence engineering
    • /
    • 제18권2호
    • /
    • pp.123-131
    • /
    • 2020
  • After observing that most tourists plan to complete their visits to multiple cultural heritage sites within one day, we surmised that for many museum visitors, the foremost thought is with regard to the amount of time is to be spent at each location and how they can maximize their enjoyment at a site while still balancing their travel itinerary? Recommendation systems in e-commerce are built on knowledge about the users' previous purchasing history; recommendation systems for museums, on the other hand, do not have an equivalent data source available. Recent solutions have incorporated advanced technologies such as algorithms that rely on social filtering, which builds recommendations from the nearest identified similar user. Our paper proposes a different approach, and involves providing dynamic recommendations that deploy social filtering as well as content-based filtering using term frequency-inverse document frequency. The main challenge is to overcome a cold start, whereby no information is available on new users entering the system, and thus there is no strong background information for generating the recommendation. In these cases, our solution deploys statistical methods to create a recommendation, which can then be used to gather data for future iterations. We are currently running a pilot test at Chao Samphraya national museum and have received positive feedback to date on the implementation.

팔요맥을 중심으로 살펴본 『동의보감』 27맥 속성 연구 (Properties of the Twenty-seven Pulses in DongUiBoGam Based on the Eight Important Pulses)

  • 이태형;정원모;고병호;박히준;김남일;채윤병
    • Korean Journal of Acupuncture
    • /
    • 제32권4호
    • /
    • pp.151-159
    • /
    • 2015
  • Objectives : Pulse diagnosis is considered particularly important among several methods of diagnosis in DongUiBoGam. In spite of its importance, numerous and various pulse descriptions made it difficult to learn and practice pulse diagnosis. In this article, we tried to analyze properties of the twenty-seven pulses from pulse diagnosis cases from DongUiBoGam to enable the practical understanding of pulse diagnosis. Methods : We constituted the four axis according to the eight important pulses. And we analyzed properties of the twenty-seven pulses through the relationship between the four pairs of important pulses and the twenty-seven pulses. To quantify the relevances of important pulses to the twenty-seven pulses, we used the term frequency-inverse document frequency(TF-IDF) method. Results : We could elicit properties of the twenty-seven pulses according to the four axis. Also, we reexamined the categorization of the seven exterior pulses / the eight interior pulses and the similar pulses from DongUiBoGam with the analysis results. Conclusions : We could understand properties of the twenty-seven pulses more specifically with the eight important pulses. And we also could see the relationship among the twenty-seven pulses on each axis. However, the limitation arising from insufficient number of pulse diagnosis cases in this research requires further research with more sources such as other traditional medical records or clinical records in the present time.

텍스트마이닝을 이용한 동의보감의 질병인식방식과 내경편 침구법 경혈 특성 분석 (A Structural Analysis of Acupuncture & Moxibustion Points in the NaeGyeong Chapter of DongUiBoGam Using Text Mining)

  • 이태형;정원모;이인선;이혜정;김남일;채윤병
    • Korean Journal of Acupuncture
    • /
    • 제30권4호
    • /
    • pp.230-242
    • /
    • 2013
  • Objectives : DongUiBoGam is a representative medical literature in Korea. This research intends to structurally grasp how DongUiBoGam understands the human body and review the methods of acupuncture and moxibustion in the NaeGyeong chapter of it using text mining. Methods : The structure of DongUiBoGam was analyzed with specific parts of the book that described contents, major premises of understanding the human body, and processes of treatment. We analyzed characteristics of each acupoints in a relationship with causes of diseases & symptoms in the NaeGyeong chapter using a Term Frequency - Inverse Document Frequency(TFIDF). Results : Three different categories of pattern identification(PI) were formed after structural analysis of DongUiBoGam. Every causes of diseases & symptoms were transformed according to the three categories of PI. After analyzing the relationship between acupoints and causes of diseases & symptoms, 114 acupoints were visualized with TFIDF values of three PI categories. Conclusions : The selection of acupoints in NaeGyeong chapter of DongUiBoGam were linked to causes of diseases & symptoms based on the three PI categories. Through visualization of bipartite relationships between acupoints and causes of diseases & symptoms, we could easily understand characteristics of each acupoint.