• 제목/요약/키워드: Text Analysis

검색결과 3,342건 처리시간 0.027초

텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안 (Detecting Spam Data for Securing the Reliability of Text Analysis)

  • 현윤진;김남규
    • 한국통신학회논문지
    • /
    • 제42권2호
    • /
    • pp.493-504
    • /
    • 2017
  • 최근 뉴스, 블로그, 소셜미디어 등을 통해 방대한 양의 비정형 텍스트 데이터가 쏟아져 나오고 있다. 이러한 비정형 텍스트 데이터는 풍부한 정보 및 의견을 거의 실시간으로 반영하고 있다는 측면에서 그 활용도가 매우 높아, 학계는 물론 산업계에서도 분석 수요가 증가하고 있다. 하지만 텍스트 데이터의 유용성이 증가함과 동시에 이러한 텍스트 데이터를 왜곡하여 특정 목적을 달성하려는 시도도 늘어나고 있다. 이러한 스팸성 텍스트 데이터의 증가는 방대한 정보 가운데 필요한 정보를 획득하는 일을 더욱 어렵게 만드는 것은 물론, 정보 자체 및 정보 제공 매체에 대한 신뢰도를 떨어뜨리는 현상을 초래하게 된다. 따라서 원본 데이터로부터 스팸성 데이터를 식별하여 제거함으로써, 정보의 신뢰성 및 분석 결과의 품질을 제고하기 위한 노력이 반드시 필요하다. 이러한 목적으로 스팸을 식별하기 위한 연구가 오피니언 스팸 탐지, 스팸 이메일 검출, 웹 스팸 탐지 등의 분야에서 매우 활발하게 수행되었다. 본 연구에서는 스팸 식별을 위한 기존의 연구 동향을 자세히 소개하고, 블로그 정보의 신뢰성 향상을 위한 방안 중 하나로 블로그의 스팸 태그를 식별하기 위한 방안을 제안한다.

WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템 (WCTT: Web Crawling System based on HTML Document Formalization)

  • 김진환;김은경
    • 한국정보통신학회논문지
    • /
    • 제26권4호
    • /
    • pp.495-502
    • /
    • 2022
  • 오늘날 웹상의 본문 수집에 주로 이용되는 웹 크롤러는 연구자가 직접 HTML 문서의 태그와 스타일을 분석한 후 수집 채널마다 다른 수집 로직을 구현해야 하므로 유지 관리 및 확장이 어렵다. 이러한 문제점을 해결하려면 웹 크롤러는 구조가 서로 다른 HTML 문서를 동일한 구조로 정형화하여 본문을 수집할 수 있어야 한다. 따라서 본 논문에서는 태그 경로 및 텍스트 출현 빈도를 기반으로 HTML 문서를 정형화하여 하나의 수집 로직으로 본문을 수집하는 웹크롤링 시스템인 WCTT(Web Crawling system based on Tag path and Text appearance frequency)를 설계 및 구현하였다. WCTT는 모든 수집 채널에서 동일한 로직으로 본문을 수집하므로 유지 관리 및 수집 채널의 확장이 용이하다. 또한, 키워드 네트워크 분석 등을 위해 불용어를 제거하고 명사만 추출하는 전처리 기능도 제공한다.

기술 문헌 분석 테스트베드 툴킷 개발 (Developing a Test-Bed Toolkit for Scientific Document Analysis)

  • 최성필;송사광;정한민
    • 한국콘텐츠학회논문지
    • /
    • 제12권8호
    • /
    • pp.13-19
    • /
    • 2012
  • 본 논문은 논문, 특허, 연구보고서 등과 같은 다양한 과학 기술 문헌에 포함된 기술 지식을 효과적으로 추출하는데 필요한 텍스트 분석 엔진들의 효과적인 모니터링 및 성능 최적화를 위한 테스트베드 도구를 소개한다. 이 도구는 과학 기술 분야의 전문 용어를 비롯한 인명, 지명, 기관명 등을 자동으로 인식하는 기술 개체 인식 엔진을 위한 테스트베드와 인식된 기술 개체 간의 의미적 연관 관계를 자동으로 추출하는 기술개체 간 관계 추출 테스트베드로 구성되어 있다. 이를 활용함으로써 사용자 및 개발자들은 기술 문헌 분석 엔진의 실행 모니터링은 물론 오류 분석을 효율적으로 수행할 수 있다.

섬유소재 분야 특허 기술 동향 분석: DETM & STM 텍스트마이닝 방법론 활용 (Research of Patent Technology Trends in Textile Materials: Text Mining Methodology Using DETM & STM)

  • 이현상;조보근;오세환;하성호
    • 한국정보시스템학회지:정보시스템연구
    • /
    • 제30권3호
    • /
    • pp.201-216
    • /
    • 2021
  • Purpose The purpose of this study is to analyze the trend of patent technology in textile materials using text mining methodology based on Dynamic Embedded Topic Model and Structural Topic Model. It is expected that this study will have positive impact on revitalizing and developing textile materials industry as finding out technology trends. Design/methodology/approach The data used in this study is 866 domestic patent text data in textile material from 1974 to 2020. In order to analyze technology trends from various aspect, Dynamic Embedded Topic Model and Structural Topic Model mechanism were used. The word embedding technique used in DETM is the GloVe technique. For Stable learning of topic modeling, amortized variational inference was performed based on the Recurrent Neural Network. Findings As a result of this analysis, it was found that 'manufacture' topics had the largest share among the six topics. Keyword trend analysis found the fact that natural and nanotechnology have recently been attracting attention. The metadata analysis results showed that manufacture technologies could have a high probability of patent registration in entire time series, but the analysis results in recent years showed that the trend of elasticity and safety technology is increasing.

The Impact of Transforming Unstructured Data into Structured Data on a Churn Prediction Model for Loan Customers

  • Jung, Hoon;Lee, Bong Gyou
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제14권12호
    • /
    • pp.4706-4724
    • /
    • 2020
  • With various structured data, such as the company size, loan balance, and savings accounts, the voice of customer (VOC), which is text data containing contact history and counseling details was analyzed in this study. To analyze unstructured data, the term frequency-inverse document frequency (TF-IDF) analysis, semantic network analysis, sentiment analysis, and a convolutional neural network (CNN) were implemented. A performance comparison of the models revealed that the predictive model using the CNN provided the best performance with regard to predictive power, followed by the model using the TF-IDF, and then the model using semantic network analysis. In particular, a character-level CNN and a word-level CNN were developed separately, and the character-level CNN exhibited better performance, according to an analysis for the Korean language. Moreover, a systematic selection model for optimal text mining techniques was proposed, suggesting which analytical technique is appropriate for analyzing text data depending on the context. This study also provides evidence that the results of previous studies, indicating that individual customers leave when their loyalty and switching cost are low, are also applicable to corporate customers and suggests that VOC data indicating customers' needs are very effective for predicting their behavior.

팬데믹 시기의 패션 테크놀로지에 관한 시각 - 텍스트 마이닝과 내용 분석을 중심으로 - (Perspectives on Fashion Technology during the Pandemic Era - A Mixed Methods Approach Using Text Mining and Content Analysis -)

  • 김미경;임은혁
    • 한국의류산업학회지
    • /
    • 제24권5호
    • /
    • pp.545-556
    • /
    • 2022
  • To overcome the pandemic, a new strategy for innovation is in demand throughout the value chains of the fashion industry that emphasize the importance of fashion technology. Accordingly, as various viewpoints and fields of debate are unfolding to consider the direction of change led by fashion technology, it is necessary to make an active value judgment precedent by understanding the differences between various opinions. This study aims to derive keywords from fashion technology used during the pandemic, to infer the characteristics of each type of perspective and to understand their characteristics. For the research, this study combines text mining analysis and content analysis. Text mining analysis is used to find statistical patterns by collecting keywords from big data from online media, and content analysis is used to interpret the data qualitatively. After analyzing the results of this study, the following observations are made. First, the perspective of positive acceptance seeks to maximize the perception and sensory action of fashion through technology; this amplifies experience, an opportunity for innovation and efficiency. Second, critical vigilance highlights the side effects of radical changes in fashion technology, characterized by concerns about capital-centered polarization, threats to human rights, and infringement of creative thinking. Lastly, the perspective of gradual adoption is the gradual convergence of technologies, characterized by the pursuit of an appropriate balance.

TextRank 알고리즘을 이용한 음악 가사 요약 기법 (Music Lyrics Summarization Method using TextRank Algorithm)

  • 손지영;신용태
    • 한국멀티미디어학회논문지
    • /
    • 제21권1호
    • /
    • pp.45-50
    • /
    • 2018
  • This research paper describes how to summarize music lyrics using the TextRank algorithm. This method can summarize music lyrics as important lyrics. Therefore, we recommend music more effectively than analyzing the number of words and recommending music.

Study of Analyzing Outcome of Building and Introducing System for Preserving Full-Text of e-Journal

  • Kim, Kwang-Young;Kim, Soon-Young;Kim, Hwan-Min
    • International Journal of Knowledge Content Development & Technology
    • /
    • 제2권2호
    • /
    • pp.5-16
    • /
    • 2012
  • Today, most researchers conduct their studies through the full-text of e-journals. Therefore, an important base for domestic development of science and technology is to obtain the full-text of quality e-journals by overseas researchers and to provide it to Korea's researchers. This study aims to build a system based on the National Archiving Center for the full-text of e-journals and to make a service system for providing them to the public by acquiring the full-text of quality overseas e-journals. To do this, an analysis was made of the outcome of introducing such a system for full-text of e-journals in comparison with the investment. As a result, 112 more institutions, that is, from 47 institutions to 159 institutions, have introduced the system as of 2012, and the number of downloaded full-texts increased at least 2.17 times.

한 손을 이용한 스마트폰 터치키 문자입력에서 선호손의 수행도 분석 (Performance Analysis of Text Entry with Preferred One Hand using Smart Phone Touch-keyboard)

  • 류태범
    • 대한인간공학회지
    • /
    • 제30권1호
    • /
    • pp.259-264
    • /
    • 2011
  • Does preferred hand show better performance than non-preferred hand in smart phone text entry using one hand. Is the performance of subjects who use left-preferred hand in smart phone text entry worse than that of others who use right preferred hand among the right handed. This study tried to address these two questions. Thirty young male undergraduate students typed a text using a smart phone which has a touch-based QWERTY keyboard two times with both hands, right and left hand, respectively. The completion time, errors were measured in the text entry tasks. All of participants were right handed, but half of them preferred right hand if they have to use one hand in smart phone text entry and other half preferred left hand. The percentage that preferred hand has better performance than non-preferred hand in smart phone text entry using one hand is less than 90% for right-preferred hand and less than 70% for left-preferred hand. The performance of left hand preferred students is not worse than that of the right hand preferred in one hand text entry of smart phone.

Arabic Text Clustering Methods and Suggested Solutions for Theme-Based Quran Clustering: Analysis of Literature

  • Bsoul, Qusay;Abdul Salam, Rosalina;Atwan, Jaffar;Jawarneh, Malik
    • Journal of Information Science Theory and Practice
    • /
    • 제9권4호
    • /
    • pp.15-34
    • /
    • 2021
  • Text clustering is one of the most commonly used methods for detecting themes or types of documents. Text clustering is used in many fields, but its effectiveness is still not sufficient to be used for the understanding of Arabic text, especially with respect to terms extraction, unsupervised feature selection, and clustering algorithms. In most cases, terms extraction focuses on nouns. Clustering simplifies the understanding of an Arabic text like the text of the Quran; it is important not only for Muslims but for all people who want to know more about Islam. This paper discusses the complexity and limitations of Arabic text clustering in the Quran based on their themes. Unsupervised feature selection does not consider the relationships between the selected features. One weakness of clustering algorithms is that the selection of the optimal initial centroid still depends on chances and manual settings. Consequently, this paper reviews literature about the three major stages of Arabic clustering: terms extraction, unsupervised feature selection, and clustering. Six experiments were conducted to demonstrate previously un-discussed problems related to the metrics used for feature selection and clustering. Suggestions to improve clustering of the Quran based on themes are presented and discussed.