• Title/Summary/Keyword: Text clustering

Search Result 206, Processing Time 0.026 seconds

Question and Answering System through Search Result Summarization of Q&A Documents (Q&A 문서의 검색 결과 요약을 활용한 질의응답 시스템)

  • Yoo, Dong Hyun;Lee, Hyun Ah
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.4
    • /
    • pp.149-154
    • /
    • 2014
  • A user should pick up relevant answers by himself from various search results when using user participation question answering community like Knowledge-iN. If refined answers are automatically provided, usability of question answering community must be improved. This paper divides questions in Q&A documents into 4 types(word, list, graph and text), then proposes summarizing methods for each question type using document statistics. Summarized answers for word, list and text type are obtained by question clustering and calculating scores for words using frequency, proximity and confidence of answers. Answers for graph type is shown by extracting user opinion from answers.

Minimally Supervised Relation Identification from Wikipedia Articles

  • Oh, Heung-Seon;Jung, Yuchul
    • Journal of Information Science Theory and Practice
    • /
    • v.6 no.4
    • /
    • pp.28-38
    • /
    • 2018
  • Wikipedia is composed of millions of articles, each of which explains a particular entity with various languages in the real world. Since the articles are contributed and edited by a large population of diverse experts with no specific authority, Wikipedia can be seen as a naturally occurring body of human knowledge. In this paper, we propose a method to automatically identify key entities and relations in Wikipedia articles, which can be used for automatic ontology construction. Compared to previous approaches to entity and relation extraction and/or identification from text, our goal is to capture naturally occurring entities and relations from Wikipedia while minimizing artificiality often introduced at the stages of constructing training and testing data. The titles of the articles and anchored phrases in their text are regarded as entities, and their types are automatically classified with minimal training. We attempt to automatically detect and identify possible relations among the entities based on clustering without training data, as opposed to the relation extraction approach that focuses on improvement of accuracy in selecting one of the several target relations for a given pair of entities. While the relation extraction approach with supervised learning requires a significant amount of annotation efforts for a predefined set of relations, our approach attempts to discover relations as they occur naturally. Unlike other unsupervised relation identification work where evaluation of automatically identified relations is done with the correct relations determined a priori by human judges, we attempted to evaluate appropriateness of the naturally occurring clusters of relations involving person-artifact and person-organization entities and their relation names.

A Quantitative Approach to a Similarity Analysis on the Culinary Manuscripts in the Chosun Periods (계량적 접근에 의한 조선시대 필사본 조리서의 유사성 분석)

  • Lee, Ki-Hwang;Lee, Jae-Yun;Paek, Doo-Hyun
    • Language and Information
    • /
    • v.14 no.2
    • /
    • pp.131-157
    • /
    • 2010
  • This article reports an attempt to perform a similarity analysis on a collection of 25 culinary manuscripts in Chosun periods using a set of quantitative text analysis methods. Historical culinary texts are valuable resources for linguistic, historic, and cultural studies. We consider the similarity of two texts as the distributional similarities of the functional components of the texts. In the case of culinary texts, text elements such as food names, cooking methods, and ingredients are regarded as functional components. We derive the similarity information from the distributional characteristics of the two key functional components, cooking methods and ingredients. The results are also quantified and visualized to achieve a better understanding of the properties of the individual texts and the collection of the texts as a whole.

  • PDF

EDGE: An Enticing Deceptive-content GEnerator as Defensive Deception

  • Li, Huanruo;Guo, Yunfei;Huo, Shumin;Ding, Yuehang
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.5
    • /
    • pp.1891-1908
    • /
    • 2021
  • Cyber deception defense mitigates Advanced Persistent Threats (APTs) with deploying deceptive entities, such as the Honeyfile. The Honeyfile distracts attackers from valuable digital documents and attracts unauthorized access by deliberately exposing fake content. The effectiveness of distraction and trap lies in the enticement of fake content. However, existing studies on the Honeyfile focus less on this perspective. In this work, we seek to improve the enticement of fake text content through enhancing its readability, indistinguishability, and believability. Hence, an enticing deceptive-content generator, EDGE, is presented. The EDGE is constructed with three steps: extracting key concepts with a semantics-aware K-means clustering algorithm, searching for candidate deceptive concepts within the Word2Vec model, and generating deceptive text content under the Integrated Readability Index (IR). Furthermore, the readability and believability performance analyses are undertaken. The experimental results show that EDGE generates indistinguishable deceptive text content without decreasing readability. In all, EDGE proves effective to generate enticing deceptive text content as deception defense against APTs.

Clustering Analysis of Films on Box Office Performance : Based on Web Crawling (영화 흥행과 관련된 영화별 특성에 대한 군집분석 : 웹 크롤링 활용)

  • Lee, Jai-Ill;Chun, Young-Ho;Ha, Chunghun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.3
    • /
    • pp.90-99
    • /
    • 2016
  • Forecasting of box office performance after a film release is very important, from the viewpoint of increase profitability by reducing the production cost and the marketing cost. Analysis of psychological factors such as word-of-mouth and expert assessment is essential, but hard to perform due to the difficulties of data collection. Information technology such as web crawling and text mining can help to overcome this situation. For effective text mining, categorization of objects is required. In this perspective, the objective of this study is to provide a framework for classifying films according to their characteristics. Data including psychological factors are collected from Web sites using the web crawling. A clustering analysis is conducted to classify films and a series of one-way ANOVA analysis are conducted to statistically verify the differences of characteristics among groups. The result of the cluster analysis based on the review and revenues shows that the films can be categorized into four distinct groups and the differences of characteristics are statistically significant. The first group is high sales of the box office and the number of clicks on reviews is higher than other groups. The characteristic of the second group is similar with the 1st group, while the length of review is longer and the box office sales are not good. The third group's audiences prefer to documentaries and animations and the number of comments and interests are significantly lower than other groups. The last group prefer to criminal, thriller and suspense genre. Correspondence analysis is also conducted to match the groups and intrinsic characteristics of films such as genre, movie rating and nation.

Analyzing data-related policy programs in Korea using text mining and network cluster analysis (텍스트 마이닝과 네트워크 군집 분석을 활용한 한국의 데이터 관련 정책사업 분석)

  • Sungjun Choi;Kiyoon Shin;Yoonhwan Oh
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.6
    • /
    • pp.63-81
    • /
    • 2023
  • This study endeavors to classify and categorize similar policy programs through network clustering analysis, using textual information from data-related policy programs in Korea. To achieve this, descriptions of data-related budgetary programs in South Korea in 2022 were collected, and keywords from the program contents were extracted. Subsequently, the similarity between each program was derived using TF-IDF, and policy program network was constructed accordingly. Following this, the structural characteristics of the network were analyzed, and similar policy programs were clustered and categorized through network clustering. Upon analyzing a total of 97 programs, 7 major clusters were identified, signifying that programs with analogous themes or objectives were categorized based on application area or services utilizing data. The findings of this research illuminate the current status of data-related policy programs in Korea, providing policy implications for a strategic approach to planning future national data strategies and programs, and contributing to the establishment of evidence-based policies.

Examining the Intellectual Structure of Records Management & Archival Science in Korea with Text Mining (텍스트 마이닝을 이용한 국내 기록관리학 분야 지적구조 분석)

  • Lee, Jae-Yun;Moon, Ju-Young;Kim, Hee-Jung
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.41 no.1
    • /
    • pp.345-372
    • /
    • 2007
  • In this study, the intellectual structure of Records Management & Archival Science in Korea was analyzed using document clustering, a widely used method of text mining, and document similarity network analysis. The data used in this study were 145 articles written on the subject of Records Management & Archival Science selected from five major representative journals in the field of Library & Information Science in Korea, published from 2001 to 2006. The results of cluster analysis show that the core subject areas are "electronic records management and digital Preservation," "records management policy and institution," "records description and catalogues." and "records management domain and education." The results of document analysis, which is more detailed than cluster analysis, show that "digital archiving," a specialized subject in digital preservation, plays a central role. The results of serial analysis, which proceeds according to a timeline, show the emergence of "archival services" as a new subject area.

Analysis of the abstracts of research articles in food related to climate change using a text-mining algorithm (텍스트 마이닝 기법을 활용한 기후변화관련 식품분야 논문초록 분석)

  • Bae, Kyu Yong;Park, Ju-Hyun;Kim, Jeong Seon;Lee, Yung-Seop
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.6
    • /
    • pp.1429-1437
    • /
    • 2013
  • Research articles in food related to climate change were analyzed by implementing a text-mining algorithm, which is one of nonstructural data analysis tools in big data analysis with a focus on frequencies of terms appearing in the abstracts. As a first step, a term-document matrix was established, followed by implementing a hierarchical clustering algorithm based on dissimilarities among the selected terms and expertise in the field to classify the documents under consideration into a few labeled groups. Through this research, we were able to find out important topics appearing in the field of food related to climate change and their trends over past years. It is expected that the results of the article can be utilized for future research to make systematic responses and adaptation to climate change.

Text Detection and Recognition in Outdoor Korean Signboards for Mobile System Applications (모바일 시스템 응용을 위한 실외 한국어 간판 영상에서 텍스트 검출 및 인식)

  • Park, J.H.;Lee, G.S.;Kim, S.H.;Lee, M.H.;Toan, N.D.
    • Journal of the Institute of Electronics Engineers of Korea CI
    • /
    • v.46 no.2
    • /
    • pp.44-51
    • /
    • 2009
  • Text understand in natural images has become an active research field in the past few decades. In this paper, we present an automatic recognition system in Korean signboards with a complex background. The proposed algorithm includes detection, binarization and extraction of text for the recognition of shop names. First, we utilize an elaborate detection algorithm to detect possible text region based on edge histogram of vertical and horizontal direction. And detected text region is segmented by clustering method. Second, the text is divided into individual characters based on connected components whose center of mass lie below the center line, which are recognized by using a minimum distance classifier. A shape-based statistical feature is adopted, which is adequate for Korean character recognition. The system has been implemented in a mobile phone and is demonstrated to show acceptable performance.

Trend Analysis of Thyroid Cancer Research in Korea with Text Mining Techniques

  • Lee, Tae-Gyeong;Heo, Seong-Min;Shin, Seung-Hyeok;Yang, Ji-Yeon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.12
    • /
    • pp.153-161
    • /
    • 2018
  • In this paper, we propose a text-centered approach to identify the research trend of thyroid cancer in Korea. We incorporate statistical analysis, text mining and machine learning techniques with our clinical insights to find connective associations between terminologies and to discover informative clusters of literatures. The incidence of thyroid cancer in Korea increased rapidly in the 2000s, which fueled the debate regarding overdiagnosis, but recently the number of patients undergoing surgery has decreased significantly due to conscious reform efforts from various circles. We analyzed the abstracts and keywords of related research papers from DBpia. It was found that most were case reports in the 1980s, and some papers in the 1990s discussed the early detection of thyroid cancer by mass screening. While many papers focused on different diagnostic techniques and the detection of small cancers in the 2000s, many emphasized more on the quality of life of patients in the 2010s. There was an apparent change in the topics of thyroid cancer research over past decades. The results of this study would serve as a reference guide for current and future research directions.