• Title/Summary/Keyword: Text Mining

Search Result 1,473, Processing Time 0.03 seconds

PubMine: An Ontology-Based Text Mining System for Deducing Relationships among Biological Entities

  • Kim, Tae-Kyung;Oh, Jeong-Su;Ko, Gun-Hwan;Cho, Wan-Sup;Hou, Bo-Kyeng;Lee, Sang-Hyuk
    • Interdisciplinary Bio Central
    • /
    • v.3 no.2
    • /
    • pp.7.1-7.6
    • /
    • 2011
  • Background: Published manuscripts are the main source of biological knowledge. Since the manual examination is almost impossible due to the huge volume of literature data (approximately 19 million abstracts in PubMed), intelligent text mining systems are of great utility for knowledge discovery. However, most of current text mining tools have limited applicability because of i) providing abstract-based search rather than sentence-based search, ii) improper use or lack of ontology terms, iii) the design to be used for specific subjects, or iv) slow response time that hampers web services and real time applications. Results: We introduce an advanced text mining system called PubMine that supports intelligent knowledge discovery based on diverse bio-ontologies. PubMine improves query accuracy and flexibility with advanced search capabilities of fuzzy search, wildcard search, proximity search, range search, and the Boolean combinations. Furthermore, PubMine allows users to extract multi-dimensional relationships between genes, diseases, and chemical compounds by using OLAP (On-Line Analytical Processing) techniques. The HUGO gene symbols and the MeSH ontology for diseases, chemical compounds, and anatomy have been included in the current version of PubMine, which is freely available at http://pubmine.kobic.re.kr. Conclusions: PubMine is a unique bio-text mining system that provides flexible searches and analysis of biological entity relationships. We believe that PubMine would serve as a key bioinformatics utility due to its rapid response to enable web services for community and to the flexibility to accommodate general ontology.

Business Model Mining: Analyzing a Firm's Business Model with Text Mining of Annual Report

  • Lee, Jihwan;Hong, Yoo S.
    • Industrial Engineering and Management Systems
    • /
    • v.13 no.4
    • /
    • pp.432-441
    • /
    • 2014
  • As the business model is receiving considerable attention these days, the ability to collect business model related information has become essential requirement for a company. The annual report is one of the most important external documents which contain crucial information about the company's business model. By investigating business descriptions and their future strategies within the annual report, we can easily analyze a company's business model. However, given the sheer volume of the data, which is usually over a hundred pages, it is not practical to depend only on manual extraction. The purpose of this study is to complement the manual extraction process by using text mining techniques. In this study, the text mining technique is applied in business model concept extraction and business model evolution analysis. By concept, we mean the overview of a company's business model within a specific year, and, by evolution, we mean temporal changes in the business model concept over time. The efficiency and effectiveness of our methodology is illustrated by a case example of three companies in the US video rental industry.

R&D Perspective Social Issue Packaging using Text Analysis

  • Wong, William Xiu Shun;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.71-95
    • /
    • 2016
  • In recent years, text mining has been used to extract meaningful insights from the large volume of unstructured text data sets of various domains. As one of the most representative text mining applications, topic modeling has been widely used to extract main topics in the form of a set of keywords extracted from a large collection of documents. In general, topic modeling is performed according to the weighted frequency of words in a document corpus. However, general topic modeling cannot discover the relation between documents if the documents share only a few terms, although the documents are in fact strongly related from a particular perspective. For instance, a document about "sexual offense" and another document about "silver industry for aged persons" might not be classified into the same topic because they may not share many key terms. However, these two documents can be strongly related from the R&D perspective because some technologies, such as "RF Tag," "CCTV," and "Heart Rate Sensor," are core components of both "sexual offense" and "silver industry." Thus, in this study, we attempted to discover the differences between the results of general topic modeling and R&D perspective topic modeling. Furthermore, we package social issues from the R&D perspective and present a prototype system, which provides a package of news articles for each R&D issue. Finally, we analyze the quality of R&D perspective topic modeling and provide the results of inter- and intra-topic analysis.

Development of Text Mining-Based Accounting Terminology Analyzer for Financial Information Utilization (재정정보 활용을 위한 텍스트 마이닝 기반 회계용어 형태소 분석기 구축)

  • Jung, Geon-Yong;Yoon, Seung-Sik;Kang, Ju-Young
    • The Journal of Information Systems
    • /
    • v.28 no.4
    • /
    • pp.155-174
    • /
    • 2019
  • Purpose Social interest in financial statement notes has recently increased. However, contrary to the keen interest in financial statement notes, there is no morphological analyzer for accounting terms, which is why researchers are having considerable difficulty in carrying out research. In this study, we build a morphological analyzer for accounting related text mining techniques. This morphological analyzer can handle accounting terms like financial statements and we expect it to serve as a springboard for growth in the text mining research field. Design/methodology/approach In this study, we build customized korean morphological analyzer to extract proper accounting terms. First, we collect Company's Financial Statement notes, financial information data published by KPFIS(Korea Public Finance Information Service), K-IFRS accounting terms data. Second, we cleaning and tokeninzing and removing stopwords. Third, we customize morphological analyzer using n-gram methodology. Findings Existing morphological analyzer cannot extract accounting terms because it split accounting terms to many nouns. In this study, the new customized morphological analyzer can detect more appropriate accounting terms comparing to the existing morphological analyzer. We found that accounting words that were not detected by existing morphological analyzers were detected in new customized morphological analyzers.

Study on Effective Extraction of New Coined Vocabulary from Political Domain Article and News Comment (정치 도메인에서 신조어휘의 효과적인 추출 및 의미 분석에 대한 연구)

  • Lee, Jihyun;Kim, Jaehong;Cho, Yesung;Lee, Mingu;Choi, Hyebong
    • The Journal of the Convergence on Culture Technology
    • /
    • v.7 no.2
    • /
    • pp.149-156
    • /
    • 2021
  • Text mining is one of the useful tools to discover public opinion and perception regarding political issues from big data. It is very common that users of social media express their opinion with newly-coined words such as slang and emoji. However, those new words are not effectively captured by traditional text mining methods that process text data using a language dictionary. In this study, we propose effective methods to extract newly-coined words that connote the political stance and opinion of users. With various text mining techniques, I attempt to discover the context and the political meaning of the new words.

Development of Online Fashion Thesaurus and Taxonomy for Text Mining (텍스트마이닝을 위한 패션 속성 분류체계 및 말뭉치 웹사전 구축)

  • Seyoon Jang;Ha Youn Kim;Songmee Kim;Woojin Choi;Jin Jeong;Yuri Lee
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.46 no.6
    • /
    • pp.1142-1160
    • /
    • 2022
  • Text data plays a significant role in understanding and analyzing trends in consumer, business, and social sectors. For text analysis, there must be a corpus that reflects specific domain knowledge. However, in the field of fashion, the professional corpus is insufficient. This study aims to develop a taxonomy and thesaurus that considers the specialty of fashion products. To this end, about 100,000 fashion vocabulary terms were collected by crawling text data from WSGN, Pantone, and online platforms; text subsequently was extracted through preprocessing with Python. The taxonomy was composed of items, silhouettes, details, styles, colors, textiles, and patterns/prints, which are seven attributes of clothes. The corpus was completed through processing synonyms of terms from fashion books such as dictionaries. Finally, 10,294 vocabulary words, including 1,956 standard Korean words, were classified in the taxonomy. All data was then developed into a web dictionary system. Quantitative and qualitative performance tests of the results were conducted through expert reviews. The performance of the thesaurus also was verified by comparing the results of text mining analysis through the previously developed corpus. This study contributes to achieving a text data standard and enables meaningful results of text mining analysis in the fashion field.

An Investigation on the Periodical Transition of News related to North Korea using Text Mining (텍스트마이닝을 활용한 북한 관련 뉴스의 기간별 변화과정 고찰)

  • Park, Chul-Soo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.63-88
    • /
    • 2019
  • The goal of this paper is to investigate changes in North Korea's domestic and foreign policies through automated text analysis over North Korea represented in South Korean mass media. Based on that data, we then analyze the status of text mining research, using a text mining technique to find the topics, methods, and trends of text mining research. We also investigate the characteristics and method of analysis of the text mining techniques, confirmed by analysis of the data. In this study, R program was used to apply the text mining technique. R program is free software for statistical computing and graphics. Also, Text mining methods allow to highlight the most frequently used keywords in a paragraph of texts. One can create a word cloud, also referred as text cloud or tag cloud. This study proposes a procedure to find meaningful tendencies based on a combination of word cloud, and co-occurrence networks. This study aims to more objectively explore the images of North Korea represented in South Korean newspapers by quantitatively reviewing the patterns of language use related to North Korea from 2016. 11. 1 to 2019. 5. 23 newspaper big data. In this study, we divided into three periods considering recent inter - Korean relations. Before January 1, 2018, it was set as a Before Phase of Peace Building. From January 1, 2018 to February 24, 2019, we have set up a Peace Building Phase. The New Year's message of Kim Jong-un and the Olympics of Pyeong Chang formed an atmosphere of peace on the Korean peninsula. After the Hanoi Pease summit, the third period was the silence of the relationship between North Korea and the United States. Therefore, it was called Depression Phase of Peace Building. This study analyzes news articles related to North Korea of the Korea Press Foundation database(www.bigkinds.or.kr) through text mining, to investigate characteristics of the Kim Jong-un regime's South Korea policy and unification discourse. The main results of this study show that trends in the North Korean national policy agenda can be discovered based on clustering and visualization algorithms. In particular, it examines the changes in the international circumstances, domestic conflicts, the living conditions of North Korea, the South's Aid project for the North, the conflicts of the two Koreas, North Korean nuclear issue, and the North Korean refugee problem through the co-occurrence word analysis. It also offers an analysis of South Korean mentality toward North Korea in terms of the semantic prosody. In the Before Phase of Peace Building, the results of the analysis showed the order of 'Missiles', 'North Korea Nuclear', 'Diplomacy', 'Unification', and ' South-North Korean'. The results of Peace Building Phase are extracted the order of 'Panmunjom', 'Unification', 'North Korea Nuclear', 'Diplomacy', and 'Military'. The results of Depression Phase of Peace Building derived the order of 'North Korea Nuclear', 'North and South Korea', 'Missile', 'State Department', and 'International'. There are 16 words adopted in all three periods. The order is as follows: 'missile', 'North Korea Nuclear', 'Diplomacy', 'Unification', 'North and South Korea', 'Military', 'Kaesong Industrial Complex', 'Defense', 'Sanctions', 'Denuclearization', 'Peace', 'Exchange and Cooperation', and 'South Korea'. We expect that the results of this study will contribute to analyze the trends of news content of North Korea associated with North Korea's provocations. And future research on North Korean trends will be conducted based on the results of this study. We will continue to study the model development for North Korea risk measurement that can anticipate and respond to North Korea's behavior in advance. We expect that the text mining analysis method and the scientific data analysis technique will be applied to North Korea and unification research field. Through these academic studies, I hope to see a lot of studies that make important contributions to the nation.

Rating and Comments Mining Using TF-IDF and SO-PMI for Improved Priority Ratings

  • Kim, Jinah;Moon, Nammee
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.11
    • /
    • pp.5321-5334
    • /
    • 2019
  • Data mining technology is frequently used in identifying the intention of users over a variety of information contexts. Since relevant terms are mainly hidden in text data, it is necessary to extract them. Quantification is required in order to interpret user preference in association with other structured data. This paper proposes rating and comments mining to identify user priority and obtain improved ratings. Structured data (location and rating) and unstructured data (comments) are collected and priority is derived by analyzing statistics and employing TF-IDF. In addition, the improved ratings are generated by applying priority categories based on materialized ratings through Sentiment-Oriented Point-wise Mutual Information (SO-PMI)-based emotion analysis. In this paper, an experiment was carried out by collecting ratings and comments on "place" and by applying them. We confirmed that the proposed mining method is 1.2 times better than the conventional methods that do not reflect priorities and that the performance is improved to almost 2 times when the number to be predicted is small.

Keyword Analysis of Two SCI Journals on Rock Engineering by using Text Mining (텍스트 마이닝을 이용한 암반공학분야 SCI논문의 주제어 분석)

  • Jung, Yong-Bok;Park, Eui-Seob
    • Tunnel and Underground Space
    • /
    • v.25 no.4
    • /
    • pp.303-319
    • /
    • 2015
  • Text mining is one of the branches of data mining and is used to find any meaningful information from the large amount of text. In this study, we analyzed titles and keywords of two SCI journals on rock engineering by using text mining to find major research area, trend and associations of research fields. Visualization of the results was also included for the intuitive understanding of the results. Two journals showed similar research fields but different patterns in the associations among research fields. IJRMMS showed simple network, that is one big group based on the keyword 'rock' with a few small groups. On the other hand, RMRE showed a complex network among various medium groups. Trend analysis by clustering and linear regression of keyword - year frequency matrix provided that most of the keywords increased in number as time goes by except a few descending keywords.

Table based Matching Algorithm for Soft Categorization of News Articles in Reuter 21578

  • Jo, Tae-Ho
    • Journal of Korea Multimedia Society
    • /
    • v.11 no.6
    • /
    • pp.875-882
    • /
    • 2008
  • This research proposes an alternative approach to machine learning based ones for text categorization. For using machine learning based approaches for any task of text mining, documents should be encoded into numerical vectors; it causes two problems: huge dimensionality and sparse distribution. Although there are various tasks of text mining such as text categorization, text clustering, and text summarization, the scope of this research is restricted to text categorization. The idea of this research is to avoid the two problems by encoding a document or documents into a table, instead of numerical vectors. Therefore, the goal of this research is to improve the performance of text categorization by proposing approaches, which are free from the two problems.

  • PDF