• Title/Summary/Keyword: Text-Mining

Search Result 1,498, Processing Time 0.026 seconds

Comparative Analysis of Korean and Japanese Textbooks on World Geography: Focused on the Contents of Global Education (한.일 고등학교 세계지리 교과서 내용 비교 분석 -국제이해교육의 관련 내용을 중심으로-)

  • Yang, Won-Taek
    • Journal of the Korean association of regional geographers
    • /
    • v.2 no.2
    • /
    • pp.75-92
    • /
    • 1996
  • Geography education is one of the best ways to improve the understanding of other countries. By analyzing Korean and Japanese textbooks on world geography, I tried to find out how well they explain the other country and to set forth guiding principles for geography education. To achieve these aims, weight analysis are used. The major findings in this study can be summarised as follow. The contents of Korean and Japanese geography textbooks were analyzed deviding into 2 major topics, 6 minor topics, and 20 key concepts. (1) By analyzing Korean geography textbook of the 5th curriculum the weight percentages which had been given to each minor topics were found. They are as follow: resource problem(57.7%), human right problem(21.4%), population problem (9.0%), mutual dependence(6.0%), environmental problem(3.3%), international competition(2.6%). (2) By analyzing Korean geography text-book of the 6th curriculum the weight percentages which had been give to each minor topics were found. They are as follow: resource problem(42.7%), human right problem(21.7%), mutual dependence (20.9%), environmental problem(7.7%), population problem(4.6%), international competition(2.4%) (3) By analyzing Japanise geography text-book of 5th curriculum ammendment the weight percentages which had been give to each minor topics were found. They are as follows: resource problem(49.9%) human right problem(21.7%), mutual dependence(15.5%), population problem (7.1%), international competition(6.2%), environmental problem(3.8%) (4) By analyzing Japanise geography textbook of 6th curriculum ammendment the weight percentages which had been give to each minor topics were found. They are as follows human right problem (31.6%), mutual dependence(22.8%), resource problem(20.7%), population problem(12.7%), environmental problem(8.6%), international competition(3.6%). We can see that in the field of dependence Korea and Japan put the similar weight but in the field of common problem they put the fairly different weight. It can be viewed as the difference of curriculum. That is to say Korea used both the systematic method on the basis of unit but Japan used only topical method on the basis of unit. Therefore Korean geography textbook introduce agriculture, forestry, fishery, mining industry and manufacturing industry. Japanese textbook, however gives a detailed account about residents' lives in specific area. For that reason in Korean textbook, resource was stressed, while in Japanese textbook, culture was stressed.

  • PDF

A Method for Evaluating News Value based on Supply and Demand of Information Using Text Analysis (텍스트 분석을 활용한 정보의 수요 공급 기반 뉴스 가치 평가 방안)

  • Lee, Donghoon;Choi, Hochang;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.45-67
    • /
    • 2016
  • Given the recent development of smart devices, users are producing, sharing, and acquiring a variety of information via the Internet and social network services (SNSs). Because users tend to use multiple media simultaneously according to their goals and preferences, domestic SNS users use around 2.09 media concurrently on average. Since the information provided by such media is usually textually represented, recent studies have been actively conducting textual analysis in order to understand users more deeply. Earlier studies using textual analysis focused on analyzing a document's contents without substantive consideration of the diverse characteristics of the source medium. However, current studies argue that analytical and interpretive approaches should be applied differently according to the characteristics of a document's source. Documents can be classified into the following types: informative documents for delivering information, expressive documents for expressing emotions and aesthetics, operational documents for inducing the recipient's behavior, and audiovisual media documents for supplementing the above three functions through images and music. Further, documents can be classified according to their contents, which comprise facts, concepts, procedures, principles, rules, stories, opinions, and descriptions. Documents have unique characteristics according to the source media by which they are distributed. In terms of newspapers, only highly trained people tend to write articles for public dissemination. In contrast, with SNSs, various types of users can freely write any message and such messages are distributed in an unpredictable way. Again, in the case of newspapers, each article exists independently and does not tend to have any relation to other articles. However, messages (original tweets) on Twitter, for example, are highly organized and regularly duplicated and repeated through replies and retweets. There have been many studies focusing on the different characteristics between newspapers and SNSs. However, it is difficult to find a study that focuses on the difference between the two media from the perspective of supply and demand. We can regard the articles of newspapers as a kind of information supply, whereas messages on various SNSs represent a demand for information. By investigating traditional newspapers and SNSs from the perspective of supply and demand of information, we can explore and explain the information dilemma more clearly. For example, there may be superfluous issues that are heavily reported in newspaper articles despite the fact that users seldom have much interest in these issues. Such overproduced information is not only a waste of media resources but also makes it difficult to find valuable, in-demand information. Further, some issues that are covered by only a few newspapers may be of high interest to SNS users. To alleviate the deleterious effects of information asymmetries, it is necessary to analyze the supply and demand of each information source and, accordingly, provide information flexibly. Such an approach would allow the value of information to be explored and approximated on the basis of the supply-demand balance. Conceptually, this is very similar to the price of goods or services being determined by the supply-demand relationship. Adopting this concept, media companies could focus on the production of highly in-demand issues that are in short supply. In this study, we selected Internet news sites and Twitter as representative media for investigating information supply and demand, respectively. We present the notion of News Value Index (NVI), which evaluates the value of news information in terms of the magnitude of Twitter messages associated with it. In addition, we visualize the change of information value over time using the NVI. We conducted an analysis using 387,014 news articles and 31,674,795 Twitter messages. The analysis results revealed interesting patterns: most issues show lower NVI than average of the whole issue, whereas a few issues show steadily higher NVI than the average.

Mapping Categories of Heterogeneous Sources Using Text Analytics (텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론)

  • Kim, Dasom;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.193-215
    • /
    • 2016
  • In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.

Twitter Issue Tracking System by Topic Modeling Techniques (토픽 모델링을 이용한 트위터 이슈 트래킹 시스템)

  • Bae, Jung-Hwan;Han, Nam-Gi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.2
    • /
    • pp.109-122
    • /
    • 2014
  • People are nowadays creating a tremendous amount of data on Social Network Service (SNS). In particular, the incorporation of SNS into mobile devices has resulted in massive amounts of data generation, thereby greatly influencing society. This is an unmatched phenomenon in history, and now we live in the Age of Big Data. SNS Data is defined as a condition of Big Data where the amount of data (volume), data input and output speeds (velocity), and the variety of data types (variety) are satisfied. If someone intends to discover the trend of an issue in SNS Big Data, this information can be used as a new important source for the creation of new values because this information covers the whole of society. In this study, a Twitter Issue Tracking System (TITS) is designed and established to meet the needs of analyzing SNS Big Data. TITS extracts issues from Twitter texts and visualizes them on the web. The proposed system provides the following four functions: (1) Provide the topic keyword set that corresponds to daily ranking; (2) Visualize the daily time series graph of a topic for the duration of a month; (3) Provide the importance of a topic through a treemap based on the score system and frequency; (4) Visualize the daily time-series graph of keywords by searching the keyword; The present study analyzes the Big Data generated by SNS in real time. SNS Big Data analysis requires various natural language processing techniques, including the removal of stop words, and noun extraction for processing various unrefined forms of unstructured data. In addition, such analysis requires the latest big data technology to process rapidly a large amount of real-time data, such as the Hadoop distributed system or NoSQL, which is an alternative to relational database. We built TITS based on Hadoop to optimize the processing of big data because Hadoop is designed to scale up from single node computing to thousands of machines. Furthermore, we use MongoDB, which is classified as a NoSQL database. In addition, MongoDB is an open source platform, document-oriented database that provides high performance, high availability, and automatic scaling. Unlike existing relational database, there are no schema or tables with MongoDB, and its most important goal is that of data accessibility and data processing performance. In the Age of Big Data, the visualization of Big Data is more attractive to the Big Data community because it helps analysts to examine such data easily and clearly. Therefore, TITS uses the d3.js library as a visualization tool. This library is designed for the purpose of creating Data Driven Documents that bind document object model (DOM) and any data; the interaction between data is easy and useful for managing real-time data stream with smooth animation. In addition, TITS uses a bootstrap made of pre-configured plug-in style sheets and JavaScript libraries to build a web system. The TITS Graphical User Interface (GUI) is designed using these libraries, and it is capable of detecting issues on Twitter in an easy and intuitive manner. The proposed work demonstrates the superiority of our issue detection techniques by matching detected issues with corresponding online news articles. The contributions of the present study are threefold. First, we suggest an alternative approach to real-time big data analysis, which has become an extremely important issue. Second, we apply a topic modeling technique that is used in various research areas, including Library and Information Science (LIS). Based on this, we can confirm the utility of storytelling and time series analysis. Third, we develop a web-based system, and make the system available for the real-time discovery of topics. The present study conducted experiments with nearly 150 million tweets in Korea during March 2013.

Influence analysis of Internet buzz to corporate performance : Individual stock price prediction using sentiment analysis of online news (온라인 언급이 기업 성과에 미치는 영향 분석 : 뉴스 감성분석을 통한 기업별 주가 예측)

  • Jeong, Ji Seon;Kim, Dong Sung;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.37-51
    • /
    • 2015
  • Due to the development of internet technology and the rapid increase of internet data, various studies are actively conducted on how to use and analyze internet data for various purposes. In particular, in recent years, a number of studies have been performed on the applications of text mining techniques in order to overcome the limitations of the current application of structured data. Especially, there are various studies on sentimental analysis to score opinions based on the distribution of polarity such as positivity or negativity of vocabularies or sentences of the texts in documents. As a part of such studies, this study tries to predict ups and downs of stock prices of companies by performing sentimental analysis on news contexts of the particular companies in the Internet. A variety of news on companies is produced online by different economic agents, and it is diffused quickly and accessed easily in the Internet. So, based on inefficient market hypothesis, we can expect that news information of an individual company can be used to predict the fluctuations of stock prices of the company if we apply proper data analysis techniques. However, as the areas of corporate management activity are different, an analysis considering characteristics of each company is required in the analysis of text data based on machine-learning. In addition, since the news including positive or negative information on certain companies have various impacts on other companies or industry fields, an analysis for the prediction of the stock price of each company is necessary. Therefore, this study attempted to predict changes in the stock prices of the individual companies that applied a sentimental analysis of the online news data. Accordingly, this study chose top company in KOSPI 200 as the subjects of the analysis, and collected and analyzed online news data by each company produced for two years on a representative domestic search portal service, Naver. In addition, considering the differences in the meanings of vocabularies for each of the certain economic subjects, it aims to improve performance by building up a lexicon for each individual company and applying that to an analysis. As a result of the analysis, the accuracy of the prediction by each company are different, and the prediction accurate rate turned out to be 56% on average. Comparing the accuracy of the prediction of stock prices on industry sectors, 'energy/chemical', 'consumer goods for living' and 'consumer discretionary' showed a relatively higher accuracy of the prediction of stock prices than other industries, while it was found that the sectors such as 'information technology' and 'shipbuilding/transportation' industry had lower accuracy of prediction. The number of the representative companies in each industry collected was five each, so it is somewhat difficult to generalize, but it could be confirmed that there was a difference in the accuracy of the prediction of stock prices depending on industry sectors. In addition, at the individual company level, the companies such as 'Kangwon Land', 'KT & G' and 'SK Innovation' showed a relatively higher prediction accuracy as compared to other companies, while it showed that the companies such as 'Young Poong', 'LG', 'Samsung Life Insurance', and 'Doosan' had a low prediction accuracy of less than 50%. In this paper, we performed an analysis of the share price performance relative to the prediction of individual companies through the vocabulary of pre-built company to take advantage of the online news information. In this paper, we aim to improve performance of the stock prices prediction, applying online news information, through the stock price prediction of individual companies. Based on this, in the future, it will be possible to find ways to increase the stock price prediction accuracy by complementing the problem of unnecessary words that are added to the sentiment dictionary.

Multi-Dimensional Analysis Method of Product Reviews for Market Insight (마켓 인사이트를 위한 상품 리뷰의 다차원 분석 방안)

  • Park, Jeong Hyun;Lee, Seo Ho;Lim, Gyu Jin;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.57-78
    • /
    • 2020
  • With the development of the Internet, consumers have had an opportunity to check product information easily through E-Commerce. Product reviews used in the process of purchasing goods are based on user experience, allowing consumers to engage as producers of information as well as refer to information. This can be a way to increase the efficiency of purchasing decisions from the perspective of consumers, and from the seller's point of view, it can help develop products and strengthen their competitiveness. However, it takes a lot of time and effort to understand the overall assessment and assessment dimensions of the products that I think are important in reading the vast amount of product reviews offered by E-Commerce for the products consumers want to compare. This is because product reviews are unstructured information and it is difficult to read sentiment of reviews and assessment dimension immediately. For example, consumers who want to purchase a laptop would like to check the assessment of comparative products at each dimension, such as performance, weight, delivery, speed, and design. Therefore, in this paper, we would like to propose a method to automatically generate multi-dimensional product assessment scores in product reviews that we would like to compare. The methods presented in this study consist largely of two phases. One is the pre-preparation phase and the second is the individual product scoring phase. In the pre-preparation phase, a dimensioned classification model and a sentiment analysis model are created based on a review of the large category product group review. By combining word embedding and association analysis, the dimensioned classification model complements the limitation that word embedding methods for finding relevance between dimensions and words in existing studies see only the distance of words in sentences. Sentiment analysis models generate CNN models by organizing learning data tagged with positives and negatives on a phrase unit for accurate polarity detection. Through this, the individual product scoring phase applies the models pre-prepared for the phrase unit review. Multi-dimensional assessment scores can be obtained by aggregating them by assessment dimension according to the proportion of reviews organized like this, which are grouped among those that are judged to describe a specific dimension for each phrase. In the experiment of this paper, approximately 260,000 reviews of the large category product group are collected to form a dimensioned classification model and a sentiment analysis model. In addition, reviews of the laptops of S and L companies selling at E-Commerce are collected and used as experimental data, respectively. The dimensioned classification model classified individual product reviews broken down into phrases into six assessment dimensions and combined the existing word embedding method with an association analysis indicating frequency between words and dimensions. As a result of combining word embedding and association analysis, the accuracy of the model increased by 13.7%. The sentiment analysis models could be seen to closely analyze the assessment when they were taught in a phrase unit rather than in sentences. As a result, it was confirmed that the accuracy was 29.4% higher than the sentence-based model. Through this study, both sellers and consumers can expect efficient decision making in purchasing and product development, given that they can make multi-dimensional comparisons of products. In addition, text reviews, which are unstructured data, were transformed into objective values such as frequency and morpheme, and they were analysed together using word embedding and association analysis to improve the objectivity aspects of more precise multi-dimensional analysis and research. This will be an attractive analysis model in terms of not only enabling more effective service deployment during the evolving E-Commerce market and fierce competition, but also satisfying both customers.

The Empirical Study on the Effect of Technology Exchanges in the Fourth Industrial Revolution between Korea and China: Focused on the Firm Social Network Analysis (한중 4차산업혁명 기술교류 및 효과에 대한 실증연구: 기업 소셜 네트워크 분석 중심으로)

  • Zhou, Zhenxin;Sohn, Kwonsang;Hwang, Yoon Min;Kwon, Ohbyung
    • The Journal of Society for e-Business Studies
    • /
    • v.25 no.3
    • /
    • pp.41-61
    • /
    • 2020
  • China's rapid development and commercialization of high-tech technologies in the fourth industrial revolution has led to effective technology exchanges between Korean and Chinese firms becoming more important to Korea's mid-term and long-term industrial development. However, there is still a lack of empirical research on how technology exchanges between Korean and Chinese firms proceed and their effectiveness. In response, this study conducted a social network analysis based on text mining data of Korea-China business technology exchange and cooperation articles introduced in the news from 2018 to March 2020 on the current status and effects of Korea-China technology exchanges related to the fourth industrial revolution, and conducted a regression analysis how network centrality effect on the firm performance. According to the results, most of the Korean major electronic firms are actively networking with Chinese firms and institutions, showing high centrality in the centrality index. Korean telecommunication firms showed high betweenness centrality and subgraph centrality, and Korean Internet service providers and broadcasting contents firms showed high eigenvector centrality. In addition, Chinese firms showed higher betweenness centrality than Korean firms, and Chinese service firms showed higher closeness centrality than manufacturing firms. As a result of regression analysis, this network centrality had a positive effect on firm performance. To the best of our knowledge, this is the first to analyze the impact of the technical cooperation between Korean and Chinese firms under the fourth industrial revolution context. This study has theoretical implications that suggested the direction of social network analysis-based empirical research in global firm cooperation. Also, this study has practical implications that the guidelines for network analysis in setting the direction of technical cooperation between Korea and China by firms or governments.

A Study on the Research Trends in Library & Information Science in Korea using Topic Modeling (토픽모델링을 활용한 국내 문헌정보학 연구동향 분석)

  • Park, Ja-Hyun;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.30 no.1
    • /
    • pp.7-32
    • /
    • 2013
  • The goal of the present study is to identify the topic trend in the field of library and information science in Korea. To this end, we collected titles and s of the papers published in four major journals such as Journal of the Korean Society for information Management, Journal of the Korean Society for Library and Information Science, Journal of Korean Library and Information Science Society, and Journal of the Korean BIBLIA Society for library and Information Science during 1970 and 2012. After that, we applied the well-received topic modeling technique, Latent Dirichlet Allocation(LDA), to the collected data sets. The research findings of the study are as follows: 1) Comparison of the extracted topics by LDA with the subject headings of library and information science shows that there are several distinct sub-research domains strongly tied with the field. Those include library and society in the domain of "introduction to library and information science," professionalism, library and information policy in the domain of "library system," library evaluation in the domain of "library management," collection development and management, information service in the domain of "library service," services by library type, user training/information literacy, service evaluation, classification/cataloging/meta-data in the domain of "document organization," bibliometrics/digital libraries/user study/internet/expert system/information retrieval/information system in the domain of "information science," antique documents in the domain of "bibliography," books/publications in the domain of "publication," and archival study. The results indicate that among these sub-domains, information science and library services are two most focused domains. Second, we observe that there is the growing trend in the research topics such as service and evaluation by library type, internet, and meta-data, but the research topics such as book, classification, and cataloging reveal the declining trend. Third, analysis by journal show that in Journal of the Korean Society for information Management, information science related topics appear more frequently than library science related topics whereas library science related topics are more popular in the other three journals studied in this paper.

Measuring the Economic Impact of Item Descriptions on Sales Performance (온라인 상품 판매 성과에 영향을 미치는 상품 소개글 효과 측정 기법)

  • Lee, Dongwon;Park, Sung-Hyuk;Moon, Songchun
    • Journal of Intelligence and Information Systems
    • /
    • v.18 no.4
    • /
    • pp.1-17
    • /
    • 2012
  • Personalized smart devices such as smartphones and smart pads are widely used. Unlike traditional feature phones, theses smart devices allow users to choose a variety of functions, which support not only daily experiences but also business operations. Actually, there exist a huge number of applications accessible by smart device users in online and mobile application markets. Users can choose apps that fit their own tastes and needs, which is impossible for conventional phone users. With the increase in app demand, the tastes and needs of app users are becoming more diverse. To meet these requirements, numerous apps with diverse functions are being released on the market, which leads to fierce competition. Unlike offline markets, online markets have a limitation in that purchasing decisions should be made without experiencing the items. Therefore, online customers rely more on item-related information that can be seen on the item page in which online markets commonly provide details about each item. Customers can feel confident about the quality of an item through the online information and decide whether to purchase it. The same is true of online app markets. To win the sales competition against other apps that perform similar functions, app developers need to focus on writing app descriptions to attract the attention of customers. If we can measure the effect of app descriptions on sales without regard to the app's price and quality, app descriptions that facilitate the sale of apps can be identified. This study intends to provide such a quantitative result for app developers who want to promote the sales of their apps. For this purpose, we collected app details including the descriptions written in Korean from one of the largest app markets in Korea, and then extracted keywords from the descriptions. Next, the impact of the keywords on sales performance was measured through our econometric model. Through this analysis, we were able to analyze the impact of each keyword itself, apart from that of the design or quality. The keywords, comprised of the attribute and evaluation of each app, are extracted by a morpheme analyzer. Our model with the keywords as its input variables was established to analyze their impact on sales performance. A regression analysis was conducted for each category in which apps are included. This analysis was required because we found the keywords, which are emphasized in app descriptions, different category-by-category. The analysis conducted not only for free apps but also for paid apps showed which keywords have more impact on sales performance for each type of app. In the analysis of paid apps in the education category, keywords such as 'search+easy' and 'words+abundant' showed higher effectiveness. In the same category, free apps whose keywords emphasize the quality of apps showed higher sales performance. One interesting fact is that keywords describing not only the app but also the need for the app have asignificant impact. Language learning apps, regardless of whether they are sold free or paid, showed higher sales performance by including the keywords 'foreign language study+important'. This result shows that motivation for the purchase affected sales. While item reviews are widely researched in online markets, item descriptions are not very actively studied. In the case of the mobile app markets, newly introduced apps may not have many item reviews because of the low quantity sold. In such cases, item descriptions can be regarded more important when customers make a decision about purchasing items. This study is the first trial to quantitatively analyze the relationship between an item description and its impact on sales performance. The results show that our research framework successfully provides a list of the most effective sales key terms with the estimates of their effectiveness. Although this study is performed for a specified type of item (i.e., mobile apps), our model can be applied to almost all of the items traded in online markets.

A Study on Automatic Classification Model of Documents Based on Korean Standard Industrial Classification (한국표준산업분류를 기준으로 한 문서의 자동 분류 모델에 관한 연구)

  • Lee, Jae-Seong;Jun, Seung-Pyo;Yoo, Hyoung Sun
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.3
    • /
    • pp.221-241
    • /
    • 2018
  • As we enter the knowledge society, the importance of information as a new form of capital is being emphasized. The importance of information classification is also increasing for efficient management of digital information produced exponentially. In this study, we tried to automatically classify and provide tailored information that can help companies decide to make technology commercialization. Therefore, we propose a method to classify information based on Korea Standard Industry Classification (KSIC), which indicates the business characteristics of enterprises. The classification of information or documents has been largely based on machine learning, but there is not enough training data categorized on the basis of KSIC. Therefore, this study applied the method of calculating similarity between documents. Specifically, a method and a model for presenting the most appropriate KSIC code are proposed by collecting explanatory texts of each code of KSIC and calculating the similarity with the classification object document using the vector space model. The IPC data were collected and classified by KSIC. And then verified the methodology by comparing it with the KSIC-IPC concordance table provided by the Korean Intellectual Property Office. As a result of the verification, the highest agreement was obtained when the LT method, which is a kind of TF-IDF calculation formula, was applied. At this time, the degree of match of the first rank matching KSIC was 53% and the cumulative match of the fifth ranking was 76%. Through this, it can be confirmed that KSIC classification of technology, industry, and market information that SMEs need more quantitatively and objectively is possible. In addition, it is considered that the methods and results provided in this study can be used as a basic data to help the qualitative judgment of experts in creating a linkage table between heterogeneous classification systems.