• Title/Summary/Keyword: Text Mining Analysis

Search Result 1,217, Processing Time 0.028 seconds

Improving Performance of Recommendation Systems Using Topic Modeling (사용자 관심 이슈 분석을 통한 추천시스템 성능 향상 방안)

  • Choi, Seongi;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.101-116
    • /
    • 2015
  • Recently, due to the development of smart devices and social media, vast amounts of information with the various forms were accumulated. Particularly, considerable research efforts are being directed towards analyzing unstructured big data to resolve various social problems. Accordingly, focus of data-driven decision-making is being moved from structured data analysis to unstructured one. Also, in the field of recommendation system, which is the typical area of data-driven decision-making, the need of using unstructured data has been steadily increased to improve system performance. Approaches to improve the performance of recommendation systems can be found in two aspects- improving algorithms and acquiring useful data with high quality. Traditionally, most efforts to improve the performance of recommendation system were made by the former approach, while the latter approach has not attracted much attention relatively. In this sense, efforts to utilize unstructured data from variable sources are very timely and necessary. Particularly, as the interests of users are directly connected with their needs, identifying the interests of the user through unstructured big data analysis can be a crew for improving performance of recommendation systems. In this sense, this study proposes the methodology of improving recommendation system by measuring interests of the user. Specially, this study proposes the method to quantify interests of the user by analyzing user's internet usage patterns, and to predict user's repurchase based upon the discovered preferences. There are two important modules in this study. The first module predicts repurchase probability of each category through analyzing users' purchase history. We include the first module to our research scope for comparing the accuracy of traditional purchase-based prediction model to our new model presented in the second module. This procedure extracts purchase history of users. The core part of our methodology is in the second module. This module extracts users' interests by analyzing news articles the users have read. The second module constructs a correspondence matrix between topics and news articles by performing topic modeling on real world news articles. And then, the module analyzes users' news access patterns and then constructs a correspondence matrix between articles and users. After that, by merging the results of the previous processes in the second module, we can obtain a correspondence matrix between users and topics. This matrix describes users' interests in a structured manner. Finally, by using the matrix, the second module builds a model for predicting repurchase probability of each category. In this paper, we also provide experimental results of our performance evaluation. The outline of data used our experiments is as follows. We acquired web transaction data of 5,000 panels from a company that is specialized to analyzing ranks of internet sites. At first we extracted 15,000 URLs of news articles published from July 2012 to June 2013 from the original data and we crawled main contents of the news articles. After that we selected 2,615 users who have read at least one of the extracted news articles. Among the 2,615 users, we discovered that the number of target users who purchase at least one items from our target shopping mall 'G' is 359. In the experiments, we analyzed purchase history and news access records of the 359 internet users. From the performance evaluation, we found that our prediction model using both users' interests and purchase history outperforms a prediction model using only users' purchase history from a view point of misclassification ratio. In detail, our model outperformed the traditional one in appliance, beauty, computer, culture, digital, fashion, and sports categories when artificial neural network based models were used. Similarly, our model outperformed the traditional one in beauty, computer, digital, fashion, food, and furniture categories when decision tree based models were used although the improvement is very small.

Influence analysis of Internet buzz to corporate performance : Individual stock price prediction using sentiment analysis of online news (온라인 언급이 기업 성과에 미치는 영향 분석 : 뉴스 감성분석을 통한 기업별 주가 예측)

  • Jeong, Ji Seon;Kim, Dong Sung;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.4
    • /
    • pp.37-51
    • /
    • 2015
  • Due to the development of internet technology and the rapid increase of internet data, various studies are actively conducted on how to use and analyze internet data for various purposes. In particular, in recent years, a number of studies have been performed on the applications of text mining techniques in order to overcome the limitations of the current application of structured data. Especially, there are various studies on sentimental analysis to score opinions based on the distribution of polarity such as positivity or negativity of vocabularies or sentences of the texts in documents. As a part of such studies, this study tries to predict ups and downs of stock prices of companies by performing sentimental analysis on news contexts of the particular companies in the Internet. A variety of news on companies is produced online by different economic agents, and it is diffused quickly and accessed easily in the Internet. So, based on inefficient market hypothesis, we can expect that news information of an individual company can be used to predict the fluctuations of stock prices of the company if we apply proper data analysis techniques. However, as the areas of corporate management activity are different, an analysis considering characteristics of each company is required in the analysis of text data based on machine-learning. In addition, since the news including positive or negative information on certain companies have various impacts on other companies or industry fields, an analysis for the prediction of the stock price of each company is necessary. Therefore, this study attempted to predict changes in the stock prices of the individual companies that applied a sentimental analysis of the online news data. Accordingly, this study chose top company in KOSPI 200 as the subjects of the analysis, and collected and analyzed online news data by each company produced for two years on a representative domestic search portal service, Naver. In addition, considering the differences in the meanings of vocabularies for each of the certain economic subjects, it aims to improve performance by building up a lexicon for each individual company and applying that to an analysis. As a result of the analysis, the accuracy of the prediction by each company are different, and the prediction accurate rate turned out to be 56% on average. Comparing the accuracy of the prediction of stock prices on industry sectors, 'energy/chemical', 'consumer goods for living' and 'consumer discretionary' showed a relatively higher accuracy of the prediction of stock prices than other industries, while it was found that the sectors such as 'information technology' and 'shipbuilding/transportation' industry had lower accuracy of prediction. The number of the representative companies in each industry collected was five each, so it is somewhat difficult to generalize, but it could be confirmed that there was a difference in the accuracy of the prediction of stock prices depending on industry sectors. In addition, at the individual company level, the companies such as 'Kangwon Land', 'KT & G' and 'SK Innovation' showed a relatively higher prediction accuracy as compared to other companies, while it showed that the companies such as 'Young Poong', 'LG', 'Samsung Life Insurance', and 'Doosan' had a low prediction accuracy of less than 50%. In this paper, we performed an analysis of the share price performance relative to the prediction of individual companies through the vocabulary of pre-built company to take advantage of the online news information. In this paper, we aim to improve performance of the stock prices prediction, applying online news information, through the stock price prediction of individual companies. Based on this, in the future, it will be possible to find ways to increase the stock price prediction accuracy by complementing the problem of unnecessary words that are added to the sentiment dictionary.

A Study on Analyzing Sentiments on Movie Reviews by Multi-Level Sentiment Classifier (영화 리뷰 감성분석을 위한 텍스트 마이닝 기반 감성 분류기 구축)

  • Kim, Yuyoung;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.71-89
    • /
    • 2016
  • Sentiment analysis is used for identifying emotions or sentiments embedded in the user generated data such as customer reviews from blogs, social network services, and so on. Various research fields such as computer science and business management can take advantage of this feature to analyze customer-generated opinions. In previous studies, the star rating of a review is regarded as the same as sentiment embedded in the text. However, it does not always correspond to the sentiment polarity. Due to this supposition, previous studies have some limitations in their accuracy. To solve this issue, the present study uses a supervised sentiment classification model to measure a more accurate sentiment polarity. This study aims to propose an advanced sentiment classifier and to discover the correlation between movie reviews and box-office success. The advanced sentiment classifier is based on two supervised machine learning techniques, the Support Vector Machines (SVM) and Feedforward Neural Network (FNN). The sentiment scores of the movie reviews are measured by the sentiment classifier and are analyzed by statistical correlations between movie reviews and box-office success. Movie reviews are collected along with a star-rate. The dataset used in this study consists of 1,258,538 reviews from 175 films gathered from Naver Movie website (movie.naver.com). The results show that the proposed sentiment classifier outperforms Naive Bayes (NB) classifier as its accuracy is about 6% higher than NB. Furthermore, the results indicate that there are positive correlations between the star-rate and the number of audiences, which can be regarded as the box-office success of a movie. The study also shows that there is the mild, positive correlation between the sentiment scores estimated by the classifier and the number of audiences. To verify the applicability of the sentiment scores, an independent sample t-test was conducted. For this, the movies were divided into two groups using the average of sentiment scores. The two groups are significantly different in terms of the star-rated scores.

Comparative Analysis of Korean and Japanese Textbooks on World Geography: Focused on the Contents of Global Education (한.일 고등학교 세계지리 교과서 내용 비교 분석 -국제이해교육의 관련 내용을 중심으로-)

  • Yang, Won-Taek
    • Journal of the Korean association of regional geographers
    • /
    • v.2 no.2
    • /
    • pp.75-92
    • /
    • 1996
  • Geography education is one of the best ways to improve the understanding of other countries. By analyzing Korean and Japanese textbooks on world geography, I tried to find out how well they explain the other country and to set forth guiding principles for geography education. To achieve these aims, weight analysis are used. The major findings in this study can be summarised as follow. The contents of Korean and Japanese geography textbooks were analyzed deviding into 2 major topics, 6 minor topics, and 20 key concepts. (1) By analyzing Korean geography textbook of the 5th curriculum the weight percentages which had been given to each minor topics were found. They are as follow: resource problem(57.7%), human right problem(21.4%), population problem (9.0%), mutual dependence(6.0%), environmental problem(3.3%), international competition(2.6%). (2) By analyzing Korean geography text-book of the 6th curriculum the weight percentages which had been give to each minor topics were found. They are as follow: resource problem(42.7%), human right problem(21.7%), mutual dependence (20.9%), environmental problem(7.7%), population problem(4.6%), international competition(2.4%) (3) By analyzing Japanise geography text-book of 5th curriculum ammendment the weight percentages which had been give to each minor topics were found. They are as follows: resource problem(49.9%) human right problem(21.7%), mutual dependence(15.5%), population problem (7.1%), international competition(6.2%), environmental problem(3.8%) (4) By analyzing Japanise geography textbook of 6th curriculum ammendment the weight percentages which had been give to each minor topics were found. They are as follows human right problem (31.6%), mutual dependence(22.8%), resource problem(20.7%), population problem(12.7%), environmental problem(8.6%), international competition(3.6%). We can see that in the field of dependence Korea and Japan put the similar weight but in the field of common problem they put the fairly different weight. It can be viewed as the difference of curriculum. That is to say Korea used both the systematic method on the basis of unit but Japan used only topical method on the basis of unit. Therefore Korean geography textbook introduce agriculture, forestry, fishery, mining industry and manufacturing industry. Japanese textbook, however gives a detailed account about residents' lives in specific area. For that reason in Korean textbook, resource was stressed, while in Japanese textbook, culture was stressed.

  • PDF

Multi-Dimensional Analysis Method of Product Reviews for Market Insight (마켓 인사이트를 위한 상품 리뷰의 다차원 분석 방안)

  • Park, Jeong Hyun;Lee, Seo Ho;Lim, Gyu Jin;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.57-78
    • /
    • 2020
  • With the development of the Internet, consumers have had an opportunity to check product information easily through E-Commerce. Product reviews used in the process of purchasing goods are based on user experience, allowing consumers to engage as producers of information as well as refer to information. This can be a way to increase the efficiency of purchasing decisions from the perspective of consumers, and from the seller's point of view, it can help develop products and strengthen their competitiveness. However, it takes a lot of time and effort to understand the overall assessment and assessment dimensions of the products that I think are important in reading the vast amount of product reviews offered by E-Commerce for the products consumers want to compare. This is because product reviews are unstructured information and it is difficult to read sentiment of reviews and assessment dimension immediately. For example, consumers who want to purchase a laptop would like to check the assessment of comparative products at each dimension, such as performance, weight, delivery, speed, and design. Therefore, in this paper, we would like to propose a method to automatically generate multi-dimensional product assessment scores in product reviews that we would like to compare. The methods presented in this study consist largely of two phases. One is the pre-preparation phase and the second is the individual product scoring phase. In the pre-preparation phase, a dimensioned classification model and a sentiment analysis model are created based on a review of the large category product group review. By combining word embedding and association analysis, the dimensioned classification model complements the limitation that word embedding methods for finding relevance between dimensions and words in existing studies see only the distance of words in sentences. Sentiment analysis models generate CNN models by organizing learning data tagged with positives and negatives on a phrase unit for accurate polarity detection. Through this, the individual product scoring phase applies the models pre-prepared for the phrase unit review. Multi-dimensional assessment scores can be obtained by aggregating them by assessment dimension according to the proportion of reviews organized like this, which are grouped among those that are judged to describe a specific dimension for each phrase. In the experiment of this paper, approximately 260,000 reviews of the large category product group are collected to form a dimensioned classification model and a sentiment analysis model. In addition, reviews of the laptops of S and L companies selling at E-Commerce are collected and used as experimental data, respectively. The dimensioned classification model classified individual product reviews broken down into phrases into six assessment dimensions and combined the existing word embedding method with an association analysis indicating frequency between words and dimensions. As a result of combining word embedding and association analysis, the accuracy of the model increased by 13.7%. The sentiment analysis models could be seen to closely analyze the assessment when they were taught in a phrase unit rather than in sentences. As a result, it was confirmed that the accuracy was 29.4% higher than the sentence-based model. Through this study, both sellers and consumers can expect efficient decision making in purchasing and product development, given that they can make multi-dimensional comparisons of products. In addition, text reviews, which are unstructured data, were transformed into objective values such as frequency and morpheme, and they were analysed together using word embedding and association analysis to improve the objectivity aspects of more precise multi-dimensional analysis and research. This will be an attractive analysis model in terms of not only enabling more effective service deployment during the evolving E-Commerce market and fierce competition, but also satisfying both customers.

Mapping Categories of Heterogeneous Sources Using Text Analytics (텍스트 분석을 통한 이종 매체 카테고리 다중 매핑 방법론)

  • Kim, Dasom;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.4
    • /
    • pp.193-215
    • /
    • 2016
  • In recent years, the proliferation of diverse social networking services has led users to use many mediums simultaneously depending on their individual purpose and taste. Besides, while collecting information about particular themes, they usually employ various mediums such as social networking services, Internet news, and blogs. However, in terms of management, each document circulated through diverse mediums is placed in different categories on the basis of each source's policy and standards, hindering any attempt to conduct research on a specific category across different kinds of sources. For example, documents containing content on "Application for a foreign travel" can be classified into "Information Technology," "Travel," or "Life and Culture" according to the peculiar standard of each source. Likewise, with different viewpoints of definition and levels of specification for each source, similar categories can be named and structured differently in accordance with each source. To overcome these limitations, this study proposes a plan for conducting category mapping between different sources with various mediums while maintaining the existing category system of the medium as it is. Specifically, by re-classifying individual documents from the viewpoint of diverse sources and storing the result of such a classification as extra attributes, this study proposes a logical layer by which users can search for a specific document from multiple heterogeneous sources with different category names as if they belong to the same source. Besides, by collecting 6,000 articles of news from two Internet news portals, experiments were conducted to compare accuracy among sources, supervised learning and semi-supervised learning, and homogeneous and heterogeneous learning data. It is particularly interesting that in some categories, classifying accuracy of semi-supervised learning using heterogeneous learning data proved to be higher than that of supervised learning and semi-supervised learning, which used homogeneous learning data. This study has the following significances. First, it proposes a logical plan for establishing a system to integrate and manage all the heterogeneous mediums in different classifying systems while maintaining the existing physical classifying system as it is. This study's results particularly exhibit very different classifying accuracies in accordance with the heterogeneity of learning data; this is expected to spur further studies for enhancing the performance of the proposed methodology through the analysis of characteristics by category. In addition, with an increasing demand for search, collection, and analysis of documents from diverse mediums, the scope of the Internet search is not restricted to one medium. However, since each medium has a different categorical structure and name, it is actually very difficult to search for a specific category insofar as encompassing heterogeneous mediums. The proposed methodology is also significant for presenting a plan that enquires into all the documents regarding the standards of the relevant sites' categorical classification when the users select the desired site, while maintaining the existing site's characteristics and structure as it is. This study's proposed methodology needs to be further complemented in the following aspects. First, though only an indirect comparison and evaluation was made on the performance of this proposed methodology, future studies would need to conduct more direct tests on its accuracy. That is, after re-classifying documents of the object source on the basis of the categorical system of the existing source, the extent to which the classification was accurate needs to be verified through evaluation by actual users. In addition, the accuracy in classification needs to be increased by making the methodology more sophisticated. Furthermore, an understanding is required that the characteristics of some categories that showed a rather higher classifying accuracy of heterogeneous semi-supervised learning than that of supervised learning might assist in obtaining heterogeneous documents from diverse mediums and seeking plans that enhance the accuracy of document classification through its usage.

Content Analysis of Food and Nutrition unit in Middle School Textbooks of Home Economics - Focus on the National Curriculums from 1st to 2009 revised (중학교 가정(기술·가정)교과 식생활 영역의 핵심 교육내용 분석 - 제1차 교육과정부터 2009개정 교육과정의 교과서 내용을 중심으로 -)

  • Jang, Yoon-Mi;Kim, Yoo Kyeong
    • Journal of Korean Home Economics Education Association
    • /
    • v.30 no.4
    • /
    • pp.93-112
    • /
    • 2018
  • We analysed the textbooks of Home Economics in middle school from 1st to 2009 curriculums to investigate the contents and the portion of Food and Nutrition section. The key words were generated by word cloud technique using text-mining, and the portion of Food and Nutrition section was presented as a ratio of the pages. The core key words of Food and Nutrition section through the curriculums were 'raw food'·'food'·'diet'. In 1st and 2nd curriculums, the main key words were related to food materials, condiments and nutrients such as 'vitamin'·'protein'. The words such as 'nutrition'·'eating'·'requirement' were newly appeared in 3rd, 'portion' in 6th, and 'diet'·'adolescence' in 7th curriculum. The mean ratio of Food and Nutrition section in Home Economics was 24.3%. While the portion was as high as 31.8% in 7th it was strikingly reduced to 15.2% in 2009th. curriculum. Besides, Food and Nutrition section was composed of 10 units of middle level category during the 2nd and 3rd curriculums, and was reduced to 2 small units with none of middle level category in 2009th curriculum. Although the contents of Food and Nutrition section has been developed and adapted to the needs of the society through the curriculums, the portion of Food and Nutrition section in Home Economics has been reduced especially in 2009th curriculum, which could raise concerns on the health of individuals and communities.

Consumers Perceptions on Sodium Saccharin in Social Media (소셜미디어 분석을 통한 삭카린나트륨 소비자 인식 조사)

  • Lee, Sooyeon;Lee, Wonsung;Moon, Il-Chul;Kwon, Hoonjeong
    • Journal of Food Hygiene and Safety
    • /
    • v.30 no.4
    • /
    • pp.329-342
    • /
    • 2015
  • The purpose of this study was to investigate consumers' perceptions of sodium saccharin in social media. Data was collected from Naver blogs and Naver web communities (Korean representative portal web-site), and media reports including comment sections on a Yonhap news website (Korean largest news agency). The results from Naver blogs and Naver web communities showed that it was primarily mentioned 'sodium saccharin-no added' products, properties of sodium saccharin, and methods of reducing sodium saccharin in food. When media reported the expansion of food categories permitted to use sodium saccharin, search volume for sodium saccharin has increased in both PC and mobile search engines. Also, it was mainly commented about distrust of government, criticism of food product price, and distrust of food companies below the news on the news site. The label of sodium saccharin-no added products in market emphasized "no added-sodium saccharin". These results suggest that consumers are interested in sodium saccharin and especially when media reported the expansion of food categories permitted to use it. Consumers were able to search various information on sodium saccharin except safety or acceptable daily intake through social media. Therefore media or competent authority should report item on sodium saccharin with information including safety or acceptable daily intake based on scientific background and reference or experts' interview for consumers to get reliable information.

A Study on the Research Trends in Fintech using Topic Modeling (토픽 모델링을 이용한 핀테크 기술 동향 분석)

  • Kim, TaeKyung;Choi, HoeRyeon;Lee, HongChul
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.17 no.11
    • /
    • pp.670-681
    • /
    • 2016
  • Recently, based on Internet and mobile environments, the Fintech industry that fuses finance and IT together has been rapidly growing and Fintech services armed with simplicity and convenience have been leading the conversion of all financial services into online and mobile services. However, despite the rapid growth of the Fintech industry, few studies have classified Fintech technologies into detailed technologies, analyzed the technology development trends of major market countries, and supported technology planning. In this respect, using Fintech technological data in the form of unstructured data, the present study extracts and defines detailed Fintech technologies through the topic modeling technique. Thereafter, hot and cold topics of the derived detailed Fintech technologies are identified to determine the trend of Fintech technologies. In addition, the trends of technology development in the USA, South Korea, and China, which are major market countries for major Fintech industrial technologies, are analyzed. Finally, through the analyses of networks between detailed Fintech technologies, linkages between the technologies are examined. The trends of Fintech industrial technologies identified in the present study are expected to be effectively utilized for the establishment of policies in the area of the Fintech industry and Fintech related enterprises' establishment of technology strategies.

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.