• Title/Summary/Keyword: Word-Prediction

Search Result 115, Processing Time 0.022 seconds

A Study on Knowledge Entity Extraction Method for Individual Stocks Based on Neural Tensor Network (뉴럴 텐서 네트워크 기반 주식 개별종목 지식개체명 추출 방법에 관한 연구)

  • Yang, Yunseok;Lee, Hyun Jun;Oh, Kyong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.25-38
    • /
    • 2019
  • Selecting high-quality information that meets the interests and needs of users among the overflowing contents is becoming more important as the generation continues. In the flood of information, efforts to reflect the intention of the user in the search result better are being tried, rather than recognizing the information request as a simple string. Also, large IT companies such as Google and Microsoft focus on developing knowledge-based technologies including search engines which provide users with satisfaction and convenience. Especially, the finance is one of the fields expected to have the usefulness and potential of text data analysis because it's constantly generating new information, and the earlier the information is, the more valuable it is. Automatic knowledge extraction can be effective in areas where information flow is vast, such as financial sector, and new information continues to emerge. However, there are several practical difficulties faced by automatic knowledge extraction. First, there are difficulties in making corpus from different fields with same algorithm, and it is difficult to extract good quality triple. Second, it becomes more difficult to produce labeled text data by people if the extent and scope of knowledge increases and patterns are constantly updated. Third, performance evaluation is difficult due to the characteristics of unsupervised learning. Finally, problem definition for automatic knowledge extraction is not easy because of ambiguous conceptual characteristics of knowledge. So, in order to overcome limits described above and improve the semantic performance of stock-related information searching, this study attempts to extract the knowledge entity by using neural tensor network and evaluate the performance of them. Different from other references, the purpose of this study is to extract knowledge entity which is related to individual stock items. Various but relatively simple data processing methods are applied in the presented model to solve the problems of previous researches and to enhance the effectiveness of the model. From these processes, this study has the following three significances. First, A practical and simple automatic knowledge extraction method that can be applied. Second, the possibility of performance evaluation is presented through simple problem definition. Finally, the expressiveness of the knowledge increased by generating input data on a sentence basis without complex morphological analysis. The results of the empirical analysis and objective performance evaluation method are also presented. The empirical study to confirm the usefulness of the presented model, experts' reports about individual 30 stocks which are top 30 items based on frequency of publication from May 30, 2017 to May 21, 2018 are used. the total number of reports are 5,600, and 3,074 reports, which accounts about 55% of the total, is designated as a training set, and other 45% of reports are designated as a testing set. Before constructing the model, all reports of a training set are classified by stocks, and their entities are extracted using named entity recognition tool which is the KKMA. for each stocks, top 100 entities based on appearance frequency are selected, and become vectorized using one-hot encoding. After that, by using neural tensor network, the same number of score functions as stocks are trained. Thus, if a new entity from a testing set appears, we can try to calculate the score by putting it into every single score function, and the stock of the function with the highest score is predicted as the related item with the entity. To evaluate presented models, we confirm prediction power and determining whether the score functions are well constructed by calculating hit ratio for all reports of testing set. As a result of the empirical study, the presented model shows 69.3% hit accuracy for testing set which consists of 2,526 reports. this hit ratio is meaningfully high despite of some constraints for conducting research. Looking at the prediction performance of the model for each stocks, only 3 stocks, which are LG ELECTRONICS, KiaMtr, and Mando, show extremely low performance than average. this result maybe due to the interference effect with other similar items and generation of new knowledge. In this paper, we propose a methodology to find out key entities or their combinations which are necessary to search related information in accordance with the user's investment intention. Graph data is generated by using only the named entity recognition tool and applied to the neural tensor network without learning corpus or word vectors for the field. From the empirical test, we confirm the effectiveness of the presented model as described above. However, there also exist some limits and things to complement. Representatively, the phenomenon that the model performance is especially bad for only some stocks shows the need for further researches. Finally, through the empirical study, we confirmed that the learning method presented in this study can be used for the purpose of matching the new text information semantically with the related stocks.

The Recognition Characteristics of Science Gifted Students on the Earth System based on their Thinking Style (과학 영재 학생들의 사고양식에 따른 지구시스템에 대한 인지 특성)

  • Lee, Hyonyong;Kim, Seung-Hwan
    • Journal of Science Education
    • /
    • v.33 no.1
    • /
    • pp.12-30
    • /
    • 2009
  • The purpose of this study was to analyze recognition characteristics of science gifted students on the earth system based on their thinking style. The subjects were 24 science gifted students at the Science Institute for Gifted Students of a university located in metropolitan city in Korea. The students' thinking styles were firstly examined on the basis of the Sternberg's theory of mental self-government. And then, the students were divided into two groups: Type I group(legislative, judicial, global, liberal) and Type II group(executive, local, conservative) based on Sternberg's theory. Data was collected from three different type of questionnaires(A, B, C types), interview, word association method, drawing analyses, concept map, hidden dimension inventory, and in-depth interviews. The findings of analysis indicated that their thinking styles were characterized by 'Legislative', 'Executive', 'Anarchic', 'Global', 'External', 'Liberal' styles. Their preference were conducting new projects and using creative problem solving processes. The results of students' recognition characteristics on earth system were as follows: First, though the two groups' quantitative value on 'System Understanding' was very similar, there were considerable distinctions in details. Second, 'Understanding the Relationship in the System' was closely connected to thinking styles. Type I group was more advantageous with multiple, dynamic, and recursive approach. Third, in the relation to 'System Generalization' both of the groups had similar simple interpretational ability of the system, but Type I group was better on generalization when 'hidden dimension inventory' factor was added. On the system prediction factor, however, students' ability was weak regardless of the type. Consequently, more specific development strategies on various objects are needed for the development and application of the system learning program. Furthermore, it is expected that this study could be practically and effectively used on various fields related to system recognition.

  • PDF

Visualizing the Results of Opinion Mining from Social Media Contents: Case Study of a Noodle Company (소셜미디어 콘텐츠의 오피니언 마이닝결과 시각화: N라면 사례 분석 연구)

  • Kim, Yoosin;Kwon, Do Young;Jeong, Seung Ryul
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.4
    • /
    • pp.89-105
    • /
    • 2014
  • After emergence of Internet, social media with highly interactive Web 2.0 applications has provided very user friendly means for consumers and companies to communicate with each other. Users have routinely published contents involving their opinions and interests in social media such as blogs, forums, chatting rooms, and discussion boards, and the contents are released real-time in the Internet. For that reason, many researchers and marketers regard social media contents as the source of information for business analytics to develop business insights, and many studies have reported results on mining business intelligence from Social media content. In particular, opinion mining and sentiment analysis, as a technique to extract, classify, understand, and assess the opinions implicit in text contents, are frequently applied into social media content analysis because it emphasizes determining sentiment polarity and extracting authors' opinions. A number of frameworks, methods, techniques and tools have been presented by these researchers. However, we have found some weaknesses from their methods which are often technically complicated and are not sufficiently user-friendly for helping business decisions and planning. In this study, we attempted to formulate a more comprehensive and practical approach to conduct opinion mining with visual deliverables. First, we described the entire cycle of practical opinion mining using Social media content from the initial data gathering stage to the final presentation session. Our proposed approach to opinion mining consists of four phases: collecting, qualifying, analyzing, and visualizing. In the first phase, analysts have to choose target social media. Each target media requires different ways for analysts to gain access. There are open-API, searching tools, DB2DB interface, purchasing contents, and so son. Second phase is pre-processing to generate useful materials for meaningful analysis. If we do not remove garbage data, results of social media analysis will not provide meaningful and useful business insights. To clean social media data, natural language processing techniques should be applied. The next step is the opinion mining phase where the cleansed social media content set is to be analyzed. The qualified data set includes not only user-generated contents but also content identification information such as creation date, author name, user id, content id, hit counts, review or reply, favorite, etc. Depending on the purpose of the analysis, researchers or data analysts can select a suitable mining tool. Topic extraction and buzz analysis are usually related to market trends analysis, while sentiment analysis is utilized to conduct reputation analysis. There are also various applications, such as stock prediction, product recommendation, sales forecasting, and so on. The last phase is visualization and presentation of analysis results. The major focus and purpose of this phase are to explain results of analysis and help users to comprehend its meaning. Therefore, to the extent possible, deliverables from this phase should be made simple, clear and easy to understand, rather than complex and flashy. To illustrate our approach, we conducted a case study on a leading Korean instant noodle company. We targeted the leading company, NS Food, with 66.5% of market share; the firm has kept No. 1 position in the Korean "Ramen" business for several decades. We collected a total of 11,869 pieces of contents including blogs, forum contents and news articles. After collecting social media content data, we generated instant noodle business specific language resources for data manipulation and analysis using natural language processing. In addition, we tried to classify contents in more detail categories such as marketing features, environment, reputation, etc. In those phase, we used free ware software programs such as TM, KoNLP, ggplot2 and plyr packages in R project. As the result, we presented several useful visualization outputs like domain specific lexicons, volume and sentiment graphs, topic word cloud, heat maps, valence tree map, and other visualized images to provide vivid, full-colored examples using open library software packages of the R project. Business actors can quickly detect areas by a swift glance that are weak, strong, positive, negative, quiet or loud. Heat map is able to explain movement of sentiment or volume in categories and time matrix which shows density of color on time periods. Valence tree map, one of the most comprehensive and holistic visualization models, should be very helpful for analysts and decision makers to quickly understand the "big picture" business situation with a hierarchical structure since tree-map can present buzz volume and sentiment with a visualized result in a certain period. This case study offers real-world business insights from market sensing which would demonstrate to practical-minded business users how they can use these types of results for timely decision making in response to on-going changes in the market. We believe our approach can provide practical and reliable guide to opinion mining with visualized results that are immediately useful, not just in food industry but in other industries as well.

A Study on Recent Research Trend in Management of Technology Using Keywords Network Analysis (키워드 네트워크 분석을 통해 살펴본 기술경영의 최근 연구동향)

  • Kho, Jaechang;Cho, Kuentae;Cho, Yoonho
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.2
    • /
    • pp.101-123
    • /
    • 2013
  • Recently due to the advancements of science and information technology, the socio-economic business areas are changing from the industrial economy to a knowledge economy. Furthermore, companies need to do creation of new value through continuous innovation, development of core competencies and technologies, and technological convergence. Therefore, the identification of major trends in technology research and the interdisciplinary knowledge-based prediction of integrated technologies and promising techniques are required for firms to gain and sustain competitive advantage and future growth engines. The aim of this paper is to understand the recent research trend in management of technology (MOT) and to foresee promising technologies with deep knowledge for both technology and business. Furthermore, this study intends to give a clear way to find new technical value for constant innovation and to capture core technology and technology convergence. Bibliometrics is a metrical analysis to understand literature's characteristics. Traditional bibliometrics has its limitation not to understand relationship between trend in technology management and technology itself, since it focuses on quantitative indices such as quotation frequency. To overcome this issue, the network focused bibliometrics has been used instead of traditional one. The network focused bibliometrics mainly uses "Co-citation" and "Co-word" analysis. In this study, a keywords network analysis, one of social network analysis, is performed to analyze recent research trend in MOT. For the analysis, we collected keywords from research papers published in international journals related MOT between 2002 and 2011, constructed a keyword network, and then conducted the keywords network analysis. Over the past 40 years, the studies in social network have attempted to understand the social interactions through the network structure represented by connection patterns. In other words, social network analysis has been used to explain the structures and behaviors of various social formations such as teams, organizations, and industries. In general, the social network analysis uses data as a form of matrix. In our context, the matrix depicts the relations between rows as papers and columns as keywords, where the relations are represented as binary. Even though there are no direct relations between papers who have been published, the relations between papers can be derived artificially as in the paper-keyword matrix, in which each cell has 1 for including or 0 for not including. For example, a keywords network can be configured in a way to connect the papers which have included one or more same keywords. After constructing a keywords network, we analyzed frequency of keywords, structural characteristics of keywords network, preferential attachment and growth of new keywords, component, and centrality. The results of this study are as follows. First, a paper has 4.574 keywords on the average. 90% of keywords were used three or less times for past 10 years and about 75% of keywords appeared only one time. Second, the keyword network in MOT is a small world network and a scale free network in which a small number of keywords have a tendency to become a monopoly. Third, the gap between the rich (with more edges) and the poor (with fewer edges) in the network is getting bigger as time goes on. Fourth, most of newly entering keywords become poor nodes within about 2~3 years. Finally, keywords with high degree centrality, betweenness centrality, and closeness centrality are "Innovation," "R&D," "Patent," "Forecast," "Technology transfer," "Technology," and "SME". The results of analysis will help researchers identify major trends in MOT research and then seek a new research topic. We hope that the result of the analysis will help researchers of MOT identify major trends in technology research, and utilize as useful reference information when they seek consilience with other fields of study and select a new research topic.

Sentiment Analysis of Movie Review Using Integrated CNN-LSTM Mode (CNN-LSTM 조합모델을 이용한 영화리뷰 감성분석)

  • Park, Ho-yeon;Kim, Kyoung-jae
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.4
    • /
    • pp.141-154
    • /
    • 2019
  • Rapid growth of internet technology and social media is progressing. Data mining technology has evolved to enable unstructured document representations in a variety of applications. Sentiment analysis is an important technology that can distinguish poor or high-quality content through text data of products, and it has proliferated during text mining. Sentiment analysis mainly analyzes people's opinions in text data by assigning predefined data categories as positive and negative. This has been studied in various directions in terms of accuracy from simple rule-based to dictionary-based approaches using predefined labels. In fact, sentiment analysis is one of the most active researches in natural language processing and is widely studied in text mining. When real online reviews aren't available for others, it's not only easy to openly collect information, but it also affects your business. In marketing, real-world information from customers is gathered on websites, not surveys. Depending on whether the website's posts are positive or negative, the customer response is reflected in the sales and tries to identify the information. However, many reviews on a website are not always good, and difficult to identify. The earlier studies in this research area used the reviews data of the Amazon.com shopping mal, but the research data used in the recent studies uses the data for stock market trends, blogs, news articles, weather forecasts, IMDB, and facebook etc. However, the lack of accuracy is recognized because sentiment calculations are changed according to the subject, paragraph, sentiment lexicon direction, and sentence strength. This study aims to classify the polarity analysis of sentiment analysis into positive and negative categories and increase the prediction accuracy of the polarity analysis using the pretrained IMDB review data set. First, the text classification algorithm related to sentiment analysis adopts the popular machine learning algorithms such as NB (naive bayes), SVM (support vector machines), XGboost, RF (random forests), and Gradient Boost as comparative models. Second, deep learning has demonstrated discriminative features that can extract complex features of data. Representative algorithms are CNN (convolution neural networks), RNN (recurrent neural networks), LSTM (long-short term memory). CNN can be used similarly to BoW when processing a sentence in vector format, but does not consider sequential data attributes. RNN can handle well in order because it takes into account the time information of the data, but there is a long-term dependency on memory. To solve the problem of long-term dependence, LSTM is used. For the comparison, CNN and LSTM were chosen as simple deep learning models. In addition to classical machine learning algorithms, CNN, LSTM, and the integrated models were analyzed. Although there are many parameters for the algorithms, we examined the relationship between numerical value and precision to find the optimal combination. And, we tried to figure out how the models work well for sentiment analysis and how these models work. This study proposes integrated CNN and LSTM algorithms to extract the positive and negative features of text analysis. The reasons for mixing these two algorithms are as follows. CNN can extract features for the classification automatically by applying convolution layer and massively parallel processing. LSTM is not capable of highly parallel processing. Like faucets, the LSTM has input, output, and forget gates that can be moved and controlled at a desired time. These gates have the advantage of placing memory blocks on hidden nodes. The memory block of the LSTM may not store all the data, but it can solve the CNN's long-term dependency problem. Furthermore, when LSTM is used in CNN's pooling layer, it has an end-to-end structure, so that spatial and temporal features can be designed simultaneously. In combination with CNN-LSTM, 90.33% accuracy was measured. This is slower than CNN, but faster than LSTM. The presented model was more accurate than other models. In addition, each word embedding layer can be improved when training the kernel step by step. CNN-LSTM can improve the weakness of each model, and there is an advantage of improving the learning by layer using the end-to-end structure of LSTM. Based on these reasons, this study tries to enhance the classification accuracy of movie reviews using the integrated CNN-LSTM model.