• 제목/요약/키워드: Data dictionary

검색결과 346건 처리시간 0.023초

Stock-Index Invest Model Using News Big Data Opinion Mining (뉴스와 주가 : 빅데이터 감성분석을 통한 지능형 투자의사결정모형)

  • Kim, Yoo-Sin;Kim, Nam-Gyu;Jeong, Seung-Ryul
    • Journal of Intelligence and Information Systems
    • /
    • 제18권2호
    • /
    • pp.143-156
    • /
    • 2012
  • People easily believe that news and stock index are closely related. They think that securing news before anyone else can help them forecast the stock prices and enjoy great profit, or perhaps capture the investment opportunity. However, it is no easy feat to determine to what extent the two are related, come up with the investment decision based on news, or find out such investment information is valid. If the significance of news and its impact on the stock market are analyzed, it will be possible to extract the information that can assist the investment decisions. The reality however is that the world is inundated with a massive wave of news in real time. And news is not patterned text. This study suggests the stock-index invest model based on "News Big Data" opinion mining that systematically collects, categorizes and analyzes the news and creates investment information. To verify the validity of the model, the relationship between the result of news opinion mining and stock-index was empirically analyzed by using statistics. Steps in the mining that converts news into information for investment decision making, are as follows. First, it is indexing information of news after getting a supply of news from news provider that collects news on real-time basis. Not only contents of news but also various information such as media, time, and news type and so on are collected and classified, and then are reworked as variable from which investment decision making can be inferred. Next step is to derive word that can judge polarity by separating text of news contents into morpheme, and to tag positive/negative polarity of each word by comparing this with sentimental dictionary. Third, positive/negative polarity of news is judged by using indexed classification information and scoring rule, and then final investment decision making information is derived according to daily scoring criteria. For this study, KOSPI index and its fluctuation range has been collected for 63 days that stock market was open during 3 months from July 2011 to September in Korea Exchange, and news data was collected by parsing 766 articles of economic news media M company on web page among article carried on stock information>news>main news of portal site Naver.com. In change of the price index of stocks during 3 months, it rose on 33 days and fell on 30 days, and news contents included 197 news articles before opening of stock market, 385 news articles during the session, 184 news articles after closing of market. Results of mining of collected news contents and of comparison with stock price showed that positive/negative opinion of news contents had significant relation with stock price, and change of the price index of stocks could be better explained in case of applying news opinion by deriving in positive/negative ratio instead of judging between simplified positive and negative opinion. And in order to check whether news had an effect on fluctuation of stock price, or at least went ahead of fluctuation of stock price, in the results that change of stock price was compared only with news happening before opening of stock market, it was verified to be statistically significant as well. In addition, because news contained various type and information such as social, economic, and overseas news, and corporate earnings, the present condition of type of industry, market outlook, the present condition of market and so on, it was expected that influence on stock market or significance of the relation would be different according to the type of news, and therefore each type of news was compared with fluctuation of stock price, and the results showed that market condition, outlook, and overseas news was the most useful to explain fluctuation of news. On the contrary, news about individual company was not statistically significant, but opinion mining value showed tendency opposite to stock price, and the reason can be thought to be the appearance of promotional and planned news for preventing stock price from falling. Finally, multiple regression analysis and logistic regression analysis was carried out in order to derive function of investment decision making on the basis of relation between positive/negative opinion of news and stock price, and the results showed that regression equation using variable of market conditions, outlook, and overseas news before opening of stock market was statistically significant, and classification accuracy of logistic regression accuracy results was shown to be 70.0% in rise of stock price, 78.8% in fall of stock price, and 74.6% on average. This study first analyzed relation between news and stock price through analyzing and quantifying sensitivity of atypical news contents by using opinion mining among big data analysis techniques, and furthermore, proposed and verified smart investment decision making model that could systematically carry out opinion mining and derive and support investment information. This shows that news can be used as variable to predict the price index of stocks for investment, and it is expected the model can be used as real investment support system if it is implemented as system and verified in the future.

RGB Channel Selection Technique for Efficient Image Segmentation (효율적인 이미지 분할을 위한 RGB 채널 선택 기법)

  • 김현종;박영배
    • Journal of KIISE:Software and Applications
    • /
    • 제31권10호
    • /
    • pp.1332-1344
    • /
    • 2004
  • Upon development of information super-highway and multimedia-related technoiogies in recent years, more efficient technologies to transmit, store and retrieve the multimedia data are required. Among such technologies, firstly, it is common that the semantic-based image retrieval is annotated separately in order to give certain meanings to the image data and the low-level property information that include information about color, texture, and shape Despite the fact that the semantic-based information retrieval has been made by utilizing such vocabulary dictionary as the key words that given, however it brings about a problem that has not yet freed from the limit of the existing keyword-based text information retrieval. The second problem is that it reveals a decreased retrieval performance in the content-based image retrieval system, and is difficult to separate the object from the image that has complex background, and also is difficult to extract an area due to excessive division of those regions. Further, it is difficult to separate the objects from the image that possesses multiple objects in complex scene. To solve the problems, in this paper, I established a content-based retrieval system that can be processed in 5 different steps. The most critical process of those 5 steps is that among RGB images, the one that has the largest and the smallest background are to be extracted. Particularly. I propose the method that extracts the subject as well as the background by using an Image, which has the largest background. Also, to solve the second problem, I propose the method in which multiple objects are separated using RGB channel selection techniques having optimized the excessive division of area by utilizing Watermerge's threshold value with the object separation using the method of RGB channels separation. The tests proved that the methods proposed by me were superior to the existing methods in terms of retrieval performances insomuch as to replace those methods that developed for the purpose of retrieving those complex objects that used to be difficult to retrieve up until now.

Product Evaluation Criteria Extraction through Online Review Analysis: Using LDA and k-Nearest Neighbor Approach (온라인 리뷰 분석을 통한 상품 평가 기준 추출: LDA 및 k-최근접 이웃 접근법을 활용하여)

  • Lee, Ji Hyeon;Jung, Sang Hyung;Kim, Jun Ho;Min, Eun Joo;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • 제26권1호
    • /
    • pp.97-117
    • /
    • 2020
  • Product evaluation criteria is an indicator describing attributes or values of products, which enable users or manufacturers measure and understand the products. When companies analyze their products or compare them with competitors, appropriate criteria must be selected for objective evaluation. The criteria should show the features of products that consumers considered when they purchased, used and evaluated the products. However, current evaluation criteria do not reflect different consumers' opinion from product to product. Previous studies tried to used online reviews from e-commerce sites that reflect consumer opinions to extract the features and topics of products and use them as evaluation criteria. However, there is still a limit that they produce irrelevant criteria to products due to extracted or improper words are not refined. To overcome this limitation, this research suggests LDA-k-NN model which extracts possible criteria words from online reviews by using LDA and refines them with k-nearest neighbor. Proposed approach starts with preparation phase, which is constructed with 6 steps. At first, it collects review data from e-commerce websites. Most e-commerce websites classify their selling items by high-level, middle-level, and low-level categories. Review data for preparation phase are gathered from each middle-level category and collapsed later, which is to present single high-level category. Next, nouns, adjectives, adverbs, and verbs are extracted from reviews by getting part of speech information using morpheme analysis module. After preprocessing, words per each topic from review are shown with LDA and only nouns in topic words are chosen as potential words for criteria. Then, words are tagged based on possibility of criteria for each middle-level category. Next, every tagged word is vectorized by pre-trained word embedding model. Finally, k-nearest neighbor case-based approach is used to classify each word with tags. After setting up preparation phase, criteria extraction phase is conducted with low-level categories. This phase starts with crawling reviews in the corresponding low-level category. Same preprocessing as preparation phase is conducted using morpheme analysis module and LDA. Possible criteria words are extracted by getting nouns from the data and vectorized by pre-trained word embedding model. Finally, evaluation criteria are extracted by refining possible criteria words using k-nearest neighbor approach and reference proportion of each word in the words set. To evaluate the performance of the proposed model, an experiment was conducted with review on '11st', one of the biggest e-commerce companies in Korea. Review data were from 'Electronics/Digital' section, one of high-level categories in 11st. For performance evaluation of suggested model, three other models were used for comparing with the suggested model; actual criteria of 11st, a model that extracts nouns by morpheme analysis module and refines them according to word frequency, and a model that extracts nouns from LDA topics and refines them by word frequency. The performance evaluation was set to predict evaluation criteria of 10 low-level categories with the suggested model and 3 models above. Criteria words extracted from each model were combined into a single words set and it was used for survey questionnaires. In the survey, respondents chose every item they consider as appropriate criteria for each category. Each model got its score when chosen words were extracted from that model. The suggested model had higher scores than other models in 8 out of 10 low-level categories. By conducting paired t-tests on scores of each model, we confirmed that the suggested model shows better performance in 26 tests out of 30. In addition, the suggested model was the best model in terms of accuracy. This research proposes evaluation criteria extracting method that combines topic extraction using LDA and refinement with k-nearest neighbor approach. This method overcomes the limits of previous dictionary-based models and frequency-based refinement models. This study can contribute to improve review analysis for deriving business insights in e-commerce market.

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

  • Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
    • Journal of Intelligence and Information Systems
    • /
    • 제23권2호
    • /
    • pp.71-88
    • /
    • 2017
  • Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

A Study of Competency for R&D Engineer on Semiconductor Company (반도체 기술 R&D 연구인력의 역량연구 -H사 기업부설연구소를 중심으로)

  • Yun, Hye-Lim;Yoon, Gwan-Sik;Jeon, Hwa-Ick
    • 대한공업교육학회지
    • /
    • 제38권2호
    • /
    • pp.267-286
    • /
    • 2013
  • Recently, the advanced company has been sparing no efforts in improving necessary core knowledge and technology to achieve outstanding work performance. In this rapidly changing knowledge-based society, the company has confronted the task of creating a high value-added knowledge. The role of R&D workforce that corresponds to the characteristic and role of knowledge worker is getting more significant. As the life cycle of technical knowledge and skill shortens, in every industry, the technical knowledge and skill have become essential elements for successful business. It is difficult to improve competitiveness of the company without enhancing the competency of individual and organization. As the competency development which is a part of human resource management in the company is being spread now, it is required to focus on the research of determining necessary competency and to analyze the competency of a core organization in the research institute. 'H' is the semiconductor manufacturing company which has a affiliated research institute with its own R&D engineers. Based on focus group interview and job analysis data, vision and necessary competency were confirmed. And to confirm whether the required competency by job is different or not, analysis was performed by dividing members into workers who are in charge of circuit design and design before process development and who are in the process actualization and process development. Also, this research included members' importance awareness of the determined competency. The interview and job analysis were integrated and analyzed after arranging by groups and contents and the analyzed results were resorted after comparative analysis with a competency dictionary of Spencer & Spencer and competency models which are developed from the advanced research. Derived main competencies are: challenge, responsibility, and prediction/responsiveness, planning a new business, achievement -oriented, training, cooperation, self-development, analytic thinking, scheduling, motivation, communication, commercialization of technology, information gathering, professionalism on the job, and professionalism outside of work. The highly required competency for both jobs was 'Professionalism'. 'Attitude', 'Performance Management', 'Teamwork' for workers in charge of circuit design and 'Challenge', 'Training', 'Professionalism on the job' and 'Communication' were recognized to be required competency for those who are in charge of process actualization and process development. With above results, this research has determined the necessary competency that the 'H' company's affiliated research institute needs and found the difference of required competency by job. Also, it has suggested more enthusiastic education methods or various kinds of education by confirming the importance awareness of competency and individual's level of awareness about the competency.

Evaluation of Preference by Bukhansan Dulegil Course Using Sentiment Analysis of Blog Data (블로그 데이터 감성분석을 통한 북한산둘레길 구간별 선호도 평가)

  • Lee, Sung-Hee;Son, Yong-Hoon
    • Journal of the Korean Institute of Landscape Architecture
    • /
    • 제49권3호
    • /
    • pp.1-10
    • /
    • 2021
  • This study aimed to evaluate preferences of Bukhansan dulegil using sentiment analysis, a natural language processing technique, to derive preferred and non-preferred factors. Therefore, we collected blog articles written in 2019 and produced sentimental scores by the derivation of positive and negative words in the texts for 21 dulegil courses. Then, content analysis was conducted to determine which factors led visitors to prefer or dislike each course. In blogs written about Bukhansan dulegil, positive words appeared in approximately 73% of the content, and the percentage of positive documents was significantly higher than that of negative documents for each course. Through this, it can be seen that visitors generally had positive sentiments toward Bukhansan dulegil. Nevertheless, according to the sentiment score analysis, all 21 dulegil courses belonged to both the preferred and non-preferred courses. Among courses, visitors preferred less difficult courses, in which they could walk without a burden, and in which various landscape elements (visual, auditory, olfactory, etc.) were harmonious yet distinct. Furthermore, they preferred courses with various landscapes and landscape sequences. Additionally, visitors appreciated the presence of viewpoints, such as observation decks, as a significant factor and preferred courses with excellent accessibility and information provisions, such as information boards. Conversely, the dissatisfaction with the dulegil courses was due to noise caused by adjacent roads, excessive urban areas, and the inequality or difficulty of the course which was primarily attributed to insufficient information on the landscape or section of the course. The results of this study can serve not only serve as a guide in national parks but also in the management of nearby forest green areas to formulate a plan to repair and improve dulegil. Further, the sentiment analysis used in this study is meaningful in that it can continuously monitor actual users' responses towards natural areas. However, since it was evaluated based on a predefined sentiment dictionary, continuous updates are needed. Additionally, since there is a tendency to share positive content rather than negative views due to the nature of social media, it is necessary to compare and review the results of analysis, such as with on-site surveys.