• Title/Summary/Keyword: Top-K mining

Search Result 94, Processing Time 0.024 seconds

Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach (카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용)

  • Lee, Minsik;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.123-138
    • /
    • 2017
  • Since the stock market is driven by the expectation of traders, studies have been conducted to predict stock price movements through analysis of various sources of text data. In order to predict stock price movements, research has been conducted not only on the relationship between text data and fluctuations in stock prices, but also on the trading stocks based on news articles and social media responses. Studies that predict the movements of stock prices have also applied classification algorithms with constructing term-document matrix in the same way as other text mining approaches. Because the document contains a lot of words, it is better to select words that contribute more for building a term-document matrix. Based on the frequency of words, words that show too little frequency or importance are removed. It also selects words according to their contribution by measuring the degree to which a word contributes to correctly classifying a document. The basic idea of constructing a term-document matrix was to collect all the documents to be analyzed and to select and use the words that have an influence on the classification. In this study, we analyze the documents for each individual item and select the words that are irrelevant for all categories as neutral words. We extract the words around the selected neutral word and use it to generate the term-document matrix. The neutral word itself starts with the idea that the stock movement is less related to the existence of the neutral words, and that the surrounding words of the neutral word are more likely to affect the stock price movements. And apply it to the algorithm that classifies the stock price fluctuations with the generated term-document matrix. In this study, we firstly removed stop words and selected neutral words for each stock. And we used a method to exclude words that are included in news articles for other stocks among the selected words. Through the online news portal, we collected four months of news articles on the top 10 market cap stocks. We split the news articles into 3 month news data as training data and apply the remaining one month news articles to the model to predict the stock price movements of the next day. We used SVM, Boosting and Random Forest for building models and predicting the movements of stock prices. The stock market opened for four months (2016/02/01 ~ 2016/05/31) for a total of 80 days, using the initial 60 days as a training set and the remaining 20 days as a test set. The proposed word - based algorithm in this study showed better classification performance than the word selection method based on sparsity. This study predicted stock price volatility by collecting and analyzing news articles of the top 10 stocks in market cap. We used the term - document matrix based classification model to estimate the stock price fluctuations and compared the performance of the existing sparse - based word extraction method and the suggested method of removing words from the term - document matrix. The suggested method differs from the word extraction method in that it uses not only the news articles for the corresponding stock but also other news items to determine the words to extract. In other words, it removed not only the words that appeared in all the increase and decrease but also the words that appeared common in the news for other stocks. When the prediction accuracy was compared, the suggested method showed higher accuracy. The limitation of this study is that the stock price prediction was set up to classify the rise and fall, and the experiment was conducted only for the top ten stocks. The 10 stocks used in the experiment do not represent the entire stock market. In addition, it is difficult to show the investment performance because stock price fluctuation and profit rate may be different. Therefore, it is necessary to study the research using more stocks and the yield prediction through trading simulation.

Bankruptcy prediction using ensemble SVM model (앙상블 SVM 모형을 이용한 기업 부도 예측)

  • Choi, Ha Na;Lim, Dong Hoon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.6
    • /
    • pp.1113-1125
    • /
    • 2013
  • Corporate bankruptcy prediction has been an important topic in the accounting and finance field for a long time. Several data mining techniques have been used for bankruptcy prediction. However, there are many limits for application to real classification problem with a single model. This study proposes ensemble SVM (support vector machine) model which assembles different SVM models with each different kernel functions. Our ensemble model is made and evaluated by v-fold cross-validation approach. The k top performing models are recruited into the ensemble. The classification is then carried out using the majority voting opinion of the ensemble. In this paper, we investigate the performance of ensemble SVM classifier in terms of accuracy, error rate, sensitivity, specificity, ROC curve, and AUC to compare with single SVM classifiers based on financial ratios dataset and simulation dataset. The results confirmed the advantages of our method: It is robust while providing good performance.

A domain-specific sentiment lexicon construction method for stock index directionality (주가지수 방향성 예측을 위한 도메인 맞춤형 감성사전 구축방안)

  • Kim, Jae-Bong;Kim, Hyoung-Joong
    • Journal of Digital Contents Society
    • /
    • v.18 no.3
    • /
    • pp.585-592
    • /
    • 2017
  • As development of personal devices have made everyday use of internet much easier than before, it is getting generalized to find information and share it through the social media. In particular, communities specialized in each field have become so powerful that they can significantly influence our society. Finally, businesses and governments pay attentions to reflecting their opinions in their strategies. The stock market fluctuates with various factors of society. In order to consider social trends, many studies have tried making use of bigdata analysis on stock market researches as well as traditional approaches using buzz amount. In the example at the top, the studies using text data such as newspaper articles are being published. In this paper, we analyzed the post of 'Paxnet', a securities specialists' site, to supplement the limitation of the news. Based on this, we help researchers analyze the sentiment of investors by generating a domain-specific sentiment lexicon for the stock market.

An experimental study on fracture coalescence characteristics of brittle sandstone specimens combined various flaws

  • Yang, Sheng-Qi
    • Geomechanics and Engineering
    • /
    • v.8 no.4
    • /
    • pp.541-557
    • /
    • 2015
  • This research aims to analyze the fracture coalescence characteristics of brittle sandstone specimen ($80{\times}160{\times}30mm$ in size) containing various flaws (a single fissure, double squares and combined flaws). Using a rock mechanics servo-controlled testing system, the strength and deformation behaviours of sandstone specimen containing various flaws are experimentally investigated. The results show that the crack initiation stress, uniaxial compressive strength and peak axial strain of specimen containing a single fissure are all higher than those containing double squares, while which are higher than those containing combined flaws. For sandstone specimen containing combined flaws, the uniaxial compressive strength of sandstone increase as fissure angle (${\alpha}$) increases from $30^{\circ}$ to $90^{\circ}$, which indicates that the specimens with steeper fissure angles can support higher axial capacity for ${\alpha}$ greater than $30^{\circ}$. In the entire deformation process of flawed sandstone specimen, crack evolution process is discussed detailed using photographic monitoring technique. For the specimen containing a single fissure, tensile wing cracks are first initiated at the upper and under tips of fissure, and anti-tensile cracks and far-field cracks are also observed in the deformation process; moreover anti-tensile cracks usually accompanies with tensile wing cracks. For the specimen containing double squares, tensile cracks are usually initiated from the top and bottom edge of two squares along the direction of axial stress, and in the process of final unstable failure, more vertical splitting failures are observed in the ligament region. When a single fissure and double squares are formed together into combined flaws, the crack coalescence between the fissure tips and double squares plays a significant role for ultimate failure of the specimen containing combined flaws.

Maritime Atmospheric Boundary Layer Observed By L-band Doppler radar (도플러 레이더를 이용한 해안지역의 대기경계층 분석 연구)

  • Kwon, Byung-Hyuk;Yoon, Hong-Joo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.4 no.5
    • /
    • pp.977-984
    • /
    • 2000
  • Atmospheric boundary layer over equatorial maritime continent was analyzed with Doppler radar. An L-band (1357.5 MHz) boundary layer radar (BLR) has been in continuous successful operation in Selpong, Indonesia(6.45, 106.7E), since November 1992. The performance of the BLR with respect to the observation height range and the wind measurement reliability has been examined on the basis of simultaneous meteorological observations. In the dry season (10-12 October 1993), we have found two types of strong echo structures appearing systematically in the equatorial planetary boundary layer with diurnal variations on clear days. The first type is the striking appearance of a strong echo layer ascending from below 300 m (in the morning) to above 3-5 km (in the afternoon), which is identified with a diurnal variation of the top of the mixing planetary boundary layer. As expected, it is higher in the Indonesian equatorial region than in midlatitudes. Another type is a layered echo appearing at 2-3 km heights from nighttime to morning, which seem to be coincident with humidity gaps. In the rainy season (20-21 February 1994), the height of the atmospheric mining was lower than that in the dry season.

  • PDF

Development of Filtering System ADDAVICHI for Fake Reviews using Big Data Analysis (빅데이터 분석을 활용한 가짜 리뷰 필터링 시스템 ADDAVICHI)

  • Jeong, Davichi;Rho, Young-J.
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.6
    • /
    • pp.1-8
    • /
    • 2019
  • Recently, consumer distrust has deepened due to blog posts focusing only on public relations due to 'viral marketing'. In addition, marketing projects such as false writing or exaggerated use of the latter phase are one of the most popular programs in 2016 as they are cheaper and more effective than newspaper and TV ads, and the size of advertising costs is set to be a major means of advertising at '3 trillion 394.1 billion won. From this 'viral marketing,' it has become an Internet environment that needs tools to filter information. The fake review filtering application ADDAVICHI presented in this paper extracts, analyzes, and presents blog keywords, total number of searches, reliability and satisfaction when users search for content such as "event" and "taste restaurant." Reliability shows the number of ad posts on a blog, the total number of posts, and satisfaction shows a clean post with confidence divided into positive and negative posts. Finally, the keyword shows a list of the top three words in the review from a positive post. In this way, it helps users interpret information away from advertising.

A Literature Review and Classification of Recommender Systems on Academic Journals (추천시스템관련 학술논문 분석 및 분류)

  • Park, Deuk-Hee;Kim, Hyea-Kyeong;Choi, Il-Young;Kim, Jae-Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.139-152
    • /
    • 2011
  • Recommender systems have become an important research field since the emergence of the first paper on collaborative filtering in the mid-1990s. In general, recommender systems are defined as the supporting systems which help users to find information, products, or services (such as books, movies, music, digital products, web sites, and TV programs) by aggregating and analyzing suggestions from other users, which mean reviews from various authorities, and user attributes. However, as academic researches on recommender systems have increased significantly over the last ten years, more researches are required to be applicable in the real world situation. Because research field on recommender systems is still wide and less mature than other research fields. Accordingly, the existing articles on recommender systems need to be reviewed toward the next generation of recommender systems. However, it would be not easy to confine the recommender system researches to specific disciplines, considering the nature of the recommender system researches. So, we reviewed all articles on recommender systems from 37 journals which were published from 2001 to 2010. The 37 journals are selected from top 125 journals of the MIS Journal Rankings. Also, the literature search was based on the descriptors "Recommender system", "Recommendation system", "Personalization system", "Collaborative filtering" and "Contents filtering". The full text of each article was reviewed to eliminate the article that was not actually related to recommender systems. Many of articles were excluded because the articles such as Conference papers, master's and doctoral dissertations, textbook, unpublished working papers, non-English publication papers and news were unfit for our research. We classified articles by year of publication, journals, recommendation fields, and data mining techniques. The recommendation fields and data mining techniques of 187 articles are reviewed and classified into eight recommendation fields (book, document, image, movie, music, shopping, TV program, and others) and eight data mining techniques (association rule, clustering, decision tree, k-nearest neighbor, link analysis, neural network, regression, and other heuristic methods). The results represented in this paper have several significant implications. First, based on previous publication rates, the interest in the recommender system related research will grow significantly in the future. Second, 49 articles are related to movie recommendation whereas image and TV program recommendation are identified in only 6 articles. This result has been caused by the easy use of MovieLens data set. So, it is necessary to prepare data set of other fields. Third, recently social network analysis has been used in the various applications. However studies on recommender systems using social network analysis are deficient. Henceforth, we expect that new recommendation approaches using social network analysis will be developed in the recommender systems. So, it will be an interesting and further research area to evaluate the recommendation system researches using social method analysis. This result provides trend of recommender system researches by examining the published literature, and provides practitioners and researchers with insight and future direction on recommender systems. We hope that this research helps anyone who is interested in recommender systems research to gain insight for future research.

A Topic Modeling-based Recommender System Considering Changes in User Preferences (고객 선호 변화를 고려한 토픽 모델링 기반 추천 시스템)

  • Kang, So Young;Kim, Jae Kyeong;Choi, Il Young;Kang, Chang Dong
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.43-56
    • /
    • 2020
  • Recommender systems help users make the best choice among various options. Especially, recommender systems play important roles in internet sites as digital information is generated innumerable every second. Many studies on recommender systems have focused on an accurate recommendation. However, there are some problems to overcome in order for the recommendation system to be commercially successful. First, there is a lack of transparency in the recommender system. That is, users cannot know why products are recommended. Second, the recommender system cannot immediately reflect changes in user preferences. That is, although the preference of the user's product changes over time, the recommender system must rebuild the model to reflect the user's preference. Therefore, in this study, we proposed a recommendation methodology using topic modeling and sequential association rule mining to solve these problems from review data. Product reviews provide useful information for recommendations because product reviews include not only rating of the product but also various contents such as user experiences and emotional state. So, reviews imply user preference for the product. So, topic modeling is useful for explaining why items are recommended to users. In addition, sequential association rule mining is useful for identifying changes in user preferences. The proposed methodology is largely divided into two phases. The first phase is to create user profile based on topic modeling. After extracting topics from user reviews on products, user profile on topics is created. The second phase is to recommend products using sequential rules that appear in buying behaviors of users as time passes. The buying behaviors are derived from a change in the topic of each user. A collaborative filtering-based recommendation system was developed as a benchmark system, and we compared the performance of the proposed methodology with that of the collaborative filtering-based recommendation system using Amazon's review dataset. As evaluation metrics, accuracy, recall, precision, and F1 were used. For topic modeling, collapsed Gibbs sampling was conducted. And we extracted 15 topics. Looking at the main topics, topic 1, top 3, topic 4, topic 7, topic 9, topic 13, topic 14 are related to "comedy shows", "high-teen drama series", "crime investigation drama", "horror theme", "British drama", "medical drama", "science fiction drama", respectively. As a result of comparative analysis, the proposed methodology outperformed the collaborative filtering-based recommendation system. From the results, we found that the time just prior to the recommendation was very important for inferring changes in user preference. Therefore, the proposed methodology not only can secure the transparency of the recommender system but also can reflect the user's preferences that change over time. However, the proposed methodology has some limitations. The proposed methodology cannot recommend product elaborately if the number of products included in the topic is large. In addition, the number of sequential patterns is small because the number of topics is too small. Therefore, future research needs to consider these limitations.

A CF-based Health Functional Recommender System using Extended User Similarity Measure (확장된 사용자 유사도를 이용한 CF-기반 건강기능식품 추천 시스템)

  • Sein Hong;Euiju Jeong;Jaekyeong Kim
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.3
    • /
    • pp.1-17
    • /
    • 2023
  • With the recent rapid development of ICT(Information and Communication Technology) and the popularization of digital devices, the size of the online market continues to grow. As a result, we live in a flood of information. Thus, customers are facing information overload problems that require a lot of time and money to select products. Therefore, a personalized recommender system has become an essential methodology to address such issues. Collaborative Filtering(CF) is the most widely used recommender system. Traditional recommender systems mainly utilize quantitative data such as rating values, resulting in poor recommendation accuracy. Quantitative data cannot fully reflect the user's preference. To solve such a problem, studies that reflect qualitative data, such as review contents, are being actively conducted these days. To quantify user review contents, text mining was used in this study. The general CF consists of the following three steps: user-item matrix generation, Top-N neighborhood group search, and Top-K recommendation list generation. In this study, we propose a recommendation algorithm that applies an extended similarity measure, which utilize quantified review contents in addition to user rating values. After calculating review similarity by applying TF-IDF, Word2Vec, and Doc2Vec techniques to review content, extended similarity is created by combining user rating similarity and quantified review contents. To verify this, we used user ratings and review data from the e-commerce site Amazon's "Health and Personal Care". The proposed recommendation model using extended similarity measure showed superior performance to the traditional recommendation model using only user rating value-based similarity measure. In addition, among the various text mining techniques, the similarity obtained using the TF-IDF technique showed the best performance when used in the neighbor group search and recommendation list generation step.

Increasing Accuracy of Classifying Useful Reviews by Removing Neutral Terms (중립도 기반 선택적 단어 제거를 통한 유용 리뷰 분류 정확도 향상 방안)

  • Lee, Minsik;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.3
    • /
    • pp.129-142
    • /
    • 2016
  • Customer product reviews have become one of the important factors for purchase decision makings. Customers believe that reviews written by others who have already had an experience with the product offer more reliable information than that provided by sellers. However, there are too many products and reviews, the advantage of e-commerce can be overwhelmed by increasing search costs. Reading all of the reviews to find out the pros and cons of a certain product can be exhausting. To help users find the most useful information about products without much difficulty, e-commerce companies try to provide various ways for customers to write and rate product reviews. To assist potential customers, online stores have devised various ways to provide useful customer reviews. Different methods have been developed to classify and recommend useful reviews to customers, primarily using feedback provided by customers about the helpfulness of reviews. Most shopping websites provide customer reviews and offer the following information: the average preference of a product, the number of customers who have participated in preference voting, and preference distribution. Most information on the helpfulness of product reviews is collected through a voting system. Amazon.com asks customers whether a review on a certain product is helpful, and it places the most helpful favorable and the most helpful critical review at the top of the list of product reviews. Some companies also predict the usefulness of a review based on certain attributes including length, author(s), and the words used, publishing only reviews that are likely to be useful. Text mining approaches have been used for classifying useful reviews in advance. To apply a text mining approach based on all reviews for a product, we need to build a term-document matrix. We have to extract all words from reviews and build a matrix with the number of occurrences of a term in a review. Since there are many reviews, the size of term-document matrix is so large. It caused difficulties to apply text mining algorithms with the large term-document matrix. Thus, researchers need to delete some terms in terms of sparsity since sparse words have little effects on classifications or predictions. The purpose of this study is to suggest a better way of building term-document matrix by deleting useless terms for review classification. In this study, we propose neutrality index to select words to be deleted. Many words still appear in both classifications - useful and not useful - and these words have little or negative effects on classification performances. Thus, we defined these words as neutral terms and deleted neutral terms which are appeared in both classifications similarly. After deleting sparse words, we selected words to be deleted in terms of neutrality. We tested our approach with Amazon.com's review data from five different product categories: Cellphones & Accessories, Movies & TV program, Automotive, CDs & Vinyl, Clothing, Shoes & Jewelry. We used reviews which got greater than four votes by users and 60% of the ratio of useful votes among total votes is the threshold to classify useful and not-useful reviews. We randomly selected 1,500 useful reviews and 1,500 not-useful reviews for each product category. And then we applied Information Gain and Support Vector Machine algorithms to classify the reviews and compared the classification performances in terms of precision, recall, and F-measure. Though the performances vary according to product categories and data sets, deleting terms with sparsity and neutrality showed the best performances in terms of F-measure for the two classification algorithms. However, deleting terms with sparsity only showed the best performances in terms of Recall for Information Gain and using all terms showed the best performances in terms of precision for SVM. Thus, it needs to be careful for selecting term deleting methods and classification algorithms based on data sets.