• Title/Summary/Keyword: Frequency of Words

Search Result 881, Processing Time 0.027 seconds

Stock Price Prediction by Utilizing Category Neutral Terms: Text Mining Approach (카테고리 중립 단어 활용을 통한 주가 예측 방안: 텍스트 마이닝 활용)

  • Lee, Minsik;Lee, Hong Joo
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.123-138
    • /
    • 2017
  • Since the stock market is driven by the expectation of traders, studies have been conducted to predict stock price movements through analysis of various sources of text data. In order to predict stock price movements, research has been conducted not only on the relationship between text data and fluctuations in stock prices, but also on the trading stocks based on news articles and social media responses. Studies that predict the movements of stock prices have also applied classification algorithms with constructing term-document matrix in the same way as other text mining approaches. Because the document contains a lot of words, it is better to select words that contribute more for building a term-document matrix. Based on the frequency of words, words that show too little frequency or importance are removed. It also selects words according to their contribution by measuring the degree to which a word contributes to correctly classifying a document. The basic idea of constructing a term-document matrix was to collect all the documents to be analyzed and to select and use the words that have an influence on the classification. In this study, we analyze the documents for each individual item and select the words that are irrelevant for all categories as neutral words. We extract the words around the selected neutral word and use it to generate the term-document matrix. The neutral word itself starts with the idea that the stock movement is less related to the existence of the neutral words, and that the surrounding words of the neutral word are more likely to affect the stock price movements. And apply it to the algorithm that classifies the stock price fluctuations with the generated term-document matrix. In this study, we firstly removed stop words and selected neutral words for each stock. And we used a method to exclude words that are included in news articles for other stocks among the selected words. Through the online news portal, we collected four months of news articles on the top 10 market cap stocks. We split the news articles into 3 month news data as training data and apply the remaining one month news articles to the model to predict the stock price movements of the next day. We used SVM, Boosting and Random Forest for building models and predicting the movements of stock prices. The stock market opened for four months (2016/02/01 ~ 2016/05/31) for a total of 80 days, using the initial 60 days as a training set and the remaining 20 days as a test set. The proposed word - based algorithm in this study showed better classification performance than the word selection method based on sparsity. This study predicted stock price volatility by collecting and analyzing news articles of the top 10 stocks in market cap. We used the term - document matrix based classification model to estimate the stock price fluctuations and compared the performance of the existing sparse - based word extraction method and the suggested method of removing words from the term - document matrix. The suggested method differs from the word extraction method in that it uses not only the news articles for the corresponding stock but also other news items to determine the words to extract. In other words, it removed not only the words that appeared in all the increase and decrease but also the words that appeared common in the news for other stocks. When the prediction accuracy was compared, the suggested method showed higher accuracy. The limitation of this study is that the stock price prediction was set up to classify the rise and fall, and the experiment was conducted only for the top ten stocks. The 10 stocks used in the experiment do not represent the entire stock market. In addition, it is difficult to show the investment performance because stock price fluctuation and profit rate may be different. Therefore, it is necessary to study the research using more stocks and the yield prediction through trading simulation.

Disease-Related Vocubulary and its translingual practice in Late 19th to Early 20th century (19세기 말 20세기 초 질병 어휘와 언어횡단적 실천)

  • Lee, Eunryoung
    • Journal of Sasang Constitutional Medicine
    • /
    • v.31 no.1
    • /
    • pp.65-78
    • /
    • 2019
  • Objectives This study aims to investigate how the Korean disease-related vocabulary is established or changed when it is translated into French or English. Through this, we examine changes in the meaning of diseases and the ecosystem of disease-related vocabulary in transition period of $19^{th}$ to $20^{th}$ century. Methods Korean disease-related vocabulary are extracted from a total of 148,000 Korean headwords included in our corpus of three bilingual dictionaries. Among them, the scope of analyisis is limited to group of vocabularies that include a high frequency words, disease(病) and symptom(症). Results The first type of change is the emergence of a neologism. In this case, coexistence of existing vocabulary and new words is observed. The second change is the appearance of loan words written in Hangul. The third is the case where the interpretation of meaning is changed while maintaining the word form. Finally, the fourth change is that the orthographic variants are displayed while maintaining the meaning of the existing vocabulary. Discussion Disease-related vocabulary increased greatly between 1897 and 1931. The increasing factor of vocabulary was the emergence of coined words, compound words and the influx of foreign words. The Korean language and the Western language made a new lexical form in order to introduce a new unknown concept to the Korean. We could also confirm that the way in which English word expanded its semantic field by modifying the way of representing the meaning of Korean Disease-related vocabulary.

Study on Domestic Awareness of Korean Medicine Treatment for Dysmenorrhea using Big Data (빅데이터를 활용한 월경통에 대한 국내 한방 치료 인식 조사)

  • Ha-Young Jeon;Deok-Sang Hwang;Jin-Moo Lee;Chang-Hoon Lee;Jun-Bock Jang
    • The Journal of Korean Obstetrics and Gynecology
    • /
    • v.37 no.3
    • /
    • pp.20-32
    • /
    • 2024
  • Objectives: The purpose of this study is to investigate the domestic awareness of Korean medicine treatment for dysmenorrhea. Methods: We conducted word frequency analysis, 2-gram analysis, degree centrality analysis, and CONCOR analysis using big data searched for '월경통', '생리통', and '한방' as main key words. The searching period was set from 2019 to 2023. Results: The number of original text data searched through main key words was on the rise. Specific words related to Korean medicine treatment were in the top 100, words related to herbal medicine ranked particularly high. The top 100 words were divided into 4 clusters (A; Body parts or body substances related to dysmenorrhea, B; Symptoms that accompany dysmenorrhea, C; Physiological or pathological situations related to dysmenorrhea, and D; Treatment related words of dysmenorrhea) and some clusters showed close relationships between each others (B-C, B-D, and C-D). Conclusions: Our results suggest that interest in Korean medicine treatment for dysmenorrhea is increasing. Also, as public understanding is estimated to be high, specific information such as method, process and mechanism of Korean medicine treatment seems to be needed.

A Comparative Study on the Frequency of Allophones, Phonemes and Letters in Korean (국어의 이음.음소와 자모의 출현빈도수 조사 대비 및 분석)

  • Lee, Sang-Oak
    • Speech Sciences
    • /
    • v.8 no.3
    • /
    • pp.51-73
    • /
    • 2001
  • This study starts with an investigation of the frequency of allophones from the narrowly transcribed data of (1) most frequently used 2000 words and (2) some passages of standard Seoul Korean. Consequently this entails the investigation of the frequency of phonemes by adding the number of allophones. These two investigations are conducted for the first time in the study of Korean phonology. Previous studies on the reported 'frequency of phoneme' are in fact studies on the 'frequency of letters' and the critical difference between these two types of studies has yet to be clarified accurately. This paper also reveals the proportional distribution of natural classes among Korean phonemes and letters.

  • PDF

Inverse Document Frequency-Based Word Embedding of Unseen Words for Question Answering Systems (질의응답 시스템에서 처음 보는 단어의 역문헌빈도 기반 단어 임베딩 기법)

  • Lee, Wooin;Song, Gwangho;Shim, Kyuseok
    • Journal of KIISE
    • /
    • v.43 no.8
    • /
    • pp.902-909
    • /
    • 2016
  • Question answering system (QA system) is a system that finds an actual answer to the question posed by a user, whereas a typical search engine would only find the links to the relevant documents. Recent works related to the open domain QA systems are receiving much attention in the fields of natural language processing, artificial intelligence, and data mining. However, the prior works on QA systems simply replace all words that are not in the training data with a single token, even though such unseen words are likely to play crucial roles in differentiating the candidate answers from the actual answers. In this paper, we propose a method to compute vectors of such unseen words by taking into account the context in which the words have occurred. Next, we also propose a model which utilizes inverse document frequencies (IDF) to efficiently process unseen words by expanding the system's vocabulary. Finally, we validate that the proposed method and model improve the performance of a QA system through experiments.

SNS Big-data Analysis and Implication of the Marine and Fisheries Sector (해양수산 SNS 빅데이터 분석 결과 및 시사점)

  • Park, Kwangseo;Lee, Jeongmin;Lee, Sunryang
    • Journal of the Korean Society for Marine Environment & Energy
    • /
    • v.20 no.2
    • /
    • pp.117-125
    • /
    • 2017
  • SNS Big-data Analysis means to find potential value from big data which has produced by the social media. In this paper, SNS Big-data has been analysed to find Korean concerns by using 24 key words from the marine and fisheries sector. Among 24 key words, seafood, shipping and Dokdo Island are the most mentioned ones. Some key words such as ocean policies and marine security that have less concerns have bess mentioned less. Also, key words that are led by government are mostly mentioned by news media, but key words that are led by private sector and have intimate relationship with people's lives are mostly mentioned by Blogs and Twitters. Therefore, reflecting close national concerns by SNS Big-data Analysis and especially resolving negative factors are the most significant part of the policy establishment. Also, differentiated promotion methods need to be prepared because the frequency of key words mentioned from each type of media are different.

An Analysis of the Vowel Formants of the Young Males in the Buckeye Corpus (벅아이 코퍼스에서의 젊은 성인 남성의 모음 포먼트 분석)

  • Yoon, Kyu-Chul;Noh, Hye-Uk
    • Phonetics and Speech Sciences
    • /
    • v.4 no.2
    • /
    • pp.41-49
    • /
    • 2012
  • The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, syllabic stress information, the location in a word, location in utterance, speech rate of three consecutive words, and the word frequency in the corpus. The results indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants. The purpose of this paper is to extract the vowel formants of the ten young male speakers from the Buckeye Corpus of Conversational Speech [1] and to analyze them in comparison to earlier works in terms of various phonetic factors that are expected to affect the realization of the formant distribution. The first two formant frequency values were automatically extracted with a Praat script along with such factors as the place of articulation, the content versus function word information, the syllabic stress information, the location in a word, the location in an utterance, the speech rate of the three consecutive words, and the word frequency in the corpus. The result indicated that the formant patterns from the corpus were very different from those of earlier works although the overall pattern was similar and that the factors were strongly responsible for the realization of the two formants.

Effects of Name Agreement and Word Frequency on the English-Korean Word Translation Task (영어-한국어 단어번역과제에서 이름-일치도와 단어빈도의 효과)

  • Koo, Min-Mo;Nam, Ki-Chun
    • MALSORI
    • /
    • no.61
    • /
    • pp.31-48
    • /
    • 2007
  • This study investigated the roles of name agreement and word frequency in the English-Korean word translation task. Using the low-frequency homonyms with low name agreement as stimuli, Experiment 1 revealed that the name agreement of materials is a determinant which could modulate times to translate English words into Korean equivalents. On the contrary, Experiment 2 showed that the name agreement of materials does not play a decisive role in the translation task, using the low-frequency homonyms having high name agreement as stimuli. In Experiment 3, we identified that the frequency effects observed from previous two experiments are indeed brought about during the lexical access. Our findings suggest that the word frequencies of materials have a strong influence on English-Korean word translation times, and homonyms are represented independently each other in the lexeme level.

  • PDF

Term Frequency-Inverse Document Frequency (TF-IDF) Technique Using Principal Component Analysis (PCA) with Naive Bayes Classification

  • J.Uma;K.Prabha
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.4
    • /
    • pp.113-118
    • /
    • 2024
  • Pursuance Sentiment Analysis on Twitter is difficult then performance it's used for great review. The present be for the reason to the tweet is extremely small with mostly contain slang, emoticon, and hash tag with other tweet words. A feature extraction stands every technique concerning structure and aspect point beginning particular tweets. The subdivision in a aspect vector is an integer that has a commitment on ascribing a supposition class to a tweet. The cycle of feature extraction is to eradicate the exact quality to get better the accurateness of the classifications models. In this manuscript we proposed Term Frequency-Inverse Document Frequency (TF-IDF) method is to secure Principal Component Analysis (PCA) with Naïve Bayes Classifiers. As the classifications process, the work proposed can produce different aspects from wildly valued feature commencing a Twitter dataset.

Word Recognition Using VQ and Fuzzy Theory (VQ와 Fuzzy 이론을 이용한 단어인식)

  • Kim, Ja-Ryong;Choi, Kap-Seok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.10 no.4
    • /
    • pp.38-47
    • /
    • 1991
  • The frequency variation among speakers is one of problems in the speech recognition. This paper applies fuzzy theory to solve the variation problem of frequency features. Reference patterns are expressed by fuzzified patterns which are produced by the peak frequency and the peak energy extracted from codebooks which are generated from training words uttered by several speakers, as they should include common features of speech signals. Words are recognized by fuzzy inference which uses the certainty factor between the reference patterns and the test fuzzified patterns which are produced by the peak frequency and the peak energy extracted from the power spectrum of input speech signals. Practically, in computing the certainty factor, to reduce memory capacity and computation requirements we propose a new equation which calculates the improved certainty factor using only the difference between two fuzzy values. As a result of experiments to test this word recognition method by fuzzy interence with Korean digits, it is shown that this word recognition method using the new equation presented in this paper, can solve the variation problem of frequency features and that the memory capacity and computation requirements are reduced.

  • PDF