• Title/Summary/Keyword: Frequency based Text Analysis

Search Result 239, Processing Time 0.025 seconds

A Study on Perceptions of Virtual Influencers through YouTube Comments -Focusing on Positive and Negative Emotional Responses Toward Character Design- (유튜브 댓글을 통해 살펴본 버추얼 인플루언서에 대한 인식 연구 -캐릭터 디자인에 대한 긍부정 감성 반응을 중심으로-)

  • Hyosun An;Jiyoung Kim
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.47 no.5
    • /
    • pp.873-890
    • /
    • 2023
  • This study analyzed users' emotional responses to VI character design through YouTube comments. The researchers applied text-mining to analyze 116,375 comments, focusing on terms related to character design and characteristics of VI. Using the BERT model in sentiment analysis, we classified comments into extremely negative, negative, neutral, positive, or extremely positive sentiments. Next, we conducted a co-occurrence frequency analysis on comments with extremely negative and extremely positive responses to examine the semantic relationships between character design and emotional characteristic terms. We also performed a content analysis of comments about Miquela and Shudu to analyze the perception differences regarding the two character designs. The results indicate that form elements (e.g., voice, face, and skin) and behavioral elements (e.g., speaking, interviewing, and reacting) are vital in eliciting users' emotional responses. Notably, in the negative responses, users focused on the humanization aspect of voice and the authenticity aspect of behavior in speaking, interviewing, and reacting. Furthermore, we found differences in the character design elements and characteristics that users expect based on the VI's field of activity. As a result, this study suggests applications to character design to accommodate these variations.

Analyzing the Effect of Characteristics of Dictionary on the Accuracy of Document Classifiers (용어 사전의 특성이 문서 분류 정확도에 미치는 영향 연구)

  • Jung, Haegang;Kim, Namgyu
    • Management & Information Systems Review
    • /
    • v.37 no.4
    • /
    • pp.41-62
    • /
    • 2018
  • As the volume of unstructured data increases through various social media, Internet news articles, and blogs, the importance of text analysis and the studies are increasing. Since text analysis is mostly performed on a specific domain or topic, the importance of constructing and applying a domain-specific dictionary has been increased. The quality of dictionary has a direct impact on the results of the unstructured data analysis and it is much more important since it present a perspective of analysis. In the literature, most studies on text analysis has emphasized the importance of dictionaries to acquire clean and high quality results. However, unfortunately, a rigorous verification of the effects of dictionaries has not been studied, even if it is already known as the most essential factor of text analysis. In this paper, we generate three dictionaries in various ways from 39,800 news articles and analyze and verify the effect each dictionary on the accuracy of document classification by defining the concept of Intrinsic Rate. 1) A batch construction method which is building a dictionary based on the frequency of terms in the entire documents 2) A method of extracting the terms by category and integrating the terms 3) A method of extracting the features according to each category and integrating them. We compared accuracy of three artificial neural network-based document classifiers to evaluate the quality of dictionaries. As a result of the experiment, the accuracy tend to increase when the "Intrinsic Rate" is high and we found the possibility to improve accuracy of document classification by increasing the intrinsic rate of the dictionary.

Analysis of the Yearbook from the Korea Meteorological Administration using a text-mining agorithm (텍스트 마이닝 알고리즘을 이용한 기상청 기상연감 자료 분석)

  • Sun, Hyunseok;Lim, Changwon;Lee, YungSeop
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.4
    • /
    • pp.603-613
    • /
    • 2017
  • Many people have recently posted about personal interests on social media. The development of the Internet and computer technology has enabled the storage of digital forms of documents that has resulted in an explosion of the amount of textual data generated; subsequently there is an increased demand for technology to create valuable information from a large number of documents. A text mining technique is often used since text-based data is mostly composed of unstructured forms that are not suitable for the application of statistical analysis or data mining techniques. This study analyzed the Meteorological Yearbook data of the Korea Meteorological Administration (KMA) with a text mining technique. First, a term dictionary was constructed through preprocessing and a term-document matrix was generated. This term dictionary was then used to calculate the annual frequency of term, and observe the change in relative frequency for frequently appearing words. We also used regression analysis to identify terms with increasing and decreasing trends. We analyzed the trends in the Meteorological Yearbook of the KMA and analyzed trends of weather related news, weather status, and status of work trends that the KMA focused on. This study is to provide useful information that can help analyze and improve the meteorological services and reflect meteorological policy.

The prediction of the stock price movement after IPO using machine learning and text analysis based on TF-IDF (증권신고서의 TF-IDF 텍스트 분석과 기계학습을 이용한 공모주의 상장 이후 주가 등락 예측)

  • Yang, Suyeon;Lee, Chaerok;Won, Jonggwan;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.237-262
    • /
    • 2022
  • There has been a growing interest in IPOs (Initial Public Offerings) due to the profitable returns that IPO stocks can offer to investors. However, IPOs can be speculative investments that may involve substantial risk as well because shares tend to be volatile, and the supply of IPO shares is often highly limited. Therefore, it is crucially important that IPO investors are well informed of the issuing firms and the market before deciding whether to invest or not. Unlike institutional investors, individual investors are at a disadvantage since there are few opportunities for individuals to obtain information on the IPOs. In this regard, the purpose of this study is to provide individual investors with the information they may consider when making an IPO investment decision. This study presents a model that uses machine learning and text analysis to predict whether an IPO stock price would move up or down after the first 5 trading days. Our sample includes 691 Korean IPOs from June 2009 to December 2020. The input variables for the prediction are three tone variables created from IPO prospectuses and quantitative variables that are either firm-specific, issue-specific, or market-specific. The three prospectus tone variables indicate the percentage of positive, neutral, and negative sentences in a prospectus, respectively. We considered only the sentences in the Risk Factors section of a prospectus for the tone analysis in this study. All sentences were classified into 'positive', 'neutral', and 'negative' via text analysis using TF-IDF (Term Frequency - Inverse Document Frequency). Measuring the tone of each sentence was conducted by machine learning instead of a lexicon-based approach due to the lack of sentiment dictionaries suitable for Korean text analysis in the context of finance. For this reason, the training set was created by randomly selecting 10% of the sentences from each prospectus, and the sentence classification task on the training set was performed after reading each sentence in person. Then, based on the training set, a Support Vector Machine model was utilized to predict the tone of sentences in the test set. Finally, the machine learning model calculated the percentages of positive, neutral, and negative sentences in each prospectus. To predict the price movement of an IPO stock, four different machine learning techniques were applied: Logistic Regression, Random Forest, Support Vector Machine, and Artificial Neural Network. According to the results, models that use quantitative variables using technical analysis and prospectus tone variables together show higher accuracy than models that use only quantitative variables. More specifically, the prediction accuracy was improved by 1.45% points in the Random Forest model, 4.34% points in the Artificial Neural Network model, and 5.07% points in the Support Vector Machine model. After testing the performance of these machine learning techniques, the Artificial Neural Network model using both quantitative variables and prospectus tone variables was the model with the highest prediction accuracy rate, which was 61.59%. The results indicate that the tone of a prospectus is a significant factor in predicting the price movement of an IPO stock. In addition, the McNemar test was used to verify the statistically significant difference between the models. The model using only quantitative variables and the model using both the quantitative variables and the prospectus tone variables were compared, and it was confirmed that the predictive performance improved significantly at a 1% significance level.

Analysis of Plant Species in Elementary School Textbooks in South Korea

  • Kwon, Min Hyeong
    • Journal of People, Plants, and Environment
    • /
    • v.24 no.5
    • /
    • pp.485-498
    • /
    • 2021
  • Background and objective: This study was conducted to find out the status of plant utilization in the current textbooks by analyzing the plants by grade and subject in the national textbooks for all elementary school grades in the 2015 revised curriculum in Korea. Methods: The data collected was analyzed using Microsoft Office Excel to obtain the frequency and ratio of collected plant data and SPSS for Windows 26.0 to determine learning content areas by grade and the R program was used to visualize the learning content areas. Results: A total of 232 species of plants were presented 1,047 times in the national textbooks. Based on an analysis of the plants presented by grade, the species that continued to increase in the lower grades tended to decrease in the fifth and sixth grades, the upper grades of elementary school. As for the number and frequency of plant species by subject, Korean Language had the highest number and frequency of plant species. The types of presentation of plants in textbooks were mainly text, followed by illustrations and photos of plants, which were largely used in first grade textbooks. In addition, as for the area of learning contents in which plants are used, in the lower grades, plants were used in the linguistic domain, and in the upper grades, in the botanical and environmental domains of the natural sciences. Herbaceous plants were presented more than woody plants, and according to an analysis of the plants based on the classification of crops, horticultural crops were presented the most, followed by food crops. Out of horticultural crops, flowering plants were found the most diversity with 63 species, but the plants that appeared most frequently were fruit trees that are commonly encountered in real life. Conclusion: As a result of this study, various plant species were included in elementary school textbooks, but most of them were horticultural crops encountered in real life depending on their use. Nevertheless, plant species with high frequency have continued a similar trend of frequency from the previous curriculums. Therefore, in the next curriculum, plant learning materials should be reflected according to social changes and students' preference for plants.

Measuring the Confidence of Human Disaster Risk Case based on Text Mining (텍스트마이닝 기반의 인적재난사고사례 신뢰도 측정연구)

  • Lee, Young-Jai;Lee, Sung-Soo
    • The Journal of Information Systems
    • /
    • v.20 no.3
    • /
    • pp.63-79
    • /
    • 2011
  • Deducting the risk level of infrastructure and buildings based on past human disaster risk cases and implementing prevention measures are important activities for disaster prevention. The object of this study is to measure the confidence to proceed quantitative analysis of various disaster risk cases through text mining methodology. Indeed, by examining confidence calculation process and method, this study suggests also a basic quantitative framework. The framework to measure the confidence is composed into four stages. First step describes correlation by categorizing basic elements based on human disaster ontology. Secondly, terms and cases of Term-Document Matrix will be created and the frequency of certain cases and terms will be quantified, the correlation value will be added to the missing values. In the third stage, association rules will be created according to the basic elements of human disaster risk cases. Lastly, the confidence value of disaster risk cases will be measured through association rules. This kind of confidence value will become a key element when deciding a risk level of a new disaster risk, followed up by preventive measures. Through collection of human disaster risk cases related to road infrastructure, this study will demonstrate a case where the four steps of the quantitative framework and process had been actually used for verification.

A Study of Perception of Golfwear Using Big Data Analysis (빅데이터를 활용한 골프웨어에 관한 인식 연구)

  • Lee, Areum;Lee, Jin Hwa
    • Fashion & Textile Research Journal
    • /
    • v.20 no.5
    • /
    • pp.533-547
    • /
    • 2018
  • The objective of this study is to examine the perception of golfwear and related trends based on major keywords and associated words related to golfwear utilizing big data. For this study, the data was collected from blogs, Jisikin and Tips, news articles, and web $caf{\acute{e}}$ from two of the most commonly used search engines (Naver & Daum) containing the keywords, 'Golfwear' and 'Golf clothes'. For data collection, frequency and matrix data were extracted through Textom, from January 1, 2016 to December 31, 2017. From the matrix created by Textom, Degree centrality, Closeness centrality, Betweenness centrality, and Eigenvector centrality were calculated and analyzed by utilizing Netminer 4.0. As a result of analysis, it was found that the keyword 'brand' showed the highest rank in web visibility followed by 'woman', 'size', 'man', 'fashion', 'sports', 'price', 'store', 'discount', 'equipment' in the top 10 frequency rankings. For centrality calculations, only the top 30 keywords were included because the density was extremely high due to high frequency of the co-occurring keywords. The results of centrality calculations showed that the keywords on top of the rankings were similar to the frequency of the raw data. When the frequency was adjusted by subtracting 100 and 500 words, it showed different results as the low-ranking keywords such as J. Lindberg in the frequency analysis ranked high along with changes in the rankings of all centrality calculations. Such findings of this study will provide basis for marketing strategies and ways to increase awareness and web visibility for Golfwear brands.

A Suggestion for Spatiotemporal Analysis Model of Complaints on Officially Assessed Land Price by Big Data Mining (빅데이터 마이닝에 의한 공시지가 민원의 시공간적 분석모델 제시)

  • Cho, Tae In;Choi, Byoung Gil;Na, Young Woo;Moon, Young Seob;Kim, Se Hun
    • Journal of Cadastre & Land InformatiX
    • /
    • v.48 no.2
    • /
    • pp.79-98
    • /
    • 2018
  • The purpose of this study is to suggest a model analysing spatio-temporal characteristics of the civil complaints for the officially assessed land price based on big data mining. Specifically, in this study, the underlying reasons for the civil complaints were found from the spatio-temporal perspectives, rather than the institutional factors, and a model was suggested monitoring a trend of the occurrence of such complaints. The official documents of 6,481 civil complaints for the officially assessed land price in the district of Jung-gu of Incheon Metropolitan City over the period from 2006 to 2015 along with their temporal and spatial poperties were collected and used for the analysis. Frequencies of major key words were examined by using a text mining method. Correlations among mafor key words were studied through the social network analysis. By calculating term frequency(TF) and term frequency-inverse document frequency(TF-IDF), which correspond to the weighted value of key words, I identified the major key words for the occurrence of the civil complaint for the officially assessed land price. Then the spatio-temporal characteristics of the civil complaints were examined by analysing hot spot based on the statistics of Getis-Ord $Gi^*$. It was found that the characteristic of civil complaints for the officially assessed land price were changing, forming a cluster that is linked spatio-temporally. Using text mining and social network analysis method, we could find out that the occurrence reason of civil complaints for the officially assessed land price could be identified quantitatively based on natural language. TF and TF-IDF, the weighted averages of key words, can be used as main explanatory variables to analyze spatio-temporal characteristics of civil complaints for the officially assessed land price since these statistics are different over time across different regions.

An Analysis of Scientific Concepts Pre-service Elementary School Teachers Have through Semantic Network Analysis (의미 네트워크 분석법을 활용한 초등 예비교사들이 생각하는 과학에 대한 의미 분석)

  • Kim, Dong-Ryeul
    • Journal of Korean Elementary Science Education
    • /
    • v.32 no.3
    • /
    • pp.327-345
    • /
    • 2013
  • This study aims to investigate how pre-service elementary school teachers understand 'something scientific', 'being scientific', 'scientific events' and 'scientific questions' through semantic network analysis. To achieve this purpose, this study carried out a central analysis of the frequency and density of words and the degree of connection between key words, a concentric analysis, a click analysis and a common network analysis through text semantic network analysis by using NetMiner 4.0 Program. Based on the results of these analyses, this study came to the following conclusions. Firstly, in perceiving 'something scientific', pre-service elementary school teachers recognized 'verification', 'objective' and 'experiment' as most important words. In other words, they perceived that main grounds for something scientific should be provided through clear facts, possible to be verified and accompanied by an exact and logical theoretical system. In regard to 'being scientific', they perceived 'explanation', 'objective' and 'verification' as most important words, while having a traditional point of view that science is a set that can be explained objectively. Secondly, in regard that the term, 'observation', is contained in 'scientific events', they showed a high rate of understanding it as a scientific event. In regard to scientifical reasons, they showed the highest frequency of 'observation', and for unscientific reasons, they showed the highest frequency of 'behavior'. In perceiving 'scientific questions', they showed the highest frequency of determining bacteria-related questions as scientific. As a reason why they thought as scientific, they mentioned 'observation' most frequently like 'scientific events', while mentioning 'value judgement' as a reason why they thought as unscientific most frequently. From the results of integrated network analysis, this study found out that words pre-service teachers commonly used in stating scientific events or scientific questions were overlapped with words they mentioned for scientific events or scientific questions. As a result, it was found there were many pre-service teachers having interpreted scientific words without clearly distinguishing scientific events or scientific questions.

A Study of Secondary Mathematics Materials at a Gifted Education Center in Science Attached to a University Using Network Text Analysis (네트워크 텍스트 분석을 활용한 대학부설 과학영재교육원의 중등수학 강의교재 분석)

  • Kim, Sungyeun;Lee, Seonyoung;Shin, Jongho;Choi, Won
    • Communications of Mathematical Education
    • /
    • v.29 no.3
    • /
    • pp.465-489
    • /
    • 2015
  • The purpose of this study is to suggest implications for the development and revision of future teaching materials for mathematically gifted students by using network text analysis of secondary mathematics materials. Subjects of the analysis were learning goals of 110 teaching materials in a gifted education center in science attached to a university from 2002 to 2014. In analysing the frequency of the texts that appeared in the learning goals, key words were selected. A co-occurrence matrix of the key words was established, and a basic information of network, centrality, centralization, component, and k-core were deducted. For the analysis, KrKwic, KrTitle, and NetMiner4.0 programs were used, respectively. The results of this study were as follows. First, there was a pivot of the network formed with core hubs including 'diversity', 'understanding' 'concept' 'method', 'application', 'connection' 'problem solving', 'basic', 'real life', and 'thinking ability' in the whole network from 2002 to 2014. In addition, knowledge aspects were well reflected in teaching materials based on the centralization analysis. Second, network text analysis based on the three periods of the Mater Plan for the promotion of gifted education was conducted. As a result, a network was built up with 'understanding', and there were strong ties among 'question', 'answer', and 'problem solving' regardless of the periods. On the contrary, the centrality analysis showed that 'communication', 'discovery', and 'proof' only appeared in the first, second, and third period of Master Plan, respectively. Therefore, the results of this study suggest that affective aspects and activities with high cognitive process should be accompanied, and learning goals' mannerism and ahistoricism be prevented in developing and revising teaching materials.