• Title/Summary/Keyword: language models

Search Result 884, Processing Time 0.025 seconds

The Audience Behavior-based Emotion Prediction Model for Personalized Service (고객 맞춤형 서비스를 위한 관객 행동 기반 감정예측모형)

  • Ryoo, Eun Chung;Ahn, Hyunchul;Kim, Jae Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.2
    • /
    • pp.73-85
    • /
    • 2013
  • Nowadays, in today's information society, the importance of the knowledge service using the information to creative value is getting higher day by day. In addition, depending on the development of IT technology, it is ease to collect and use information. Also, many companies actively use customer information to marketing in a variety of industries. Into the 21st century, companies have been actively using the culture arts to manage corporate image and marketing closely linked to their commercial interests. But, it is difficult that companies attract or maintain consumer's interest through their technology. For that reason, it is trend to perform cultural activities for tool of differentiation over many firms. Many firms used the customer's experience to new marketing strategy in order to effectively respond to competitive market. Accordingly, it is emerging rapidly that the necessity of personalized service to provide a new experience for people based on the personal profile information that contains the characteristics of the individual. Like this, personalized service using customer's individual profile information such as language, symbols, behavior, and emotions is very important today. Through this, we will be able to judge interaction between people and content and to maximize customer's experience and satisfaction. There are various relative works provide customer-centered service. Specially, emotion recognition research is emerging recently. Existing researches experienced emotion recognition using mostly bio-signal. Most of researches are voice and face studies that have great emotional changes. However, there are several difficulties to predict people's emotion caused by limitation of equipment and service environments. So, in this paper, we develop emotion prediction model based on vision-based interface to overcome existing limitations. Emotion recognition research based on people's gesture and posture has been processed by several researchers. This paper developed a model that recognizes people's emotional states through body gesture and posture using difference image method. And we found optimization validation model for four kinds of emotions' prediction. A proposed model purposed to automatically determine and predict 4 human emotions (Sadness, Surprise, Joy, and Disgust). To build up the model, event booth was installed in the KOCCA's lobby and we provided some proper stimulative movie to collect their body gesture and posture as the change of emotions. And then, we extracted body movements using difference image method. And we revised people data to build proposed model through neural network. The proposed model for emotion prediction used 3 type time-frame sets (20 frames, 30 frames, and 40 frames). And then, we adopted the model which has best performance compared with other models.' Before build three kinds of models, the entire 97 data set were divided into three data sets of learning, test, and validation set. The proposed model for emotion prediction was constructed using artificial neural network. In this paper, we used the back-propagation algorithm as a learning method, and set learning rate to 10%, momentum rate to 10%. The sigmoid function was used as the transform function. And we designed a three-layer perceptron neural network with one hidden layer and four output nodes. Based on the test data set, the learning for this research model was stopped when it reaches 50000 after reaching the minimum error in order to explore the point of learning. We finally processed each model's accuracy and found best model to predict each emotions. The result showed prediction accuracy 100% from sadness, and 96% from joy prediction in 20 frames set model. And 88% from surprise, and 98% from disgust in 30 frames set model. The findings of our research are expected to be useful to provide effective algorithm for personalized service in various industries such as advertisement, exhibition, performance, etc.

The Analysis on the Relationship between Firms' Exposures to SNS and Stock Prices in Korea (기업의 SNS 노출과 주식 수익률간의 관계 분석)

  • Kim, Taehwan;Jung, Woo-Jin;Lee, Sang-Yong Tom
    • Asia pacific journal of information systems
    • /
    • v.24 no.2
    • /
    • pp.233-253
    • /
    • 2014
  • Can the stock market really be predicted? Stock market prediction has attracted much attention from many fields including business, economics, statistics, and mathematics. Early research on stock market prediction was based on random walk theory (RWT) and the efficient market hypothesis (EMH). According to the EMH, stock market are largely driven by new information rather than present and past prices. Since it is unpredictable, stock market will follow a random walk. Even though these theories, Schumaker [2010] asserted that people keep trying to predict the stock market by using artificial intelligence, statistical estimates, and mathematical models. Mathematical approaches include Percolation Methods, Log-Periodic Oscillations and Wavelet Transforms to model future prices. Examples of artificial intelligence approaches that deals with optimization and machine learning are Genetic Algorithms, Support Vector Machines (SVM) and Neural Networks. Statistical approaches typically predicts the future by using past stock market data. Recently, financial engineers have started to predict the stock prices movement pattern by using the SNS data. SNS is the place where peoples opinions and ideas are freely flow and affect others' beliefs on certain things. Through word-of-mouth in SNS, people share product usage experiences, subjective feelings, and commonly accompanying sentiment or mood with others. An increasing number of empirical analyses of sentiment and mood are based on textual collections of public user generated data on the web. The Opinion mining is one domain of the data mining fields extracting public opinions exposed in SNS by utilizing data mining. There have been many studies on the issues of opinion mining from Web sources such as product reviews, forum posts and blogs. In relation to this literatures, we are trying to understand the effects of SNS exposures of firms on stock prices in Korea. Similarly to Bollen et al. [2011], we empirically analyze the impact of SNS exposures on stock return rates. We use Social Metrics by Daum Soft, an SNS big data analysis company in Korea. Social Metrics provides trends and public opinions in Twitter and blogs by using natural language process and analysis tools. It collects the sentences circulated in the Twitter in real time, and breaks down these sentences into the word units and then extracts keywords. In this study, we classify firms' exposures in SNS into two groups: positive and negative. To test the correlation and causation relationship between SNS exposures and stock price returns, we first collect 252 firms' stock prices and KRX100 index in the Korea Stock Exchange (KRX) from May 25, 2012 to September 1, 2012. We also gather the public attitudes (positive, negative) about these firms from Social Metrics over the same period of time. We conduct regression analysis between stock prices and the number of SNS exposures. Having checked the correlation between the two variables, we perform Granger causality test to see the causation direction between the two variables. The research result is that the number of total SNS exposures is positively related with stock market returns. The number of positive mentions of has also positive relationship with stock market returns. Contrarily, the number of negative mentions has negative relationship with stock market returns, but this relationship is statistically not significant. This means that the impact of positive mentions is statistically bigger than the impact of negative mentions. We also investigate whether the impacts are moderated by industry type and firm's size. We find that the SNS exposures impacts are bigger for IT firms than for non-IT firms, and bigger for small sized firms than for large sized firms. The results of Granger causality test shows change of stock price return is caused by SNS exposures, while the causation of the other way round is not significant. Therefore the correlation relationship between SNS exposures and stock prices has uni-direction causality. The more a firm is exposed in SNS, the more is the stock price likely to increase, while stock price changes may not cause more SNS mentions.

The research for the yachting development of Korean Marina operation plans (요트 발전을 위한 한국형 마리나 운영방안에 관한 연구)

  • Jeong Jong-Seok;Hugh Ihl
    • Journal of Navigation and Port Research
    • /
    • v.28 no.10 s.96
    • /
    • pp.899-908
    • /
    • 2004
  • The rise of income and introduction of 5 day a week working system give korean people opportunities to enjoy their leisure time. And many korean people have much interest in oceanic sports such as yachting and also oceanic leisure equipments. With the popularization and development of the equipments, the scope of oceanic activities has been expanding in Korea just as in the advanced oceanic countries. However, The current conditions for the sports in Korea are not advanced and even worse than underdeveloped countries. In order to develop the underdeveloped resources of Korean marina, we need to customize the marina models of advanced nations to serve the specific needs and circumstances of Korea As such we have carried out a comparative analysis of how Austrailia, Newzealand, Singapore, japan and Malaysia operate their marina, reaching the following conclusions. Firstly, in marina operations, in order to protect personal property rights and to preserve the environment, we must operate membership and non-membership, profit and non-profit schemes separately, yet without regulating the dress code entering or leaving the club house. Secondly, in order to accumulate greater value added, new sporting events should be hosted each year. There is also the need for an active use of volunteers, the generation of greater interest in yacht tourism, and the simplification of CIQ procedures for foreign yachts as well as the provision of language services. Thirdly, a permanent yacht school should be established, and classes should be taught by qualified instructors. Beginners, intermediary, and advanced learner classes should be managed separately with special emphasis on the dinghy yacht program for children. Fourthly, arrival and departure at the moorings must be regulated autonomically, and there must be systematic measures for the marina to be able, in part, to compensate for loss and damages to equipment, security and surveillance after usage fees have been paid for. Fifthly, marine safety personnel must be formed in accordance with Korea's current circumstances from civilian organizations in order to be used actively in benchmarking, rescue operations, and oceanic searches at times of disaster at sea.

The PRISM-based Rainfall Mapping at an Enhanced Grid Cell Resolution in Complex Terrain (복잡지형 고해상도 격자망에서의 PRISM 기반 강수추정법)

  • Chung, U-Ran;Yun, Kyung-Dahm;Cho, Kyung-Sook;Yi, Jae-Hyun;Yun, Jin-I.
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.11 no.2
    • /
    • pp.72-78
    • /
    • 2009
  • The demand for rainfall data in gridded digital formats has increased in recent years due to the close linkage between hydrological models and decision support systems using the geographic information system. One of the most widely used tools for digital rainfall mapping is the PRISM (parameter-elevation regressions on independent slopes model) which uses point data (rain gauge stations), a digital elevation model (DEM), and other spatial datasets to generate repeatable estimates of monthly and annual precipitation. In the PRISM, rain gauge stations are assigned with weights that account for other climatically important factors besides elevation, and aspects and the topographic exposure are simulated by dividing the terrain into topographic facets. The size of facet or grid cell resolution is determined by the density of rain gauge stations and a $5{\times}5km$ grid cell is considered as the lowest limit under the situation in Korea. The PRISM algorithms using a 270m DEM for South Korea were implemented in a script language environment (Python) and relevant weights for each 270m grid cell were derived from the monthly data from 432 official rain gauge stations. Weighted monthly precipitation data from at least 5 nearby stations for each grid cell were regressed to the elevation and the selected linear regression equations with the 270m DEM were used to generate a digital precipitation map of South Korea at 270m resolution. Among 1.25 million grid cells, precipitation estimates at 166 cells, where the measurements were made by the Korea Water Corporation rain gauge network, were extracted and the monthly estimation errors were evaluated. An average of 10% reduction in the root mean square error (RMSE) was found for any months with more than 100mm monthly precipitation compared to the RMSE associated with the original 5km PRISM estimates. This modified PRISM may be used for rainfall mapping in rainy season (May to September) at much higher spatial resolution than the original PRISM without losing the data accuracy.

The Construction of QoS Integration Platform for Real-time Negotiation and Adaptation Stream Service in Distributed Object Computing Environments (분산 객체 컴퓨팅 환경에서 실시간 협약 및 적응 스트림 서비스를 위한 QoS 통합 플랫폼의 구축)

  • Jun, Byung-Taek;Kim, Myung-Hee;Joo, Su-Chong
    • The Transactions of the Korea Information Processing Society
    • /
    • v.7 no.11S
    • /
    • pp.3651-3667
    • /
    • 2000
  • Recently, in the distributed multimedia environments based on internet, as radical growing technologies, the most of researchers focus on both streaming technology and distributed object thchnology, Specially, the studies which are tried to integrate the streaming services on the distributed object technology have been progressing. These technologies are applied to various stream service mamgements and protocols. However, the stream service management mexlels which are being proposed by the existing researches are insufficient for suporting the QoS of stream services. Besides, the existing models have the problems that cannot support the extensibility and the reusability, when the QoS-reiatedfunctions are being developed as a sub-module which is suited on the specific-purpose application services. For solving these problems, in this paper. we suggested a QoS Integrated platform which can extend and reuse using the distributed object technologies, and guarantee the QoS of the stream services. A structure of platform we suggested consists of three components such as User Control Module(UCM), QoS Management Module(QoSM) and Stream Object. Stream Object has Send/Receive operations for transmitting the RTP packets over TCP/IP. User Control ModuleI(UCM) controls Stream Objects via the COREA service objects. QoS Management Modulel(QoSM) has the functions which maintain the QoS of stream service between the UCMs in client and server. As QoS control methexlologies, procedures of resource monitoring, negotiation, and resource adaptation are executed via the interactions among these comiXments mentioned above. For constmcting this QoS integrated platform, we first implemented the modules mentioned above independently, and then, used IDL for defining interfaces among these mexlules so that can support platform independence, interoperability and portability base on COREA. This platform is constructed using OrbixWeb 3.1c following CORBA specification on Solaris 2.5/2.7, Java language, Java, Java Media Framework API 2.0, Mini-SQL1.0.16 and multimedia equipments. As results for verifying this platform functionally, we showed executing results of each module we mentioned above, and a numerical data obtained from QoS control procedures on client and server's GUI, while stream service is executing on our platform.

  • PDF

Transfer and Validation of NIRS Calibration Models for Evaluating Forage Quality in Italian Ryegrass Silages (이탈리안 라이그라스 사일리지의 품질평가를 위한 근적외선분광 (NIRS) 검량식의 이설 및 검증)

  • Cho, Kyu Chae;Park, Hyung Soo;Lee, Sang Hoon;Choi, Jin Hyeok;Seo, Sung;Choi, Gi Jun
    • Journal of Animal Environmental Science
    • /
    • v.18 no.sup
    • /
    • pp.81-90
    • /
    • 2012
  • This study was evaluated high end research grade Near infrared spectrophotometer (NIRS) to low end popular field grade multiple Near infrared spectrophotometer (NIRS) for rapid analysis at forage quality at sight with 241 samples of Italian ryegrass silage during 3 years collected whole country for evaluate accuracy and precision between instruments. Firstly collected and build database high end research grade NIRS using with Unity Scientific Model 2500X (650 nm~2,500 nm) then trim and fit to low end popular field grade NIRS with Unity Scientific Model 1400 (1,400 nm~2,400 nm) then build and create calibration, transfer calibration with special transfer algorithm. The result between instruments was 0.000%~0.343% differences, rapidly analysis for chemical constituents, NDF, ADF, and crude protein, crude ash and fermentation parameter such as moisture, pH and lactic acid, finally forage quality parameter, TDN, DMI, RFV within 5 minutes at sight and the result equivalent with laboratory data. Nevertheless during 3 years collected samples for build calibration was organic samples that make differentiate by local or yearly bases etc. This strongly suggest population evaluation technique needed and constantly update calibration and maintenance calibration to proper handling database accumulation and spread out by knowledgable control laboratory analysis and reflect calibration update such as powerful control center needed for long lasting usage of forage analysis with NIRS at sight. Especially the agriculture products such as forage will continuously changes that made easily find out the changes and update routinely, if not near future NIRS was worthless due to those changes. Many research related NIRS was shortly study not long term study that made not well using NIRS, so the system needed check simple and instantly using with local language supported signal methods Global Distance (GD) and Neighbour Distance (ND) algorithm. Finally the multiple popular field grades instruments should be the same results not only between research grade instruments but also between multiple popular field grade instruments that needed easily transfer calibration and maintenance between instruments via internet networking techniques.

Installation Art In Indonesian Contemporary Art; A Quest For Medium and Social Spaces (인도네시아 현대미술에 있어서의 설치미술 - 미디엄과 사회적 공간을 위한 탐색)

  • Kusmara, A. Rikrik
    • The Journal of Art Theory & Practice
    • /
    • no.5
    • /
    • pp.217-229
    • /
    • 2007
  • Many historical research and facet about modern art in Indonesia which formulating background of contemporary Indonesian Art. Indonesian art critic Sanento Yuliman states that Modern art has been rapidly developing in Indonesia since the Indonesian Independence in 1945. Modern Art is a part of the super culture of the Indonesian metropolitan and is closely related to the contact between the Indonesian and Western Cultures. Its birth was part of the nationalism project, when the Indonesian people consists of various ethnics were determined to become a new nation, the Indonesian nation, and they wished for a new culture, and therefore, a new art. The period 1960s, which was the beginning of the creation and development of the painters and the painters associations, was the first stage of the development of modern art in Indonesia. The second stage showed the important role of the higher education institutes for art. These institutes have developed since the 1950s and in the 1970s they were the main education institutes for painters and other artists. The artists awareness of the medium, forms or the organization of shapes were encouraged more intensely and these encouraged the exploring and experimental attitudes. Meanwhile, the information about the world's modern art, particularly Western Art; was widely and rapidly spread. The 1960s and 1970s were marked by the development of various abstractions and abstract art and the great number of explorations in various new media, like the experiment with collage, assemblage, mixed media. The works of the Neo Art Movement-group in the second half of the 1970s and in the 1980s shows environmental art and installations, influenced by the elements of popular art, from the commercial world and mass media, as well as the involvement of art in the social and environmental affairs. The issues about the environment, frequently launched by the intellectuals in the period of economic development starting in the 1970s, echoed among the artists, and they were widened in the social, art and cultural circles. The Indonesian economic development following the important change in the 1970s has caused a change in the life of the middle and upper class society, as has the change in various aspects of a big city, particularly Jakarta. The new genre emerged in 1975 which indicates contemporary art in Indonesia, when a group of young artists organized a movement, which was widely known as the Indonesian New Art Movement. This movement criticized international style, universalism and the long standing debate on an east-west-dichotomy. As far as the actual practice of the arts was concerned the movement criticized the domination of the art of painting and saw this as a sign of stagnation in Indonesian art development. Based on this criticism 'the movement' introduced ready-mades and installations (Jim Supangkat). Takes almost two decades that the New Art Movement activists were establishing Indonesian Installation art genre as contemporary paradigm and influenced the 1980's gene ration like, FX Harsono, Dadang Christanto, Arahmaiani, Tisna Sanjaya, Diyanto, Andarmanik, entering the 1990's decade as "rebellion period" ; reject towards established aesthetic mainstream i.e. painting, sculpture, graphic art which are insufficient to express "new language" and artistic needs especially to mediate social politic and cultural situation. Installation Art which contains open possibilities of creation become a vehicle for aesthetic establishment rejection and social politics stagnant expression in 1990s. Installation art accommodates two major field; first, the rejection of aesthetic establishment has a consequences an artists quest for medium; deconstruction models and cross disciplines into multi and intermedia i.e. performance, music, video etc. Second aspect is artists' social politic intention for changes, both conclude as characteristics of Indonesian Installation Art and establishing the freedom of expression in contemporary Indonesian Art until today.

  • PDF

Knowledge Extraction Methodology and Framework from Wikipedia Articles for Construction of Knowledge-Base (지식베이스 구축을 위한 한국어 위키피디아의 학습 기반 지식추출 방법론 및 플랫폼 연구)

  • Kim, JaeHun;Lee, Myungjin
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.43-61
    • /
    • 2019
  • Development of technologies in artificial intelligence has been rapidly increasing with the Fourth Industrial Revolution, and researches related to AI have been actively conducted in a variety of fields such as autonomous vehicles, natural language processing, and robotics. These researches have been focused on solving cognitive problems such as learning and problem solving related to human intelligence from the 1950s. The field of artificial intelligence has achieved more technological advance than ever, due to recent interest in technology and research on various algorithms. The knowledge-based system is a sub-domain of artificial intelligence, and it aims to enable artificial intelligence agents to make decisions by using machine-readable and processible knowledge constructed from complex and informal human knowledge and rules in various fields. A knowledge base is used to optimize information collection, organization, and retrieval, and recently it is used with statistical artificial intelligence such as machine learning. Recently, the purpose of the knowledge base is to express, publish, and share knowledge on the web by describing and connecting web resources such as pages and data. These knowledge bases are used for intelligent processing in various fields of artificial intelligence such as question answering system of the smart speaker. However, building a useful knowledge base is a time-consuming task and still requires a lot of effort of the experts. In recent years, many kinds of research and technologies of knowledge based artificial intelligence use DBpedia that is one of the biggest knowledge base aiming to extract structured content from the various information of Wikipedia. DBpedia contains various information extracted from Wikipedia such as a title, categories, and links, but the most useful knowledge is from infobox of Wikipedia that presents a summary of some unifying aspect created by users. These knowledge are created by the mapping rule between infobox structures and DBpedia ontology schema defined in DBpedia Extraction Framework. In this way, DBpedia can expect high reliability in terms of accuracy of knowledge by using the method of generating knowledge from semi-structured infobox data created by users. However, since only about 50% of all wiki pages contain infobox in Korean Wikipedia, DBpedia has limitations in term of knowledge scalability. This paper proposes a method to extract knowledge from text documents according to the ontology schema using machine learning. In order to demonstrate the appropriateness of this method, we explain a knowledge extraction model according to the DBpedia ontology schema by learning Wikipedia infoboxes. Our knowledge extraction model consists of three steps, document classification as ontology classes, proper sentence classification to extract triples, and value selection and transformation into RDF triple structure. The structure of Wikipedia infobox are defined as infobox templates that provide standardized information across related articles, and DBpedia ontology schema can be mapped these infobox templates. Based on these mapping relations, we classify the input document according to infobox categories which means ontology classes. After determining the classification of the input document, we classify the appropriate sentence according to attributes belonging to the classification. Finally, we extract knowledge from sentences that are classified as appropriate, and we convert knowledge into a form of triples. In order to train models, we generated training data set from Wikipedia dump using a method to add BIO tags to sentences, so we trained about 200 classes and about 2,500 relations for extracting knowledge. Furthermore, we evaluated comparative experiments of CRF and Bi-LSTM-CRF for the knowledge extraction process. Through this proposed process, it is possible to utilize structured knowledge by extracting knowledge according to the ontology schema from text documents. In addition, this methodology can significantly reduce the effort of the experts to construct instances according to the ontology schema.

Performance of Investment Strategy using Investor-specific Transaction Information and Machine Learning (투자자별 거래정보와 머신러닝을 활용한 투자전략의 성과)

  • Kim, Kyung Mock;Kim, Sun Woong;Choi, Heung Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.1
    • /
    • pp.65-82
    • /
    • 2021
  • Stock market investors are generally split into foreign investors, institutional investors, and individual investors. Compared to individual investor groups, professional investor groups such as foreign investors have an advantage in information and financial power and, as a result, foreign investors are known to show good investment performance among market participants. The purpose of this study is to propose an investment strategy that combines investor-specific transaction information and machine learning, and to analyze the portfolio investment performance of the proposed model using actual stock price and investor-specific transaction data. The Korea Exchange offers daily information on the volume of purchase and sale of each investor to securities firms. We developed a data collection program in C# programming language using an API provided by Daishin Securities Cybosplus, and collected 151 out of 200 KOSPI stocks with daily opening price, closing price and investor-specific net purchase data from January 2, 2007 to July 31, 2017. The self-organizing map model is an artificial neural network that performs clustering by unsupervised learning and has been introduced by Teuvo Kohonen since 1984. We implement competition among intra-surface artificial neurons, and all connections are non-recursive artificial neural networks that go from bottom to top. It can also be expanded to multiple layers, although many fault layers are commonly used. Linear functions are used by active functions of artificial nerve cells, and learning rules use Instar rules as well as general competitive learning. The core of the backpropagation model is the model that performs classification by supervised learning as an artificial neural network. We grouped and transformed investor-specific transaction volume data to learn backpropagation models through the self-organizing map model of artificial neural networks. As a result of the estimation of verification data through training, the portfolios were rebalanced monthly. For performance analysis, a passive portfolio was designated and the KOSPI 200 and KOSPI index returns for proxies on market returns were also obtained. Performance analysis was conducted using the equally-weighted portfolio return, compound interest rate, annual return, Maximum Draw Down, standard deviation, and Sharpe Ratio. Buy and hold returns of the top 10 market capitalization stocks are designated as a benchmark. Buy and hold strategy is the best strategy under the efficient market hypothesis. The prediction rate of learning data using backpropagation model was significantly high at 96.61%, while the prediction rate of verification data was also relatively high in the results of the 57.1% verification data. The performance evaluation of self-organizing map grouping can be determined as a result of a backpropagation model. This is because if the grouping results of the self-organizing map model had been poor, the learning results of the backpropagation model would have been poor. In this way, the performance assessment of machine learning is judged to be better learned than previous studies. Our portfolio doubled the return on the benchmark and performed better than the market returns on the KOSPI and KOSPI 200 indexes. In contrast to the benchmark, the MDD and standard deviation for portfolio risk indicators also showed better results. The Sharpe Ratio performed higher than benchmarks and stock market indexes. Through this, we presented the direction of portfolio composition program using machine learning and investor-specific transaction information and showed that it can be used to develop programs for real stock investment. The return is the result of monthly portfolio composition and asset rebalancing to the same proportion. Better outcomes are predicted when forming a monthly portfolio if the system is enforced by rebalancing the suggested stocks continuously without selling and re-buying it. Therefore, real transactions appear to be relevant.

Topic Modeling Insomnia Social Media Corpus using BERTopic and Building Automatic Deep Learning Classification Model (BERTopic을 활용한 불면증 소셜 데이터 토픽 모델링 및 불면증 경향 문헌 딥러닝 자동분류 모델 구축)

  • Ko, Young Soo;Lee, Soobin;Cha, Minjung;Kim, Seongdeok;Lee, Juhee;Han, Ji Yeong;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.2
    • /
    • pp.111-129
    • /
    • 2022
  • Insomnia is a chronic disease in modern society, with the number of new patients increasing by more than 20% in the last 5 years. Insomnia is a serious disease that requires diagnosis and treatment because the individual and social problems that occur when there is a lack of sleep are serious and the triggers of insomnia are complex. This study collected 5,699 data from 'insomnia', a community on 'Reddit', a social media that freely expresses opinions. Based on the International Classification of Sleep Disorders ICSD-3 standard and the guidelines with the help of experts, the insomnia corpus was constructed by tagging them as insomnia tendency documents and non-insomnia tendency documents. Five deep learning language models (BERT, RoBERTa, ALBERT, ELECTRA, XLNet) were trained using the constructed insomnia corpus as training data. As a result of performance evaluation, RoBERTa showed the highest performance with an accuracy of 81.33%. In order to in-depth analysis of insomnia social data, topic modeling was performed using the newly emerged BERTopic method by supplementing the weaknesses of LDA, which is widely used in the past. As a result of the analysis, 8 subject groups ('Negative emotions', 'Advice and help and gratitude', 'Insomnia-related diseases', 'Sleeping pills', 'Exercise and eating habits', 'Physical characteristics', 'Activity characteristics', 'Environmental characteristics') could be confirmed. Users expressed negative emotions and sought help and advice from the Reddit insomnia community. In addition, they mentioned diseases related to insomnia, shared discourse on the use of sleeping pills, and expressed interest in exercise and eating habits. As insomnia-related characteristics, we found physical characteristics such as breathing, pregnancy, and heart, active characteristics such as zombies, hypnic jerk, and groggy, and environmental characteristics such as sunlight, blankets, temperature, and naps.