• Title/Summary/Keyword: Predictive

Search Result 5,438, Processing Time 0.041 seconds

The prediction of the stock price movement after IPO using machine learning and text analysis based on TF-IDF (증권신고서의 TF-IDF 텍스트 분석과 기계학습을 이용한 공모주의 상장 이후 주가 등락 예측)

  • Yang, Suyeon;Lee, Chaerok;Won, Jonggwan;Hong, Taeho
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.237-262
    • /
    • 2022
  • There has been a growing interest in IPOs (Initial Public Offerings) due to the profitable returns that IPO stocks can offer to investors. However, IPOs can be speculative investments that may involve substantial risk as well because shares tend to be volatile, and the supply of IPO shares is often highly limited. Therefore, it is crucially important that IPO investors are well informed of the issuing firms and the market before deciding whether to invest or not. Unlike institutional investors, individual investors are at a disadvantage since there are few opportunities for individuals to obtain information on the IPOs. In this regard, the purpose of this study is to provide individual investors with the information they may consider when making an IPO investment decision. This study presents a model that uses machine learning and text analysis to predict whether an IPO stock price would move up or down after the first 5 trading days. Our sample includes 691 Korean IPOs from June 2009 to December 2020. The input variables for the prediction are three tone variables created from IPO prospectuses and quantitative variables that are either firm-specific, issue-specific, or market-specific. The three prospectus tone variables indicate the percentage of positive, neutral, and negative sentences in a prospectus, respectively. We considered only the sentences in the Risk Factors section of a prospectus for the tone analysis in this study. All sentences were classified into 'positive', 'neutral', and 'negative' via text analysis using TF-IDF (Term Frequency - Inverse Document Frequency). Measuring the tone of each sentence was conducted by machine learning instead of a lexicon-based approach due to the lack of sentiment dictionaries suitable for Korean text analysis in the context of finance. For this reason, the training set was created by randomly selecting 10% of the sentences from each prospectus, and the sentence classification task on the training set was performed after reading each sentence in person. Then, based on the training set, a Support Vector Machine model was utilized to predict the tone of sentences in the test set. Finally, the machine learning model calculated the percentages of positive, neutral, and negative sentences in each prospectus. To predict the price movement of an IPO stock, four different machine learning techniques were applied: Logistic Regression, Random Forest, Support Vector Machine, and Artificial Neural Network. According to the results, models that use quantitative variables using technical analysis and prospectus tone variables together show higher accuracy than models that use only quantitative variables. More specifically, the prediction accuracy was improved by 1.45% points in the Random Forest model, 4.34% points in the Artificial Neural Network model, and 5.07% points in the Support Vector Machine model. After testing the performance of these machine learning techniques, the Artificial Neural Network model using both quantitative variables and prospectus tone variables was the model with the highest prediction accuracy rate, which was 61.59%. The results indicate that the tone of a prospectus is a significant factor in predicting the price movement of an IPO stock. In addition, the McNemar test was used to verify the statistically significant difference between the models. The model using only quantitative variables and the model using both the quantitative variables and the prospectus tone variables were compared, and it was confirmed that the predictive performance improved significantly at a 1% significance level.

Comparative assessment and uncertainty analysis of ensemble-based hydrologic data assimilation using airGRdatassim (airGRdatassim을 이용한 앙상블 기반 수문자료동화 기법의 비교 및 불확실성 평가)

  • Lee, Garim;Lee, Songhee;Kim, Bomi;Woo, Dong Kook;Noh, Seong Jin
    • Journal of Korea Water Resources Association
    • /
    • v.55 no.10
    • /
    • pp.761-774
    • /
    • 2022
  • Accurate hydrologic prediction is essential to analyze the effects of drought, flood, and climate change on flow rates, water quality, and ecosystems. Disentangling the uncertainty of the hydrological model is one of the important issues in hydrology and water resources research. Hydrologic data assimilation (DA), a technique that updates the status or parameters of a hydrological model to produce the most likely estimates of the initial conditions of the model, is one of the ways to minimize uncertainty in hydrological simulations and improve predictive accuracy. In this study, the two ensemble-based sequential DA techniques, ensemble Kalman filter, and particle filter are comparatively analyzed for the daily discharge simulation at the Yongdam catchment using airGRdatassim. The results showed that the values of Kling-Gupta efficiency (KGE) were improved from 0.799 in the open loop simulation to 0.826 in the ensemble Kalman filter and to 0.933 in the particle filter. In addition, we analyzed the effects of hyper-parameters related to the data assimilation methods such as precipitation and potential evaporation forcing error parameters and selection of perturbed and updated states. For the case of forcing error conditions, the particle filter was superior to the ensemble in terms of the KGE index. The size of the optimal forcing noise was relatively smaller in the particle filter compared to the ensemble Kalman filter. In addition, with more state variables included in the updating step, performance of data assimilation improved, implicating that adequate selection of updating states can be considered as a hyper-parameter. The simulation experiments in this study implied that DA hyper-parameters needed to be carefully optimized to exploit the potential of DA methods.

Development of Deep-Learning-Based Models for Predicting Groundwater Levels in the Middle-Jeju Watershed, Jeju Island (딥러닝 기법을 이용한 제주도 중제주수역 지하수위 예측 모델개발)

  • Park, Jaesung;Jeong, Jiho;Jeong, Jina;Kim, Ki-Hong;Shin, Jaehyeon;Lee, Dongyeop;Jeong, Saebom
    • The Journal of Engineering Geology
    • /
    • v.32 no.4
    • /
    • pp.697-723
    • /
    • 2022
  • Data-driven models to predict groundwater levels 30 days in advance were developed for 12 groundwater monitoring stations in the middle-Jeju watershed, Jeju Island. Stacked long short-term memory (stacked-LSTM), a deep learning technique suitable for time series forecasting, was used for model development. Daily time series data from 2001 to 2022 for precipitation, groundwater usage amount, and groundwater level were considered. Various models were proposed that used different combinations of the input data types and varying lengths of previous time series data for each input variable. A general procedure for deep-learning-based model development is suggested based on consideration of the comparative validation results of the tested models. A model using precipitation, groundwater usage amount, and previous groundwater level data as input variables outperformed any model neglecting one or more of these data categories. Using extended sequences of these past data improved the predictions, possibly owing to the long delay time between precipitation and groundwater recharge, which results from the deep groundwater level in Jeju Island. However, limiting the range of considered groundwater usage data that significantly affected the groundwater level fluctuation (rather than using all the groundwater usage data) improved the performance of the predictive model. The developed models can predict the future groundwater level based on the current amount of precipitation and groundwater use. Therefore, the models provide information on the soundness of the aquifer system, which will help to prepare management plans to maintain appropriate groundwater quantities.

Classification Algorithm-based Prediction Performance of Order Imbalance Information on Short-Term Stock Price (분류 알고리즘 기반 주문 불균형 정보의 단기 주가 예측 성과)

  • Kim, S.W.
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.157-177
    • /
    • 2022
  • Investors are trading stocks by keeping a close watch on the order information submitted by domestic and foreign investors in real time through Limit Order Book information, so-called price current provided by securities firms. Will order information released in the Limit Order Book be useful in stock price prediction? This study analyzes whether it is significant as a predictor of future stock price up or down when order imbalances appear as investors' buying and selling orders are concentrated to one side during intra-day trading time. Using classification algorithms, this study improved the prediction accuracy of the order imbalance information on the short-term price up and down trend, that is the closing price up and down of the day. Day trading strategies are proposed using the predicted price trends of the classification algorithms and the trading performances are analyzed through empirical analysis. The 5-minute KOSPI200 Index Futures data were analyzed for 4,564 days from January 19, 2004 to June 30, 2022. The results of the empirical analysis are as follows. First, order imbalance information has a significant impact on the current stock prices. Second, the order imbalance information observed in the early morning has a significant forecasting power on the price trends from the early morning to the market closing time. Third, the Support Vector Machines algorithm showed the highest prediction accuracy on the day's closing price trends using the order imbalance information at 54.1%. Fourth, the order imbalance information measured at an early time of day had higher prediction accuracy than the order imbalance information measured at a later time of day. Fifth, the trading performances of the day trading strategies using the prediction results of the classification algorithms on the price up and down trends were higher than that of the benchmark trading strategy. Sixth, except for the K-Nearest Neighbor algorithm, all investment performances using the classification algorithms showed average higher total profits than that of the benchmark strategy. Seventh, the trading performances using the predictive results of the Logical Regression, Random Forest, Support Vector Machines, and XGBoost algorithms showed higher results than the benchmark strategy in the Sharpe Ratio, which evaluates both profitability and risk. This study has an academic difference from existing studies in that it documented the economic value of the total buy & sell order volume information among the Limit Order Book information. The empirical results of this study are also valuable to the market participants from a trading perspective. In future studies, it is necessary to improve the performance of the trading strategy using more accurate price prediction results by expanding to deep learning models which are actively being studied for predicting stock prices recently.

A Longitudinal Validation Study of the Korean Version of PCL-5(Post-traumatic Stress Disorder Checklist for DSM-5) (PCL-5(DSM-5 기준 외상 후 스트레스 장애 체크리스트) 한국판 종단 타당화 연구)

  • Lee, DongHun;Lee, DeokHee;Kim, SungHyun;Jung, DaSong
    • Korean Journal of Culture and Social Issue
    • /
    • v.28 no.2
    • /
    • pp.187-217
    • /
    • 2022
  • The aim of this study is to examine the psychometric properties of the Korean version of the Post-traumatic Stress Disorder Checklist for DSM-5(PCL-5). For this purpose, online surveys were conducted for two times with a one year interval using the data from 1,077 Korean adults at time 1, and 563 Korean adults at time 2. First, from the result of the confirmatory factor analysis, comparing the model fit of the 1, 4, 6, and 7-factor model, the 4, 6, and 7-factor model showed a acceptable fit, and the best fit was seen in the order of the 7, 6, 4-factor model. Second, the internal consistency, omega coefficient, construct validity, average variance extracted, and test-retest reliability results were all satisfactory.. Third, a correlation analysis with the K-PC-PTSD-5 and the sub-factors of BSI-18 was conducted to check the validity of the Korean Version of PCL-5. As a result, a positive correlation was seen with both K-PC-PTSD-5 and BSI-18. Fourth, a hierarchical multiple regression was performed to examine whether the Korean Version of PCL-5 predicts future PTSD, depression, anxiety, and somatization. As a result, the Korean Version of PCL-5 measured at time 1 significantly predicted PTSD, depression, anxiety, and somatization symptoms at time 2. Fifth, by analyzing the ROC curve, the discriminant power of PCL-5 for screening PTSD symptom groups was confirmed, and the best cut-off score was suggested. As a result of the longitudinal validation of Korean version of PCL-5, it was found that this scale is a reliable and valid measure for Korean adults. By looking into the predictive validity of the scale, it was found that the Korean version of PCL-5 can predict not only PTSD symptoms but also PTSD-related symptoms such as depression, anxiety, and somatization. Also, this study differs from previous validation studies measuring PTSD symptoms in that it suggested a cut-off score to help differentiate PTSD symptom groups.

Factors Affecting Physicians who will be Vaccinated Every Year after Receiving the COVID-19 Vaccine in Healthcare Workers (의료종사자의 COVID-19 예방 백신 접종받은 후 향후 매년 예방접종 의향에 미치는 요인)

  • Hyeun-Woo Choi;Sung-Hwa Park;Eun-Kyung Cho;Chang-hyun Han;Jong-Min Lee
    • Journal of the Korean Society of Radiology
    • /
    • v.17 no.2
    • /
    • pp.257-265
    • /
    • 2023
  • The purpose of this study was to vaccinate every year according to the general characteristics of COVID-19, whether to vaccinate every year according to the vaccination experience, whether to vaccinate every year according to knowledge/attitude about vaccination, and negative responses to the vaccinate every year In order to understand the factors affecting the vaccination physician every year by identifying the factors of Statistical analysis is based on general characteristics, variables based on vaccination experience, and knowledge/attitudes related to vaccination. The doctor calculates the frequency and percentage, A square test (-test) was performed, and if the chi-square test was significant but the expected frequency was less than 5 for 25% or more, a ratio difference test was performed with Fisher's exact test. Through multiple logistic regression analysis using variables that were significant in simple analysis, a predictive model for future vaccination and the effect size of each independent variable were estimated. As statistical analysis software, SAS 9.4 (SAS Institute Inc., Cary, NC, USA) was used, and because the sample size was not large, the significance level was set at 10%, and when the p-value was less than 0.10, it was interpreted as statistically significant. In the simple logistic regression analysis, the reason why they answered that they would not be vaccinated every year was that they answered 'to prevent infection of family and hospital guests' rather than 'to prevent my infection' as the reason for the vaccination. It was 11.0 times higher and 3.67 times higher in the case of 'for the formation of collective immunity of the local community and the country'. The adverse reactions experienced after the 1st and 2nd vaccination were 8.42 times higher in those who did not experience pain at the injection site than those who did not, 4.00 times higher in those who experienced swelling or redness, and 5.69 times higher in those who experienced joint pain. There was a 5.57 times higher rate of absenteeism annually than those who did not. In addition, the more anxious they felt about vaccination, the more likely they were to not get the vaccine every year by 2.94 times.

Development and Testing of a RIVPACS-type Model to Assess the Ecosystem Health in Korean Streams: A Preliminary Study (저서성 대형무척추동물을 이용한 RIVPACS 유형의 하천생태계 건강성 평가법 국내 하천 적용성)

  • Da-Yeong Lee;Dae-Seong Lee;Joong-Hyuk Min;Young-Seuk Park
    • Korean Journal of Ecology and Environment
    • /
    • v.56 no.1
    • /
    • pp.45-56
    • /
    • 2023
  • In stream ecosystem assessment, RIVPACS, which makes a simple but clear evaluation based on macroinvertebrate community, is widely used. In this study, a preliminary study was conducted to develop a RIVPACS-type model suitable for Korean streams nationwide. Reference streams were classified into two types(upstream and downstream), and a prediction model for macroinvertebrates was developed based on each family. A model for upstream was divided into 7 (train): 3 (test), and that for downstream was made using a leave-one-out method. Variables for the models were selected by non-metric multidimensional scaling, and seven variables were chosen, including elevation, slope, annual average temperature, stream width, forest ratio in land use, riffle ratio in hydrological characteristics, and boulder ratio in substrate composition. Stream order classified 3,224 sites as upstream and downstream, and community compositions of sites were predicted. The prediction was conducted for 30 macroinvertebrate families. Expected (E) and observed fauna (O) were compared using an ASPT biotic index, which is computed by dividing the BMWPK score into the number of families in a community. EQR values (i.e. O/E) for ASPT were used to assess stream condition. Lastly, we compared EQR to BMI, an index that is commonly used in the assessment. In the results, the average observed ASPT was 4.82 (±2.04 SD) and the expected one was 6.30 (±0.79 SD), and the expected ASPT was higher than the observed one. In the comparison between EQR and BMI index, EQR generally showed a higher value than the BMI index.

A study on solar radiation prediction using medium-range weather forecasts (중기예보를 이용한 태양광 일사량 예측 연구)

  • Sujin Park;Hyojeoung Kim;Sahm Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.1
    • /
    • pp.49-62
    • /
    • 2023
  • Solar energy, which is rapidly increasing in proportion, is being continuously developed and invested. As the installation of new and renewable energy policy green new deal and home solar panels increases, the supply of solar energy in Korea is gradually expanding, and research on accurate demand prediction of power generation is actively underway. In addition, the importance of solar radiation prediction was identified in that solar radiation prediction is acting as a factor that most influences power generation demand prediction. In addition, this study can confirm the biggest difference in that it attempted to predict solar radiation using medium-term forecast weather data not used in previous studies. In this paper, we combined the multi-linear regression model, KNN, random fores, and SVR model and the clustering technique, K-means, to predict solar radiation by hour, by calculating the probability density function for each cluster. Before using medium-term forecast data, mean absolute error (MAE) and root mean squared error (RMSE) were used as indicators to compare model prediction results. The data were converted into daily data according to the medium-term forecast data format from March 1, 2017 to February 28, 2022. As a result of comparing the predictive performance of the model, the method showed the best performance by predicting daily solar radiation with random forest, classifying dates with similar climate factors, and calculating the probability density function of solar radiation by cluster. In addition, when the prediction results were checked after fitting the model to the medium-term forecast data using this methodology, it was confirmed that the prediction error increased by date. This seems to be due to a prediction error in the mid-term forecast weather data. In future studies, among the weather factors that can be used in the mid-term forecast data, studies that add exogenous variables such as precipitation or apply time series clustering techniques should be conducted.

A stratified random sampling design for paddy fields: Optimized stratification and sample allocation for effective spatial modeling and mapping of the impact of climate changes on agricultural system in Korea (농지 공간격자 자료의 층화랜덤샘플링: 농업시스템 기후변화 영향 공간모델링을 위한 국내 농지 최적 층화 및 샘플 수 최적화 연구)

  • Minyoung Lee;Yongeun Kim;Jinsol Hong;Kijong Cho
    • Korean Journal of Environmental Biology
    • /
    • v.39 no.4
    • /
    • pp.526-535
    • /
    • 2021
  • Spatial sampling design plays an important role in GIS-based modeling studies because it increases modeling efficiency while reducing the cost of sampling. In the field of agricultural systems, research demand for high-resolution spatial databased modeling to predict and evaluate climate change impacts is growing rapidly. Accordingly, the need and importance of spatial sampling design are increasing. The purpose of this study was to design spatial sampling of paddy fields (11,386 grids with 1 km spatial resolution) in Korea for use in agricultural spatial modeling. A stratified random sampling design was developed and applied in 2030s, 2050s, and 2080s under two RCP scenarios of 4.5 and 8.5. Twenty-five weather and four soil characteristics were used as stratification variables. Stratification and sample allocation were optimized to ensure minimum sample size under given precision constraints for 16 target variables such as crop yield, greenhouse gas emission, and pest distribution. Precision and accuracy of the sampling were evaluated through sampling simulations based on coefficient of variation (CV) and relative bias, respectively. As a result, the paddy field could be optimized in the range of 5 to 21 strata and 46 to 69 samples. Evaluation results showed that target variables were within precision constraints (CV<0.05 except for crop yield) with low bias values (below 3%). These results can contribute to reducing sampling cost and computation time while having high predictive power. It is expected to be widely used as a representative sample grid in various agriculture spatial modeling studies.

Success Factor in the K-Pop Music Industry: focusing on the mediated effect of Internet Memes (대중음악 흥행 요인에 대한 연구: 인터넷 밈(Internet Meme)의 매개효과를 중심으로)

  • YuJeong Sim;Minsoo Shin
    • Journal of Service Research and Studies
    • /
    • v.13 no.1
    • /
    • pp.48-62
    • /
    • 2023
  • As seen in the recent K-pop craze, the size and influence of the Korean music industry is growing even bigger. At least 6,000 songs are released a year in the Korean music market, but not many can be said to have been successful. Many studies and attempts are being made to identify the factors that make the hit music. Commercial factors such as media exposure and promotion as well as the quality of music play an important role in the commercial success of music. Recently, there have been many marketing campaigns using Internet memes in the pop music industry, and Internet memes are activities or trends that spread in various forms, such as images and videos, as cultural units that spread among people. Depending on the Internet environment and the characteristics of digital communication, contents are expanded and reproduced in the form of various memes, which causes a greater response to consumers. Previously, the phenomenon of Internet memes has occurred naturally, but artists who are aware of the marketing effects have recently used it as an element of marketing. In this paper, the mediated effect of Internet memes in relation to the success factors of popular music was analyzed, and a prediction model reflecting them was proposed. As a result of the analysis, the factors with the mediated effect of 'cover effect' and 'challenge effect' were the same. Among the internal success factors, there were mediated effects in "Singer Recognition," the genres of "POP, Dance, Ballad, Trot and Electronica," and among the external success factors, mediated effects in "Planning Company Capacity," "The Number of Music Broadcasting Programs," and "The Number of News Articles." Predictive models reflecting cover effects and challenge effects showed F1-score at 0.6889 and 0.7692, respectively. This study is meaningful in that it has collected and analyzed actual chart data and presented commercial directions that can be used in practice, and found that there are many success factors of popular music and the mediating effects of Internet memes.