• Title/Summary/Keyword: statistical prediction

Search Result 1,557, Processing Time 0.025 seconds

Interval prediction on the sum of binary random variables indexed by a graph

  • Park, Seongoh;Hahn, Kyu S.;Lim, Johan;Son, Won
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.3
    • /
    • pp.261-272
    • /
    • 2019
  • In this paper, we propose a procedure to build a prediction interval of the sum of dependent binary random variables over a graph to account for the dependence among binary variables. Our main interest is to find a prediction interval of the weighted sum of dependent binary random variables indexed by a graph. This problem is motivated by the prediction problem of various elections including Korean National Assembly and US presidential election. Traditional and popular approaches to construct the prediction interval of the seats won by major parties are normal approximation by the CLT and Monte Carlo method by generating many independent Bernoulli random variables assuming that those binary random variables are independent and the success probabilities are known constants. However, in practice, the survey results (also the exit polls) on the election are random and hardly independent to each other. They are more often spatially correlated random variables. To take this into account, we suggest a spatial auto-regressive (AR) model for the surveyed success probabilities, and propose a residual based bootstrap procedure to construct the prediction interval of the sum of the binary outcomes. Finally, we apply the procedure to building the prediction intervals of the number of legislative seats won by each party from the exit poll data in the $19^{th}$ and $20^{th}$ Korea National Assembly elections.

Application of data mining and statistical measurement of agricultural high-quality development

  • Yan Zhou
    • Advances in nano research
    • /
    • v.14 no.3
    • /
    • pp.225-234
    • /
    • 2023
  • In this study, we aim to use big data resources and statistical analysis to obtain a reliable instruction to reach high-quality and high yield agricultural yields. In this regard, soil type data, raining and temperature data as well as wheat production in each year are collected for a specific region. Using statistical methodology, the acquired data was cleaned to remove incomplete and defective data. Afterwards, using several classification methods in machine learning we tried to distinguish between different factors and their influence on the final crop yields. Comparing the proposed models' prediction using statistical quantities correlation factor and mean squared error between predicted values of the crop yield and actual values the efficacy of machine learning methods is discussed. The results of the analysis show high accuracy of machine learning methods in the prediction of the crop yields. Moreover, it is indicated that the random forest (RF) classification approach provides best results among other classification methods utilized in this study.

Statistical Analysis for Risk Factors and Prediction of Hypertension based on Health Behavior Information (건강행위정보기반 고혈압 위험인자 및 예측을 위한 통계분석)

  • Heo, Byeong Mun;Kim, Sang Yeob;Ryu, Keun Ho
    • Journal of Digital Contents Society
    • /
    • v.19 no.4
    • /
    • pp.685-692
    • /
    • 2018
  • The purpose of this study is to develop a prediction model of hypertension in middle-aged adults using Statistical analysis. Statistical analysis and prediction models were developed using the National Health and Nutrition Survey (2013-2016).Binary logistic regression analysis showed statistically significant risk factors for hypertension, and a predictive model was developed using logistic regression and the Naive Bayes algorithm using Wrapper approach technique. In the statistical analysis, WHtR(p<0.0001, OR = 2.0242) in men and AGE (p<0.0001, OR = 3.9185) in women were the most related factors to hypertension. In the performance evaluation of the prediction model, the logistic regression model showed the best predictive power in men (AUC = 0.782) and women (AUC = 0.858). Our findings provide important information for developing large-scale screening tools for hypertension and can be used as the basis for hypertension research.

Risk Prediction Using Genome-Wide Association Studies on Type 2 Diabetes

  • Choi, Sungkyoung;Bae, Sunghwan;Park, Taesung
    • Genomics & Informatics
    • /
    • v.14 no.4
    • /
    • pp.138-148
    • /
    • 2016
  • The success of genome-wide association studies (GWASs) has enabled us to improve risk assessment and provide novel genetic variants for diagnosis, prevention, and treatment. However, most variants discovered by GWASs have been reported to have very small effect sizes on complex human diseases, which has been a big hurdle in building risk prediction models. Recently, many statistical approaches based on penalized regression have been developed to solve the "large p and small n" problem. In this report, we evaluated the performance of several statistical methods for predicting a binary trait: stepwise logistic regression (SLR), least absolute shrinkage and selection operator (LASSO), and Elastic-Net (EN). We first built a prediction model by combining variable selection and prediction methods for type 2 diabetes using Affymetrix Genome-Wide Human SNP Array 5.0 from the Korean Association Resource project. We assessed the risk prediction performance using area under the receiver operating characteristic curve (AUC) for the internal and external validation datasets. In the internal validation, SLR-LASSO and SLR-EN tended to yield more accurate predictions than other combinations. During the external validation, the SLR-SLR and SLR-EN combinations achieved the highest AUC of 0.726. We propose these combinations as a potentially powerful risk prediction model for type 2 diabetes.

Prediction of the Major Factors for the Analysis of the Erosion Effect on Atomic Oxygen in LEO Satellite Using a Machine Learning Method (LSTM)

  • Kim, You Gwang;Park, Eung Sik;Kim, Byung Chun;Lee, Suk Hoon;Lee, Seo Hyun
    • Journal of Aerospace System Engineering
    • /
    • v.14 no.2
    • /
    • pp.50-56
    • /
    • 2020
  • In this study, we investigated whether long short-term memory (LSTM) can be used in the future to predict F10.7 index data; the F10.7 index is a space environment factor affecting atomic oxygen erosion. Based on this, we compared the prediction performances of LSTM, the Autoregressive integrated moving average (ARIMA) model (which is a traditional statistical prediction model), and the similar pattern searching method used for long-term prediction. The LSTM model yielded superior results compared to the other techniques in the prediction period starting from the max/min points, but presented inferior results in the prediction period including the inflection points. It was found that efficient learning was not achieved, owing to the lack of currently available learning data in the prediction period including the maximum points. To overcome this, we proposed a method to increase the size of the learning samples using the sunspot data and to upgrade the LSTM model.

A Study on Model of Regional Logistics Requirements Prediction

  • Lu, Bo;Park, Nam-Kyu
    • Journal of Navigation and Port Research
    • /
    • v.36 no.7
    • /
    • pp.553-559
    • /
    • 2012
  • It is extremely important to predict the logistics requirements in a scientific and rational way. However, in recent years, the improvement effect on the prediction method is not very significant and the traditional statistical prediction method has the defects of low precision and poor interpretation of the prediction model, which cannot only guarantee the generalization ability of the prediction model theoretically, but also cannot explain the models effectively. Therefore, in combination with the theories of the spatial economics, industrial economics, and neo-classical economics, taking city of Erdos as the research object, the study identifies the leading industry that can produce a large number of cargoes, and further predicts the static logistics generation of the Erdos and hinterlands. By integrating various factors that can affect the regional logistics requirements, this study established a logistics requirements potential model from the aspect of spatial economic principles, and expanded the way of logistics requirements prediction from the single statistical principles to an new area of special and regional economics.

Learning fair prediction models with an imputed sensitive variable: Empirical studies

  • Kim, Yongdai;Jeong, Hwichang
    • Communications for Statistical Applications and Methods
    • /
    • v.29 no.2
    • /
    • pp.251-261
    • /
    • 2022
  • As AI has a wide range of influence on human social life, issues of transparency and ethics of AI are emerging. In particular, it is widely known that due to the existence of historical bias in data against ethics or regulatory frameworks for fairness, trained AI models based on such biased data could also impose bias or unfairness against a certain sensitive group (e.g., non-white, women). Demographic disparities due to AI, which refer to socially unacceptable bias that an AI model favors certain groups (e.g., white, men) over other groups (e.g., black, women), have been observed frequently in many applications of AI and many studies have been done recently to develop AI algorithms which remove or alleviate such demographic disparities in trained AI models. In this paper, we consider a problem of using the information in the sensitive variable for fair prediction when using the sensitive variable as a part of input variables is prohibitive by laws or regulations to avoid unfairness. As a way of reflecting the information in the sensitive variable to prediction, we consider a two-stage procedure. First, the sensitive variable is fully included in the learning phase to have a prediction model depending on the sensitive variable, and then an imputed sensitive variable is used in the prediction phase. The aim of this paper is to evaluate this procedure by analyzing several benchmark datasets. We illustrate that using an imputed sensitive variable is helpful to improve prediction accuracies without hampering the degree of fairness much.

Statistical Life Prediction of Fatigue Crack Growth for SiC Whisker Reinforced Aluminium Composite (SiC 휘스커 보강 Al6061 복합재료의 통계학적 피로균열진전 수명예측)

  • 권재도;안정주;김상태
    • Transactions of the Korean Society of Mechanical Engineers
    • /
    • v.19 no.2
    • /
    • pp.475-485
    • /
    • 1995
  • In this study, statistical analysis of fatigue data which had obtained from respective 24 fatigue crack, was examined for SiC whisker reinforced aluminium 6061 composite alloy (SiC$_{w}$/A16061) and aluminium 6061 alloy. SiC volume fraction in composite alloy is 25%. The analysis results stress intensity factor range and 0.1 mm fatigue crack initiation life for SiC$_{w}$/A16061 composite & A16061 matrix are the log-normal distribution. And regression analysis by linear model, exponential model and multiplicative model were performed to find out the relationship between fatigue crack growth rate(da/dN) and stress intensity for find out the relationship between fatigue crack growth rate(da/dN) and stress intensity factor range(.DELTA.K) in the SiC$_{w}$/A16061 composite and examine the applicability of Paris' equation to SiC$_{w}$A16061 composite. Also computer simulation was performed for fatigue life prediction of SiC$_{w}$/A16061 composite using the statistical results of this study.udy.

Multivariate Statistical Analysis and Prediction for the Flash Points of Binary Systems Using Physical Properties of Pure Substances (순수 성분의 물성 자료를 이용한 2성분계 혼합물의 인화점에 대한 다변량 통계 분석 및 예측)

  • Lee, Bom-Sock;Kim, Sung-Young
    • Journal of the Korean Institute of Gas
    • /
    • v.11 no.3
    • /
    • pp.13-18
    • /
    • 2007
  • The multivariate statistical analysis, using the multiple linear regression(MLR), have been applied to analyze and predict the flash points of binary systems. Prediction for the flash points of flammable substances is important for the examination of the fire and explosion hazards in the chemical process design. In this paper, the flash points are predicted by MLR based on the physical properties of pure substances and the experimental flash points data. The results of regression and prediction by MLR are compared with the values calculated by Raoult's law and Van Laar equation.

  • PDF

Development and Application of Statistical Programs Based on Data and Artificial Intelligence Prediction Model to Improve Statistical Literacy of Elementary School Students (초등학생의 통계적 소양 신장을 위한 데이터와 인공지능 예측모델 기반의 통계프로그램 개발 및 적용)

  • Kim, Yunha;Chang, Hyewon
    • Communications of Mathematical Education
    • /
    • v.37 no.4
    • /
    • pp.717-736
    • /
    • 2023
  • The purpose of this study is to develop a statistical program using data and artificial intelligence prediction models and apply it to one class in the sixth grade of elementary school to see if it is effective in improving students' statistical literacy. Based on the analysis of problems in today's elementary school statistical education, a total of 15 sessions of the program was developed to encourage elementary students to experience the entire process of statistical problem solving and to make correct predictions by incorporating data, the core in the era of the Fourth Industrial Revolution into AI education. The biggest features of this program are the recognition of the importance of data, which are the key elements of artificial intelligence education, and the collection and analysis activities that take into account context using real-life data provided by public data platforms. In addition, since it consists of activities to predict the future based on data by using engineering tools such as entry and easy statistics, and creating an artificial intelligence prediction model, it is composed of a program focused on the ability to develop communication skills, information processing capabilities, and critical thinking skills. As a result of applying this program, not only did the program positively affect the statistical literacy of elementary school students, but we also observed students' interest, critical inquiry, and mathematical communication in the entire process of statistical problem solving.