• Title/Summary/Keyword: random data analysis

Search Result 1,689, Processing Time 0.035 seconds

Rank Tests for Multivariate Linear Models in the Presence of Missing Data

  • Lee, Jae-Won;David M. Reboussin
    • Journal of the Korean Statistical Society
    • /
    • v.26 no.3
    • /
    • pp.319-332
    • /
    • 1997
  • The application of multivariate linear rank statistics to data with item nonresponse is considered. Only a modest extension of the complete data techniques is required when the missing data may be thought of as a random sample, and an appropriate modification of the covariances is derived. A proof of the asymptotic multivariate normality is given. A review of some related results in the literature is presented and applications including longitudinal and repeated measures designs are discussed.

  • PDF

Analysis of Degradation Data Using Robust Experimental Design (강건 실험계획법을 이용한 열화자료의 분석)

  • 서순근;하천수
    • Journal of Korean Society for Quality Management
    • /
    • v.32 no.1
    • /
    • pp.113-129
    • /
    • 2004
  • The reliability of the product can be improved by making the product less sensitive to noises. Especially, it Is important to make products robust against various noise factors encountered in production and field environments. In this paper, the phenomenon of degradation assumes a simple random coefficient degradation model to present analysis procedures of degradation data for robust experimental design. To alleviate weak points of previous studies, such as Taguchi's, Wasserman's, and pseudo failure time methods, novel techniques for analysis of degradation data using the cross array that regards amount of degradation as a dynamic characteristic for time are proposed. Analysis approach for degradation data using robust experimental design are classified by assumptions on parametric or nonparametric degradation rate(or slope). Also, a simulation study demonstrates the superiority of proposed methods over some previous works.

Factors Influencing Sexual Experiences in Adolescents Using a Random Forest Model: Secondary Data Analysis of the 2019~2021 Korea Youth Risk Behavior Web-based Survey Data (랜덤 포레스트 모델을 활용한 국내 청소년 성경험 영향요인 분석 연구: 2019~2021년 청소년건강행태조사 데이터)

  • Yang, Yoonseok;Kwon, Ju Won;Yang, Youngran
    • Journal of Korean Academy of Nursing
    • /
    • v.54 no.2
    • /
    • pp.193-210
    • /
    • 2024
  • Purpose: The objective of this study was to develop a predictive model for the sexual experiences of adolescents using the random forest method and to identify the "variable importance." Methods: The study utilized data from the 2019 to 2021 Korea Youth Risk Behavior Web-based Survey, which included 86,595 man and 80,504 woman participants. The number of independent variables stood at 44. SPSS was used to conduct Rao-Scott χ2 tests and complex sample t-tests. Modeling was performed using the random forest algorithm in Python. Performance evaluation of each model included assessments of precision, recall, F1-score, receiver operating characteristics curve, and area under the curve calculations derived from the confusion matrix. Results: The prevalence of sexual experiences initially decreased during the COVID-19 pandemic, but later increased. "Variable importance" for predicting sexual experiences, ranked in the top six, included week and weekday sedentary time and internet usage time, followed by ease of cigarette purchase, age at first alcohol consumption, smoking initiation, breakfast consumption, and difficulty purchasing alcohol. Conclusion: Education and support programs for promoting adolescent sexual health, based on the top-ranking important variables, should be integrated with health behavior intervention programs addressing internet usage, smoking, and alcohol consumption. We recommend active utilization of the random forest analysis method to develop high-performance predictive models for effective disease prevention, treatment, and nursing care.

Prediction of spatio-temporal AQI data

  • KyeongEun Kim;MiRu Ma;KyeongWon Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.2
    • /
    • pp.119-133
    • /
    • 2023
  • With the rapid growth of the economy and fossil fuel consumption, the concentration of air pollutants has increased significantly and the air pollution problem is no longer limited to small areas. We conduct statistical analysis with the actual data related to air quality that covers the entire of South Korea using R and Python. Some factors such as SO2, CO, O3, NO2, PM10, precipitation, wind speed, wind direction, vapor pressure, local pressure, sea level pressure, temperature, humidity, and others are used as covariates. The main goal of this paper is to predict air quality index (AQI) spatio-temporal data. The observations of spatio-temporal big datasets like AQI data are correlated both spatially and temporally, and computation of the prediction or forecasting with dependence structure is often infeasible. As such, the likelihood function based on the spatio-temporal model may be complicated and some special modelings are useful for statistically reliable predictions. In this paper, we propose several methods for this big spatio-temporal AQI data. First, random effects with spatio-temporal basis functions model, a classical statistical analysis, is proposed. Next, neural networks model, a deep learning method based on artificial neural networks, is applied. Finally, random forest model, a machine learning method that is closer to computational science, will be introduced. Then we compare the forecasting performance of each other in terms of predictive diagnostics. As a result of the analysis, all three methods predicted the normal level of PM2.5 well, but the performance seems to be poor at the extreme value.

Projection analysis for split-plot data (분할구자료의 사영분석)

  • Choi, Jaesung
    • The Korean Journal of Applied Statistics
    • /
    • v.30 no.3
    • /
    • pp.335-344
    • /
    • 2017
  • This paper discusses a method of analyzing data from split-plot experiments by projections. The assumed model for data has two experimental errors due to two different experimental sizes and some random components in treatment effects. Residual random models are constructed to obtain sums of squares due to random effects. Expectations of sums of squares are obtained by Hartley's synthesis. Estimable functions of fixed effects are discussed.

Applying the Nash Equilibrium to Constructing Covert Channel in IoT

  • Ho, Jun-Won
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.13 no.1
    • /
    • pp.243-248
    • /
    • 2021
  • Although many different types of covert channels have been suggested in the literature, there are little work in directly applying game theory to building up covert channel. This is because researchers have mainly focused on tailoring game theory for covert channel analysis, identification, and covert channel problem solving. Unlike typical adaptation of game theory to covert channel, we show that game theory can be utilized to establish a new type of covert channel in IoT devices. More specifically, we propose a covert channel that can be constructed by utilizing the Nash Equilibrium with sensor data collected from IoT devices. For covert channel construction, we set random seed to the value of sensor data and make payoff from random number created by running pseudo random number generator with the configured random seed. We generate I × J (I ≥ 2, J ≥ 2) matrix game with these generated payoffs and attempt to obtain the Nash Equilibrium. Covert channel construction method is distinctly determined in accordance with whether or not to acquire the Nash Equilibrium.

A Clustering Approach for Feature Selection in Microarray Data Classification Using Random Forest

  • Aydadenta, Husna;Adiwijaya, Adiwijaya
    • Journal of Information Processing Systems
    • /
    • v.14 no.5
    • /
    • pp.1167-1175
    • /
    • 2018
  • Microarray data plays an essential role in diagnosing and detecting cancer. Microarray analysis allows the examination of levels of gene expression in specific cell samples, where thousands of genes can be analyzed simultaneously. However, microarray data have very little sample data and high data dimensionality. Therefore, to classify microarray data, a dimensional reduction process is required. Dimensional reduction can eliminate redundancy of data; thus, features used in classification are features that only have a high correlation with their class. There are two types of dimensional reduction, namely feature selection and feature extraction. In this paper, we used k-means algorithm as the clustering approach for feature selection. The proposed approach can be used to categorize features that have the same characteristics in one cluster, so that redundancy in microarray data is removed. The result of clustering is ranked using the Relief algorithm such that the best scoring element for each cluster is obtained. All best elements of each cluster are selected and used as features in the classification process. Next, the Random Forest algorithm is used. Based on the simulation, the accuracy of the proposed approach for each dataset, namely Colon, Lung Cancer, and Prostate Tumor, achieved 85.87%, 98.9%, and 89% accuracy, respectively. The accuracy of the proposed approach is therefore higher than the approach using Random Forest without clustering.

Efficient hardware implementation and analysis of true random-number generator based on beta source

  • Park, Seongmo;Choi, Byoung Gun;Kang, Taewook;Park, Kyunghwan;Kwon, Youngsu;Kim, Jongbum
    • ETRI Journal
    • /
    • v.42 no.4
    • /
    • pp.518-526
    • /
    • 2020
  • This paper presents an efficient hardware random-number generator based on a beta source. The proposed generator counts the values of "0" and "1" and provides a method to distinguish between pseudo-random and true random numbers by comparing them using simple cumulative operations. The random-number generator produces labeled data indicating whether the count value is a pseudo- or true random number according to its bit value based on the generated labeling data. The proposed method is verified using a system based on Verilog RTL coding and LabVIEW for hardware implementation. The generated random numbers were tested according to the NIST SP 800-22 and SP 800-90B standards, and they satisfied the test items specified in the standard. Furthermore, the hardware is efficient and can be used for security, artificial intelligence, and Internet of Things applications in real time.

Informal Quality Data Analysis via Sentimental analysis and Word2vec method (감성분석과 Word2vec을 이용한 비정형 품질 데이터 분석)

  • Lee, Chinuk;Yoo, Kook Hyun;Mun, Byeong Min;Bae, Suk Joo
    • Journal of Korean Society for Quality Management
    • /
    • v.45 no.1
    • /
    • pp.117-128
    • /
    • 2017
  • Purpose: This study analyzes automobile quality review data to develop alternative analytical method of informal data. Existing methods to analyze informal data are based mainly on the frequency of informal data, however, this research tries to use correlation information of each informal data. Method: After sentimental analysis to acquire the user information for automobile products, three classification methods, that is, $na{\ddot{i}}ve$ Bayes, random forest, and support vector machine, were employed to accurately classify the informal user opinions with respect to automobile qualities. Additionally, Word2vec was applied to discover correlated information about informal data. Result: As applicative results of three classification methods, random forest method shows most effective results compared to the other classification methods. Word2vec method manages to discover closest relevant data with automobile components. Conclusion: The proposed method shows its effectiveness in terms of accuracy and sensitivity on the analysis of informal quality data, however, only two sentiments (positive or negative) can be categorized due to human errors. Further studies are required to derive more sentiments to accurately classify informal quality data. Word2vec method also shows comparative results to discover the relevance of components precisely.

Restricted maximum likelihood estimation of a censored random effects panel regression model

  • Lee, Minah;Lee, Seung-Chun
    • Communications for Statistical Applications and Methods
    • /
    • v.26 no.4
    • /
    • pp.371-383
    • /
    • 2019
  • Panel data sets have been developed in various areas, and many recent studies have analyzed panel, or longitudinal data sets. Maximum likelihood (ML) may be the most common statistical method for analyzing panel data models; however, the inference based on the ML estimate will have an inflated Type I error because the ML method tends to give a downwardly biased estimate of variance components when the sample size is small. The under estimation could be severe when data is incomplete. This paper proposes the restricted maximum likelihood (REML) method for a random effects panel data model with a censored dependent variable. Note that the likelihood function of the model is complex in that it includes a multidimensional integral. Many authors proposed to use integral approximation methods for the computation of likelihood function; however, it is well known that integral approximation methods are inadequate for high dimensional integrals in practice. This paper introduces to use the moments of truncated multivariate normal random vector for the calculation of multidimensional integral. In addition, a proper asymptotic standard error of REML estimate is given.