• Title/Summary/Keyword: Outlier Data

Search Result 415, Processing Time 0.027 seconds

An Outlier Detection Algorithm and Data Integration Technique for Prediction of Hypertension (고혈압 예측을 위한 이상치 탐지 알고리즘 및 데이터 통합 기법)

  • Khongorzul Dashdondov;Mi-Hye Kim;Mi-Hwa Song
    • Annual Conference of KIPS
    • /
    • 2023.05a
    • /
    • pp.417-419
    • /
    • 2023
  • Hypertension is one of the leading causes of mortality worldwide. In recent years, the incidence of hypertension has increased dramatically, not only among the elderly but also among young people. In this regard, the use of machine-learning methods to diagnose the causes of hypertension has increased in recent years. In this study, we improved the prediction of hypertension detection using Mahalanobis distance-based multivariate outlier removal using the KNHANES database from the Korean national health data and the COVID-19 dataset from Kaggle. This study was divided into two modules. Initially, the data preprocessing step used merged datasets and decision-tree classifier-based feature selection. The next module applies a predictive analysis step to remove multivariate outliers using the Mahalanobis distance from the experimental dataset and makes a prediction of hypertension. In this study, we compared the accuracy of each classification model. The best results showed that the proposed MAH_RF algorithm had an accuracy of 82.66%. The proposed method can be used not only for hypertension but also for the detection of various diseases such as stroke and cardiovascular disease.

Building the Outlier Candidate Discrimination Training Data based on Inventory for Automatic Classification of Transferred Records (이관 기록물 분류 자동화를 위한 목록 기반 이상치 판별 학습데이터 구축)

  • Jeong, Ji-Hye;Lee, Gemma;Wang, Hosung;Oh, Hyo-Jung
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.22 no.1
    • /
    • pp.43-59
    • /
    • 2022
  • Electronic public records are classified simultaneously as production, a preservation period is granted, and after a certain period, they are transferred to an archive and preserved. This study intends to find a way to improve the efficiency in classifying transferred records and maintain consistent standards. To this end, the current record classification work process carried out by the National Archives of Korea was analyzed, and problems were identified. As a way to minimize the manual work of record classification by converging the required improvement, the process of identifying outlier candidates based on a list consisting of classified information of the transferred records was proposed and systemized. Furthermore, the proposed outlier discrimination process was applied to the actual records transferred to the National Archives of Korea. The results were standardized and constructed as a training data format that can be used for machine learning in the future.

Diagnosis of Observations after Fit of Multivariate Skew t-Distribution: Identification of Outliers and Edge Observations from Asymmetric Data

  • Kim, Seung-Gu
    • The Korean Journal of Applied Statistics
    • /
    • v.25 no.6
    • /
    • pp.1019-1026
    • /
    • 2012
  • This paper presents a method for the identification of "edge observations" located on a boundary area constructed by a truncation variable as well as for the identification of outliers and the after fit of multivariate skew $t$-distribution(MST) to asymmetric data. The detection of edge observation is important in data analysis because it provides information on a certain critical area in observation space. The proposed method is applied to an Australian Institute of Sport(AIS) dataset that is well known for asymmetry in data space.

Sequence-based 5-mers highly correlated to epigenetic modifications in genes interactions

  • Salimi, Dariush;Moeini, Ali;Masoudi?Nejad, Ali
    • Genes and Genomics
    • /
    • v.40 no.12
    • /
    • pp.1363-1371
    • /
    • 2018
  • One of the main concerns in biology is extracting sophisticated features from DNA sequence for gene interaction determination, receiving a great deal of researchers' attention. The epigenetic modifications along with their patterns have been intensely recognized as dominant features affecting on gene expression. However, studying sequenced-based features highly correlated to this key element has remained limited. The main objective in this research was to propose a new feature highly correlated to epigenetic modifications capable of classification of genes. In this paper, classification of 34 genes in PPAR signaling pathway associated with muscle fat tissue in human was performed. Using different statistical outlier detection methods, we proposed that 5-mers highly correlated to epigenetic modifications can correctly categorize the genes involved in the same biological pathway or process. Thirty-four genes in PPAR signaling pathway were classified via applying a proposed feature, 5-mers strongly associated to 17 different epigenetic modifications. For this, diverse statistical outlier detection methods were applied to specify the group of thoroughly correlated genes. The results indicated that these 5-mers can appropriately identify correlated genes. In addition, our results corresponded to GeneMania interaction information, leading to support the suggested method. The appealing findings imply that not only epigenetic modifications but also their highly correlated 5-mers can be applied for reconstructing gene regulatory networks as supplementary data as well as other applications like physical interaction, genes prioritization, indicating some sort of data fusion in this analysis.

Skew Normal Boxplot and Outliers

  • Huh, Myung-Hoe;Lee, Yong-Goo
    • Communications for Statistical Applications and Methods
    • /
    • v.19 no.4
    • /
    • pp.591-595
    • /
    • 2012
  • We frequently use Tukey's boxplot to identify outliers in the batch of observations of the continuous variable. In doing so, we implicitly assume that the underlying distribution belongs to the family of normal distributions. Such a practice of data handling is often superficial and improper, since in reality too many variables manifest the skewness. In this short paper, we build a modified boxplot and set the outlier identification procedure by assuming that the observations are generated from the skew normal distribution (Azzalini, 1985), which is an extension of the normal distribution. Statistical performance of the proposed procedure is examined with simulated datasets.

Weight Reduction Method for Outlier in Survey Sampling

  • Kim Jin
    • Communications for Statistical Applications and Methods
    • /
    • v.13 no.1
    • /
    • pp.19-27
    • /
    • 2006
  • Outliers in survey are a perennial problem for applied survey statisticians to estimate the total or mean of population. The influence of outliers is more increasing as they have large weights in survey sampling. Many techniques have been studied to lower the impact of outliers on sample survey estimates. Outliers can be downweighted by winsorization or reducing the weight of outliers. The weight reduction is more reasonable than replacing one outlier by one value of non-outliers, because it has at least one unit. In this paper, we suggest the square root transformation of weight as the weight reduction method. We show this method is efficient with real data, and it's also easy to apply in practical affairs.

A procedure for simultaneous variable selection, variable transformation and outlier identification in linear regression (선형회귀에서 변수선택, 변수변환과 이상치 탐지의 동시적 수행을 위한 절차)

  • Seo, Han Son;Yoon, Min
    • The Korean Journal of Applied Statistics
    • /
    • v.33 no.1
    • /
    • pp.1-10
    • /
    • 2020
  • We propose a unified approach to variable selection, transformation and outliers in the linear model. The procedure includes a sequential method for outlier detection and a least trimmed squares estimator for variable transformation. It uses all possible subsets regressions for model selection. Some real data analyses and the simulation results are provided to show the efficiency of the methods in the context of the correct variable selection and the fitness of the estimated model.

Development of Healthcare Data Quality Control Algorithm Using Interactive Decision Tree: Focusing on Hypertension in Diabetes Mellitus Patients (대화식 의사결정나무를 이용한 보건의료 데이터 질 관리 알고리즘 개발: 당뇨환자의 고혈압 동반을 중심으로)

  • Hwang, Kyu-Yeon;Lee, Eun-Sook;Kim, Go-Won;Hong, Seong-Ok;Park, Jung-Sun;Kwak, Mi-Sook;Lee, Ye-Jin;Lim, Chae-Hyeok;Park, Tae-Hyun;Park, Jong-Ho;Kang, Sung-Hong
    • The Korean Journal of Health Service Management
    • /
    • v.10 no.3
    • /
    • pp.63-74
    • /
    • 2016
  • Objectives : There is a need to develop a data quality management algorithm to improve the quality of healthcare data using a data quality management system. In this study, we developed a data quality control algorithms associated with diseases related to hypertension in patients with diabetes mellitus. Methods : To make a data quality algorithm, we extracted the 2011 and 2012 discharge damage survey data from diabetes mellitus patients. Derived variables were created using the primary diagnosis, diagnostic unit, primary surgery and treatment, minor surgery and treatment items. Results : Significant factors in diabetes mellitus patients with hypertension were sex, age, ischemic heart disease, and diagnostic ultrasound of the heart. Depending on the decision tree results, we found four groups with extreme values for diabetes accompanying hypertension patients. Conclusions : There is a need to check the actual data contained in the Outlier (extreme value) groups to improve the quality of the data.

A ROBUST ESTINMATOR FOR INTERPOLATING REGIONALIZED VARIABLES

  • SUNGKWON KANG
    • Journal of applied mathematics & informatics
    • /
    • v.4 no.2
    • /
    • pp.419-432
    • /
    • 1997
  • A robust estimator for interpolating spatially distributed regionalized variables is introduced. It reduces outlier effects on ob-taining correlation between spatial lags and the correlation between spatial lags and the corresponding semi-variances and produces a significaantly improved semivariogram com-pared with those of conventional estimators. This estimator is applied to a field experimental data set.

SOURCES OF HIGH LEVERAGE IN LINEAR REGRESSION MODEL

  • Kim, Myung-Geun
    • Journal of applied mathematics & informatics
    • /
    • v.16 no.1_2
    • /
    • pp.509-513
    • /
    • 2004
  • Some reasons for high leverage are analytically investigated by decomposing leverage into meaningful components. The results in this work can be used for remedial action as a next step of data analysis.