• Title/Summary/Keyword: Outlier diagnostic

Search Result 12, Processing Time 0.027 seconds

Outlier Detection Diagnostic based on Interpolation Method in Autoregressive Models

  • Cho, Sin-Sup;Ryu, Gui-Yeol;Park, Byeong-Uk;Lee, Jae-June
    • Journal of the Korean Statistical Society
    • /
    • v.22 no.2
    • /
    • pp.283-306
    • /
    • 1993
  • An outlier detection diagnostic for the detection of k-consecutive atypical observations is considered. The proposed diagnostic is based on the innovational variance estimate utilizing both the interpolated and the predicted residuals. We adopt the interpolation method to construct the proposed diagnostic by replacing atypical observations. The perfomance of the proposed diagnositc is investigated by simulation. A real example is presented.

  • PDF

Influence in Testing the Equality of Two Covariance Matrices (두개의 공분산 행렬의 동질성 검정에서의 영향치 분석)

  • Myung Geun Kim
    • The Korean Journal of Applied Statistics
    • /
    • v.7 no.2
    • /
    • pp.213-224
    • /
    • 1994
  • A diagnostic method useful for detecting outliers in testing the equality of two covariance metrics is developed using the influence curve approach. This method is easily generalized to more than two covariance matrices. A sample version for the influence measure of detecting outliers is considered based on the empirical distribution functions. The sample version includes as its component terms the well-known test statistic for detecting one outlier at a time introduced by Wilks and its generalization to the two-group case.

  • PDF

유전자 알고리듬을 이용한 다중이상치 탐색

  • Go Yeong-Hyeon;Lee Hye-Seon;Jeon Chi-Hyeok
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2000.11a
    • /
    • pp.173-179
    • /
    • 2000
  • Genetic algorithm(GA) is applied for detecting multiple outliers. GA is a heuristic optimization tool solving for near optimal solution. We compare the performance of GA and the other diagnostic measures commonly used for detecting outliers in regression model. The results show that GA seems to have better performance than the others for the detection of multiple outliers.

  • PDF

A Study on Applications of Regression Diagnostic Method to Technometrics, and the Statistical Quality Control

  • Kim, Soon-Kwi
    • Journal of Korean Society for Quality Management
    • /
    • v.21 no.1
    • /
    • pp.55-64
    • /
    • 1993
  • This article is concerned with procedures for detecting one or more outliers or influential observations in a linear regression model. A test procedure, based on recursive residuals is proposed and developed The power of the test procedure to identify one or more outliers is investigated through simulation, and its relevance to the number and configuration of the outlier.

  • PDF

Outlier Detection and Treatment for the Conversion of Chemical Oxygen Demand to Total Organic Carbon (화학적산소요구량의 총유기탄소 변환을 위한 이상자료의 탐지와 처리)

  • Cho, Beom Jun;Cho, Hong Yeon;Kim, Sung
    • Journal of Korean Society of Coastal and Ocean Engineers
    • /
    • v.26 no.4
    • /
    • pp.207-216
    • /
    • 2014
  • Total organic carbon (TOC) is an important indicator used as an direct biological index in the research field of the marine carbon cycle. It is possible to produce the sufficient TOC estimation data by using the Chemical Oxygen Demand(COD) data because the available TOC data is relatively poor than the COD data. The outlier detection and treatment (removal) should be carried out reasonably and objectively because the equation for a COD-TOC conversion is directly affected the TOC estimation. In this study, it aims to suggest the optimal regression model using the available salinity, COD, and TOC data observed in the Korean coastal zone. The optimal regression model is selected by the comparison and analysis on the changes of data numbers before and after removal, variation coefficients and root mean square (RMS) error of the diverse detection methods of the outlier and influential observations. According to research result, it is shown that a diagnostic case combining SIQR (Semi - Inter-Quartile Range) boxplot and Cook's distance method is most suitable for the outlier detection. The optimal regression function is estimated as the TOC(mg/L) = $0.44{\cdot}COD(mg/L)+1.53$, then determination coefficient is showed a value of 0.47 and RMS error is 0.85 mg/L. The RMS error and the variation coefficients of the leverage values are greatly reduced to the 31% and 80% of the value before the outlier removal condition. The method suggested in this study can provide more appropriate regression curve because the excessive impacts of the outlier frequently included in the COD and TOC monitoring data is removed.

Development of Healthcare Data Quality Control Algorithm Using Interactive Decision Tree: Focusing on Hypertension in Diabetes Mellitus Patients (대화식 의사결정나무를 이용한 보건의료 데이터 질 관리 알고리즘 개발: 당뇨환자의 고혈압 동반을 중심으로)

  • Hwang, Kyu-Yeon;Lee, Eun-Sook;Kim, Go-Won;Hong, Seong-Ok;Park, Jung-Sun;Kwak, Mi-Sook;Lee, Ye-Jin;Lim, Chae-Hyeok;Park, Tae-Hyun;Park, Jong-Ho;Kang, Sung-Hong
    • The Korean Journal of Health Service Management
    • /
    • v.10 no.3
    • /
    • pp.63-74
    • /
    • 2016
  • Objectives : There is a need to develop a data quality management algorithm to improve the quality of healthcare data using a data quality management system. In this study, we developed a data quality control algorithms associated with diseases related to hypertension in patients with diabetes mellitus. Methods : To make a data quality algorithm, we extracted the 2011 and 2012 discharge damage survey data from diabetes mellitus patients. Derived variables were created using the primary diagnosis, diagnostic unit, primary surgery and treatment, minor surgery and treatment items. Results : Significant factors in diabetes mellitus patients with hypertension were sex, age, ischemic heart disease, and diagnostic ultrasound of the heart. Depending on the decision tree results, we found four groups with extreme values for diabetes accompanying hypertension patients. Conclusions : There is a need to check the actual data contained in the Outlier (extreme value) groups to improve the quality of the data.

Value of Contrast-Enhanced Ultrasonography in the Differential Diagnosis of Enlarged Lymph Nodes: a Meta-Analysis of Diagnostic Accuracy Studies

  • Jin, Ya;He, Yu-Shuang;Zhang, Ming-Ming;Parajuly, Shyam Sundar;Chen, Shuang;Zhao, Hai-Na;Peng, Yu-Lan
    • Asian Pacific Journal of Cancer Prevention
    • /
    • v.16 no.6
    • /
    • pp.2361-2368
    • /
    • 2015
  • Objective: To evaluate the diagnostic accuracy of contrast-enhanced ultrasonography (CEUS) in differentiating between benign and malignant enlarged lymph nodes using meta-analysis. Materials and Methods: Pubmed, Embase, SCI and Cochrane databases were searched for studies (up to September 1, 2014) reporting the diagnostic performance of CEUS in discriminating between benign and malignant lymph nodes. Inclusion criteria were: prospective study; histopathology as the reference standard; and sufficient data to construct $2{\times}2$ contingency tables. Methodological quality was assessed using Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2). Patient clinical characteristics, sensitivity and specificity were extracted. The summary receiver operating characteristic curve was used to examine the accuracy of CEUS. A meta-analysis was performed to evaluate the clinical utility in identification of benign and malignant lymph nodes. Sensitivity analysis was performed after omitting outliers identified in a bivariate boxplot and publication bias was assessed with Egger testing. Results: The pooled sensitivity, specificity and AUROC were 0.92 (95%CI, 0.85-0.96), 0.91 (95%CI, 0.82-0.95) and 0.97 (95%CI, 0.95-0.98), respectively. After omitting 3 outlier studies, heterogeneity decreased. Sensitivity analysis demonstrated no disproportionate influences of individual studies. Publication bias was not significant. Conclusions: CEUS is a promising diagnostic modality in differentiating between benign and malignant lymph nodes and can potentially reduce unnecessary fine-needle aspiration biopsies of benign nodes.

Temperature Effect-free Impedance-based Local Damage Detection (온도변화에 자유로운 임피던스 기반 국부 손상검색)

  • Koo, Ki-Young;Park, Seung-Hee;Lee, Jong-Jae;Yun, Chung-Bang
    • Proceedings of the Computational Structural Engineering Institute Conference
    • /
    • 2007.04a
    • /
    • pp.21-26
    • /
    • 2007
  • This paper presents an impedance-based structural health monitoring (SHM) technique considering temperature effects. The temperature variation results in a significant impedance variation, particularly both horizontal and vertical shifts in the frequency domain, which may lead to erroneous diagnostic results of real structures. A new damage detection strategy has been proposed based on the correlation coefficient (CC) between the reference impedance data and a concurrent impedance data with an effective frequency shift which is defined as the shift causing the maximum correlation. The proposed technique was applied to a lab-sized steel truss bridge member under the temperature varying environment. From an experimental study, it has been demonstrated that a narrow cut inflicted artificially to the steel structure was successfully detected using the proposed SHM strategy.

  • PDF

On Feasibility of Ambulatory KDRGs for the Classification of Health Insurance Claims (KDRG를 이용한 건강보험 외래 진료비 분류 타당성)

  • 박하영;박기동;신영수
    • Health Policy and Management
    • /
    • v.13 no.1
    • /
    • pp.98-115
    • /
    • 2003
  • Concerns about growing health insurance expenditures became a national Issue in 2001 when the National Health Insurance went into a deficit. Increases in spending for ambulatory care shared the largest portion of the problem. Methods and systems to control the spending should be developed and a system to measure case mix of providers is one of core components of the control system. The objectives of this article is to examine the feasibility of applying Korean Diagnosis Related Groups (KDRGs) to classify health insurance claims for ambulatory care and to identify problem areas of the classification. A database of 11,586,270 claims for ambulatory care delivered during January 2002 was obtained for the study, and the final number of claims analyzed was 8,319,494 after KDRG numbers were assigned to the data and records with an error KDRG were excluded from the study. The unit of analysis was a claim and resource use was measured by the sum of charges incurred during a month at a department of a hospital of at a clinic. Within group variance was assessed by th coefficient of variation (CV), and the classification accuracy was evaluated by the variance reduction achieved by the KDRG classification. The analyses were performed on both all and non-outlier data, and on a subset of the database to examine the validity of study results. Data were assigned to 787 KDRGs among 1,244 KDRGs defined in the classification system. For non-outlier data, 77.4% of KDRGs had a CV of charges from tertiary care hospitals less than 100% and 95.43% of KDRGs for data from clinics. The variance reduction achieved by the KDRG classification was 40.80% for non-outlier claims from tertiary care hospitals, 51.98% for general hospitals, 40.89% for hospitals, and 54.99% for clinics. Similar results were obtained from the analyses performed on a subset of the study database. The study results indicated that KDRGs developed for a classification of inpatient care could be used for ambulatory care, although there were areas where the classification should be refined. Its power to predict tile resource utilization showed a potential for its application to measure case mix of providers for monitoring and managing delivery of ambulatory care. The issue concerning the quality of diagnostic information contained in insurance claims remains to be improved, and significance of future studies for other classification systems based on visits or episodes is guaranteed.

Effect of outliers on the variable selection by the regularized regression

  • Jeong, Junho;Kim, Choongrak
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.2
    • /
    • pp.235-243
    • /
    • 2018
  • Many studies exist on the influence of one or few observations on estimators in a variety of statistical models under the "large n, small p" setup; however, diagnostic issues in the regression models have been rarely studied in a high dimensional setup. In the high dimensional data, the influence of observations is more serious because the sample size n is significantly less than the number variables p. Here, we investigate the influence of observations on the least absolute shrinkage and selection operator (LASSO) estimates, suggested by Tibshirani (Journal of the Royal Statistical Society, Series B, 73, 273-282, 1996), and the influence of observations on selected variables by the LASSO in the high dimensional setup. We also derived an analytic expression for the influence of the k observation on LASSO estimates in simple linear regression. Numerical studies based on artificial data and real data are done for illustration. Numerical results showed that the influence of observations on the LASSO estimates and the selected variables by the LASSO in the high dimensional setup is more severe than that in the usual "large n, small p" setup.