Search | Korea Science

A comparison of imputation methods using machine learning models

Heajung Suh;Jongwoo Song
- Communications for Statistical Applications and Methods
- /
- v.30 no.3
- /
- pp.331-341
- /
- 2023
Handling missing values in data analysis is essential in constructing a good prediction model. The easiest way to handle missing values is to use complete case data, but this can lead to information loss within the data and invalid conclusions in data analysis. Imputation is a technique that replaces missing data with alternative values obtained from information in a dataset. Conventional imputation methods include K-nearest-neighbor imputation and multiple imputations. Recent methods include missForest, missRanger, and mixgb ,all which use machine learning algorithms. This paper compares the imputation techniques for datasets with mixed datatypes in various situations, such as data size, missing ratios, and missing mechanisms. To evaluate the performance of each method in mixed datasets, we propose a new imputation performance measure (IPM) that is a unified measurement applicable to numerical and categorical variables. We believe this metric can help find the best imputation method. Finally, we summarize the comparison results with imputation performances and computational times.
https://doi.org/10.29220/CSAM.2023.30.3.331 인용 PDF

Comparing the Results of Big-Data with Questionnaire Survey (빅데이터 분석결과와 실증조사 결과의 비교)

Kim, Do-Goan;Shin, Seong-Yoon
- Journal of the Korea Institute of Information and Communication Engineering
- /
- v.20 no.11
- /
- pp.2027-2032
- /
- 2016
The rapid diffusion of smart phones and the development of data storage and analysis technology have made the field of big-data a promising industry in the future. In the marketing field, big-data analysis on social data can be used for understanding the needs of consumers as an effective and efficient marketing tool. Before the age of big-data, companies had relied upon the traditional methods such as questionnaire survey and marketing test in which a small number of consumers had participated. The traditional methods have still been used. Although both of big-data analysis and traditional methods are useful to understand consumers. It is need to check whether the results from both include similar implications. In this point, this study attempts to compare the results of big-data analysis with that of questionnaire survey on some cosmetics brands methods. As the results of this study, both results of big-data analysis and questionnaire survey include similar implications.
https://doi.org/10.6109/jkiice.2016.20.11.2027 인용 PDF KSCI

Informal Quality Data Analysis via Sentimental analysis and Word2vec method (감성분석과 Word2vec을 이용한 비정형 품질 데이터 분석)

Lee, Chinuk;Yoo, Kook Hyun;Mun, Byeong Min;Bae, Suk Joo
- Journal of Korean Society for Quality Management
- /
- v.45 no.1
- /
- pp.117-128
- /
- 2017
Purpose: This study analyzes automobile quality review data to develop alternative analytical method of informal data. Existing methods to analyze informal data are based mainly on the frequency of informal data, however, this research tries to use correlation information of each informal data. Method: After sentimental analysis to acquire the user information for automobile products, three classification methods, that is, $na{\ddot{i}}ve$ Bayes, random forest, and support vector machine, were employed to accurately classify the informal user opinions with respect to automobile qualities. Additionally, Word2vec was applied to discover correlated information about informal data. Result: As applicative results of three classification methods, random forest method shows most effective results compared to the other classification methods. Word2vec method manages to discover closest relevant data with automobile components. Conclusion: The proposed method shows its effectiveness in terms of accuracy and sensitivity on the analysis of informal quality data, however, only two sentiments (positive or negative) can be categorized due to human errors. Further studies are required to derive more sentiments to accurately classify informal quality data. Word2vec method also shows comparative results to discover the relevance of components precisely.
https://doi.org/10.7469/JKSQM.2017.45.1.117 인용 PDF KSCI

The development of statistical methods for retrieving MODIS missing data: Mean bias, regressions analysis and local variation method (MODIS 손실 자료 복원을 위한 통계적 방법 개발: 평균 편차 방법, 회귀 분석 방법과 지역 변동 방법)

Kim, Min Wook;Yi, Jonghyuk;Park, Yeon Gu;Song, Junghyun
- Journal of Satellite, Information and Communications
- /
- v.11 no.4
- /
- pp.94-101
- /
- 2016
Satellite data for remote sensing technology has limitations, especially with visible range sensor, cloud and/or other environmental factors cause missing data. In this study, using land surface temperature data from the MODerate resolution Imaging Spectro-radiometer(MODIS), we developed retrieving methods for satellite missing data and developed three methods; mean bias, regression analysis and local variation method. These methods used the previous day data as reference data. In order to validate these methods, we selected a specific measurement ratio using artificial missing data from 2014 to 2015. The local variation method showed low accuracy with root mean square error(RMSE) more than 2 K in some cases, and the regression analysis method showed reliable results in most cases with small RMSE values, 1.13 K, approximately. RMSE with the mean bias method was similar to RMSE with the regression analysis method, 1.32 K, approximately.
PDF KSCI

Recent deep learning methods for tabular data

Yejin Hwang;Jongwoo Song
- Communications for Statistical Applications and Methods
- /
- v.30 no.2
- /
- pp.215-226
- /
- 2023
Deep learning has made great strides in the field of unstructured data such as text, images, and audio. However, in the case of tabular data analysis, machine learning algorithms such as ensemble methods are still better than deep learning. To keep up with the performance of machine learning algorithms with good predictive power, several deep learning methods for tabular data have been proposed recently. In this paper, we review the latest deep learning models for tabular data and compare the performances of these models using several datasets. In addition, we also compare the latest boosting methods to these deep learning methods and suggest the guidelines to the users, who analyze tabular datasets. In regression, machine learning methods are better than deep learning methods. But for the classification problems, deep learning methods perform better than the machine learning methods in some cases.
https://doi.org/10.29220/CSAM.2023.30.2.215 인용 PDF

Results of Discriminant Analysis with Respect to Cluster Analyses Under Dimensional Reduction

Chae, Seong-San
- Communications for Statistical Applications and Methods
- /
- v.9 no.2
- /
- pp.543-553
- /
- 2002
Principal component analysis is applied to reduce p-dimensions into q-dimensions ( $q {\leq} p$). Any partition of a collection of data points with p and q variables generated by the application of six hierarchical clustering methods is re-classified by discriminant analysis. From the application of discriminant analysis through each hierarchical clustering method, correct classification ratios are obtained. The results illustrate which method is more reasonable in exploratory data analysis.
https://doi.org/10.5351/CKSS.2002.9.2.543 인용 PDF KSCI

A System for Medical Statistical Analysis Using Guide Maps and Interactive Visualization (가이드 맵과 인터랙티브 시각화를 이용한 의료 통계분석 시스템)

Lee Don-Soo;Choi Soo-Mi
- Journal of Korea Multimedia Society
- /
- v.8 no.7
- /
- pp.1000-1011
- /
- 2005
This paper presents a system for medical statistical analysis that helps medical professionals analyze clinical data more easily and accurately. It is able to recommend proper methods according to the distribution of sample data, and provides guide maps composed of icons for the understanding of the process of analysis. Besides general statistical analysis, it includes commonly-used statistical methods for medical fields, such as survival analysis and methods for repetitive measurements. The results of analysis are interactively displayed by 3D glyph-based visualization with uncertainty.
PDF

Comparison of missing data methods in clustered survival data using Bayesian adaptive B-Spline estimation

Yoo, Hanna;Lee, Jae Won
- Communications for Statistical Applications and Methods
- /
- v.25 no.2
- /
- pp.159-172
- /
- 2018
In many epidemiological studies, missing values in the outcome arise due to censoring. Such censoring is what makes survival analysis special and differentiated from other analytical methods. There are many methods that deal with censored data in survival analysis. However, few studies have dealt with missing covariates in survival data. Furthermore, studies dealing with missing covariates are rare when data are clustered. In this paper, we conducted a simulation study to compare results of several missing data methods when data had clustered multi-structured type with missing covariates. In this study, we modeled unknown baseline hazard and frailty with Bayesian B-Spline to obtain more smooth and accurate estimates. We also used prior information to achieve more accurate results. We assumed the missing mechanism as MAR. We compared the performance of five different missing data techniques and compared these results through simulation studies. We also presented results from a Multi-Center study of Korean IBD patients with Crohn's disease(Lee et al., Journal of the Korean Society of Coloproctology, 28, 188-194, 2012).
https://doi.org/10.29220/CSAM.2018.25.2.159 인용 PDF KSCI

A Study on the Construction of Database, Online Management System, and Analysis Instrument for Biological Diversity Data (생물다양성 자료의 데이터베이스화와 온라인 관리시스템 및 분석도구 구축에 관한 연구)

Bec Kee-Yul;Jung Jong-Chul;Park Seon-Joo;Lee Jong-Wook
- Journal of Environmental Science International
- /
- v.14 no.12
- /
- pp.1119-1127
- /
- 2005
The management of data on biological diversity is presently complex and confusing. This study was initiated to construct a database so that such data could be stored in a data management, and analysis instrument to correct the problems inherent in the current incoherent storage methods. MySQL was used in DBMS(DataBase Management System), and the program was basically produced using Java technology Also, the program was developed so people could adapt to the requirements that are changing every minute. We hope this was accomplished by modifying easily and quickly the advanced programming technology and patterns. To this end, an effective and flexible database schema was devised to store and analyze diversity databases. Even users with no knowledge of databases should be able to access this management instrument and easily manage the database through the World Wide Web. On a basis of databases stored in this manner, it could become routinely used for various databases using this analysis instrument supplied on the World Wide Web. Supplying the derived results by using a simple table and making results visible using simple charts, researchers could easily adapt these methods to various data analyses. As the diversity data was stored in a database, not in a general file, this study makes the precise, error-free and high -quality storage in a consistent manner. The methods proposed here should also minimize the errors that might appear in each data search, data movement, or data conversion by supplying management instrumentation on the Web. Also, this study was to deduce the various results to the level we required and execute the comparative analysis without the lengthy time necessary to supply the analytical instrument with similar results as provided by various other methods of analysis. The results of this research may be summerized as follows: 1)This study suggests methods of storage by giving consistency to diversity data. 2)This study prepared a suggested foundation for comparative analysis of various data. 3)It may suggest further research, which could lead to more and better standardization of diversity data and to better methods for predicting changes in species diversity.
https://doi.org/10.5322/JES.2005.14.12.1119 인용 PDF KSCI

Big data platform for health monitoring systems of multiple bridges

Wang, Manya;Ding, Youliang;Wan, Chunfeng;Zhao, Hanwei
- Structural Monitoring and Maintenance
- /
- v.7 no.4
- /
- pp.345-365
- /
- 2020
At present, many machine leaning and data mining methods are used for analyzing and predicting structural response characteristics. However, the platform that combines big data analysis methods with online and offline analysis modules has not been used in actual projects. This work is dedicated to developing a multifunctional Hadoop-Spark big data platform for bridges to monitor and evaluate the serviceability based on structural health monitoring system. It realizes rapid processing, analysis and storage of collected health monitoring data. The platform contains offline computing and online analysis modules, using Hadoop-Spark environment. Hadoop provides the overall framework and storage subsystem for big data platform, while Spark is used for online computing. Finally, the big data Hadoop-Spark platform computational performance is verified through several actual analysis tasks. Experiments show the Hadoop-Spark big data platform has good fault tolerance, scalability and online analysis performance. It can meet the daily analysis requirements of 5s/time for one bridge and 40s/time for 100 bridges.
https://doi.org/10.12989/smm.2020.7.4.345 인용 KSCI

Search Result 19,523, Processing Time 0.049 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)