• Title/Summary/Keyword: Missing data theory

Search Result 39, Processing Time 0.022 seconds

Robust Speech Recognition Using Missing Data Theory (손실 데이터 이론을 이용한 강인한 음성 인식)

  • 김락용;조훈영;오영환
    • The Journal of the Acoustical Society of Korea
    • /
    • v.20 no.3
    • /
    • pp.56-62
    • /
    • 2001
  • In this paper, we adopt a missing data theory to speech recognition. It can be used in order to maintain high performance of speech recognizer when the missing data occurs. In general, hidden Markov model (HMM) is used as a stochastic classifier for speech recognition task. Acoustic events are represented by continuous probability density function in continuous density HMM(CDHMM). The missing data theory has an advantage that can be easily applicable to this CDHMM. A marginalization method is used for processing missing data because it has small complexity and is easy to apply to automatic speech recognition (ASR). Also, a spectral subtraction is used for detecting missing data. If the difference between the energy of speech and that of background noise is below given threshold value, we determine that missing has occurred. We propose a new method that examines the reliability of detected missing data using voicing probability. The voicing probability is used to find voiced frames. It is used to process the missing data in voiced region that has more redundant information than consonants. The experimental results showed that our method improves performance than baseline system that uses spectral subtraction method only. In 452 words isolated word recognition experiment, the proposed method using the voicing probability reduced the average word error rate by 12% in a typical noise situation.

  • PDF

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • Jun, Sung-Hae;Park, Jung-Eun;Oh, Kyung-Whan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.6
    • /
    • pp.789-792
    • /
    • 2004
  • In various fields as web mining, bioinformatics, statistical data analysis, and so forth, very diversely missing values are found. These values make training data to be sparse. Largely, the missing values are replaced by predicted values using mean and mode. We can used the advanced missing value imputation methods as conditional mean, tree method, and Markov Chain Monte Carlo algorithm. But general imputation models have the property that their predictive accuracy is decreased according to increase the ratio of missing in training data. Moreover the number of available imputations is limited by increasing missing ratio. To settle this problem, we proposed statistical learning theory to preprocess for missing values. Our statistical learning theory is the support vector regression by Vapnik. The proposed method can be applied to sparsely training data. We verified the performance of our model using the data sets from UCI machine learning repository.

A Modified Grey-Based k-NN Approach for Treatment of Missing Value

  • Chun, Young-M.;Lee, Joon-W.;Chung, Sung-S.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.17 no.2
    • /
    • pp.421-436
    • /
    • 2006
  • Huang proposed a grey-based nearest neighbor approach to predict accurately missing attribute value in 2004. Our study proposes which way to decide the number of nearest neighbors using not only the deng's grey relational grade but also the wen's grey relational grade. Besides, our study uses not an arithmetic(unweighted) mean but a weighted one. Also, GRG is used by a weighted value when we impute missing values. There are four different methods - DU, DW, WU, WW. The performance of WW(Wen's GRG & weighted mean) method is the best of any other methods. It had been proven by Huang that his method was much better than mean imputation method and multiple imputation method. The performance of our study is far superior to that of Huang.

  • PDF

Compressive sensing-based two-dimensional scattering-center extraction for incomplete RCS data

  • Bae, Ji-Hoon;Kim, Kyung-Tae
    • ETRI Journal
    • /
    • v.42 no.6
    • /
    • pp.815-826
    • /
    • 2020
  • We propose a two-dimensional (2D) scattering-center-extraction (SCE) method using sparse recovery based on the compressive-sensing theory, even with data missing from the received radar cross-section (RCS) dataset. First, using the proposed method, we generate a 2D grid via adaptive discretization that has a considerably smaller size than a fully sampled fine grid. Subsequently, the coarse estimation of 2D scattering centers is performed using both the method of iteratively reweighted least square and a general peak-finding algorithm. Finally, the fine estimation of 2D scattering centers is performed using the orthogonal matching pursuit (OMP) procedure from an adaptively sampled Fourier dictionary. The measured RCS data, as well as simulation data using the point-scatterer model, are used to evaluate the 2D SCE accuracy of the proposed method. The results indicate that the proposed method can achieve higher SCE accuracy for an incomplete RCS dataset with missing data than that achieved by the conventional OMP, basis pursuit, smoothed L0, and existing discrete spectral estimation techniques.

A Study on the Treatment of Missing Value using Grey Relational Grade and k-NN Approach

  • Chun, Young-Min;Chung, Sung-Suk
    • 한국데이터정보과학회:학술대회논문집
    • /
    • 2006.04a
    • /
    • pp.55-62
    • /
    • 2006
  • Huang proposed a grey-based nearest neighbor approach to predict accurately missing attribute value in 2004. Our study proposes which way to decide the number of nearest neighbors using not only the dong's grey relational grade but also the wen's grey relational grade. Besides, our study uses not an arithmetic(unweighted) mean but a weighted one. Also, GRG is used by a weighted value when we impute a missing values. There are four different methods - DU, DW, WU, WW. The performance of WW(wen's GRG & weighted mean) method is the best of my other methods. It had been proven by Huang that his method was much better than mean imputation method and multiple imputation method. The performance of our study is far superior to that of Huang.

  • PDF

A Clustering Algorithm for Handling Missing Data (손실 데이터를 처리하기 위한 집락분석 알고리즘)

  • Lee, Jong Chan
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.11
    • /
    • pp.103-108
    • /
    • 2017
  • In the ubiquitous environment, there has been a problem of transmitting data from various sensors at a long distance. Especially, in the process of integrating data arriving at different locations, data having different property values of data or having some loss in data had to be processed. This paper present a method to analyze such data. The core of this method is to define an objective function suitable for the problem and to develop an algorithm that can optimize this objective function. The objective function is used by modifying the OCS function. MFA (Mean Field Annealing), which was able to process only binary data, is extended to be applicable to fields with continuous values. It is called CMFA and used as an optimization algorithm.

A Real Time Traffic Flow Model Based on Deep Learning

  • Zhang, Shuai;Pei, Cai Y.;Liu, Wen Y.
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.8
    • /
    • pp.2473-2489
    • /
    • 2022
  • Urban development has brought about the increasing saturation of urban traffic demand, and traffic congestion has become the primary problem in transportation. Roads are in a state of waiting in line or even congestion, which seriously affects people's enthusiasm and efficiency of travel. This paper mainly studies the discrete domain path planning method based on the flow data. Taking the traffic flow data based on the highway network structure as the research object, this paper uses the deep learning theory technology to complete the path weight determination process, optimizes the path planning algorithm, realizes the vehicle path planning application for the expressway, and carries on the deployment operation in the highway company. The path topology is constructed to transform the actual road information into abstract space that the machine can understand. An appropriate data structure is used for storage, and a path topology based on the modeling background of expressway is constructed to realize the mutual mapping between the two. Experiments show that the proposed method can further reduce the interpolation error, and the interpolation error in the case of random missing is smaller than that in the other two missing modes. In order to improve the real-time performance of vehicle path planning, the association features are selected, the path weights are calculated comprehensively, and the traditional path planning algorithm structure is optimized. It is of great significance for the sustainable development of cities.

Sample size calculation for comparing time-averaged responses in K-group repeated binary outcomes

  • Wang, Jijia;Zhang, Song;Ahn, Chul
    • Communications for Statistical Applications and Methods
    • /
    • v.25 no.3
    • /
    • pp.321-328
    • /
    • 2018
  • In clinical trials with repeated measurements, the time-averaged difference (TAD) may provide a more powerful evaluation of treatment efficacy than the rate of changes over time when the treatment effect has rapid onset and repeated measurements continue across an extended period after a maximum effect is achieved (Overall and Doyle, Controlled Clinical Trials, 15, 100-123, 1994). The sample size formula has been investigated by many researchers for the evaluation of TAD in two treatment groups. For the evaluation of TAD in multi-arm trials, Zhang and Ahn (Computational Statistics & Data Analysis, 58, 283-291, 2013) and Lou et al. (Communications in Statistics-Theory and Methods, 46, 11204-11213, 2017b) developed the sample size formulas for continuous outcomes and count outcomes, respectively. In this paper, we derive a sample size formula to evaluate the TAD of the repeated binary outcomes in multi-arm trials using the generalized estimating equation approach. This proposed sample size formula accounts for various correlation structures and missing patterns (including a mixture of independent missing and monotone missing patterns) that are frequently encountered by practitioners in clinical trials. We conduct simulation studies to assess the performance of the proposed sample size formula under a wide range of design parameters. The results show that the empirical powers and the empirical Type I errors are close to nominal levels. We illustrate our proposed method using a clinical trial example.

Data Cleaning and Integration of Multi-year Dietary Survey in the Korea National Health and Nutrition Examination Survey (KNHANES) using Database Normalization Theory (데이터베이스 정규화 이론을 이용한 국민건강영양조사 중 다년도 식이조사 자료 정제 및 통합)

  • Kwon, Namji;Suh, Jihye;Lee, Hunjoo
    • Journal of Environmental Health Sciences
    • /
    • v.43 no.4
    • /
    • pp.298-306
    • /
    • 2017
  • Objectives: Since 1998, the Korea National Health and Nutrition Examination Survey (KNHANES) has been conducted in order to investigate the health and nutritional status of Koreans. The food intake data of individuals in the KNHANES has also been utilized as source dataset for risk assessment of chemicals via food. To improve the reliability of intake estimation and prevent missing data for less-responded foods, the structure of integrated long-standing datasets is significant. However, it is difficult to merge multi-year survey datasets due to ineffective cleaning processes for handling extensive numbers of codes for each food item along with changes in dietary habits over time. Therefore, this study aims at 1) cleaning the process of abnormal data 2) generation of integrated long-standing raw data, and 3) contributing to the production of consistent dietary exposure factors. Methods: Codebooks, the guideline book, and raw intake data from KNHANES V and VI were used for analysis. The violation of the primary key constraint and the $1^{st}-3rd$ normal form in relational database theory were tested for the codebook and the structure of the raw data, respectively. Afterwards, the cleaning process was executed for the raw data by using these integrated codes. Results: Duplication of key records and abnormality in table structures were observed. However, after adjusting according to the suggested method above, the codes were corrected and integrated codes were newly created. Finally, we were able to clean the raw data provided by respondents to the KNHANES survey. Conclusion: The results of this study will contribute to the integration of the multi-year datasets and help improve the data production system by clarifying, testing, and verifying the primary key, integrity of the code, and primitive data structure according to the database normalization theory in the national health data.

A Study on the Incomplete Information Processing System(INiPS) Using Rough Set

  • Jeong, Gu-Beom;Chung, Hwan-Mook;Kim, Guk-Boh;Park, Kyung-Ok
    • Proceedings of the Korean Institute of Intelligent Systems Conference
    • /
    • 2000.11a
    • /
    • pp.243-251
    • /
    • 2000
  • In general, Rough Set theory is used for classification, inference, and decision analysis of incomplete data by using approximation space concepts in information system. Information system can include quantitative attribute values which have interval characteristics, or incomplete data such as multiple or unknown(missing) data. These incomplete data cause the inconsistency in information system and decrease the classification ability in system using Rough Sets. In this paper, we present various types of incomplete data which may occur in information system and propose INcomplete information Processing System(INiPS) which converts incomplete information system into complete information system in using Rough Sets.

  • PDF