• Title/Summary/Keyword: Data Heterogeneity

Search Result 604, Processing Time 0.027 seconds

Text Classification with Heterogeneous Data Using Multiple Self-Training Classifiers

  • William Xiu Shun Wong;Donghoon Lee;Namgyu Kim
    • Asia pacific journal of information systems
    • /
    • v.29 no.4
    • /
    • pp.789-816
    • /
    • 2019
  • Text classification is a challenging task, especially when dealing with a huge amount of text data. The performance of a classification model can be varied depending on what type of words contained in the document corpus and what type of features generated for classification. Aside from proposing a new modified version of the existing algorithm or creating a new algorithm, we attempt to modify the use of data. The classifier performance is usually affected by the quality of learning data as the classifier is built based on these training data. We assume that the data from different domains might have different characteristics of noise, which can be utilized in the process of learning the classifier. Therefore, we attempt to enhance the robustness of the classifier by injecting the heterogeneous data artificially into the learning process in order to improve the classification accuracy. Semi-supervised approach was applied for utilizing the heterogeneous data in the process of learning the document classifier. However, the performance of document classifier might be degraded by the unlabeled data. Therefore, we further proposed an algorithm to extract only the documents that contribute to the accuracy improvement of the classifier.

Behavior recognition system based fog cloud computing

  • Lee, Seok-Woo;Lee, Jong-Yong;Jung, Kye-Dong
    • International journal of advanced smart convergence
    • /
    • v.6 no.3
    • /
    • pp.29-37
    • /
    • 2017
  • The current behavior recognition system don't match data formats between sensor data measured by user's sensor module or device. Therefore, it is necessary to support data processing, sharing and collaboration services between users and behavior recognition system in order to process sensor data of a large capacity, which is another formats. It is also necessary for real time interaction with users and behavior recognition system. To solve this problem, we propose fog cloud based behavior recognition system for human body sensor data processing. Fog cloud based behavior recognition system solve data standard formats in DbaaS (Database as a System) cloud by servicing fog cloud to solve heterogeneity of sensor data measured in user's sensor module or device. In addition, by placing fog cloud between users and cloud, proximity between users and servers is increased, allowing for real time interaction. Based on this, we propose behavior recognition system for user's behavior recognition and service to observers in collaborative environment. Based on the proposed system, it solves the problem of servers overload due to large sensor data and the inability of real time interaction due to non-proximity between users and servers. This shows the process of delivering behavior recognition services that are consistent and capable of real time interaction.

A Study of Data Collection Method for Efficient Sharing in IoT Environment (사물인터넷(IoT) 환경에서 효율적 공유를 위한 데이터 수집 기법에 대한 연구)

  • Hwang, Chi-Gon;Yoon, Chang-Pyo
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2015.10a
    • /
    • pp.268-269
    • /
    • 2015
  • The current Internet environment, it is accessible by a computer, but also transferred to the IoT(Internet of Things). These data become large. If the data are provided to the application without any adjustment, it is difficult to exert the original performance. In this paper, we propose a method for filtering the data using the MapReduce of big data processing techniques to refine the collected data. We want to address the heterogeneity of the data generated by the sensor by adding a knowledge identification step in MapReduce. We use XMDR for this purpose.

  • PDF

Bayesian Conway-Maxwell-Poisson (CMP) regression for longitudinal count data

  • Morshed Alam ;Yeongjin Gwon ;Jane Meza
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.3
    • /
    • pp.291-309
    • /
    • 2023
  • Longitudinal count data has been widely collected in biomedical research, public health, and clinical trials. These repeated measurements over time on the same subjects need to account for an appropriate dependency. The Poisson regression model is the first choice to model the expected count of interest, however, this may not be an appropriate when data exhibit over-dispersion or under-dispersion. Recently, Conway-Maxwell-Poisson (CMP) distribution is popularly used as the distribution offers a flexibility to capture a wide range of dispersion in the data. In this article, we propose a Bayesian CMP regression model to accommodate over and under-dispersion in modeling longitudinal count data. Specifically, we develop a regression model with random intercept and slope to capture subject heterogeneity and estimate covariate effects to be different across subjects. We implement a Bayesian computation via Hamiltonian MCMC (HMCMC) algorithm for posterior sampling. We then compute Bayesian model assessment measures for model comparison. Simulation studies are conducted to assess the accuracy and effectiveness of our methodology. The usefulness of the proposed methodology is demonstrated by a well-known example of epilepsy data.

Child Care Service Quality Management Through the Evaluation of Efficiency at Child Care Centers: An Evaluation with Data Envelopment Analysis

  • Song, Seung-Min
    • International Journal of Quality Innovation
    • /
    • v.9 no.2
    • /
    • pp.1-9
    • /
    • 2008
  • This paper proposes a scheme to estimate the technical efficiency at child care centers for the less-than-three-year-old infants by Data Envelopment Analysis (DEA) and to manage the quality of care service through implementing flexible and efficient government subsidy system. The result of technical efficiency estimation shows that there exists the heterogeneity in technical efficiency a substantial opportunity for improvement in technical efficiency across child care centers. This result implies that government may bring up the competition by giving subsidy differentially based on efficiency and use the money which has been used inefficiently other purposes. Both can improve the quality of child care service.

On the Negative Estimates of Direct and Maternal Genetic Correlation - A Review

  • Lee, C.
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.15 no.8
    • /
    • pp.1222-1226
    • /
    • 2002
  • Estimates of genetic correlation between direct and maternal effects for weaning weight of beef cattle are often negative in field data. The biological existence of this genetic antagonism has been the point at issue. Some researchers perceived such negative estimate to be an artifact from poor modeling. Recent studies on sources affecting the genetic correlation estimates are reviewed in this article. They focus on heterogeneity of the correlation by sex, selection bias caused from selective reporting, selection bias caused from splitting data by sex, sire by year interaction variance, and sire misidentification and inbreeding depression as factors contributing sire by year interaction variance. A biological justification of the genetic antagonism is also discussed. It is proposed to include the direct-maternal genetic covariance in the analytical models.

Sr-Nd-Pb Isotopic Study of the Ogcheon Amphibolites (옥천 각섬암의 Sr-Nd-Pb 동위원소 연구)

  • Lee, Kwang-Sik;Chang, Ho-Wan
    • Economic and Environmental Geology
    • /
    • v.29 no.1
    • /
    • pp.35-43
    • /
    • 1996
  • Sr-Nd-Pb isotopic results are reported for the Ogcheon amphibolites from the central part of the Ogcheon Belt. Rb-Sr and Pb-Pb whole rock isotope data plot greatly scattered in the isochron diagrams due to later alteration or metamorphism, whereas the Sm-Nd whole rock isotope data define a linear array with an age of $1270{\pm}220$ Ma ($1{\sigma}$). Considering several geochemical features of the amphibolites, the 1270 Ma linear array may be not a true but an apparent mixing isochron due to source heterogeneity.

  • PDF

Meta Analysis of Usability Experimental Research Using New Bi-Clustering Algorithm

  • Kim, Kyung-A;Hwang, Won-Il
    • The Korean Journal of Applied Statistics
    • /
    • v.21 no.6
    • /
    • pp.1007-1014
    • /
    • 2008
  • Usability evaluation(UE) experiments are conducted to provide UE practitioners with guidelines for better outcomes. In UE research, significant quantities of empirical results have been accumulated in the past decades. While those results have been anticipated to integrate for producing generalized guidelines, traditional meta-analysis has limitations to combine UE empirical results that often show considerable heterogeneity. In this study, a new data mining method called weighted bi-clustering(WBC) was proposed to partition heterogeneous studies into homogeneous subsets. We applied the WBC to UE empirical results and identified two homogeneous subsets, each of which can be meta-analyzed. In addition, interactions between experimental conditions and UE methods were hypothesized based on the resulting partition and some interactions were confirmed via statistical tests.

Autoregressive Cholesky Factor Modeling for Marginalized Random Effects Models

  • Lee, Keunbaik;Sung, Sunah
    • Communications for Statistical Applications and Methods
    • /
    • v.21 no.2
    • /
    • pp.169-181
    • /
    • 2014
  • Marginalized random effects models (MREM) are commonly used to analyze longitudinal categorical data when the population-averaged effects is of interest. In these models, random effects are used to explain both subject and time variations. The estimation of the random effects covariance matrix is not simple in MREM because of the high dimension and the positive definiteness. A relatively simple structure for the correlation is assumed such as a homogeneous AR(1) structure; however, it is too strong of an assumption. In consequence, the estimates of the fixed effects can be biased. To avoid this problem, we introduce one approach to explain a heterogenous random effects covariance matrix using a modified Cholesky decomposition. The approach results in parameters that can be easily modeled without concern that the resulting estimator will not be positive definite. The interpretation of the parameters is sensible. We analyze metabolic syndrome data from a Korean Genomic Epidemiology Study using this method.

Determinants of Technological Innovation and Spillover Effects: Using Count Data Model (국내 제조업 기업의 기술혁신 요인 및 기술파급효과 분석: 가산자료 모형을 이용하여)

  • Jang, Jeong-In;Yu, Seung-Hun;Gwak, Seung-Jun
    • Journal of Technology Innovation
    • /
    • v.14 no.3
    • /
    • pp.23-42
    • /
    • 2006
  • This study investigates the determinants of output of a manufacturing firm's innovative activity (the number of patent applications) and spillover effects using a count data model. This paper attempted a negative binomial distribution In order to take into account unobserved heterogeneity. The results of our study suggested that Firm size, R&D intensity, technical network activity, and online business performance have significantly positive effects in the Korean manufacturing industry. Moreover, there are significantly positive spillover effects in the same industry sector.

  • PDF