• Title/Summary/Keyword: Same data

Search Result 10,938, Processing Time 0.044 seconds

Exploratory Data Analysis for microarray experiments with replicates

  • Lee, Eun-Kyung;Yi, Sung-Gon;Park, Tae-Sung
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2005.11a
    • /
    • pp.37-41
    • /
    • 2005
  • Exploratory data analysis(EDA) is the initial stage of data analysis and provides a useful overview about the whole microarray experiment. If the experiments are replicated, the analyst should check the quality and reliability of microarray data within same experimental condition before the deeper statistical analysis. We shows EDA method focusing on the quality and reproducibility for replicates.

  • PDF

A Modelling of Multi-derived Data and Its Retrieval Scheme (복합생성 자료검색의 모형화)

  • Lee, Chun-Yeol
    • Asia pacific journal of information systems
    • /
    • v.4 no.1
    • /
    • pp.115-138
    • /
    • 1994
  • Current database systems are based on the assumption that a datum denotes the same meaning; however, in reality, the violation of this assumption is not unusual. Some data are created in such a way that they represent different sets of attribute values. The current research formulates this phenomenon as dissimilarities of derivation rules and defines multi-derived data as ones that are derived by multiple rules. For multi- derived data, this research proposes a new retrieval scheme and analyze its implication with relation to data retrieval.

  • PDF

The diagnosis of Plasma Through RGB Data Using Rough Set Theory

  • Lim, Woo-Yup;Park, Soo-Kyong;Hong, Sang-Jeen
    • Proceedings of the Korean Vacuum Society Conference
    • /
    • 2010.02a
    • /
    • pp.413-413
    • /
    • 2010
  • In semiconductor manufacturing field, all equipments have various sensors to diagnosis the situations of processes. For increasing the accuracy of diagnosis, hundreds of sensors are emplyed. As sensors provide millions of data, the process diagnosis from them are unrealistic. Besides, in some cases, the results from some data which have same conditions are different. We want to find some information, such as data and knowledge, from the data. Nowadays, fault detection and classification (FDC) has been concerned to increasing the yield. Certain faults and no-faults can be classified by various FDC tools. The uncertainty in semiconductor manufacturing, no-faulty in faulty and faulty in no-faulty, has been caused the productivity to decreased. From the uncertainty, the rough set theory is a viable approach for extraction of meaningful knowledge and making predictions. Reduction of data sets, finding hidden data patterns, and generation of decision rules contrasts other approaches such as regression analysis and neural networks. In this research, a RGB sensor was used for diagnosis plasma instead of optical emission spectroscopy (OES). RGB data has just three variables (red, green and blue), while OES data has thousands of variables. RGB data, however, is difficult to analyze by human's eyes. Same outputs in a variable show different outcomes. In other words, RGB data includes the uncertainty. In this research, by rough set theory, decision rules were generated. In decision rules, we could find the hidden data patterns from the uncertainty. RGB sensor can diagnosis the change of plasma condition as over 90% accuracy by the rough set theory. Although we only present a preliminary research result, in this paper, we will continuously develop uncertainty problem solving data mining algorithm for the application of semiconductor process diagnosis.

  • PDF

Comparison and analysis of peak flow by Areal Reduction Factor (면적감소계수에 따른 첨두유량의 비교연구)

  • Baek, Hyo-Sun;Lee, De-Young;Kang, Young-Buk;Choi, Han-Kuy
    • Proceedings of the Korea Water Resources Association Conference
    • /
    • 2007.05a
    • /
    • pp.1798-1802
    • /
    • 2007
  • The practice of business estimate flood discharge by rainfall-flow relation that is easy collection of observation data. The important factor is rainfall, coefficient of runoff, and drainage area for analysis of runoff-flow relation.The practice of business usually use probability rainfall that use a weighted average value after each observation post estimate probability of non-same time. It has more error than same time probability rainfall, and it can excess of estimation because it can't consider space distribution of rainfall.The study of result showed similar aspect with existing ARF but width of coefficient become smaller. And the comparison of peak flow did not different what used by ARF and same time probability rainfall(A group). But non-same time probability rainfall is bigger 25% more than another(B group). Between A group and B group of the difference increased with the lapse of time.

  • PDF

Comparison and analysis of peak flow by Areal Reduction Factor (면적감소계수에 따른 첨두유량의 비교 분석)

  • Lee, Dae-Young;Choi, Han-Kuy
    • Journal of Industrial Technology
    • /
    • v.27 no.A
    • /
    • pp.95-102
    • /
    • 2007
  • The practice of business estimate flood discharge by rainfall-flow relation that is easy collection of observation data. The important factor is rainfall, coefficient of runoff, and drainage area for analysis of runoff-flow relation. The practice of business usually use probability rainfall that use a weighted average value after each observation post estimate probability of non-same time. It has more error than same time probability rainfall, and it can excess of estimation because it can't consider space distribution of rainfall. The study of result showed similar aspect with existing ARF but width of coefficient become smaller. And the comparison of peak flow did not different what used by ARF and same time probability rainfall(A group). But non-same time probability rainfall is bigger 25% more than another(B group). Between A group and B group of the difference increased with the lapse of time.

  • PDF

A Clinical Study on a 5 Decades Tuberculosis Screening Program Based on Chest Radiography(CXR) (흉부방사선영상(CXR)에 의한 폐결핵검진사업 50년의 임상적 고찰)

  • Kim, Ham-Gyum
    • Journal of radiological science and technology
    • /
    • v.32 no.2
    • /
    • pp.141-146
    • /
    • 2009
  • This study analyzed decade-based statistic data which had been collected from the reports of annual radiographic pulmonary tuberculosis screening program initiated by the Korean National Tuberculosis Association (KNTA) for last 5 decades (from 1956 to 2005). We analyzed only the content of annual statistic report to preserve the characteristic of statistic data and the contents of original copy by focusing on the analysis of tuberculosis cases where age and sex were excluded. The results of the disease-based analysis on the tuberculosis cases from cumulative subjects of chest radiography (CXR) from 1956 to 2005 are summarized as follows. 1. The cumulative number of subjects who were examined under annual chest radiography over last 5 decades totaled 54,938,875 persons. 2. The cumulative number of pulmonary tuberculosis cases during same period totaled 958,251 persons (1.74%). 3. The cumulative number of subjects treated during same period totaled 465,082 persons (0.85%). 4. The cumulative number of mild pulmonary tuberculosis cases during same period totaled 229,615 persons (0.42%). 5. The cumulative number of moderate pulmonary tuberculosis cases during same period totaled 144,247 persons (0.26%). 6. The cumulative number of severe pulmonary tuberculosis cases during same period totaled 74,066 persons (0.13%). 7. The cumulative number of exudative pleurisy cases during same period totaled 17,154 persons (0.03%). 8. The cumulative number of subjects under monitoring during same period totaled 493,169 persons (0.90%). 9. The cumulative number of uncertain activity cases during same period totaled 78,214 persons (0.14%). 10. The cumulative number of pseudo-pulmonary tuberculosis cases during same period totaled 272,349 persons (0.50%).

  • PDF

A Study of the virtue terms in herbal medicine (본초 효능 용어에 관한 연구)

  • Oh, Yong-Taek;Lee, Byung-Wook;Kim, Eun-Ha
    • Journal of Korean Medical classics
    • /
    • v.23 no.5
    • /
    • pp.35-50
    • /
    • 2010
  • By grouping freshly the virtue terms used in herbal medicine, we are apt to establish the position coordinates of concepts and raise the level of the herbal virtue research in future. As the terms related to the herbal virtue used in herbal medicine are used with the virtue terms mingled with the chief treatable disease terms, it's hard to use the herbal virtue data only. And though the virtues terms imply many data like medical act data or medical operation data, we can't use them fully. We sort the terms related to the herbal virtue into the virtue terms and the chief treatable disease terms and acquire many data like medical act data or medical operation data and group the data by same attribute. At this time in the process of classification we establish sort standards inductively, put relations between the attributes in order, out of this result we grasp the actual conditions of the virtue terms used now, and show useful data for herbal virtue research in future. We got the chief treatable disease terms from the ones related to the herbal virtue, acquired a lot of data from the virtue terms and grouped the data by the same attribute. We established a proper standard inductively in the process of classification, put the relations between the attributes in order, grasped the actual conditions of the virtue terms in use at the moment out of the result of the classification and presented the applicable data for the herbal virtue research in future.

A Benchmark Test of Spatial Big Data Processing Tools and a MapReduce Application

  • Nguyen, Minh Hieu;Ju, Sungha;Ma, Jong Won;Heo, Joon
    • Journal of the Korean Society of Surveying, Geodesy, Photogrammetry and Cartography
    • /
    • v.35 no.5
    • /
    • pp.405-414
    • /
    • 2017
  • Spatial data processing often poses challenges due to the unique characteristics of spatial data and this becomes more complex in spatial big data processing. Some tools have been developed and provided to users; however, they are not common for a regular user. This paper presents a benchmark test between two notable tools of spatial big data processing: GIS Tools for Hadoop and SpatialHadoop. At the same time, a MapReduce application is introduced to be used as a baseline to evaluate the effectiveness of two tools and to derive the impact of number of maps/reduces on the performance. By using these tools and New York taxi trajectory data, we perform a spatial data processing related to filtering the drop-off locations within Manhattan area. Thereby, the performance of these tools is observed with respect to increasing of data size and changing number of worker nodes. The results of this study are as follows 1) GIS Tools for Hadoop automatically creates a Quadtree index in each spatial processing. Therefore, the performance is improved significantly. However, users should be familiar with Java to handle this tool conveniently. 2) SpatialHadoop does not automatically create a spatial index for the data. As a result, its performance is much lower than GIS Tool for Hadoop on a same spatial processing. However, SpatialHadoop achieved the best result in terms of performing a range query. 3) The performance of our MapReduce application has increased four times after changing the number of reduces from 1 to 12.

A Distributed Privacy-Utility Tradeoff Method Using Distributed Lossy Source Coding with Side Information

  • Gu, Yonghao;Wang, Yongfei;Yang, Zhen;Gao, Yimu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.11 no.5
    • /
    • pp.2778-2791
    • /
    • 2017
  • In the age of big data, distributed data providers need to ensure the privacy, while data analysts need to mine the value of data. Therefore, how to find the privacy-utility tradeoff has become a research hotspot. Besides, the adversary may have the background knowledge of the data source. Therefore, it is significant to solve the privacy-utility tradeoff problem in the distributed environment with side information. This paper proposes a distributed privacy-utility tradeoff method using distributed lossy source coding with side information, and quantitatively gives the privacy-utility tradeoff region and Rate-Distortion-Leakage region. Four results are shown in the simulation analysis. The first result is that both the source rate and the privacy leakage decrease with the increase of source distortion. The second result is that the finer relevance between the public data and private data of source, the finer perturbation of source needed to get the same privacy protection. The third result is that the greater the variance of the data source, the slighter distortion is chosen to ensure more data utility. The fourth result is that under the same privacy restriction, the slighter the variance of the side information, the less distortion of data source is chosen to ensure more data utility. Finally, the provided method is compared with current ones from five aspects to show the advantage of our method.

On Reliability and UMVUE of Right-Tail Probability in a Half-Normal Variable

  • Woo, Jung-Soo
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.1
    • /
    • pp.259-267
    • /
    • 2007
  • We consider parametric estimation in a half-normal variable and a UMVUE of its right-tail probability. Also we consider estimation of reliability in two independent half-normal variables, and derive k-th moment of ratio of two same variables.

  • PDF