• Title/Summary/Keyword: Statistical data

Search Result 14,760, Processing Time 0.043 seconds

Big Data Smoothing and Outlier Removal for Patent Big Data Analysis

  • Choi, JunHyeog;Jun, Sunghae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.21 no.8
    • /
    • pp.77-84
    • /
    • 2016
  • In general statistical analysis, we need to make a normal assumption. If this assumption is not satisfied, we cannot expect a good result of statistical data analysis. Most of statistical methods processing the outlier and noise also need to the assumption. But the assumption is not satisfied in big data because of its large volume and heterogeneity. So we propose a methodology based on box-plot and data smoothing for controling outlier and noise in big data analysis. The proposed methodology is not dependent upon the normal assumption. In addition, we select patent documents as target domain of big data because patent big data analysis is a important issue in management of technology. We analyze patent documents using big data learning methods for technology analysis. The collected patent data from patent databases on the world are preprocessed and analyzed by text mining and statistics. But the most researches about patent big data analysis did not consider the outlier and noise problem. This problem decreases the accuracy of prediction and increases the variance of parameter estimation. In this paper, we check the existence of the outlier and noise in patent big data. To know whether the outlier is or not in the patent big data, we use box-plot and smoothing visualization. We use the patent documents related to three dimensional printing technology to illustrate how the proposed methodology can be used for finding the existence of noise in the searched patent big data.

An Analysis on Statistical Units of Elementary School Mathematics Textbook (통계적 문제해결 과정 관점에 따른 초등 수학교과서 통계 지도 방식 분석)

  • Bae, Hye Jin;Lee, Dong Hwan
    • Journal of Elementary Mathematics Education in Korea
    • /
    • v.20 no.1
    • /
    • pp.55-69
    • /
    • 2016
  • The purpose of this study is to investigate statistical units of elementary school mathematics textbooks upon on the statistical problem solving process to provide useful information for qualitative improvement of developing curriculum and teaching materials. This study analyzed the statistical units from the textbooks of 1st to 6th year along the 2009 revised national curriculum. The analysis frame is based on the 4 phases of the statistical problem solving process: formulate questions, plan and collect data, present and analyze data and interpret data.

Development of a Dynamic Geometry Environment to Collect Learning History Data

  • Mun, Kill-Sung;Han, Beom-Soo;Han, Kyung-Soo;Ahn, Jeong-Yong
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.2
    • /
    • pp.375-384
    • /
    • 2007
  • As teachings that use the ICT are more popular, many studies on the dynamic geometry environment(DGE) are under way. An important factor emphasized in the studies is to practical use learning activities of learners. In this study, we first define the learning history data in DGE. Second we develop a prototype of the DGE that is able to collect and analyze the learning history data automatically. The environment enables not only to grasp leaning history but also to create and manage new learning objects.

  • PDF

A Statistical Matching Method with k-NN and Regression

  • Chung, Sung-S.;Kim, Soon-Y.;Lee, Seung-S.;Lee, Ki-H.
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.4
    • /
    • pp.879-890
    • /
    • 2007
  • Statistical matching is a method of data integration for data sources that do not share the same units. It could produce rapidly lots of new information at low cost and decrease the response burden affecting the quality of data. This paper proposes a statistical matching technique combining k-NN (k-nearest neighborhood) and regression methods. We select k records in a donor file that have similarity in value with a specific observation of the common variable in a recipient file and estimate an imputation value for the recipient file, using regression modeling in the donor file. An empirical comparison study is conducted to show the properties of the proposed method.

  • PDF

Land Use Classification Using GIS based Statistical Unit data (GIS기반의 통계정보를 이용한 토지이용 분류)

  • 민숙주;김계현;박태옥;전방진
    • Proceedings of the Korean Society of Surveying, Geodesy, Photogrammetry, and Cartography Conference
    • /
    • 2004.11a
    • /
    • pp.343-347
    • /
    • 2004
  • Landuse information is used to plan land use, urban and environmental management as base data. And, demand for landuse information is rising due to ecological consideration in urban area. But existing method to extract landuse information from aerial photographs or satellite images is difficulte to describe sufficient urban landuses. Also landuse information need to be linked with statistical data because statistical data is used to make decision for urban planning and management with landuse. Therefore this study aims to examine the landuse classification method using statistical unit data and 1:1,000 digital topographic data. for the purpose, the method was applied to a part of metropolitan Seoul. The results of study shows that total accuracy is 95%. For the future, the method will be effectively applicable for the city maintenance.

  • PDF

A Study on the Data Fusion Method using Decision Rule for Data Enrichment (의사결정 규칙을 이용한 데이터 통합에 관한 연구)

  • Kim S.Y.;Chung S.S.
    • The Korean Journal of Applied Statistics
    • /
    • v.19 no.2
    • /
    • pp.291-303
    • /
    • 2006
  • Data mining is the work to extract information from existing data file. So, the one of best important thing in data mining process is the quality of data to be used. In this thesis, we propose the data fusion technique using decision rule for data enrichment that one phase to improve data quality in KDD process. Simulations were performed to compare the proposed data fusion technique with the existing techniques. As a result, our data fusion technique using decision rule is characterized with low MSE or misclassification rate in fusion variables.

An Analysis on Classifying and Representing Data as Statistical Literacy: Focusing on Elementary Mathematics Curriculum for 1st and 2nd Grades (통계적 소양으로서 자료의 분류 및 표현 활동의 의의 분석: 초등학교 1~2학년군 수학과 교육과정을 중심으로)

  • Tak, Byungjoo
    • Journal of Elementary Mathematics Education in Korea
    • /
    • v.22 no.3
    • /
    • pp.221-240
    • /
    • 2018
  • In this study, we focus on the classifying and representing data in the elementary mathematics curriculum for 1st and 2nd grades which have been rarely addressed in the previous studies. We analyze the significance of classifying and representing sata in terms of statistical problem solving and variability as the core of statistical literacy. As a result, the classifying and representing data are important for students to recognize the variability which is ubiquitous in the data and to construct distribution of data, respectively. They are reflected in the 2015 revised mathematics curriculum as the statistical literacy for addressing data. We suggest some implications to teach the classifying and representing data as the practice of statistical literacy education in their statistics classes for 1st and 2nd grades.

  • PDF

Quantitative Linguistic Analysis on Literary Works

  • Choi, Kyung-Ho
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.4
    • /
    • pp.1057-1064
    • /
    • 2007
  • From the view of natural language process, quantitative linguistic analysis is a linguistic study relying on statistical methods, and is a mathematical linguistics in an attempt to discover various linguistic characters by interpreting linguistic facts quantitatively through statistical methods. In this study, I would like to introduce a quantitative linguistic analysis method utilizing a computer and statistical methods on literary works. I also try to introduce a use of SynKDP, a synthesized Korean data process, and show the relations between distribution of linguistic unit elements which are used by the hero in a novel #Sassinamjunggi# and theme analysis on literary works.

  • PDF

Robust Regression and Stratified Residuals for Left-Truncated and Right-Censored Data

  • Kim, Chul-Ki
    • Journal of the Korean Statistical Society
    • /
    • v.26 no.3
    • /
    • pp.333-354
    • /
    • 1997
  • Computational algorithms to calculate M-estimators and rank estimators of regression parameters from left-truncated and right-censored data are developed herein. In the case of M-estimators, new statistical methods are also introduced to incorporate leverage assements and concomitant scale estimation in the presence of left truncation and right censoring on the observed response. Furthermore, graphical methods to examine the residuals from these data are presented. Two real data sets are used for illustration.

  • PDF

Association Rule Mining by Environmental Data Fusion

  • Cho, Kwang-Hyun;Park, Hee-Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.18 no.2
    • /
    • pp.279-287
    • /
    • 2007
  • Data fusion is the process of combining multiple data in order to produce information of tactical value to the user. Data fusion is generally defined as the use of techniques that combine data from multiple sources and gather that information in order to achieve inferences. Data fusion is also called data combination or data matching. Data fusion is divided in five branch types which are exact matching, judgemental matching, probability matching, statistical matching, and data linking. In this paper, we develop was macro program for statistical matching which is one of five branch types for data fusion. And then we apply data fusion and association rule techniques to environmental data.

  • PDF