Browse > Article
http://dx.doi.org/10.9708/jksci.2016.21.8.077

Big Data Smoothing and Outlier Removal for Patent Big Data Analysis  

Choi, JunHyeog (Dept. of Secretarial Management, Kimpo College)
Jun, Sunghae (Dept. of Statistics, Cheongju University)
Abstract
In general statistical analysis, we need to make a normal assumption. If this assumption is not satisfied, we cannot expect a good result of statistical data analysis. Most of statistical methods processing the outlier and noise also need to the assumption. But the assumption is not satisfied in big data because of its large volume and heterogeneity. So we propose a methodology based on box-plot and data smoothing for controling outlier and noise in big data analysis. The proposed methodology is not dependent upon the normal assumption. In addition, we select patent documents as target domain of big data because patent big data analysis is a important issue in management of technology. We analyze patent documents using big data learning methods for technology analysis. The collected patent data from patent databases on the world are preprocessed and analyzed by text mining and statistics. But the most researches about patent big data analysis did not consider the outlier and noise problem. This problem decreases the accuracy of prediction and increases the variance of parameter estimation. In this paper, we check the existence of the outlier and noise in patent big data. To know whether the outlier is or not in the patent big data, we use box-plot and smoothing visualization. We use the patent documents related to three dimensional printing technology to illustrate how the proposed methodology can be used for finding the existence of noise in the searched patent big data.
Keywords
Patent big data; Smoothing; Box-plot; Noise; Outlier; Statistical analysis;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 I. Feinerer, K. Hornik, D. Meyer, "Text mining infrastructure in R", Journal of Statistical Software, Vol. 25, No. 5, pp. 1-54, 2008.
2 I. Feinerer, K. Hornik, Package 'tm' Ver. 0.6, Text Mining Package, CRAN of R project, 2016.
3 S. Jun, S. Park, D. Jang, "Technology Forecasting using Matrix Map and Patent Clustering", Industrial Management & Data Systems, Vol. 112, Iss. 5, pp. 786-807, 2012.   DOI
4 B. L. Bowerman, R. T. O'Connell, A. B. Koehler, Forecasting, Time Series, and Regression, An Applied Approach, Independence, KY, Brooks/Cole, 2005.
5 W. S. Cleveland, "LOWESS: A program for smoothing scatterplots by robust locally weighted regression", The American Statistician, Vol. 35, No. 1, pp. 54, 1981.
6 D. Ruppert, M. P. Wand, "Multivariate locally weighted least squares regression", The annals of statistics, pp. 1346-1370, 1994.
7 G. Guo, Y. Fu, C. R. Dyer, T. S. Huang, "Image-based human age estimation by manifold learning and locally adjusted robust regression", IEEE Transactions on Image Processing, Vol. 17, No. 7, pp. 1178-1188, 2008.   DOI
8 M. Akritas, Probability and Statistics with R for Engineers and Scientists, Boston, Pearson, 2016.
9 J. Choi, S. Jun, "Bayesian Regression Modeling for Patent Keyword Analysis", Journal of The Korea Society of Computer and Information, Vol. 21 No. 1, pp. 125-129, 2016.   DOI
10 S. Park, J. Kim, D. Jang, H. Lee, S. Jun, "Methodology of Technological Evolution for Three-dimensional Printing", Industrial Management & Data Systems, Vol. 116, No. 1, pp. 122-146, 2016.   DOI
11 R Development Core Team, R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria, http://www.R-project.org, 2016.
12 J. Choi, S. Jun, "A Technology Analysis Model using Dynamic Time Warping", Journal of the Korea Society of Computer and Information, Vol. 20, No. 5, 113-120, 2015.   DOI
13 S. Jun, S. Park, D. Jang, "Technology Forecasting using Matrix Map and Patent Clustering", Industrial Management & Data Systems, Vol. 112, Iss. 5, pp. 786-807, 2012.   DOI
14 S. Lee, S. Jun, "Key IPC Codes Extraction Using Classification and Regression Tree Structure", Advances in Intelligent Systems and Computing, Vol. 271, pp 101-109, 2014.   DOI
15 J. J. Berman, Principles of Big Data, Morgan Kaufmann, 2013.
16 K. Krishnan, Data Warehousing in the Age of Big Data, Morgan Kaufmann, 2013.
17 B. Chun, S. Lee, "A Study on Big Data Processing Mechanism & Applicability", International Journal of Software Engineering and Its Applications, Vol. 8, No. 8, pp. 73-82, 2014.
18 M. Riondato, Sampling-based Randomized Algorithms for Big Data Analytics, PhD dissertation in the Department of Computer Science at Brown University, 2014.
19 S. Ha, S. Lee, K. Lee, "Standardization Requirements Analysis on Big Data in Public Sector based on Potential Business Models", International Journal of Software Engineering and Its Applications, Vol. 8, No. 11, pp. 165-172, 2014.
20 S. Jeon, B. Hong, J. Kwon, Y. Kwak, S. Song, "Redundant Data Removal Technique for Efficient Big Data Search Processing", International Journal of Software Engineering and Its Applications, Vol. 7, No. 4, pp. 427-436, 2014.
21 J. Lu, D. LiBias, "Correction in a Small Sample from Big Data", IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 11, pp. 2658-2663, 2013.   DOI
22 A. T. Roper, S. W. Cunningham, A. L. Porter, T. W. Mason, F. A. Rossini, J. Banks, Forecasting and Management of Technology, John Wiley & Sons, 2011.
23 D. Hunt, L. Nguyen, M. Rodgers, Patent Searching Tools & Techniques, Wiley, 2007.
24 J. Han, M. Kamber, J. Pei, Data Mining: Concepts and Techniques, Third Edition, Waltham, MA, Morgan Kaufmann, 2012.
25 WIPSON, WIPS Corporation, http://www.wipson.com, 2016.
26 USPTO, The United States Patent and Trademark Office, http://www.uspto.gov, 2016.
27 KIPRIS, Korea Intellectual Property Rights Information Service, www.kipris.or.kr, 2016.
28 I. Feinerer, A Text Mining Framework in R and Its Applications, Dissertation, Department of Statistics and Mathematics, Vienna University of Economics and Business Administration, 2008.