슬라이딩 윈도우 기반 다변량 스트림 데이타 분류 기법

A Sliding Window-based Multivariate Stream Data Classification

  • 서성보 (충북대학교 전자계산학과) ;
  • 강재우 (노스캐롤라이나 주립대 전자계산학과) ;
  • 남광우 (군산대학교 컴퓨터정보과학과) ;
  • 류근호 (충북대학교 전자계산학과)
  • 발행 : 2006.04.01

초록

분산 센서 네트워크에서 대용량 스트림 데이타를 제한된 네트워크, 전력, 프로세서를 이용하여 모든 센서 데이타를 전송하고 분석하는 것은 어렵고 바람직하지 않다. 그러므로 연속적으로 입력되는 데이타를 사전에 분류하여 특성에 따라 선택적으로 데이타를 처리하는 데이타 분류 기법이 요구된다. 이 논문에서는 다차원 센서에서 주기적으로 수집되는 스트림 데이타를 슬라이딩 윈도우 단위로 데이타를 분류하는 기법을 제안한다. 제안된 기법은 전처리 단계와 분류단계로 구성된다. 전처리 단계는 다변량 스트림 데이타를 포함한 각 슬라이딩 윈도우 입력에 대해 데이타의 변화 특성에 따라 문자 기호를 이용하여 다양한 이산적 문자열 데이타 집합으로 변환한다. 분류단계는 각 윈도우마다 생성된 이산적 문자열 데이타를 분류하기 위해 표준 문서 분류 알고리즘을 이용하였다. 실험을 위해 우리는 Supervised 학습(베이지안 분류기, SVM)과 Unsupervised 학습(Jaccard, TFIDF, Jaro, Jaro Winkler) 알고리즘을 비교하고 평가하였다. 실험결과 SVM과 TFIDF 기법이 우수한 결과를 보였으며, 특히 속성간의 상관 정도와 인접한 각 문자 기호를 연결한 n-gram방식을 함께 고려하였을 때 높은 정확도를 보였다.

In distributed wireless sensor network, it is difficult to transmit and analyze the entire stream data depending on limited networks, power and processor. Therefore it is suitable to use alternative stream data processing after classifying the continuous stream data. We propose a classification framework for continuous multivariate stream data. The proposed approach works in two steps. In the preprocessing step, it takes input as a sliding window of multivariate stream data and discretizes the data in the window into a string of symbols that characterize the signal changes. In the classification step, it uses a standard text classification algorithm to classify the discretized data in the window. We evaluated both supervised and unsupervised classification algorithms. For supervised, we tested Bayesian classifier and SVM, and for unsupervised, we tested Jaccard, TFIDF Jaro and Jaro Winkler. In our experiments, SVM and TFIDF outperformed other classification methods. In particular, we observed that classification accuracy is improved when the correlation of attributes is also considered along with the n-gram tokens of symbols.

키워드

참고문헌

  1. A. Mainwaring, and J. Polastre, et al., 'Wireless Sensor Networks for Habitat Monitoring,' In ACM Int. Workshop on WSNA, pp.88-97, 2002 https://doi.org/10.1145/570738.570751
  2. B. Xu and O. Wolfson., 'Time-Series Prediction with Applications to Traffic and Moving Objects Databases,' In ACM Int. Workshop on MobiDE, pp.56-60, 2003 https://doi.org/10.1145/940923.940934
  3. R. C. Oliver, and K. Smettem, et al., 'Field Testing a Wireless Sensor Network for Reactive Environmental Monitoring,' In Proc. of ISSNIP Conf., pp.7-12, 2004 https://doi.org/10.1109/ISSNIP.2004.1417429
  4. M. Galan, H. Liu, and K. Torkkola, 'Intelligent Instance Selection of Data Streams for Smart Sensor Applications,' SPIE Defense & Security Symposium, Intelligent Computing, pp.108-119, 2005 https://doi.org/10.1117/12.605855
  5. J. Han and M. Kamber., 'Data Mining Concepts and Techniques,' Morgan Kaufmann Publishers, 2000
  6. H. Wang, W. Fan, P. S. Yu, and J. Han., 'Mining Concept-Drifting Data Streams Using Ensemble Classifiers,' In Proc. of ACM SIGKDD Conf., pp.226-235, 2003 https://doi.org/10.1145/956750.956778
  7. Xianping Ge., 'Pattern Matching in Financial Time Series Data,' In Final Project Report for ICS 278 UC Irvine, 1998
  8. M. Nagao and S. Mori, 'A new method of N-gram statistics for large number of n and automatic extraction of words and phrases from large text data of Japanese,' Int. Conf. on Computational Linguistics, pp.611-615, 1994 https://doi.org/10.3115/991886.991994
  9. C. C. Aggrawal, J. Han, and P. S. Yu., 'On Demand Classification of Data Streams,' In Proc. of ACM SIGKDD Conf., pp.503-508, 2004 https://doi.org/10.1145/1014052.1014110
  10. M. W. Kadous and C. Sammut., 'Classification of multivariate time series and structured data using constructive induction,' Machine Learning Journal, Vol. 58, pp.176-216, 2005 https://doi.org/10.1007/s10994-005-5826-5
  11. J. Lin, E. Keogh, S. Lonardi, and B. Chiu., 'A Symbolic Representation of Time Series with Implications for Streaming Algorithms,' In ACM SIGMOD Workshop on DMKD, pp.2-11, 2003 https://doi.org/10.1145/882082.882086
  12. P. Geurts., 'Pattern Extraction for Time Series Classification,' In Proc. of PKDD, pp.115-127, 2001 https://doi.org/10.1007/3-540-44794-6_10
  13. M. M. Gaber, A. Zaslavsky, and S. Krishnaswamy., 'Mining Data Streams: A Review,' ACM SIGMOD Record Vol. 34, No. 2, pp.18-26, 2005 https://doi.org/10.1145/1083784.1083789
  14. W. W.Cohen, P. Ravikumar, and S. Fienberg., 'A Comparison of String Distance Metrics for Naming-matching tasks,' In Proc. of IIWEB, pp.73-78, 2003
  15. P.N. Tan, M. Steinbach, and V.Kumar., 'Introduction to data Mining,' Pearson Addison Wesley, 2005
  16. N. Cristianini and J. Shawe-Taylor., 'An Introduction to Support Vector Machines,' Cambridge University Press, 2000
  17. B.W. On, D.W. Lee, J. W. Kang, and P. Mitra, 'Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework,' In ACM/IEEE JCDL, pp.344-353, 2005 https://doi.org/10.1145/1065385.1065463
  18. R. Agrawal, G. Psaila, E. L. Wimmers, and Mohamed Zait., 'Querying Shapes of Histories,' Proc. of VLBD Conf., pp.502-514, 1995
  19. S. Hettich and S. D. Bay, 'The UCI KDD Archive [http://kdd.ics.uci.edu] (Robot Execution Failure, Synthetic Control Chart Time Series),' Irvine, CA: Univ. of California, Dept. of Information and Computer Science, 1999
  20. A Library for Support Vector Machines., http://www.csie.ntu.edu.tw/~cjlin/libsvm
  21. SecondString (Jave-based Package of Approximate String-Matching)., http://secondstring. sourceforge.net
  22. J. Chen and R. Greiner, 'Comparing Bayesian Network Classifiers,' In Proc. of UAI-99, pp.101-108, Jul., 1999
  23. Java Bayesian Network Classifier Toolkit, 'http://jbnc.sourforge.net,' 2005