Browse > Article

Hidden Markov Model-based Extraction of Internet Information  

Park, Dong-Chul (Dept. of Information Eng., Myong Ji University)
Publication Information
Abstract
A Hidden Markov Model(HMM)-based information extraction method is proposed in this paper. The proposed extraction method is applied to extraction of products' prices. The input of the proposed IESHMM is the URLs of a search engine's interface, which contains the names of the product types. The output of the system is the list of extracted slots of each product: name, price, image, and URL. With the observation data set Maximum Likelihood algorithm and Baum-Welch algorithm are used for the training of HMM and The Viterbi algorithm is then applied to find the state sequence of the maximal probability that matches the observation block sequence. When applied to practical problems, the proposed HMM-based system shows improved results over a conventional method, PEWEB, in terms of recall ration and accuracy.
Keywords
HMM; internet; data extraction; Baum-Welch algorithm; Viterbi algorithm;
Citations & Related Records
Times Cited By KSCI : 4  (Citation Analysis)
연도 인용수 순위
1 X. H. Phan, S. Horiguchi, and T. Ho, 'PEWEB: Product Extraction from the Web Based on Entropy Estimation', Proc. of the 2004 IEEE/WIC/ACM International Conference on the Web Intelligence, pp. 590-593, 2004   DOI
2 석현택, 곽경섭, '인체에 투사된 스트라이프 파형의 HMM을 이용한 인식방안,' 전자공학회논문지, 제42권 CI편 제1호, pp. 51-58, 2005   과학기술학회마을
3 양욱일, 손광훈, '방사 기저 함수 신경망을 이용한 3차원 얼굴인식,' 전자공학회논문지, 제 44권 SP편, 제2호, pp. 82-92, 2007   과학기술학회마을
4 노수호, 박병준, 'Stochastic 프로세스 모델을 이용한 웹 페이지 추천 기법,' 전자공학회논문지, 제42권 CI편 제6호, pp. 37-46, 2005   과학기술학회마을
5 D. Gusfield, Algorithms on strings, tree, and sequence. 1997
6 C. Chang, C. Hsu, and S. Lui, 'Automatic information extraction from semi-structured Web pages by pattern discovery', Decision Support Systems, Vol. 35, No.1, pp. 129-147, 2004   DOI   ScienceOn
7 http://www.jaist.ac.jp/~hieuxuan/softwares/peweb/
8 K. Lerman, S. Minton, and C. Knoblock, 'Wrapper Maintenanc: A machine learning approach,' J. of Artificial Intelligence Research, V. 18, pp. 149-181, 2003
9 D. Embley, Y. Jiang, Y and Y. Ng, 'Record-boundary discovery in Web documents,' Proc. of SIGMOD-99, 1999   DOI
10 L. R. Rabiner, 'A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition', Proc. of IEEE, Vol.7, No. 2, 57-286, 1989
11 D. Buttler, S. Liu, and C. Pu, 'A Fully Automated Extraction System for the World Wide Web', Proc. of IEEE ICDCS, pp. 361-370, 2001.
12 http://omini.sourceforge.net/
13 박창현, 송명선, '인지 무선 시스템을 위한 채널 집합 관리기의 개발 및 성능 분석,' 전자공학회논문지, 제45권 CI편 제5호, pp. 8-14, 2008   과학기술학회마을
14 B. Liu, R. Grossman, and Y. Zhai, 'Mining Data Records in Web Pages,' IEEE Intelligent Systems, V. 19, No.6, pp. 49-5, 2004   DOI   ScienceOn
15 D.-C. Park, et al., 'Information Extraction System Based on Hidden Markov Model', Proc. of ISNN 2009, (accepted for presentation).   DOI   ScienceOn