Browse > Article
http://dx.doi.org/10.5392/JKCA.2012.12.10.001

Optimization and Performance Analysis of Distributed Parallel Processing Platform for Terminology Recognition System  

Choi, Yun-Soo (한국과학기술정보연구원 소프트웨어 연구실)
Lee, Won-Goo (한국과학기술정보연구원 소프트웨어 연구실)
Lee, Min-Ho (한국과학기술정보연구원 소프트웨어 연구실)
Choi, Dong-Hoon (한국과학기술정보연구원 소프트웨어 연구실)
Yoon, Hwa-Mook (한국과학기술정보연구원 소프트웨어 연구실)
Song, Sa-kwang (한국과학기술정보연구원 소프트웨어 연구실)
Jung, Han-Min (한국과학기술정보연구원 소프트웨어 연구실)
Publication Information
Abstract
Many statistical methods have been adapted for terminology recognition to improve its accuracy. However, since previous studies have been carried out in a single core or a single machine, they have difficulties in real-time analysing explosively increasing documents. In this study, the task where bottlenecks occur in the process of terminology recognition is classified into linguistic processing in the process of 'candidate terminology extraction' and collection of statistical information in the process of 'terminology weight assignment'. A terminology recognition system is implemented and experimented to address each task by means of the distributed parallel processing-based MapReduce. The experiments were performed in two ways; the first experiment result revealed that distributed parallel processing by means of 12 nodes improves processing speed by 11.27 times as compared to the case of using a single machine and the second experiment was carried out on 1) default environment, 2) multiple reducers, 3) combiner, and 4) the combination of 2)and 3), and the use of 3) showed the best performance. Our terminology recognition system contributes to speed up knowledge extraction of large scale science and technology documents.
Keywords
Terminology Recognition; Distributed Parallel Processing; Hadoop; MapReduce;
Citations & Related Records
Times Cited By KSCI : 2  (Citation Analysis)
연도 인용수 순위
1 F. Smadja, K. R. McKeown, and V. Hatzivassiloglou, "Translating collocations for bilingual lexicons: A statistical approach," Computational Linguistics, Vol.22, No.1, pp.1-38, 1996.
2 K. Frantzi, S. Ananiadou, and H. Mima, "Automatic recognition of multi-word terms: the C-value/NC-value method," International Journal on Digital Libraries, Vol.3, No.2, pp.115-130, 2000.   DOI
3 S. K. Song, Y. S. Choi, H. W. Chun, C. H. Jeong, S. P. CHoi, and W. K. Sung, "Multi-words Terminology Recognition Using Web Search," Communications in Computer and Information Science, Vol.264, No.1, pp.233-238, 2011.   DOI
4 최성필, 송사광, 정한민, "기술 문헌 분석 테스트 베드 툴킷 개발", 한국콘텐츠학회논문지, 제12권, 제8호, pp.13-19, 2012.
5 정창후, 최성필, 윤화묵, 최윤수, "그리드 기반의 고성능 과학기술지식처리 프레임워크 개발", 한국콘텐츠학회논문지, 제9권, 제12호, pp.877-885, 2009.
6 B. Daille, E. Gaussier, and J. Lange, "Towards Automatic Extraction of Monolingual and Bilingual Terminology," COLING-94, 1994.
7 J. S. Justeson and S. M. Katz, "Technical terminology : some lingustic propertis and an algorithm for identification in text," Natural Language Engineering, Vol.1, No.1, pp.9-27, 1995.
8 K. W. Church and P. Hanks, "Word association norms, mutual information, and lexicography," Computational Linguistics, Vol.16, No.1, pp.22-29, 1990.
9 R. Cilibrasi and P. Vitanyi, "The Google Similarity Distance," IEEE Trans. Knowledge and Data Engineering, Vol.19, No.3, pp.370-383, 2007.   DOI   ScienceOn
10 S. Ghemawat, H. Gobioff, and S. Leungm, "The Google File System," In Proc. of ACM Symposium on Operating Systems Principles, pp.20-43, 2003.
11 W. Tom, and C. Doug, Hadoop:The Definitive Guide, O'REILLY, 2009.