Browse > Article
http://dx.doi.org/10.3745/JIPS.2012.8.2.301

The Use of MSVM and HMM for Sentence Alignment  

Fattah, Mohamed Abdel (Dept. of Electronics technology, Helwan University)
Publication Information
Journal of Information Processing Systems / v.8, no.2, 2012 , pp. 301-314 More about this Journal
Abstract
In this paper, two new approaches to align English-Arabic sentences in bilingual parallel corpora based on the Multi-Class Support Vector Machine (MSVM) and the Hidden Markov Model (HMM) classifiers are presented. A feature vector is extracted from the text pair that is under consideration. This vector contains text features such as length, punctuation score, and cognate score values. A set of manually prepared training data was assigned to train the Multi-Class Support Vector Machine and Hidden Markov Model. Another set of data was used for testing. The results of the MSVM and HMM outperform the results of the length based approach. Moreover these new approaches are valid for any language pairs and are quite flexible since the feature vector may contain less, more, or different features, such as a lexical matching feature and Hanzi characters in Japanese-Chinese texts, than the ones used in the current research.
Keywords
Sentence Alignment; English/Arabic Parallel Corpus; Parallel Corpora; Machine Translation; Multi-Class Support Vector Machine; Hidden Markov model;
Citations & Related Records
연도 인용수 순위
  • Reference
1 I. Melamed, "A portable algorithm for mapping bitext correspondence" In The 35th Conference of the Association for Computational Linguistics (ACL 1997), Madrid, Spain, 1997.
2 H. Dejean, E. Gaussier, F. Sadat, "Bilingual Terminology Extraction: An Approach based on a Multilingual thesaurus Applicable to Comparable Corpora", Proceedings of the 19th International Conference on Computational Linguistics COLING 2002, Taipei, Taiwan, 2002, pp.218-224.
3 C. Thomas, C. Kevin, "Aligning Parallel Bilingual Corpora Statistically with Punctuation Criteria", Computational Linguistics and Chinese Language Processing , Vol.10, No.1, 2005, pp.95-122.
4 W. Gale, K. Church, "A program for aligning sentences in bilingual corpora" Computational Linguistics, Vol.19, 1993, pp.75-102.
5 P. Brown, J. Lai, R. Mercer, "Aligning sentences in parallel corpora" In Proceedings of the 29th annual meeting of the association for computational linguistics, Berkeley, CA, USA, 1991.
6 M. Simard, G. Foster, P. Isabelle, "Using cognates to align sentences in bilingual corpora", Proceedings of TMI92, Montreal, Canada, 1992, pp.67-81.
7 I. Melamed, "Bitext Maps and Alignment via Pattern Recognition", Computational Linguistics, March, Vol.25, No.1, 1999, pp.107-130.
8 P. Danielsson, K. Mühlenbock, "The Misconception of High-Frequency Words in Scandinavian Translation", AMTA, 2000, pp.158-168.
9 Q. She, H. Su, L. Dong, J. Chu, "Support vector machine with adaptive parameters in image coding", Int. J. Innovative Computing, Information and Control, Vol.4, No.2, 2008, pp.359-368.
10 R. Chen, S. Chen, "Intrusion detection using a hybrid support vector machine based on entropy and TF-IDF", Int. J. Innovative Computing, Information and Control, Vol.4, No.2, 2008, pp.413-424.
11 X. Song, , W. Chen, B. Jiang, "Sample Reducing Method in Support Vector Machine Based on KClosest Sub-Clusters", Int. J. Innovative Computing, Information and Control, Vol.4, No.7, 2008, pp.1751-1760.
12 L. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition", Proceddings of the IEEE, Vol.77, No.2, 1989, pp.257-286.
13 P. Resnik, N. Smith, "The Web as a Parallel Corpus", Computational Linguistics, Vol.29, No.3, 2003, pp.349-380.   DOI   ScienceOn
14 S. Vogel, H. Ney, C. Tillmann, "HMM-Based Word Alignment in Statistical Translation", Proceedings of the 16th International Conference on Computational Linguistics, Copenhagen, Denmark, 1996, pp.836-841.
15 W. Kraaij, J. Nie, M. Simard, "Embedding Web-Based Statistical Translation Models in Cross- Language Information Retrieval", Computational Linguistics, Vol.29, No.3, 2003, pp.381-419.   DOI   ScienceOn
16 A. Ribeiro, G. Dias, G. Lopes, J. Mexia, "Cognates Alignment: In Bente Maegaard (ed.)", Proceedings of the Machine Translation Summit VIII (MT Summit VIII) - Machine Translation in the Information Age, Santiago de Compostela, Spain, 2001, pp.287-292.
17 A. Ceauşu, D. Ştefănescu, D. Tufiş, "Acquis communautaire sentence alignment using support vector machines", Proceedings of the Fifth Language Resources and Evaluation Conference, 2006.
18 N. Collier, K. Ono, H. Hirakawa, "An Experiment in Hybrid Dictionary and Statistical Sentence Alignment" COLING-ACL, 1998, pp.268-274.
19 S. Chen, "Aligning Sentences in Bilingual Corpora Using Lexical Information", Proceedings of ACL-93, Columbus OH, 1993, pp.9-16.
20 K. Chen, H. Chen, "A Part-of-Speech-Based Alignment Algorithm", Proceedings of 15th International Conference on Computational Linguistics, Kyoto, 1994, pp.166-171.
21 S. Mukherjee, E. Osuna, F. Girosi, "Nonlinear prediction of chaotic time series using support vector machine", In proceedings of the IEEE Workshop on Neural Networks for Signal Processing 7, Amerlia Island, FL, 1997, pp.511-519
22 E. Osuna, R. Freund, F. Girosi, "An improved training algorithm for support vector machines", In Proc. of the IEEE Workshop on Neural Networks for Signal Processing VII, New York, 1997, pp.276-285.
23 M. Brown, H. Lewis, S. Gunn, "Linear Spectral Mixture Models and Support Vector Machines for Remote Sensing", IEEE Transactions On Geoscience And Remote Sensing, Vol.38, No.5, 2000, September.
24 G. Foody, A. Mathur, "A Relative Evaluation of Multiclass Image Classification by Support Vector Machines", IEEE Transactions On Geoscience And Remote Sensing, Vol.42, No.6, 2004, June.
25 C. Christopher, L. Kar, "Building parallel corpora by automatic title alignment using length-based and text-based approaches" Information Processing and Management, Vol.40, 2004, pp.939-955.   DOI   ScienceOn
26 M. Fattah, F. Ren, S. Kuroiwa, "Stemming to Improve Translation Lexicon Creation form Bitexts" Information Processing & Management, Vol.42 No.4, 2006, pp.1003-1016.   DOI   ScienceOn
27 S. Ker, J. Chang, "A class-based approach to word alignment", Computational Linguistics, Vol.23, No.2, 1997, pp.313-344.
28 F. Gey, A. Chen, M. Buckland, R. Larson, "Translingual vocabulary mappings for multilingual information access," SIGIR, 2002, pp.455-456.
29 M. Fattah, F. Ren, S. Kuroiwa, "Sentence Alignment using P-NNT and GMM," Computer Speech and Language, Vol.21, No.4, 2007, pp.594-608.   DOI   ScienceOn
30 R. Moore, "Fast and Accurate Sentence Alignment of Bilingual Corpora," AMTA, 2002, pp.135-144.
31 M. Davis, F. Ren, "Automatic Japanese-Chinese Parallel Text Alignment," Proceedings of International Conference on Chinese Information Processing, 1998, pp.452-457.
32 W. Dolan, J. Pinkham, S. Richardson, "MSR-MT, The Microsoft Research Machine Translation System," AMTA, 2002, pp.237-239.
33 M. Simard, "Text-translation alignment: three languages are better than two" In Proceedings of EMNLP/VLC- 99, College Park, MD, 1999.
34 A. Chen, F. Gey, "Translation Term Weighting and Combining Translation Resources in Cross-Language Retrieval" TREC 2001.
35 D. Oard, "Alternative approaches for cross-language text retrieval": In D. Hull, & D. Oard (Eds.), AAAI symposium in cross-language text and speech retrieval. American Association for Artificial Intelligence, March, 1997.