DOI QR코드

DOI QR Code

An Innovative Approach of Bangla Text Summarization by Introducing Pronoun Replacement and Improved Sentence Ranking

  • Haque, Md. Majharul (Computer Science & Engineering, University of Dhaka) ;
  • Pervin, Suraiya (Computer Science & Engineering, University of Dhaka) ;
  • Begum, Zerina (Institute of Information Technology, University of Dhaka)
  • Received : 2016.12.22
  • Accepted : 2017.05.30
  • Published : 2017.08.31

Abstract

This paper proposes an automatic method to summarize Bangla news document. In the proposed approach, pronoun replacement is accomplished for the first time to minimize the dangling pronoun from summary. After replacing pronoun, sentences are ranked using term frequency, sentence frequency, numerical figures and title words. If two sentences have at least 60% cosine similarity, the frequency of the larger sentence is increased, and the smaller sentence is removed to eliminate redundancy. Moreover, the first sentence is included in summary always if it contains any title word. In Bangla text, numerical figures can be presented both in words and digits with a variety of forms. All these forms are identified to assess the importance of sentences. We have used the rule-based system in this approach with hidden Markov model and Markov chain model. To explore the rules, we have analyzed 3,000 Bangla news documents and studied some Bangla grammar books. A series of experiments are performed on 200 Bangla news documents and 600 summaries (3 summaries are for each document). The evaluation results demonstrate the effectiveness of the proposed technique over the four latest methods.

Keywords

References

  1. D. Ai, Y. Zheng, and D. Zhang, "Automatic text summarization based on latent semantic indexing," Journal of Artificial Life and Robotics, Springer, vol. 15, no. 1, pp. 25-29, 2010. https://doi.org/10.1007/s10015-010-0759-x
  2. M. Kunder, "The size of the World Wide Web," 2016 [Online]. Available: www.worldwidewebsize.com.
  3. R. Ferreira and S. Luciano, "A multi-document summarization system based on statistics and linguistic treatment," Journal of Expert Systems with Applications, vol. 41, no. 13, pp. 5780-5787, 2014. https://doi.org/10.1016/j.eswa.2014.03.023
  4. M. M. Haque, S. Pervin, and Z. Begum, "Literature review of automatic multiple documents text summarization," International Journal of IAS, vol. 3, no. 1, pp. 121-129, 2013. https://doi.org/10.5429/2079-3871(2013)v3i2.13en
  5. M. M. Haque, S. Pervin, and Z. Begum, "Literature review of automatic single document text summarization using NLP," International Journal of IAS, vol. 3, no. 3, pp. 857-865, 2013.
  6. K. Sarkar, "Bengali text summarization by sentence extraction," in Proceedings of International Conference on Business and Information Management (ICBIM-2012), Durgapur, India, 2012, pp. 233-245.
  7. K. Sarkar, "An approach to summarizing Bengali news documents," in Proceedings of the International Conference on Advances in Computing, Communications and Informatics, Chennai, India, 2012, pp. 857-862.
  8. H. P. Luhn, "The automatic creation of literature abstracts," IBM Journal of Research and Development, vol. 2, no. 2, pp. 159-165, 1958. https://doi.org/10.1147/rd.22.0159
  9. H. P. Edmundson, "New methods in automatic extracting," Journal of the ACM, vol. 16, no. 2, pp. 264-285, 1969. https://doi.org/10.1145/321510.321519
  10. M. I. Efat, M. Ibrahim, and H. Kayesh, "Automated Bangla text summarization by sentence scoring and ranking," in Proceedings of International Conference on Informatics, Electronics & Vision (ICIEV), Dhaka, Bangladesh, 2013, pp. 1-5.
  11. Banglapedia: the National Encyclopedia of Bangladesh. Dhaka: Asiatic Society of Bangladesh, 2003.
  12. G. Miller, "WordNet: a lexical database for English," Communications of the Association for Computing Machinery (CACM), vol. 38, no. 11, pp. 39-41, 1995.
  13. Bengali WordNet, "Indradhanush WordNet Development for the Bengali Language," Dept. of Information Technology, Ministry of Information and Communication Technology, Govt. of India, 2017, [Online]. Available: http://www.isical.ac.in/-lru/externalprojects.html.
  14. M. A. Karim, M. Kaykobad, M. Murshed, Technical Challenges and Design Issues in Bangla Language Processing. Hershey, PA: Information Science Reference, 2013.
  15. N. Uzzaman, "Bangla language and research on Bangla language processing: its motivation and impact!," 2008. [Online]. Available: https://sites.google.com/a/naushadzaman.com/www/BigPicture-URCS-NZ-Bangla.pdf?attredirects=0.
  16. M. M. Haque, S. Pervin, and Z. Begum, "Automatic Bengali news documents summarization by introducing sentence frequency and clustering," in Proceedings of 18th International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, 2015, pp. 156-160.
  17. V. Gupta and G. S. Lehal, "A survey of text summarization extractive techniques," Journal of Emerging Technologies in Web Intelligence, vol. 2, no. 3, pp. 258-268, 2010.
  18. H. Saggion and T. Poibeau, "Automatic text summarization: past, present and future," in Multi-source, Multilingual Information Extraction and Summarization. Heidelberg: Springer, 2013, pp. 3-21.
  19. E. Canhasi and I. Kononenko, "Weighted archetypal analysis of the multi-element graph for query-focused multi-document summarization," Expert Systems with Applications, vol. 41, no. 2, pp. 535-543, 2014. https://doi.org/10.1016/j.eswa.2013.07.079
  20. A. M. Azmia and S. Al-Thanyyan, "A text summarizer for Arabic," Journal of Computer Speech & Language, vol. 26, no. 4, pp. 260-273, 2012. https://doi.org/10.1016/j.csl.2012.01.002
  21. T. Islam and S. M. Masum, "Bhasa: a corpus-based information retrieval and summariser for Bengali text," in Proceedings of the 7th International Conference on Computer and Information Technology, Dhaka, Bangladesh, 2004.
  22. N. Uddin and S. A. Khan, "A study on text summarization techniques and implement few of them for Bangla language," in Proceedings of 10th International conference on Computer and Information Technology, Dhaka, Bangladesh, 2007, pp. 1-4.
  23. A. Das and S. Bandyopadhyay, "Topic-based Bengali opinion summarization," in Proceedings of the 23rd International Conference on Computational Linguistics (COILING10), Beijing, China, 2010, pp. 232-240.
  24. K. Sarkar, "A keyphrase-based approach to text summarization for English and Bengali documents," International Journal of Technology Diffusion (IJTD), vol. 5, no. 2, pp. 28-38, 2014. https://doi.org/10.4018/ijtd.2014040103
  25. S. R. El-Beltagy and A. Rafea, "KP-Miner: a keyphrase extraction system for English and Arabic documents," Journal Information Systems, vol. 34, no. 1, pp. 132-144, 2009. https://doi.org/10.1016/j.is.2008.05.002
  26. ROUGE 2.0: a Java package for automatic summary evaluation [Online]. Available: http://www.rxnlp.com/rouge-2-0/.
  27. Indian Statistical Institute, "List of stop words for Bengali language," 2016 [Online]. Available: http://www.isical.ac.in/-fire/data/stopwords/.
  28. M. Islam, M. Uddin, and M. Khan, "A light weight stemmer for Bengali and its use in spelling checker," Center for Research on Bangla Language Processing, Dhaka, Bangladesh, 2007.
  29. Society for National Language Technology Research, "Bengali POS Tagger," [Online]. Available: http://nltr.org/snltr-software.
  30. A. Das and S. Bandyopadhyay, "SentiWordNet for Bangla," in Knowledge Sharing Event-4: Task 2: Building Electronic Dictionary. Mysore, India: Knowledge Sharing Event, 2010.
  31. M. Chowdhury, I. Khalil, and M. H. Chowdhury, Bangla Vasar Byakaran. Dhaka: Ideal Publishers, 2000.
  32. H. Mamud, Vasa Shikkha, Bangla Vasar Byakaran O Rachanariti. Dhaka: The Atlas Publishing House, 2011.
  33. Occupation in Bangladesh, "Name of occupation in largest job site," [Online]. Available: http://bdjobs.com.
  34. Gpedia [Online]. Available: http://www.gpedia.com/bn.
  35. A. Ekbal, R. Haque, and S. Bandyopadhyay, "Named entity recognition in Bengali: a conditional random field approach," in Proceedings of International Joint Conference on Natural Language Processing, Hyderabad, India, 2008, pp. 589-594.
  36. Z. R. Siddiqui, English-Bangla Dictionary, 2nd ed. Dhaka: Bangla Academy, 2011.
  37. G. M. Kiron, Ajker Bishaw (General Knowledge, Bangladesh and International Affairs). Dhaka: Premier Publications, 2014.
  38. Bengali names [Online]. Available: http://www.indiachildnames.com/regional/bengalinames.aspx.
  39. Post office of Bangladesh [Online]. Available: http://www.bangladeshpost.gov.bd/postcode.asp.
  40. L. Gravano, P. Ipeirotis, H. Jagadish, N. Koudas, S. Muthukrishnan, L. Pietarinen, and D. Srivastava, "Using qgrams in a DBMS for approximate string processing," IEEE Data Engineering Bulletin, vol. 24, no. 4, pp. 28-34, 2001.
  41. G. Salton and C. Buckley, "Term-weighting approaches in automatic text retrieval," Information Processing & Management, vol. 25, no. 5, pp. 513-523, 1988.
  42. A. Abuobieda, N. Salim, A. T. Albaham, A. H. Osman, and Y. J. Kumar, "Text summarization features selection method using pseudo genetic-based model," in Proceedings of International Conference on Information Retrieval & Knowledge Management, Kuala Lumpur, Malaysia, 2012, pp. 193-197.
  43. M. A. Fattah and F. Ren, "GA, MR, FFNN, PNN and GMM based models for automatic text summarization," Computer Speech and Language, vol. 23, no. 1, pp. 126-144, 2009. https://doi.org/10.1016/j.csl.2008.04.002
  44. D. R. Radev, E. Hovy, and K. McKeown, "Introduction to the special issue on summarization," Journal of Computational Linguistics, vol. 28, no. 4, pp. 399-408, 2002. https://doi.org/10.1162/089120102762671927
  45. Rule based system [Online]. Available: http://www.j-paine.org/students/lectures/lect3/node5.html.
  46. Markov process [Online]. Available: digital.cs.usu.edu/-cy an/CS7960/Markov_Chains.ppt
  47. Bangla Natural Language Processing Community [Online]. Available: http://bnlpc.org/research.php.
  48. R. Ferreira, F. Freitas, L. de Souza Cabral, R. D. Lins, R. Lima, G. Franca, S. J. Simskez, and L. Favaro, "A four dimension graph model for automatic text summarization," in Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, 2013, pp. 389-396.
  49. J. Chen and H, Zhuge, "Summarization of scientific documents by detecting common facts in citations," Future Generation Computer Systems, vol. 32, pp. 246-252, 2014. https://doi.org/10.1016/j.future.2013.07.018
  50. C. Lin and E. Hovy, "Automatic evaluation of summaries using n-gram co-occurrence statistics," in Proceedings of the Human Technology Conference (HLT-NAACL-2003), Edmonton, Canada, 2003, pp. 71-78.
  51. S. Hariharan, T. Ramkumar, and R. Srinivasan, "Enhanced graph based approach for multi document summarization," The International Arab Journal of Information Technology, vol. 10, no. 4, pp. 334-341, 2013.