DOI QR코드

DOI QR Code

Automatic extraction of similar poetry for study of literary texts: An experiment on Hindi poetry

  • Prakash, Amit (Department of Computer Science and Engineering, Birla Institute of Technology) ;
  • Singh, Niraj Kumar (Department of Computer Science and Engineering, Birla Institute of Technology) ;
  • Saha, Sujan Kumar (Department of Computer Science and Engineering, Birla Institute of Technology)
  • Received : 2019.08.20
  • Accepted : 2022.03.15
  • Published : 2022.06.10

Abstract

The study of literary texts is one of the earliest disciplines practiced around the globe. Poetry is artistic writing in which words are carefully chosen and arranged for their meaning, sound, and rhythm. Poetry usually has a broad and profound sense that makes it difficult to be interpreted even by humans. The essence of poetry is Rasa, which signifies mood or emotion. In this paper, we propose a poetry classification-based approach to automatically extract similar poems from a repository. Specifically, we perform a novel Rasa-based classification of Hindi poetry. For the task, we primarily used lexical features in a bag-of-words model trained using the support vector machine classifier. In the model, we employed Hindi WordNet, Latent Semantic Indexing, and Word2Vec-based neural word embedding. To extract the rich feature vectors, we prepared a repository containing 37 717 poems collected from various sources. We evaluated the performance of the system on a manually constructed dataset containing 945 Hindi poems. Experimental results demonstrated that the proposed model attained satisfactory performance.

Keywords

Acknowledgement

The authors would like to thank the associated editor and anonymous reviewers for their valuable comments and suggestions to improve the quality of this paper.

References

  1. F. E. Gould, Creative expression through poetry, Elem. Engl. 26 (1949), 391-393.
  2. S. Wang and C. D. Manning, Baselines and bigrams: Simple, good sentiment and topic classification, (Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Association for Computational Linguistics, Jeju, Rep. of Korea), July 2012, pp. 90-94.
  3. E. Gabrilovich and S. Markovitch, Feature generation for text categorization using world knowledge, (Proceedings of the 19th International Joint Conference on Artificial Intelligence, Edinburgh, Scotland), July 2005, pp. 1048-1053.
  4. R. Shukla, Hindi Sahitya ka Itihas Prabhat Prakashan, 1st ed. 10 April 2016.
  5. L. Mohan, Encyclopedia of Indian literature, Sahitya Akad., 1992.
  6. C. O. Hartman and Free, Verse: An essay on Prosody, Northwestern University Press, 1980.
  7. P. Hobsbaum, Metre, rhythm and verse form Routledge, Routledge, 1996.
  8. M. Williams, Rasa, Sanskrit English dictionary with etymology, Motilal Banarsidass (Originally Published: Oxford), 1899.
  9. P. J. Chaudhury, The theory of Rasa, J. Aesthet. Art Critic. 11 (1952), no. 2, 147-150, Special Issue on Oriental Art and Aesthetics. https://doi.org/10.2307/426040
  10. W. Dace, The concept of "Rasa" in Sanskrit dramatic theory, Educ. Theatre J. 15 (1963), no. 3, 249-2554. https://doi.org/10.2307/3204783
  11. S. L. Schwartz, Rasa: "Performing the divine in India", Columbia University Press, 2004, 12-15.
  12. N. Lidova, Natyashastra, Oxford University Press, (2014). https://doi.org/10.1093/obo/9780195399318-0071
  13. V. P. Dhananjayan and B. R. Rhythms, Dhananjayan on Indian classical dance, 3rd revised ed., BR Rhythms, 2004.
  14. H. R. Tizhoosh, F. Sahba, and R. Dara, Poetic features for poem recognition: A comparative study, Pattern Recognit. Res. 3 (2008), 24-39.
  15. A. Almuhareb, I. Alkharashi, L. A. L Saud, and H Altuwaijri, Recognition of classical arabic poems, (Proceedings of the Second Workshop on Computational Linguistics for Literature, Atlanta, GA, USA), June 2013, pp. 9-16.
  16. N. Rang, Poetry classification using support vector machines, J. Comput. Sci. 8 (2012), no. 9, 1441-1446. https://doi.org/10.3844/jcssp.2012.1441.1446
  17. A. Almuhareb, W. A. Almutairi, H. Altuwaijri, A. Almubarak, and M. Khan, Recognition of modern Arabic poems, J. Softw. 10 (2015), 454-464. https://doi.org/10.17706/jsw.10.4.454-464
  18. F. Can and J. M. Patton, Change of writing style with time, Comput. Humanit. 38 (2004), 61-82. https://doi.org/10.1023/B:CHUM.0000009225.28847.77
  19. F. Can and J. M. Patton, Change of word characteristics in 20th-century Turkish literature: A statistical analysis, J. Quant. Linguist. 17 (2010), no. 3, 167-190. https://doi.org/10.1080/09296174.2010.485444
  20. J. T. Kao and D. Jurafsky, A computational analysis of style, affect, and imagery in contemporary poetry, (Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature, Montreal, Canada), June 2012, pp. 8-17.
  21. R. Voigt and D. Jurafsky, Tradition and modernity in 20th century Chinese poetry, (Proceedings of the Workshop on Computational Linguistics for Literature, Atlanta, GA, USA), June 2013, pp. 17-22.
  22. M. Lustrek, Overview of automatic genre identification, Jozef Stefan Institute Department of Intelligent Systems, 2006.
  23. D. M. Kaplan and D. M. Blei, A computational approach to style in American poetry, (Seventh IEEE International Conference on Data Mining. Omaha, NE, USA), Oct. 2007, pp. 553-558.
  24. B. Yu, An evaluation of text classification methods for literary study, Literary Linguist. Comput. 23 (2008), 327-343. https://doi.org/10.1093/llc/fqn015
  25. A. Lou, D. Inkpen, and C. Tanasescu, Multilabel subject-based classification of poetry, (Proceedings of the Twenty-Eighth International Florida Artificial Intelligence Research Society Conference, Hollywood, FL, USA), May 2015, pp. 187-192.
  26. A. Rahgozar and D. Inkpen, Bilingual chronological classification of Hafezs poems, (Proceedings of the Fifth Workshop on Computational Linguistics for Literature, San Diego, CA, USA), June 2016, pp. 54-62.
  27. J. T. Kao and D. Jurafsky, A computational analysis of poetic style, Literature Lifts up Comput. Linguistics 12 (2015), 1377. https://doi.org/10.33011/lilt.v12i.1377
  28. R. Delmonte, Computing poetry style, (ESSEM@AI* IA, Torino, Italy) Dec. 2013, pp. 148-155.
  29. R. M. Cyotl-Morales, L. Villasenor-Pineda, M. Montes-y-Gomez, and P. Rosso, Authorship attribution using word sequences, (Progress in Pattern Recognition, Image Analysis and Applications, Cancun, Mexico), Nov. 2006, pp. 844-853. https://doi.org/10.1007/11892755_87
  30. S. Das and P. Mitra, Author identification in Bengali literary works, (PReMI 2011: Pattern Recognition and Machine Intelligence, Moscow, Russia), 2011, pp. 220-226. https://doi.org/10.1007/978-3-642-21786-9_37
  31. J. Kaur and J. R. Saini, Automatic Punjabi poetry classification using machine learning algorithms with reduced feature set, Int. J. Artif. Intell. Soft. Comput. 5 (2016), no. 4, 311-319. https://doi.org/10.1504/IJAISC.2016.081353
  32. T. Chakraborty and S. Bandyopadhyay, Identification of reduplication in Bengali corpus and their semantic analysis A rulebased approach, (Proceedings of the Multiword Expressions: From Theory to Applications, Beijing, China), Aug. 2010, pp. 73-76.
  33. S. Phani, L. Shibamouli, and A. Biswas, Authorship attribution in Bengali language, (Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India), 2015, pp. 100-105.
  34. G. Rakshit, A. Ghosh, P. Bhattacharyya, and G. Haffari, Automated analysis of Bangla poetry for classification and poet identification, (Proceedings of the 12th International Conference on Natural Language Processing, Trivandrum, India), 2015, pp. 247-253.
  35. S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci. 41 (1990), no. 6, 391-407. https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9
  36. T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint, ICLR, 2013. https://doi.org/10.48550/arXiv.1301.3781
  37. C. Cortes and V. Vapnik, Support vector networks, Mach. Learn. 20 (1995), 273-297. https://doi.org/10.1007/BF00994018
  38. P. Y. Pawar and S. H. Gawande, A comparative study on different types of approaches to text categorization, Int. J. Mach. Learn. Comput. 2 (2012), no. 4, 423-426. https://doi.org/10.7763/IJMLC.2012.V2.158
  39. F. Colas and P. Brazdil, Comparison of SVM and some older classification algorithms in text classification tasks, In Conference on Artificial Intelligence in Theory and Practice, I. F. I. P. International. (ed.), Springer, Boston, MA, 2006, 169-178.
  40. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, The Weka data mining software, A.C.M. SIGKDD Explor. Newsl. 11 (2009), 10-18. https://doi.org/10.1145/1656274.1656278
  41. D. V. Lindley, Fiducial distributions and Bayes theorem, J. R. Stat. Soc. B. Methodol. 20 (1958), 102-107. https://doi.org/10.1111/j.2517-6161.1958.tb00278.x
  42. J. R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1986), 81-106. https://doi.org/10.1007/BF00116251
  43. L. Breiman, Random forests, Mach. Learn. 45 (2001), no. 1, 5-32. https://doi.org/10.1023/A:1010933404324
  44. T. Cover and P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inform. Theory 13 (1967), 21-27. https://doi.org/10.1109/TIT.1967.1053964
  45. J. H. Friedman, Greedy function approximation: A gradient boosting machine, Ann. Statist. 29 (2001), no. 5, 1189-1232. https://doi.org/10.1214/aos/1013203451