[KSCI] Korea Science Citation Index Service

http://dx.doi.org/10.6109/jicce.2022.20.2.113

Grammatical Structure Oriented Automated Approach for Surface Knowledge Extraction from Open Domain Unstructured Text

Tissera, Muditha (Department of Software Engineering, University of Kelaniya)
Weerasinghe, Ruvan (School of Computing, University of Colombo)

Publication Information

Journal of information and communication convergence engineering / v.20, no.2, 2022 , pp. 113-124 More about this Journal

Abstract

News in the form of web data generates increasingly large amounts of information as unstructured text. The capability of understanding the meaning of news is limited to humans; thus, it causes information overload. This hinders the effective use of embedded knowledge in such texts. Therefore, Automatic Knowledge Extraction (AKE) has now become an integral part of Semantic web and Natural Language Processing (NLP). Although recent literature shows that AKE has progressed, the results are still behind the expectations. This study proposes a method to auto-extract surface knowledge from English news into a machine-interpretable semantic format (triple). The proposed technique was designed using the grammatical structure of the sentence, and 11 original rules were discovered. The initial experiment extracted triples from the Sri Lankan news corpus, of which 83.5% were meaningful. The experiment was extended to the British Broadcasting Corporation (BBC) news dataset to prove its generic nature. This demonstrated a higher meaningful triple extraction rate of 92.6%. These results were validated using the inter-rater agreement method, which guaranteed the high reliability.

Keywords

Automatic Knowledge Extraction; Relation extraction; Natural Language Processing; Semantic Web; Triples Extraction;

Citations & Related Records

Reference

1	Y. Ouyang, W. Li, and R. Zhang, "273. Task 5. keyphrase extraction based on core word identification and word expansion," in Proceedings of the 5th international workshop on semantic evaluation, Uppsala, Sweden, pp. 142-145, 2010.
2	D. Mahata, J. Kuriakose, R. R. Shah, and R. Zimmermann, "Key2Vec: Automatic ranked keyphrase extraction from scientific articles using phrase embeddings," in Proceedings of NAACL-HLT 2018, New Orleans: LA, USA, vol. 2, pp. 634-639, 2018. DOI: 10.18653/v1/N18-2100. DOI
3	G. Rabby, S. Azad, M. Mahmud, K. Z. Zamli, and M. M. Rahman, "A flexible keyphrase extraction technique for academic literature," in Procedia Computer Science, Tangerang, Indonesia, vol. 135, pp. 553-563, 2018. DOI: 10.1016/j.procs.2018.08.208. DOI
4	O. Etzioni, M. Cafarella, D. Downey, S. Kok, A. -M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates,"Web-scale information extraction in knowitall: (preliminary results)," in Proceedings of the 13th international conference on World Wide Web, New York: NY, USA, pp. 100-110, May. 2004. DOI: 10.1145/988672.988687. DOI
5	O. Etzioni, M. Cafarella, D. Downey, A. -M. Popescu, T. Shaked, S. Soderland, D. S. Weld, and A. Yates, "Unsupervised named-entity extraction from the web: An experimental study," Artificial Intelligence, vol. 165, no. 1, pp. 91-134, Jun. 2005. DOI: 10.1016/j.artint.2005.03.001. DOI
6	D. Q. Nguyen and K. Verspoor, "Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings," in Proceedings of the BioNLP 2018 workshop, Melbourne, Australia, pp. 129-136, May. 2018. DOI: 10.18653/v1/W18-2314. DOI
7	S. Pawar, G. K. Palshikar, and P. Bhattacharyya, "Relation extraction: A survey," arXiv:1712.05191 [cs], Dec. 2017. DOI: 10.1007/978-981-10-7359-5_6.
8	G. Bordea, E. Lefever, and P. Buitelaar, "Semeval-2016 task 13: Taxonomy extraction evaluation (texeval-2)," in SemEval-2016, San Diego: CA, USA, pp. 1081-1091, 2016. DOI: 10.18653/v1/S16-1168. DOI
9	A. Yates, M. Banko, M. Broadhead, M. Cafarella, O. Etzioni, and S. Soderland, "TextRunner: open information extraction on the web," in Proceedings of Human Language Technologies: The Annual Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations on XX - NAACL '07, Rochester: NY, USA, pp. 25-26, 2007.
10	A. Panchenko, S. Faralli, E. Ruppert, S. Remus, H. Naets, C. Fairon, S. P. Ponzetto, and C. Biemann, "TAXI at SemEval-2016 Task 13: A taxonomy induction method based on lexico-syntactic patterns, substrings and focused crawling," in Proceedings of SemEval, San Diego: CA, USA, pp. 1320-1327, 2016. DOI: 10.18653/v1/S16-1206. DOI
11	O. Etzioni, A. Fader, J. Christensen, S. Soderland, and M. Mausam, "Open information extraction: The second generation.," in IJCAI, 2011, vol. 11, pp. 3-10. Accessed: Jul. 04, 2017. DOI: 10.5591/978-1-57735-516-8/IJCAI11-012. DOI
12	H. M. M. Hasan, F. Sanyal, D. Chaki, and M. H. Ali, "An empirical study of important keyword extraction techniques from documents," in 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, pp. 91-94, Oct. 2017. DOI: 10.1109/ICISIM.2017.8122154. DOI
13	D. Bennet and A Bennet, "The depth of knowledge: surface, shallow or deep?," VINE, vol. 38, no. 4, pp. 405-420, Oct. 2008. DOI: 10.1108/03055720810917679. DOI
14	J. Fan, A. Kalyanpur, D. C. Gondek, and D. A. Ferrucci, "Automatic knowledge extraction from documents," IBM Journal of Research and Development, vol. 56, no. 3.4, pp. 5:1-5:10, May. 2012. DOI: 10.1147/JRD.2012.2186519. DOI
15	T. M. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge, A. Carlson, B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis, T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles, R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling, "NeverEnding learning," Communication of the ACM, vol. 61, no. 5, p. 103-115, May. 2018. DOI: 10.1145/3191513. DOI
16	M. Mintz, S. Bills, R. Snow, and D. Jurafsky, "Distant supervision for relation extraction without labeled data," in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 2-Volume 2, Suntec, Singapore, vol. 2, pp. 1003-1011, 2009. DOI: 10.3115/1690219.1690287. DOI
17	Y. Matsuo and M Ishizuka, "Keyword extraction from a single document using word cooccurrence statistical information," International Journal on Artificial Intelligence Tools, vol. 13, no. 1, pp. 157-169, Mar. 2004. DOI: 10.1142/S0218213004001466. DOI
18	P. K. Shah, C. Perez-Iratxeta, P. Bork, and M. A. Andrade, "Information extraction from full text scientific articles: where are the keywords?," BMC bioinformatics, vol. 4, no. 1, p. 20, May. 2003. DOI: 10.1186/1471-2105-4-20. DOI
19	A. Tixier, F. Malliaros, and M. Vazirgiannis, "A graph degeneracy-based approach to keyword extraction," in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin: TX, USA. pp. 1860-1870, 2016. DOI: 10.18653/v1/D16-1191. DOI
20	Z. Liu, P. Li, Y. Zheng, and M. Sun, "Clustering to find exemplar terms for keyphrase extraction," in Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, Singapore, pp. 257-266, Aug. 2009. DOI: 10.3115/1699510.1699544. DOI
21	A. Ritter, S. Clark, Mausam, and O. Etzioni, "Named entity recognition in tweets: An experimental study," in Proceedings of the Conference on Empirical Methods in Natural Language Processing, Edinburgh, U.K, pp. 1524-1534, Jul. 2011.
22	K. G'abor, D. Buscaldi, A. -K. Schumann, B. QasemiZadeh, H. Zargayouna, and T. Charnois, "SemEval-2018Task7: Semantic relation extraction and classification in scientific papers," in Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans: LA, USA, pp. 679-688, 2018. DOI: 10.18653/v1/S18-1111. DOI
23	S. Beliga, A. Mestrovic, and S. Martincic-Ipsic, "An overview of graph-based key words extraction methods and approaches," Journal of information and organizational sciences and JIOS, vol. 39, no. 1, pp. 1-20, Jul. 2015.
24	S. K. Bharti, K. S. Babu, and A. Pradhan, "Automatic keyword extraction for text summarization in multi-document e-newspapers articles," European Journal of Advances in Engineering and Technology, vol. 4, no. 6, pp. 410-427, 2017.
25	S. N. Kim, O. Medelyan, M. -Y. Kan, and T. Baldwin, "Automatic keyphrase extraction from scientific articles," Language Resources and Evaluation, vol. 47, pp. 723-742, Dec. 2013. DOI: 10.1007/s10579-012-9210-3. DOI
26	K. Bennani-Smires, C. Musat, A. Hossmann, M. Baeriswyl, and M. Jaggi, "Simple unsupervised keyphrase extraction using sentence embeddings," in Proceedings of the 22nd Conference on Computational Natural Language Learning, Brussels, Belgium, pp. 221-229, Jan. 2018. DOI: 10.18653/v1/K18-1022. DOI
27	P. Maitra and D. Das, "JUNLP at SemEval-2016 Task 13: A language independent approach for hypernym identification," in Proceedings of SemEval, San Diego: CA, USA, pp. 1310-1314, 2016. DOI: 10.18653/v1/S16-1204. DOI
28	F. Wu and D. S. Weld, "Open information extraction using Wikipedia," in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, pp. 118-127, 2010.
29	S. Soderland, B. Roof, B. Qin, S. Xu, Mausam, and O. Etzioni, "Adapting open information extraction to domain-specific relations," AI Magazine, vol. 31, pp. 93-102, Jul. 2010. DOI: 10.1609/aimag.v31i3.2305. DOI
30	"BBC News Summary", Kaggle [Online]. Available: https://www.kaggle.com/pariza/bbc-news-summary (accessed May 20, 2020)