DOI QR코드

DOI QR Code

Comprehensive Knowledge Archive Network harvester improvement for efficient open-data collection and management

  • Kim, Dasol (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University) ;
  • Gil, Myeong-Seon (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University) ;
  • Nguyen, Minh Chau (CybreBrain Section, Future and Basic Technology Research Division, Electronics and Telecommunications Research Institute) ;
  • Won, Heesun (CybreBrain Section, Future and Basic Technology Research Division, Electronics and Telecommunications Research Institute) ;
  • Moon, Yang-Sae (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University)
  • Received : 2020.08.04
  • Accepted : 2021.01.22
  • Published : 2021.10.01

Abstract

With the recent increase in data disclosure, the Comprehensive Knowledge Archive Network (CKAN), which is an open-source data distribution platform, is drawing much attention. CKAN is used together with additional extensions, such as Datastore and Datapusher for data management and Harvest and DCAT for data collection. This study derives the problems of CKAN itself and Harvest Extension. First, CKAN causes two problems of data inconsistency and storage space waste for data deletion. Second, Harvest Extension causes three additional problems, namely source deletion that deletes only sources without deleting data themselves, job stop that cannot delete job during data collection, and service interruption that cannot provide service, even if data exist. Based on these observations, we propose herein an improved CKAN that provides a new deletion function solving data inconsistency and storage space waste problems. In addition, we present an improved Harvest Extension solving three problems of the legacy Harvest Extension. We verify the correctness and the usefulness of the improved CKAN and Harvest Extension functions through actual implementation and extensive experiments.

Keywords

Acknowledgement

This work was partly supported by Institute of Information & communications Technology Planning & evaluation (IITP) grant funded by the Korea government(MSIT) (No. 2020-0-00077, Core Technology Development for Intelligently Searching and Utilizing Big Data based on DataMap) and the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT). (No. 2019R1A2C1085311).

References

  1. CKAN documentation, available at http://docs.ckan.org/.
  2. Open Government Platform (OGPL), available at https://ogpl.github.io/.
  3. Socrata, available at https://dev.socrata.com/.
  4. S. Corlosquet et al., Produce and consumer linked data with drupal!, in Proc. Int. Semantic Web Conf. (Chantilly, VA, USA), Oct. 2009, pp. 763-778.
  5. Junar, available at https://www.junar.com/.
  6. Open Knowledge Foundation (OKFN), why open data, available at https://okfn.org/opendata/why-open-data/.
  7. A. S. Correa and F. S. Silva, Laying the foundations for benchmarking open data automatically: A method for surveying data portals from the whole web, in Proc. Int Conf. Dig. Gov. Res. (Dubai, United Arab Emirates), June 2019, pp. 287-296.
  8. J. Winn, Open data and the academy: An evaluation of CKAN for research data management, in Proc. Int. Assoc. Soc. Sci. Inform. Serv. Tech. (Cologne, Germany), May 2013.
  9. R. Kitchin, The data revolution: Big data, open data, data infrastructures and their consequences, SAGE Publications, Thousand Oaks, CA, USA, 2014.
  10. F. Kirstein et al., Linked data in the European data portal: A comprehensive platform for applying DCAT-AP, in Proc. Int. Conf. Electron. Gov. (Tronto, Italy), July 2019, pp. 192-204.
  11. B. Momjian, PostgreSQL: Introduction and Concepts, vol. 192, Addison-Wesley, Boston, MA, USA, 2001.
  12. R. Copeland, Essential SQLAlchemy, O'Reilly Media, Sebastopol, CA, USA, 2008.
  13. E. O'Neil, Object/relational mapping 2008: Hibernate and the entity data model (EDM), in Proc. ACM SIGMOD Int. Conf. Manag. Data (Vancouver, Canada), June 2008, pp. 1351-1356.
  14. D. Smiley et al., Apache solr enterprise search server, Packt, Birmingham, UK, 2015.
  15. CKAN User Guide, available at https://docs.ckan.org/en/latest/user-guide.html/.
  16. Jinja2 documentation, available at http://jinja.palletsprojects.com/en/2.10.x/.
  17. C. Bizer, T. Heath, and T. Berners-Lee, Linked data: The story so far, in Semantic Services, Interoperability and Web Applications: Emerging Concepts, IGI Global, Hershey, PA, USA, 2011, pp. 205-227.
  18. M. Jabalameli, M. Nematbakhsh, and A. Zaeri, Ontology-lexiconbased question answering over linked data, ETRI J. 42 (2020), no. 2, pp. 239-246. https://doi.org/10.4218/etrij.2018-0312
  19. F. Maali, J. Erickson, and P. Archer, Data catalog vocabulary (DCAT), W3C Recommendation, Jan. 2014.
  20. K. Banker, MongoDB in action, Manning Publications, Shelter Island, NY, USA, 2011.
  21. M. Palankar et al., Amazon S3 for science grids: A viable solution?, in Proc. Int. workshop Data-aware Distrib. Comput. (Boston, MA, USA), June 2008, pp. 55-64.
  22. C. Millette and P. Hosein, A consumer focused open data platform, in Proc. Int. Conf. Big Data Smart City (Muscat, Oman), Mar. 2016, pp. 1-6.
  23. H. Elmekki, D. Chiadmi, and H. Lamharhar, Open government data: Problem assessment of machine processability, in Proc. Int. Conf. Inform. Syst. Technol. Support Learn. (Marrakech, Morocco), Oct. 2018, pp. 492-501.
  24. J. J. Macedo, OpenEasier: A CKAN extension to enhance opendata publication and management, M.S. thesis, UFRN, Brazil, Aug. 2018.
  25. A. Varon-Capera et al., VACIT: Tool for consumption, analysis and machine learning for LOD resources of CKAN instances, in Proc. Int. Conf. Inform. Syst. Technol. Support Learn. (Marrakech, Morocco), Nov. 2018, pp. 552-564.
  26. R. Scholz et al., A CKAN plugin for data harvesting to the hadoop distributed file system, in Proc. Int. Conf. Cloud Comput. Serv. Sci. (Porto, Portugal), Apr. 2017, pp. 19-28.
  27. D. Tunkelang, Faceted Search, Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 1, Morgan & Claypool Publishers, San Rafael, CA, USA, 2009.
  28. J. Han et al., Survey on NoSQL database, in Proc. Int. Conf. Pervasive Comput. Appl. (Port Elizabeth, South Africa), Oct. 2011, pp. 363-366.
  29. V. Ionescu, The analysis of the performance of RabbitMQ and ActiveMQ, in Proc. RoEduNet Int. Conf. Netw. Educ. Res. (Craiova, Romania), Sept. 2015, pp. 132-137.
  30. P. Heim et al., RelFinder: Revealing relationships in RDF knowledge bases, in Proc. Int. Conf. Semantic Digit. Media Technol. (Graz, Austria), Dec. 2009, pp. 182-187.
  31. CKAN Harvest Extension v1.1.0, available at https://github.com/ckan/ckanextharvest/releases/tag/v1.1.0/.