Browse > Article
http://dx.doi.org/10.4218/etrij.2020-0298

Comprehensive Knowledge Archive Network harvester improvement for efficient open-data collection and management  

Kim, Dasol (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University)
Gil, Myeong-Seon (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University)
Nguyen, Minh Chau (CybreBrain Section, Future and Basic Technology Research Division, Electronics and Telecommunications Research Institute)
Won, Heesun (CybreBrain Section, Future and Basic Technology Research Division, Electronics and Telecommunications Research Institute)
Moon, Yang-Sae (Department of Computer Science and Engineering, Interdisciplinary Graduate Program in Medical Bigdata Convergence, Kangwon National University)
Publication Information
ETRI Journal / v.43, no.5, 2021 , pp. 835-855 More about this Journal
Abstract
With the recent increase in data disclosure, the Comprehensive Knowledge Archive Network (CKAN), which is an open-source data distribution platform, is drawing much attention. CKAN is used together with additional extensions, such as Datastore and Datapusher for data management and Harvest and DCAT for data collection. This study derives the problems of CKAN itself and Harvest Extension. First, CKAN causes two problems of data inconsistency and storage space waste for data deletion. Second, Harvest Extension causes three additional problems, namely source deletion that deletes only sources without deleting data themselves, job stop that cannot delete job during data collection, and service interruption that cannot provide service, even if data exist. Based on these observations, we propose herein an improved CKAN that provides a new deletion function solving data inconsistency and storage space waste problems. In addition, we present an improved Harvest Extension solving three problems of the legacy Harvest Extension. We verify the correctness and the usefulness of the improved CKAN and Harvest Extension functions through actual implementation and extensive experiments.
Keywords
CKAN; CKAN harvester; DCAT; harvest extension; open data;
Citations & Related Records
연도 인용수 순위
  • Reference
1 J. Han et al., Survey on NoSQL database, in Proc. Int. Conf. Pervasive Comput. Appl. (Port Elizabeth, South Africa), Oct. 2011, pp. 363-366.
2 V. Ionescu, The analysis of the performance of RabbitMQ and ActiveMQ, in Proc. RoEduNet Int. Conf. Netw. Educ. Res. (Craiova, Romania), Sept. 2015, pp. 132-137.
3 CKAN Harvest Extension v1.1.0, available at https://github.com/ckan/ckanextharvest/releases/tag/v1.1.0/.
4 Open Government Platform (OGPL), available at https://ogpl.github.io/.
5 Socrata, available at https://dev.socrata.com/.
6 Junar, available at https://www.junar.com/.
7 F. Kirstein et al., Linked data in the European data portal: A comprehensive platform for applying DCAT-AP, in Proc. Int. Conf. Electron. Gov. (Tronto, Italy), July 2019, pp. 192-204.
8 CKAN documentation, available at http://docs.ckan.org/.
9 S. Corlosquet et al., Produce and consumer linked data with drupal!, in Proc. Int. Semantic Web Conf. (Chantilly, VA, USA), Oct. 2009, pp. 763-778.
10 A. S. Correa and F. S. Silva, Laying the foundations for benchmarking open data automatically: A method for surveying data portals from the whole web, in Proc. Int Conf. Dig. Gov. Res. (Dubai, United Arab Emirates), June 2019, pp. 287-296.
11 E. O'Neil, Object/relational mapping 2008: Hibernate and the entity data model (EDM), in Proc. ACM SIGMOD Int. Conf. Manag. Data (Vancouver, Canada), June 2008, pp. 1351-1356.
12 Jinja2 documentation, available at http://jinja.palletsprojects.com/en/2.10.x/.
13 F. Maali, J. Erickson, and P. Archer, Data catalog vocabulary (DCAT), W3C Recommendation, Jan. 2014.
14 K. Banker, MongoDB in action, Manning Publications, Shelter Island, NY, USA, 2011.
15 C. Millette and P. Hosein, A consumer focused open data platform, in Proc. Int. Conf. Big Data Smart City (Muscat, Oman), Mar. 2016, pp. 1-6.
16 R. Copeland, Essential SQLAlchemy, O'Reilly Media, Sebastopol, CA, USA, 2008.
17 Open Knowledge Foundation (OKFN), why open data, available at https://okfn.org/opendata/why-open-data/.
18 J. Winn, Open data and the academy: An evaluation of CKAN for research data management, in Proc. Int. Assoc. Soc. Sci. Inform. Serv. Tech. (Cologne, Germany), May 2013.
19 B. Momjian, PostgreSQL: Introduction and Concepts, vol. 192, Addison-Wesley, Boston, MA, USA, 2001.
20 D. Smiley et al., Apache solr enterprise search server, Packt, Birmingham, UK, 2015.
21 A. Varon-Capera et al., VACIT: Tool for consumption, analysis and machine learning for LOD resources of CKAN instances, in Proc. Int. Conf. Inform. Syst. Technol. Support Learn. (Marrakech, Morocco), Nov. 2018, pp. 552-564.
22 CKAN User Guide, available at https://docs.ckan.org/en/latest/user-guide.html/.
23 M. Jabalameli, M. Nematbakhsh, and A. Zaeri, Ontology-lexiconbased question answering over linked data, ETRI J. 42 (2020), no. 2, pp. 239-246.   DOI
24 J. J. Macedo, OpenEasier: A CKAN extension to enhance opendata publication and management, M.S. thesis, UFRN, Brazil, Aug. 2018.
25 R. Scholz et al., A CKAN plugin for data harvesting to the hadoop distributed file system, in Proc. Int. Conf. Cloud Comput. Serv. Sci. (Porto, Portugal), Apr. 2017, pp. 19-28.
26 P. Heim et al., RelFinder: Revealing relationships in RDF knowledge bases, in Proc. Int. Conf. Semantic Digit. Media Technol. (Graz, Austria), Dec. 2009, pp. 182-187.
27 C. Bizer, T. Heath, and T. Berners-Lee, Linked data: The story so far, in Semantic Services, Interoperability and Web Applications: Emerging Concepts, IGI Global, Hershey, PA, USA, 2011, pp. 205-227.
28 H. Elmekki, D. Chiadmi, and H. Lamharhar, Open government data: Problem assessment of machine processability, in Proc. Int. Conf. Inform. Syst. Technol. Support Learn. (Marrakech, Morocco), Oct. 2018, pp. 492-501.
29 R. Kitchin, The data revolution: Big data, open data, data infrastructures and their consequences, SAGE Publications, Thousand Oaks, CA, USA, 2014.
30 D. Tunkelang, Faceted Search, Synthesis Lectures on Information Concepts, Retrieval, and Services, vol. 1, Morgan & Claypool Publishers, San Rafael, CA, USA, 2009.
31 M. Palankar et al., Amazon S3 for science grids: A viable solution?, in Proc. Int. workshop Data-aware Distrib. Comput. (Boston, MA, USA), June 2008, pp. 55-64.