A Study on Duplicate Detection Algorithm in Union Catalog

Cho, Sun-Yeong;

doi:10.4275/KSLIS.2003.37.4.069

Journal of the Korean Society for Library and Information Science (한국문헌정보학회지)

Volume 37 Issue 4
/
Pages.69-88
/
2003
/
1225-598X(pISSN)

Korean Society For Library And Information Science (한국문헌정보학회)

DOI QR Code

A Study on Duplicate Detection Algorithm in Union Catalog

종합목록의 중복레코드 검증을 위한 알고리즘 연구

Cho, Sun-Yeong

조순영 (한국교육학술정보원(KERIS) 학술연구정보화실)

Published : 2003.12.01

https://doi.org/10.4275/KSLIS.2003.37.4.069 Citation PDF

Download PDF

⟨ Previous Next ⟩

Abstract

This study intends to develop a new duplicate detection algorithm to improve database quality. The new algorithm is developed to analyze by variables of language and bibliographic type, and it checks elements in bibliographic data not just MARC fields. The algorithm computes the degree of similarity and the weight values to avoid possible elimination of records by simple input error. The study was peformed on the 7,649 newly uploaded records during the last one year against the 210,000 sample master database. The findings show that the new algorithm has improved the duplicates recall rate by 36.2%.

본 연구는 KERIS 종합목록의 품질 개선을 위하여 새로운 유형의 중복 데이터 색출 알고리즘을 개발한 것이다. 새로운 알고리즘에서는 현재 적용하고 있는 것과 같은 MARC 데이터 일치여부 비교 방식에서 탈피하여 언어별 서지 유형별 다른 비교방식을 적용하였다. 아울러 비교 요소간의 유사성을 측정하고, 각 요소의 중요도에 따라 가중치를 차등 부여하는 방식을 병행하였다. 새로 개발한 알고리즘의 효용성을 입증하기 위하여 최근 종합목록에 업로드된 데이터 210,000건을 추출하여 실험용 마스터 파일을 구축하고 7,649건을 두 개의 알고리즘으로 처리한 결과 새로운 알고리즘에서 중복레코드의 색출 비율이 36.2% 더 높게 나타났다.

Keywords

References

한국문헌자동화목록 형식 및 기술규칙 국립중앙도서관
대학도서관 분담편목용 입력 기본 표준에 관한 연구 최석두(외)
情報の科學と技術 v.44 no.4 オンライン總合目錄デ-タベ-スの重複排除酒井淸彦
Journal of Information Science v.24 no.4 Duplicate Detection and Record Consolidation in Large Bibliographic Databases : the COPAC Database Experience in Great Britain Cousins,S.A. https://doi.org/10.1177/016555159802400402
The Electronic Library v.17 no.2 Virtual OPACs versus Union Database: Two Models of Union Catalogue Provision Cousins,S.A. https://doi.org/10.1108/02640479910329617
Advances in Library Administration and Organization : A Research Annual Quality in Bibliographic Databases : An Analysis of Member-Contributed Cataloging in OCLC and RLIN Intner, Shelia S.
TLC: Solutions the Deliver Library of Congress Rule Interpretations : Contents Library of Congress
MARC21 Format for Bibliographic Data Library of Congress
Database Preparation Services : How Many Duplicates Are There? Library Technologies,Inc.
Bibliographic Formats and Standards OCLC
Bibliographic Input Standards(4th ed.) OCLC
OCLC Batchloading Guide(3.ed.) OCLC
Journal of Information Science v.23 no.1 Measuring Quality in the Production of Databases Rittberger,M.;W.Rittbeger https://doi.org/10.1177/016555159702300103
Cataloging & Classification Quarterly v.12 no.2 Bibliographic Record Maintenance in a Consortium Database Stankowski, Rebecca House https://doi.org/10.1300/J104v12n02_04

Journal of the Korean Society for Library and Information Science (한국문헌정보학회지)

A Study on Duplicate Detection Algorithm in Union Catalog

종합목록의 중복레코드 검증을 위한 알고리즘 연구

Abstract

Keywords

References

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)