• Title/Summary/Keyword: Text Retrieval

Search Result 342, Processing Time 0.029 seconds

Word Extraction from Table Regions in Document Images (문서 영상 내 테이블 영역에서의 단어 추출)

  • Jeong, Chang-Bu;Kim, Soo-Hyung
    • The KIPS Transactions:PartB
    • /
    • v.12B no.4 s.100
    • /
    • pp.369-378
    • /
    • 2005
  • Document image is segmented and classified into text, picture, or table by a document layout analysis, and the words in table regions are significant for keyword spotting because they are more meaningful than the words in other regions. This paper proposes a method to extract words from table regions in document images. As word extraction from table regions is practically regarded extracting words from cell regions composing the table, it is necessary to extract the cell correctly. In the cell extraction module, table frame is extracted first by analyzing connected components, and then the intersection points are extracted from the table frame. We modify the false intersections using the correlation between the neighboring intersections, and extract the cells using the information of intersections. Text regions in the individual cells are located by using the connected components information that was obtained during the cell extraction module, and they are segmented into text lines by using projection profiles. Finally we divide the segmented lines into words using gap clustering and special symbol detection. The experiment performed on In table images that are extracted from Korean documents, and shows $99.16\%$ accuracy of word extraction.

Developing of Text Plagiarism Detection Model using Korean Corpus Data (한글 말뭉치를 이용한 한글 표절 탐색 모델 개발)

  • Ryu, Chang-Keon;Kim, Hyong-Jun;Cho, Hwan-Gue
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.14 no.2
    • /
    • pp.231-235
    • /
    • 2008
  • Recently we witnessed a few scandals on plagiarism among academic paper and novels. Plagiarism on documents is getting worse more frequently. Although plagiarism on English had been studied so long time, we hardly find the systematic and complete studies on plagiarisms in Korean documents. Since the linguistic features of Korean are quite different from those of English, we cannot apply the English-based method to Korean documents directly. In this paper, we propose a new plagiarism detecting method for Korean, and we throughly tested our algorithm with one benchmark Korean text corpus. The proposed method is based on "k-mer" and "local alignment" which locates the region of plagiarized document pairs fast and accurately. Using a Korean corpus which contains more than 10 million words, we establish a probability model (or local alignment score (random similarity by chance). The experiment has shown that our system was quite successful to detect the plagiarized documents.

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

A Text Network Analysis of North Korean Library Journal, 『Reference Materials for Librarian』 (북한 도서관잡지 『도서관일군 참고자료』의 텍스트 네트워크 분석)

  • Lee, Seongsin;Kim, Hyunsook;Baek, Sumin;Yoon, Subin;Choi, Jae-Hwang
    • Journal of Korean Library and Information Science Society
    • /
    • v.53 no.3
    • /
    • pp.169-191
    • /
    • 2022
  • The purpose of this study is to attempt a text network analysis for two years of 『Reference Materials for Librarian』 (2016-2017) published by the Library Operation Methodology Research Institute in North Korea. A text network analysis can measure how important a particular word by grasping the connectivity and relationship between words beyond a simple word frequency analysis, and it is also possible to interpret specific social phenomena and derive implications. Frequency, degree centrality, the betweenness centrality, community analysis of the collected words were calculated using NetMiner. As a result, the terms 'users', 'information services', 'information needs', 'information technology', 'social learning', 'computers', 'databases', 'information acquisition', 'information retrieval' and 'librarian' were appeared as important ones in understanding North Korean libraries.

Analysis of Research Trends on Archival Information Services Using Text Mining (텍스트마이닝을 활용한 국내외 기록서비스 연구동향 분석)

  • Seohee Park;Hye-Eun Lee
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.24 no.1
    • /
    • pp.89-109
    • /
    • 2024
  • The study analyzed the research trends of domestic and international record information services from 2003 to 2022. A total of 136 academic papers registered in the Korea Citation Index (KCI) and 74 from the Library, Information Science & Technology Abstracts (LISTA) were examined by quantitative and qualitative content analysis to understand the research status of 20 years from various angles, such as publication year, research type, researcher type, subject, and purpose. Frequency analysis, co-occurrence frequency analysis, centrality analysis, and topic modeling were performed by applying text mining techniques. Results showed that domestic papers demonstrated a research flow focused on specific institutions or records, and user-centered satisfaction surveys and content-centered studies were conducted. Moreover, foreign papers confirmed various evaluation-oriented and information provision studies, such as data, resources, and collections, along with the research trend focusing on the relationship between archivists and users. The management of information resources was identified as a common topic in both domestic and foreign papers, but it is possible to identify that domestic research focuses on maintaining the quality of domestic information resources, while foreign research focuses on the storage and retrieval of information.

RGB Channel Selection Technique for Efficient Image Segmentation (효율적인 이미지 분할을 위한 RGB 채널 선택 기법)

  • 김현종;박영배
    • Journal of KIISE:Software and Applications
    • /
    • v.31 no.10
    • /
    • pp.1332-1344
    • /
    • 2004
  • Upon development of information super-highway and multimedia-related technoiogies in recent years, more efficient technologies to transmit, store and retrieve the multimedia data are required. Among such technologies, firstly, it is common that the semantic-based image retrieval is annotated separately in order to give certain meanings to the image data and the low-level property information that include information about color, texture, and shape Despite the fact that the semantic-based information retrieval has been made by utilizing such vocabulary dictionary as the key words that given, however it brings about a problem that has not yet freed from the limit of the existing keyword-based text information retrieval. The second problem is that it reveals a decreased retrieval performance in the content-based image retrieval system, and is difficult to separate the object from the image that has complex background, and also is difficult to extract an area due to excessive division of those regions. Further, it is difficult to separate the objects from the image that possesses multiple objects in complex scene. To solve the problems, in this paper, I established a content-based retrieval system that can be processed in 5 different steps. The most critical process of those 5 steps is that among RGB images, the one that has the largest and the smallest background are to be extracted. Particularly. I propose the method that extracts the subject as well as the background by using an Image, which has the largest background. Also, to solve the second problem, I propose the method in which multiple objects are separated using RGB channel selection techniques having optimized the excessive division of area by utilizing Watermerge's threshold value with the object separation using the method of RGB channels separation. The tests proved that the methods proposed by me were superior to the existing methods in terms of retrieval performances insomuch as to replace those methods that developed for the purpose of retrieving those complex objects that used to be difficult to retrieve up until now.

Similar sub-Trajectory Retrieval Technique based on Grid for Video Data (비디오 데이타를 위한 그리드 기반의 유사 부분 궤적 검색 기법)

  • Lee, Ki-Young;Lim, Myung-Jae;Kim, Kyu-Ho;Kim, Joung-Joon
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.9 no.5
    • /
    • pp.183-189
    • /
    • 2009
  • Recently, PCS, PDA and mobile devices, such as the proliferation of spread, GPS (Global Positioning System) the use of, the rapid development of wireless network and a regular user even images, audio, video, multimedia data, such as increased use is for. In particular, video data among multimedia data, unlike the moving object, text or image data that contains information about the movements and changes in the space of time, depending on the kinds of changes that have sigongganjeok attributes. Spatial location of objects on the flow of time, changing according to the moving object (Moving Object) of the continuous movement trajectory of the meeting is called, from the user from the database that contains a given query trajectory and data trajectory similar to the finding of similar trajectory Search (Similar Sub-trajectory Retrieval) is called. To search for the trajectory, and these variations, and given the similar trajectory of the user query (Tolerance) in the search for a similar trajectory to approximate data matching (Approximate Matching) should be available. In addition, a large multimedia data from the database that you only want to be able to find a faster time-effective ways to search different from the existing research is required. To this end, in this paper effectively divided into a grid to search for the trajectory to the trajectory of moving objects, similar to the effective support of the search trajectory offers a new grid-based search techniques.

  • PDF

Design for Database Retrieval System using Virtual Database in Intranet (인트라넷에서 가상데이터베이스를이용한 데이터베이스 검색 시스템의 설계)

  • Lee, Dong-Wook;Park, Young-Bae
    • The Transactions of the Korea Information Processing Society
    • /
    • v.5 no.6
    • /
    • pp.1404-1417
    • /
    • 1998
  • Currently, there exists two different methods for database retrieval in the internet. First is to use the search engine and the second is to use the plug-in or ActiveX technology, If a search engine, which makes use of indices built from keywords of simple text data in order to do a search, is used when accessing a database, first it is not possible to access more than one database at a time, second it is also not possible to support various conditional retrievals as in using query language, and third the set of data received might include many unwanted data, in other words, precision rate might be relatively low. Plug in or Active technology make use of Web browset to execute chents' query in order to do a database retrieval. Problems associated with this is that it is not possible to activate more than one DBMS simultaneously even if they are of the same data model. sefond it is not possible to execute a user query other than the ones thai arc previou sly defined by the client program In this paper, to resolve those aforementioned problems we design and implement database retrieval system using a virtual database, which makes it possible to provide direct query jntertacc through the conventional Web browser. We assume that the virtual database is designed and aggregated from more than one relational database using the same data model.

  • PDF

Automatic Foreign Word Transliteration Model for Information Retrieval (정보검색을 위한 외래어 자동표기 모델)

  • 이재성;최기선
    • Proceedings of the Korean Society for Information Management Conference
    • /
    • 1997.08a
    • /
    • pp.17-24
    • /
    • 1997
  • 조사에 따르면 한글 문서에서 사용되는 단어 중 외래어 또는 영어가 포함된 단어가 약 26%정도를 차지하고 있으며, 이는 정보검색의 중요 색인어로 사용된다(권윤형 1996). 그러나 이들 단어들은 서로 같은 단어인데도 영어로 표기되기도 하고 이형의 외래어들로 표기되기도 하여, 정보검색의 효율을 떨어뜨리고 있다. 본 논문에서는 영어 단어와 그에 대응되어 표기되는 외래어들을 찾기 위한 한 단계로서, 영어를 한글로 음차(transliteration)하여 자동표기하는 통계적 모델을 제안하고 실험한다. 제안된 모델은 통계적 기계번역 방식과 그의 한 방법인 문서 정렬(text alignment) 방식에 근거하고 있다. 특히 이 모델에서는 효과적으로 발음의 단위를 분리한 다음 정렬을 하여. 전체적인 계산량을 줄이고 성능도 향상시켰다. 음차표기는 피봇방식과 직접방식의 두가지로 구현하였다. 피봇방식은 영어에서 발음을 생성한 후, 그 발음을 다시 한글로 표기하는 방식이고, 직접방식은 직접 영어 단어에서 한글 표기로 포기하는 방식이다. 두 방식을 제안된 모델을 이용하여 비교 테스트한 결과 직접방식이 보다 정확하게 표준 외래어로 표기하였다.

  • PDF

Enabling a fast annotation process with the Table2Annotation tool

  • Larmande, Pierre;Jibril, Kazim Muhammed
    • Genomics & Informatics
    • /
    • v.18 no.2
    • /
    • pp.19.1-19.6
    • /
    • 2020
  • In semantic annotation, semantic concepts are linked to natural language. Semantic annotation helps in boosting the ability to search and access resources and can be used in information retrieval systems to augment the queries from the user. In the research described in this paper, we aimed to identify ontological concepts in scientific text contained in spreadsheets. We developed a tool that can handle various types of spreadsheets. Furthermore, we used the NCBO Annotator API provided by BioPortal to enhance the semantic annotation functionality to cover spreadsheet data. Table2Annotation has strengths in certain criteria such as speed, error handling, and complex concept matching.