Search | Korea Science

An Efficient Information Retrieval System for Unstructured Data Using Inverted Index

Abdullah Iftikhar;Muhammad Irfan Khan;Kulsoom Iftikhar
- International Journal of Computer Science & Network Security
- /
- v.24 no.7
- /
- pp.31-44
- /
- 2024
The inverted index is combination of the keywords and posting lists associated for indexing of document. In modern age excessive use of technology has increased data volume at a very high rate. Big data is great concern of researchers. An efficient Document indexing in big data has become a major challenge for researchers. All organizations and web engines have limited number of resources such as space and storage which is very crucial in term of data management of information retrieval system. Information retrieval system need to very efficient. Inverted indexing technique is introduced in this research to minimize the delay in retrieval of data in information retrieval system. Inverted index is illustrated and then its issues are discussed and resolve by implementing the scalable inverted index. Then existing algorithm of inverted compared with the naïve inverted index. The Interval list of inverted indexes stores on primary storage except of auxiliary memory. In this research an efficient architecture of information retrieval system is proposed particularly for unstructured data which don't have a predefined structure format and data volume.
https://doi.org/10.22937/IJCSNS.2024.24.7.4 인용 PDF

Segment-Based Inverted Index for Querying Large XML Documents (대용량 XML 문서의 효율적인 질의 처리를 위한 세그먼트 기반 역 인덱스)

Jeong, Byeong-Soo;Lee, Hiye-Ja
- Journal of Information Technology Services
- /
- v.7 no.3
- /
- pp.145-157
- /
- 2008
The existing XML storage methods which use relational data model, usually store path information for every node type including literal contents in order to keep the structural information of XML documents. Such path information is usually maintained by an inverted index to efficiently process XPath queries for large XML documents. In this study, We propose an improved approach that retrieve information from the large volume of XML documents stored in a relational database, while using a segment-based inverted index for path searches. Our new approach can reduce the number of searching an inverted index for getting target path information. We show the effectiveness of this approach through several experiments that compare XPath query performance with the existing methods.
PDF KSCI

An Efficient Inverted Index Technique based on RDBMS for XML Documents (XML 문서에 대한 RDBMS에 기반을 둔 효율적인 역색인 기법)

서치영;이상원;김형주
- Journal of KIISE:Databases
- /
- v.30 no.1
- /
- pp.27-40
- /
- 2003
The inverted index widely used in the existing information retrieval field should be extended for XML documents to support containment queries by XML information retrieval systems. In this paper, we consider that there are two methods in storing the inverted index and processing containment queries for XML documents as the previous work suggested: using a RDBMS or using an inverted lift engine. It has two drawbacks to extend the inverted index in the previous work. One is that using a RDBMS is moth worse in the performance than using an inverted list engine. The other is that when containment queries are processed in a RDBMS, there is an increase in the number of a join operation as the path length of a query increases and a join operation always happens between large fables. In this paper. we extend the inverted index in a different way to solve these problems and show the effectiveness of using a RDBMS.
PDF KSCI

n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure (n-gram/2L: 공간 및 시간 효율적인 2단계 n-gram 역색인 구조)

Kim Min-Soo;Whang Kyu-Young;Lee Jae-Gil;Lee Min-Jae
- Journal of KIISE:Databases
- /
- v.33 no.1
- /
- pp.12-31
- /
- 2006
The n-gram inverted index has two major advantages: language-neutral and error-tolerant. Due to these advantages, it has been widely used in information retrieval or in similar sequence matching for DNA and Protein databases. Nevertheless, the n-gram inverted index also has drawbacks: the size tends to be very large, and the performance of queries tends to be bad. In this paper, we propose the two-level n-gram inverted index (simply, the n-gram/2L index) that significantly reduces the size and improves the query performance while preserving the advantages of the n-gram inverted index. The proposed index eliminates the redundancy of the position information that exists in the n-gram inverted index. The proposed index is constructed in two steps: 1) extracting subsequences of length m from documents and 2) extracting n-grams from those subsequences. We formally prove that this two-step construction is identical to the relational normalization process that removes the redundancy caused by a non-trivial multivalued dependency. The n-gram/2L index has excellent properties: 1) it significantly reduces the size and improves the Performance compared with the n-gram inverted index with these improvements becoming more marked as the database size gets larger; 2) the query processing time increases only very slightly as the query length gets longer. Experimental results using databases of 1 GBytes show that the size of the n-gram/2L index is reduced by up to 1.9${\~}$2.7 times and, at the same time, the query performance is improved by up to 13.1 times compared with those of the n-gram inverted index.
PDF KSCI

A Space-Efficient Inverted Index Technique using Data Rearrangement for String Similarity Searches (유사도 검색을 위한 데이터 재배열을 이용한 공간 효율적인 역 색인 기법)

Im, Manu;Kim, Jongik
- Journal of KIISE
- /
- v.42 no.10
- /
- pp.1247-1253
- /
- 2015
An inverted index structure is widely used for efficient string similarity search. One of the main requirements of similarity search is a fast response time; to this end, most techniques use an in-memory index structure. Since the size of an inverted index structure usually very large, however, it is not practical to assume that an index structure will fit into the main memory. To alleviate this problem, we propose a novel technique that reduces the size of an inverted index. In order to reduce the size of an index, the proposed technique rearranges data strings so that the data strings containing the same q-grams can be placed close to one other. Then, the technique encodes those multiple strings into a range. Through an experimental study using real data sets, we show that our technique significantly reduces the size of an inverted index without sacrificing query processing time.
https://doi.org/10.5626/JOK.2015.42.10.1247 인용 KSCI

An Update-Efficient, Disk-Based Inverted Index Structure for Keyword Search on Data Streams (데이터 스트림에 대한 키워드 검색을 위한, 효율적인 갱신이 가능한 디스크 기반 역색인 구조)

Park, Eun Ju;Lee, Ki Yong
- KIPS Transactions on Software and Data Engineering
- /
- v.5 no.4
- /
- pp.171-180
- /
- 2016
As social networking services such as twitter become increasingly popular, data streams are widely prevalent these days. In order to search data accumulated from data streams efficiently, the use of an index structure is essential. In this paper, we propose an update-efficient, disk-based inverted index structure for efficient keyword search on data streams. When new data arrive at the data stream, the index needs to be updated to incorporate the new data. The traditional inverted index is very inefficient to update in terms of disk I/O, because all index data stored in the disk need to be read and written to the disk each time the index is updated. To solve this problem, we divide the whole inverted index into a sequence of inverted indices with exponentially increasing size. When new data arrives, it is first inserted into the smallest index and, later, the small indices are merged with the larger indices, which leads to a small amortize update cost for each new data. Furthermore, when indices stored in the disk are merged with each other, we minimize the disk I/O cost incurred for the merge operation, resulting in an even smaller update cost. Through various experiments, we compare the update efficiency of the proposed index structure with the previous one, and show the performance advantage of the proposed structure in terms of the update cost.
https://doi.org/10.3745/KTSDE.2016.5.4.171 인용 PDF KSCI

A Multi-level Inverted Index Technique for Structural Document Search (구조화 문서 검색을 위한 다단계 역색인 기법)

Kim, Jong-Ik
- The KIPS Transactions:PartB
- /
- v.15B no.4
- /
- pp.355-364
- /
- 2008
In general, we can use an inverted index for retrieving element lists from structured documents. An inverted index can retrieve a list of elements that have the same tag name. In this approach, however, the cost of query processing is linear to the length of a path query because all the structural relationships (parent-child and ancestor-descendant) should be resolved by structural join operations. In this paper, we propose an inverted index technique and a novel structural join technique for accelerating XML path query evaluation. Our inverted index can retrieve element lists for path segments in a parent-child relationship. Our structural join technique can handle lists of element pairs while the existing techniques handle lists of elements. We show through experiments that these two proposed techniques are integrated to accelerate evaluation of XML path queries.
https://doi.org/10.3745/KIPSTB.2008.15-B.4.355 인용 PDF KSCI

Multivariate Process Capability Index Using Inverted Normal Loss Function (역정규 손실함수를 이용한 다변량 공정능력지수)

Moon, Hye-Jin;Chung, Young-Bae
- Journal of Korean Society of Industrial and Systems Engineering
- /
- v.41 no.2
- /
- pp.174-183
- /
- 2018
In the industrial fields, the process capability index has been using to evaluate the variation of quality in the process. The traditional process capability indices such as $C_p$, $C_{pk}$, $C_{pm}$ and $C^+_{pm}$ have been applied in the industrial fields. These traditional process capability indices are mainly applied in the univariate analysis. However, the main streams in the recent industry are the multivariate manufacturing process and the multiple quality characteristics are corrected each other. Therefore, the multivariate statistical method should be used in the process capability analysis. The multivariate process indices need to be enhanced with more useful information and extensive application in the recent industrial fields. Hence, the purpose of the study is to develop a more effective multivariate process index ($MC_{pI}$) using the multivariate inverted normal loss function. The multivariate inverted normal loss function has the flexibility for the any type of the symmetrical and asymmetrical loss functions as well as the economic information. Especially, the proposed modeling method for the multivariate inverted normal loss function (MINLF) and the expected loss from MINLF in this paper can be applied to the any type of the symmetrical and asymmetrical loss functions. And this modeling method can be easily expanded from a bivariate case to a multivariate case.
https://doi.org/10.11627/jkise.2018.41.2.174 인용 PDF KSCI

Query Processing using Information of Parent Nodes in Partitioned Inverted Index Tables (분할된 역 인덱스 테이블에서 부모노드의 정보를 이용한 질의 처리)

Kim, Myung-Soo;Hwang, Byung-Yeon
- Journal of Korea Multimedia Society
- /
- v.11 no.7
- /
- pp.905-913
- /
- 2008
Many heterogeneous XML documents are being widely used with the increasing employment of XML, and the importance of data structure research for more efficient document management has been growing steadily. We propose a query processing technique which uses parent node information in a partitioned inverted index tree. The searching efficiency of these heterogeneous documents is greatly influenced by the number of query processing and the amount of target data sets in many ways. Therefore, considering these two factors is very important for designing a data structure. First, our technique stores parent node's information in an inverted index table. Then using this information, we can reduce the number of query processing by half. Also, the amount of target data sets can be lessoned by using partitioned inverted index table. Some XML documents collected from the Internet will be used to demonstrate the new method, and its high efficiency will be compared with some of the existing searching methods.
PDF

Efficient Dynamic Index Structure for SSD (SPM) (SSD에 적합한 동적 색인 저장 구조 : SPM)

Jin, Du-Seok;Kim, Jin-Suk;You, Beom-Jong;Jung, Hoe-Kyung
- The Journal of the Korea Contents Association
- /
- v.10 no.2
- /
- pp.54-62
- /
- 2010
Inverted index structures have become the most efficient data structure for high performance indexing of large text collections, especially online index maintenance, In-Place and merge-based index structures are the two main competing strategies for index construction in dynamic search environments. In the above-mentioned two strategies, a contiguity of posting information is the mainstay of design for online index maintenance and query time. Whereas with the emergence of new storage device(SSD, SCRAM), those do not consider a contiguity of posting information in the design of index structures because of its superiority such as low access latency and I/O throughput speeds. However, SSD(Solid State Drive) is not well suited for traditional inverted structures due to the poor random write throughput in practical systems. In this paper, we propose the new efficient online index structure(SPM) for SSD that significantly reduces the query time and improves the index maintenance performance.
https://doi.org/10.5392/JKCA.2010.10.2.054 인용 PDF KSCI

Search Result 107, Processing Time 0.018 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)