Search | Korea Science

Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector

Kim, Ki-Ju;Cho, Young-Bok
- Journal of information and communication convergence engineering
- /
- v.18 no.1
- /
- pp.33-38
- /
- 2020
Elasticsearch is an open source search and analytics engine that can search petabytes of data in near real time. It is designed as a distributed system horizontally scalable and highly available. It provides RESTful APIs, thereby making it programming-language agnostic. Full text search of multilingual text requires language-specific analyzers and field mappings appropriate for indexing and searching multilingual text. Additionally, a language detector can be used in conjunction with the analyzers to improve the multilingual text search. Elasticsearch provides more than 40 language analysis plugins that can process text and extract language-specific tokens and language detector plugins that can determine the language of the given text. This study investigates three different approaches to index and search Chinese, Japanese, and Korean (CJK) text (single analyzer, multi-fields, and language detector-based), and identifies the advantages of the language detector-based approach compared to the other two.
https://doi.org/10.6109/jicce.2020.18.1.33 인용 PDF KSCI

A Study on the Index Insect as Forensic Entomological Evidence

Jung, Jae-Bong;Yoon, Myung-Hee
- Proceedings of the Korean Society of Life Science Conference
- /
- 2007.10a
- /
- pp.44.1-44.1
- /
- 2007
See Full Text
PDF

The Evaluation Measure of Text Clustering for the Variable Number of Clusters (가변적 클러스터 개수에 대한 문서군집화 평가방법)

Jo, Tae-Ho
- Proceedings of the Korean Information Science Society Conference
- /
- 2006.10b
- /
- pp.233-237
- /
- 2006
This study proposes an innovative measure for evaluating the performance of text clustering. In using K-means algorithm and Kohonen Networks for text clustering, the number clusters is fixed initially by configuring it as their parameter, while in using single pass algorithm for text clustering, the number of clusters is not predictable. Using labeled documents, the result of text clustering using K-means algorithm or Kohonen Network is able to be evaluated by setting the number of clusters as the number of the given target categories, mapping each cluster to a target category, and using the evaluation measures of text. But in using single pass algorithm, if the number of clusters is different from the number of target categories, such measures are useless for evaluating the result of text clustering. This study proposes an evaluation measure of text clustering based on intra-cluster similarity and inter-cluster similarity, what is called CI (Clustering Index) in this article.
PDF

Comparisons of Practical Performance for Constructing Compressed Suffix Arrays (압축된 써픽스 배열 구축의 실제적인 성능 비교)

Park, Chi-Seong;Kim, Min-Hwan;Lee, Suk-Hwan;Kwon, Ki-Ryong;Kim, Dong-Kyue
- Journal of KIISE:Computer Systems and Theory
- /
- v.34 no.5_6
- /
- pp.169-175
- /
- 2007
Suffix arrays, fundamental full-text index data structures, can be efficiently used where patterns are queried many times. Although many useful full-text index data structures have been proposed, their O(nlogn)-bit space consumption motivates researchers to develop more space-efficient ones. However, their space efficient versions such as the compressed suffix array and the FM-index have been developed; those can not reduce the practical working space because their constructions are based on the existing suffix array. Recently, two direct construction algorithms of compressed suffix arrays from the text without constructing the suffix array have been proposed. In this paper, we compare practical performance of these algorithms of compressed suffix arrays with that of various algorithms of suffix arrays by measuring the construction times, the peak memory usages during construction and the sizes of their final outputs.
PDF KSCI

Inverted Indexes for XML Updates and Full-Text Retrievals in Relational Model (관계형 모델에서 XML 변경과 전문 검색을 지원하기 위한 역 인덱스 구축 기법)

Cheon, Yun-Woo;Hong, Dong-Kweon
- The KIPS Transactions:PartD
- /
- v.11D no.3
- /
- pp.509-518
- /
- 2004
Recently there has been some efforts to add XML full-text retrievals and XML updates into new standardization of XML queries. XML full-text retrievals plays an important role in XML query languages. of like tables in relational model an XML document has complex and unstructured natures. We believe that when we try to get some information from unstructured XML documents a full-text retrieval query is much more convenient approach than a regular structured query XML update is another core function that an XML query have to have. In this paper we propose an inverted index to support XML updates and XML full-text queries in relational environment. Performance comparisons exhibit that our approach maintains a comparable size of inverted indexes and it supports many full-text retrieval functions very well. It also shows very stable retrieval performance especially for large size of XML documents. Foremost our approach handles XML updates efficiently by removing cascading effects.
https://doi.org/10.3745/KIPSTD.2004.11D.3.509 인용 PDF KSCI

Theory and Practice of Automatic Indexing (자동색인의 이론과 실제)

- Journal of Korean Library and Information Science Society
- /
- v.30 no.3
- /
- pp.27-51
- /
- 1999
This paper deals with the methods as well as the problems associated with automatic extraction indexing and assignment indexing, expert systems for indexing, and major approaches currently used to index the Internet resources. It also briefly reviews basic methods for establishing hypertext/hypermedia links automatically. The methods used in much of text processing today are not particularly new. Most of the them were used, perhaps in a more rudimentary form, 30 or more years ago by Luhn and many other investigators. Better results can be achieved today because much greater bodies of electronic text are now avaliable and the power of present-day computers allows the processing of such text with reasonable efficiency.
PDF

A Text Similarity Measurement Method Based on Singular Value Decomposition and Semantic Relevance

Li, Xu;Yao, Chunlong;Fan, Fenglong;Yu, Xiaoqiang
- Journal of Information Processing Systems
- /
- v.13 no.4
- /
- pp.863-875
- /
- 2017
The traditional text similarity measurement methods based on word frequency vector ignore the semantic relationships between words, which has become the obstacle to text similarity calculation, together with the high-dimensionality and sparsity of document vector. To address the problems, the improved singular value decomposition is used to reduce dimensionality and remove noises of the text representation model. The optimal number of singular values is analyzed and the semantic relevance between words can be calculated in constructed semantic space. An inverted index construction algorithm and the similarity definitions between vectors are proposed to calculate the similarity between two documents on the semantic level. The experimental results on benchmark corpus demonstrate that the proposed method promotes the evaluation metrics of F-measure.
https://doi.org/10.3745/JIPS.02.0067 인용 PDF KSCI

On supporting full-text retrievals in XML query

Hong, Dong-Kweon
- International Journal of Fuzzy Logic and Intelligent Systems
- /
- v.7 no.4
- /
- pp.274-278
- /
- 2007
As XML becomes the standard of digital data exchange format we need to manage a lot of XML data effectively. Unlike tables in relational model XML documents are not structural. That makes it difficult to store XML documents as tables in relational model. To solve these problems there have been significant researches in relational database systems. There are two kinds of approaches: 1) One way is to decompose XML documents so that elements of XML match fields of relational tables. 2) The other one stores a whole XML document as a field of relational table. In this paper we adopted the second approach to store XML documents because sometimes it is not easy for us to decompose XML documents and in some cases their element order in documents are very meaningful. We suggest an efficient table schema to store only inverted index as tables to retrieve required data from XML data fields of relational tables and shows SQL translations that correspond to XML full-text retrievals. The functionalities of XML retrieval are based on the W3C XQuery which includes full-text retrievals. In this paper we show the superiority of our method by comparing the performances in terms of a response time and a space to store inverted index. Experiments show our approach uses less space and shows faster response times.
https://doi.org/10.5391/IJFIS.2007.7.4.274 인용 PDF KSCI

Improving Lookup Time Complexity of Compressed Suffix Arrays using Multi-ary Wavelet Tree

Wu, Zheng;Na, Joong-Chae;Kim, Min-Hwan;Kim, Dong-Kyue
- Journal of Computing Science and Engineering
- /
- v.3 no.1
- /
- pp.1-4
- /
- 2009
In a given text T of size n, we need to search for the information that we are interested. In order to support fast searching, an index must be constructed by preprocessing the text. Suffix array is a kind of index data structure. The compressed suffix array (CSA) is one of the compressed indices based on the regularity of the suffix array, and can be compressed to the $k^{th}$ order empirical entropy. In this paper we improve the lookup time complexity of the compressed suffix array by using the multi-ary wavelet tree at the cost of more space. In our implementation, the lookup time complexity of the compressed suffix array is O(${\log}_{\sigma}^{\varepsilon/(1-{\varepsilon})}\;n\;{\log}_r\;\sigma$), and the space of the compressed suffix array is ${\varepsilon}^{-1}\;nH_k(T)+O(n\;{\log}\;{\log}\;n/{\log}^{\varepsilon}_{\sigma}\;n)$ bits, where a is the size of alphabet, $H_k$ is the kth order empirical entropy r is the branching factor of the multi-ary wavelet tree such that $2{\leq}r{\leq}\sqrt{n}$ and $r{\leq}O({\log}^{1-{\varepsilon}}_{\sigma}\;n)$ and 0 < $\varepsilon$ < 1/2 is a constant.
https://doi.org/10.5626/JCSE.2009.3.1.001 인용 PDF

In-depth Analysis of Soccer Game via Webcast and Text Mining (웹 캐스트와 텍스트 마이닝을 이용한 축구 경기의 심층 분석)

Jung, Ho-Seok;Lee, Jong-Uk;Yu, Jae-Hak;Lee, Han-Sung;Park, Dai-Hee
- The Journal of the Korea Contents Association
- /
- v.11 no.10
- /
- pp.59-68
- /
- 2011
As the role of soccer game analyst who analyzes soccer games and creates soccer wining strategies is emphasized, it is required to have high-level analysis beyond the procedural ones such as main event detection in the context of IT based broadcasting soccer game research community. In this paper, we propose a novel approach to generate the high-level in-depth analysis results via real-time text based soccer Webcast and text mining. Proposed method creates a metadata such as attribute, action and event, build index, and then generate available knowledges via text mining techniques such as association rule mining, event growth index, and pathfinder network analysis using Webcast and domain knowledges. We carried out a feasibility experiment on the proposed technique with the Webcast text about Spain team's 2010 World Cup games.
https://doi.org/10.5392/JKCA.2011.11.10.059 인용 PDF KSCI

Search Result 268, Processing Time 0.022 seconds

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)