DOI QR코드

DOI QR Code

Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector

  • 투고 : 2019.11.14
  • 심사 : 2020.03.24
  • 발행 : 2020.03.31

초록

Elasticsearch is an open source search and analytics engine that can search petabytes of data in near real time. It is designed as a distributed system horizontally scalable and highly available. It provides RESTful APIs, thereby making it programming-language agnostic. Full text search of multilingual text requires language-specific analyzers and field mappings appropriate for indexing and searching multilingual text. Additionally, a language detector can be used in conjunction with the analyzers to improve the multilingual text search. Elasticsearch provides more than 40 language analysis plugins that can process text and extract language-specific tokens and language detector plugins that can determine the language of the given text. This study investigates three different approaches to index and search Chinese, Japanese, and Korean (CJK) text (single analyzer, multi-fields, and language detector-based), and identifies the advantages of the language detector-based approach compared to the other two.

키워드

참고문헌

  1. ReadonlyREST, The Heart of the Elastic Stack [Internet], Available: https://www.elastic.co/products/elasticsearch.
  2. Elastic Research Center, Analyzers [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysisanalyzers.html.
  3. Elastic Research Center, Korean (nori) Analysis plugin [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-nori.html.
  4. C. Breitinger, B. Gipp, and S. Langer, "Research-paper recommender systems: a literature survey," International Journal on Digital Libraries. vol. 17, no. 4, pp. 305-338, 2015. DOI: 10.1007/s00799-015-0156-0.
  5. S. S. Byun, "Measurement Allocation by Shapley Value in Wireless Sensor Networks," Journal of Information and Communication Convergence Engineering, vol. 16, no. 1, pp. 38-42, 2018. DOI: 10.6109/jicce.2018.16.1.38.
  6. Kuromoji, Japanese Analysis Plugin [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysiskuromoji.html.
  7. Elastic Research Center, Smart Chinese Analysis plugin [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/plugins/6.x/analysis-smartcn.html.
  8. Elastic Research Center, What is Relevance? [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html.
  9. Elastic Research Center, Ingest Node [Internet], Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.6/ingest.html.
  10. Elastic Research Center, Elasticsearch Langdetect Ingest Processor [Internet], Available: https://github.com/spinscale/elasticsearch-ingestlangdetect.
  11. B. J. Jansen and S. Rieh, "The Seventeen Theoretical Constructs of Information Searching and Information Retrieval," The Journal of the American Society for Information Sciences and Technology, vol. 61, no. 8, pp. 1517-1534, 2010. DOI:10.1002/asi.v61:8.
  12. F. L. Jill, "Adaptive Parsing: Self-Extending Natural Language Interfaces," International Journal of Computational Linguistics, vol. 18, no. 3, 1992. DOI: 10.1007/978-1-4615-3622-2.
  13. R. Radim and K. Milan, "Language Identification on the Web: Extending the Dictionary Method," Lecture Notes in Computer Science, vol. 5449, DOI: https://doi.org/10.1007/978-3-642-00382-0_29.
  14. R. Cilibrasi and M. B. Paul, "Clustering by compression," International Journal of IEEE Transactions on Information Theory, vol. 51, no. 4. pp. 1523-1545, 2005. DOI: 10.1109/TIT.2005.844059.
  15. R. S. Bhandari and A. Bansal, "Impact of Search Engine Optimization as a Marketing Tool," Jindal Journal of business Research, March 2018,. [Online] Available: https://doi.org/10.1177/2278682117754016.
  16. K. Cao, J. Lee, and H. Jung, "Keyword Analysis Based Document Compression System," Journal of Information and Communication Convergence Engineering, vol. 16, no. 1, pp. 48-51, 2018. DOI: 10.6109/jicce.2018.16.1.48.