• 제목/요약/키워드: text analytics

검색결과 109건 처리시간 0.029초

Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector

  • Kim, Ki-Ju;Cho, Young-Bok
    • Journal of information and communication convergence engineering
    • /
    • 제18권1호
    • /
    • pp.33-38
    • /
    • 2020
  • Elasticsearch is an open source search and analytics engine that can search petabytes of data in near real time. It is designed as a distributed system horizontally scalable and highly available. It provides RESTful APIs, thereby making it programming-language agnostic. Full text search of multilingual text requires language-specific analyzers and field mappings appropriate for indexing and searching multilingual text. Additionally, a language detector can be used in conjunction with the analyzers to improve the multilingual text search. Elasticsearch provides more than 40 language analysis plugins that can process text and extract language-specific tokens and language detector plugins that can determine the language of the given text. This study investigates three different approaches to index and search Chinese, Japanese, and Korean (CJK) text (single analyzer, multi-fields, and language detector-based), and identifies the advantages of the language detector-based approach compared to the other two.

데이터 분석 기반 미래 신기술의 사회적 위험 예측과 위험성 평가 (Data Analytics for Social Risk Forecasting and Assessment of New Technology)

  • 서용윤
    • 한국안전학회지
    • /
    • 제32권3호
    • /
    • pp.83-89
    • /
    • 2017
  • A new technology has provided the nation, industry, society, and people with innovative and useful functions. National economy and society has been improved through this technology innovation. Despite the benefit of technology innovation, however, since technology society was sufficiently mature, the unintended side effect and negative impact of new technology on society and human beings has been highlighted. Thus, it is important to investigate a risk of new technology for the future society. Recently, the risks of the new technology are being suggested through a large amount of social data such as news articles and report contents. These data can be used as effective sources for quantitatively and systematically forecasting social risks of new technology. In this respect, this paper aims to propose a data-driven process for forecasting and assessing social risks of future new technology using the text mining, 4M(Man, Machine, Media, and Management) framework, and analytic hierarchy process (AHP). First, social risk factors are forecasted based on social risk keywords extracted by the text mining of documents containing social risk information of new technology. Second, the social risk keywords are classified into the 4M causes to identify the degree of risk causes. Finally, the AHP is applied to assess impact of social risk factors and 4M causes based on social risk keywords. The proposed approach is helpful for technology engineers, safety managers, and policy makers to consider social risks of new technology and their impact.

Understanding the Food Hygiene of Cruise through the Big Data Analytics using the Web Crawling and Text Mining

  • Shuting, Tao;Kang, Byongnam;Kim, Hak-Seon
    • 한국조리학회지
    • /
    • 제24권2호
    • /
    • pp.34-43
    • /
    • 2018
  • The objective of this study was to acquire a general and text-based awareness and recognition of cruise food hygiene through big data analytics. For the purpose, this study collected data with conducting the keyword "food hygiene, cruise" on the web pages and news on Google, during October 1st, 2015 to October 1st, 2017 (two years). The data collection was processed by SCTM which is a data collecting and processing program and eventually, 899 kb, approximately 20,000 words were collected. For the data analysis, UCINET 6.0 packaged with visualization tool-Netdraw was utilized. As a result of the data analysis, the words such as jobs, news, showed the high frequency while the results of centrality (Freeman's degree centrality and Eigenvector centrality) and proximity indicated the distinct rank with the frequency. Meanwhile, as for the result of CONCOR analysis, 4 segmentations were created as "food hygiene group", "person group", "location related group" and "brand group". The diagnosis of this study for the food hygiene in cruise industry through big data is expected to provide instrumental implications both for academia research and empirical application.

빅데이터 분석을 위해 아파치 스파크를 이용한 원시 데이터 소스에서 데이터 추출 (Capturing Data from Untapped Sources using Apache Spark for Big Data Analytics)

  • ;구흥서
    • 전기학회논문지
    • /
    • 제65권7호
    • /
    • pp.1277-1282
    • /
    • 2016
  • The term "Big Data" has been defined to encapsulate a broad spectrum of data sources and data formats. It is often described to be unstructured data due to its properties of variety in data formats. Even though the traditional methods of structuring data in rows and columns have been reinvented into column families, key-value or completely replaced with JSON documents in document-based databases, the fact still remains that data have to be reshaped to conform to certain structure in order to persistently store the data on disc. ETL processes are key in restructuring data. However, ETL processes incur additional processing overhead and also require that data sources are maintained in predefined formats. Consequently, data in certain formats are completely ignored because designing ETL processes to cater for all possible data formats is almost impossible. Potentially, these unconsidered data sources can provide useful insights when incorporated into big data analytics. In this project, using big data solution, Apache Spark, we tapped into other sources of data stored in their raw formats such as various text files, compressed files etc and incorporated the data with persistently stored enterprise data in MongoDB for overall data analytics using MongoDB Aggregation Framework and MapReduce. This significantly differs from the traditional ETL systems in the sense that it is compactible regardless of the data formats at source.

GNI Corpus Version 1.0: Annotated Full-Text Corpus of Genomics & Informatics to Support Biomedical Information Extraction

  • Oh, So-Yeon;Kim, Ji-Hyeon;Kim, Seo-Jin;Nam, Hee-Jo;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • 제16권3호
    • /
    • pp.75-77
    • /
    • 2018
  • Genomics & Informatics (NLM title abbreviation: Genomics Inform) is the official journal of the Korea Genome Organization. Text corpus for this journal annotated with various levels of linguistic information would be a valuable resource as the process of information extraction requires syntactic, semantic, and higher levels of natural language processing. In this study, we publish our new corpus called GNI Corpus version 1.0, extracted and annotated from full texts of Genomics & Informatics, with NLTK (Natural Language ToolKit)-based text mining script. The preliminary version of the corpus could be used as a training and testing set of a system that serves a variety of functions for future biomedical text mining.

텍스트 마이닝 기법을 활용한 인공지능과 헬스케어 융·복합 분야 연구동향 분석 (Research Trend Analysis by using Text-Mining Techniques on the Convergence Studies of AI and Healthcare Technologies)

  • 윤지은;서창진
    • 한국IT서비스학회지
    • /
    • 제18권2호
    • /
    • pp.123-141
    • /
    • 2019
  • The goal of this study is to review the major research trend on the convergence studies of AI and healthcare technologies. For the study, 15,260 English articles on AI and healthcare related topics were collected from Scopus for 55 years from 1963, and text mining techniques were conducted. As a result, seven key research topics were defined : "AI for Clinical Decision Support System (CDSS)", "AI for Medical Image", "Internet of Healthcare Things (IoHT)", "Big Data Analytics in Healthcare", "Medical Robotics", "Blockchain in Healthcare", and "Evidence Based Medicine (EBM)". The result of this study can be utilized to set up and develop the appropriate healthcare R&D strategies for the researchers and government. In this study, text mining techniques such as Text Analysis, Frequency Analysis, Topic Modeling on LDA (Latent Dirichlet Allocation), Word Cloud, and Ego Network Analysis were conducted.

Opinion: Strategy of Semi-Automatically Annotating a Full-Text Corpus of Genomics & Informatics

  • Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • 제16권4호
    • /
    • pp.40.1-40.3
    • /
    • 2018
  • There is a communal need for an annotated corpus consisting of the full texts of biomedical journal articles. In response to community needs, a prototype version of the full-text corpus of Genomics & Informatics, called GNI version 1.0, has recently been published, with 499 annotated full-text articles available as a corpus resource. However, GNI needs to be updated, as the texts were shallow-parsed and annotated with several existing parsers. I list issues associated with upgrading annotations and give an opinion on the methodology for developing the next version of the GNI corpus, based on a semi-automatic strategy for more linguistically rich corpus annotation.

Trend Analysis of the Agricultural Industry Based on Text Analytics

  • Choi, Solsaem;Kim, Junhwan;Nam, Seungju
    • Agribusiness and Information Management
    • /
    • 제11권1호
    • /
    • pp.1-9
    • /
    • 2019
  • This research intends to propose the methodology for analyzing the current trends of agriculture, which directly connects to the survival of the nation, and through this methodology, identify the agricultural trend of Korea. Based on the relationship between three types of data - policy reports, academic articles, and news articles - the research deducts the major issues stored by each data through LDA, the representative topic modeling method. By comparing and analyzing the LDA results deducted from each data source, this study intends to identify the implications regarding the current agricultural trends of Korea. This methodology can be utilized in analyzing industrial trends other than agricultural ones. To go on further, it can also be used as a basic resource for contemplation on potential areas in the future through insight on the current situation. database of the profitability of a total of 180 crop types by analyzing Rural Development Administration's survey of agricultural products income of 115 crop types, small land profitability index survey of 53 crop types, and Statistics Korea's survey of production costs of 12 crop types. Furthermore, this research presents the result and developmental process of a web-based crop introduction decision support system that provides overseas cases of new crop introduction support programs, as well as databases of outstanding business success cases of each crop type researched by agricultural institutions.

딥러닝 기반 광학 문자 인식 기술 동향 (Recent Trends in Deep Learning-Based Optical Character Recognition)

  • 민기현;이아람;김거식;김정은;강현서;이길행
    • 전자통신동향분석
    • /
    • 제37권5호
    • /
    • pp.22-32
    • /
    • 2022
  • Optical character recognition is a primary technology required in different fields, including digitizing archival documents, industrial automation, automatic driving, video analytics, medicine, and financial institution, among others. It was created in 1928 using pattern matching, but with the advent of artificial intelligence, it has since evolved into a high-performance character recognition technology. Recently, methods for detecting curved text and characters existing in a complicated background are being studied. Additionally, deep learning models are being developed in a way to recognize texts in various orientations and resolutions, perspective distortion, illumination reflection and partially occluded text, complex font characters, and special characters and artistic text among others. This report reviews the recent deep learning-based text detection and recognition methods and their various applications.

텍스트 마이닝을 활용한 사용자 핵심 요구사항 분석 방법론 : 중국 온라인 화장품 시장을 중심으로 (A Methodology for Customer Core Requirement Analysis by Using Text Mining : Focused on Chinese Online Cosmetics Market)

  • 신윤식;백동현
    • 산업경영시스템학회지
    • /
    • 제44권2호
    • /
    • pp.66-77
    • /
    • 2021
  • Companies widely use survey to identify customer requirements, but the survey has some problems. First of all, the response is passive due to pre-designed questionnaire by companies which are the surveyor. Second, the surveyor needs to have good preliminary knowledge to improve the quality of the survey. On the other hand, text mining is an excellent way to compensate for the limitations of surveys. Recently, the importance of online review is steadily grown, and the enormous amount of text data has increased as Internet usage higher. Also, a technique to extract high-quality information from text data called Text Mining is improving. However, previous studies tend to focus on improving the accuracy of individual analytics techniques. This study proposes the methodology by combining several text mining techniques and has mainly three contributions. Firstly, able to extract information from text data without a preliminary design of the surveyor. Secondly, no need for prior knowledge to extract information. Lastly, this method provides quantitative sentiment score that can be used in decision-making.