• 제목/요약/키워드: Text Analysis

검색결과 3,342건 처리시간 0.027초

Improving Elasticsearch for Chinese, Japanese, and Korean Text Search through Language Detector

  • Kim, Ki-Ju;Cho, Young-Bok
    • Journal of information and communication convergence engineering
    • /
    • 제18권1호
    • /
    • pp.33-38
    • /
    • 2020
  • Elasticsearch is an open source search and analytics engine that can search petabytes of data in near real time. It is designed as a distributed system horizontally scalable and highly available. It provides RESTful APIs, thereby making it programming-language agnostic. Full text search of multilingual text requires language-specific analyzers and field mappings appropriate for indexing and searching multilingual text. Additionally, a language detector can be used in conjunction with the analyzers to improve the multilingual text search. Elasticsearch provides more than 40 language analysis plugins that can process text and extract language-specific tokens and language detector plugins that can determine the language of the given text. This study investigates three different approaches to index and search Chinese, Japanese, and Korean (CJK) text (single analyzer, multi-fields, and language detector-based), and identifies the advantages of the language detector-based approach compared to the other two.

Text Line Segmentation of Handwritten Documents by Area Mapping

  • Boragule, Abhijeet;Lee, GueeSang
    • 스마트미디어저널
    • /
    • 제4권3호
    • /
    • pp.44-49
    • /
    • 2015
  • Text line segmentation is a preprocessing step in OCR, which can significantly influence the accuracy of document analysis applications. This paper proposes a novel methodology for the text line segmentation of handwritten documents. First, the average width of the connected components is used to form a 1-D Gaussian kernel and a smoothing operation is then applied to the input binary image. The adaptive binarization of the smoothed image forms the final text lines. In this work, the segmentation method involves two stages: firstly, the large connected components are labelled as a unique text line using text line area mapping. Secondly, the final refinement of the segmentation is performed using the Euclidean distance between the text line and small connected components. The group of uniquely labelled text candidates achieves promising segmentation results. The proposed approach works well on Korean and English language handwritten documents captured using a camera.

A Technical Approach for Suggesting Research Directions in Telecommunications Policy

  • Oh, Junseok;Lee, Bong Gyou
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • 제8권12호
    • /
    • pp.4467-4488
    • /
    • 2014
  • The bibliometric analysis is widely used for understanding research domains, trends, and knowledge structures in a particular field. The analysis has majorly been used in the field of information science, and it is currently applied to other academic fields. This paper describes the analysis of academic literatures for classifying research domains and for suggesting empty research areas in the telecommunications policy. The application software is developed for retrieving Thomson Reuters' Web of Knowledge (WoK) data via web services. It also used for conducting text mining analysis from contents and citations of publications. We used three text mining techniques: the Keyword Extraction Algorithm (KEA) analysis, the co-occurrence analysis, and the citation analysis. Also, R software is used for visualizing the term frequencies and the co-occurrence network among publications. We found that policies related to social communication services, the distribution of telecommunications infrastructures, and more practical and data-driven analysis researches are conducted in a recent decade. The citation analysis results presented that the publications are generally received citations, but most of them did not receive high citations in the telecommunications policy. However, although recent publications did not receive high citations, the productivity of papers in terms of citations was increased in recent ten years compared to the researches before 2004. Also, the distribution methods of infrastructures, and the inequity and gap appeared as topics in important references. We proposed the necessity of new research domains since the analysis results implies that the decrease of political approaches for technical problems is an issue in past researches. Also, insufficient researches on policies for new technologies exist in the field of telecommunications. This research is significant in regard to the first bibliometric analysis with abstracts and citation data in telecommunications as well as the development of software which has functions of web services and text mining techniques. Further research will be conducted with Big Data techniques and more text mining techniques.

텍스트마이닝 기법을 이용한 모바일 피트니스 애플리케이션 주요 요인 분석 : 사용자 경험 관점 (An Analysis on Key Factors of Mobile Fitness Application by Using Text Mining Techniques : User Experience Perspective)

  • 이소현;김진솔;윤상혁;김희웅
    • 한국IT서비스학회지
    • /
    • 제19권3호
    • /
    • pp.117-137
    • /
    • 2020
  • The development of information technology leads to changes in various industries. In particular, the health care industry is more influenced so that it is focused on. With the widening of the health care market, the market of smart device based personal health care also draws attention. Since a variety of fitness applications for smartphone based exercise were introduced, more interest has been in the health care industry. But although an amount of use of mobile fitness applications increase, it fails to lead to a sustained use. It is necessary to find and understand what matters for mobile fitness application users. Therefore, this study analyze the reviews of mobile fitness application users, to draw key factors, and thereby to propose detailed strategies for promoting mobile fitness applications. We utilize text mining techniques - LDA topic modeling, term frequency analysis, and keyword extraction - to draw and analyze the issues related to mobile fitness applications. In particular, the key factors drawn by text mining techniques are explained through the concept of user experience. This study is academically meaningful in the point that the key factors of mobile fitness applications are drawn by the user experience based text mining techniques, and practically this study proposes detailed strategies for promoting mobile fitness applications in the health care area.

국민청원글의 토픽 모델링을 통한 교육이슈 분석 (Analysis of Educational Issues through Topic Modeling of National Petitions Text)

  • 심재권
    • 정보교육학회논문지
    • /
    • 제25권4호
    • /
    • pp.633-640
    • /
    • 2021
  • 교육과 관련된 이슈는 다양한 집단과 상황이 서로 복잡하게 연계된 사회문제로 교육과 관련된 현상을 분석하여 이슈와 문제를 구체적으로 발견하는 것은 쉽지 않은 일이다. 한국어 기반 텍스트 분석은 정량적인 형태로 분석이 가능하고, 텍스트 분석기법의 발전에 따라 연구적인 성과를 내고 있어 교육과 관련된 이슈를 한국어 텍스트로 된 데이터에서 도출하는데 충분히 활용할 수 있다. 본 연구는 청와대 국민청원 홈페이지 게시판의 육아/교육 분야의 청원글을 수집하고 텍스트 분석방법을 활용하여 교육계의 이슈와 문제를 도출하고자 하였다. 분석은 토픽 모델링 기법 중 잠재 디리클레 할당(LDA)을 통해 6개 토픽을 도출하였고, 주요 키워드의 연관규칙을 분석하여 그래프로 시각화하였다. 기존의 설문을 통한 교육의 이슈를 도출하는 방법 이외에 추가로 텍스트 기반의 분석방법을 통해 이슈를 충분히 발견할 수 있다는 점에서 향후 연구의 방향과 정책에 시사점을 제공할 수 있다.

빅데이터 텍스트 마이닝 분석을 활용한 아메카지 패션 트렌드 특징 고찰 (A Study on the Characteristics of Amekaji Fashion Trends Using Big Data Text Mining Analysis)

  • 김지형
    • 패션비즈니스
    • /
    • 제26권3호
    • /
    • pp.138-154
    • /
    • 2022
  • The purpose of this study is to identify the characteristics of domestic American casual fashion trends using big data text mining analysis. 108,524 posts and 2,038,999 extracted keywords from Naver and Daum related to American casual fashion in the past 5 years were collected and refined by the Textom program, and frequency analysis, word cloud, N-gram, centrality analysis, and CONCOR analysis were performed. The frequency analysis, 'vintage', 'style', 'daily look', 'coordination', 'workwear', 'men's wear' appeared as the main keywords. The main nationality of the representative brands was Japanese, followed by American, Korean, and others. As a result of the CONCOR analysis, four clusters were derived: "general American casual trend", "vintage taste", "direct sales mania", and "American styling". This study results showed that Japanese American casual clothes are influenced by American casual clothes, and American casual fashion in Korea, which has been reinterpreted, is completed with various coordination and creative styles such as workwear, street, military, classic, etc., focusing on items and brands. Looks were worn and shared on social networks, and the existence of an active consumer group and market potential to obtain genuine products, ranging from second-hand transactions for limited edition vintages to individual transactions were also confirmed. The significance of this study is that it presented the characteristics of American casual fashion trends academically based on online text data that the public actually uses because it has been spread by the public.

Patent Document Similarity Based on Image Analysis Using the SIFT-Algorithm and OCR-Text

  • Park, Jeong Beom;Mandl, Thomas;Kim, Do Wan
    • International Journal of Contents
    • /
    • 제13권4호
    • /
    • pp.70-79
    • /
    • 2017
  • Images are an important element in patents and many experts use images to analyze a patent or to check differences between patents. However, there is little research on image analysis for patents partly because image processing is an advanced technology and typically patent images consist of visual parts as well as of text and numbers. This study suggests two methods for using image processing; the Scale Invariant Feature Transform(SIFT) algorithm and Optical Character Recognition(OCR). The first method which works with SIFT uses image feature points. Through feature matching, it can be applied to calculate the similarity between documents containing these images. And in the second method, OCR is used to extract text from the images. By using numbers which are extracted from an image, it is possible to extract the corresponding related text within the text passages. Subsequently, document similarity can be calculated based on the extracted text. Through comparing the suggested methods and an existing method based only on text for calculating the similarity, the feasibility is achieved. Additionally, the correlation between both the similarity measures is low which shows that they capture different aspects of the patent content.

정보적 과학 텍스트의 유형에 따른 초등학생들의 내용 이해도와 인식 비교 (A Comparative Analysis of Elementary Students' Content Understanding and Perceptions by Different Types of Informational Science Texts)

  • 임희준;김연상
    • 한국초등과학교육학회지:초등과학교육
    • /
    • 제29권4호
    • /
    • pp.526-537
    • /
    • 2010
  • The purpose of this study was to compare the effects of two different types of texts, which were narrative and expository, on the understanding of content. Elementary students' perceptions of the two types of the texts were also investigated. In the comparison of the effects on the understanding of the text contents, test scores of mind-mapping, closed-answer question, and essay test were used. The analyses of mind-mapping tests showed narrative text was more effective to figure out main concepts of the text throughout the mind-mapping test. But expository text was more effective in the hierarchical organization of the concepts. In the closed-answer questions and essay test, narrative text was more effective than expository text. However when the contents of text were difficult and complex, there was no meaningful difference between the two types of texts. The analyses of students' perceptions of the texts showed that narrative texts were preferred. Students perceived that the narrative text was more interesting and familiar. However, the perceptions of helpful text for their science learning were not different by the types of texts.

  • PDF

언어 네트워크 분석 방법을 활용한 학술논문의 내용분석 (A Content Analysis of Journal Articles Using the Language Network Analysis Methods)

  • 이수상
    • 정보관리학회지
    • /
    • 제31권4호
    • /
    • pp.49-68
    • /
    • 2014
  • 본 연구의 목적은 국내 학술논문 데이터베이스에서 검색한 언어 네트워크 분석 관련 53편의 국내 학술논문들을 대상으로 하는 내용분석을 통해, 언어 네트워크 분석 방법의 기초적인 체계를 파악하기 위한 것이다. 내용분석의 범주는 분석대상의 언어 텍스트 유형, 키워드 선정 방법, 동시출현관계의 파악 방법, 네트워크의 구성 방법, 네트워크 분석도구와 분석지표의 유형이다. 분석결과로 나타난 주요 특성은 다음과 같다. 첫째, 학술논문과 인터뷰 자료를 분석대상의 언어 텍스트로 많이 사용하고 있다. 둘째, 키워드는 주로 텍스트의 본문에서 추출한 단어의 출현빈도를 사용하여 선정하고 있다. 셋째, 키워드 간 관계의 파악은 거의 동시출현빈도를 사용하고 있다. 넷째, 언어 네트워크는 단수의 네트워크보다 복수의 네트워크를 구성하고 있다. 다섯째, 네트워크 분석을 위해 NetMiner, UCINET/NetDraw, NodeXL, Pajek 등을 사용하고 있다. 여섯째, 밀도, 중심성, 하위 네트워크 등 다양한 분석지표들을 사용하고 있다. 이러한 특성들은 언어 네트워크 분석 방법의 기초적인 체계를 구성하는 데 활용할 수 있을 것이다.

동시적 텍스트 기반 매체를 이용한 집단의사결정에 관한 질적 연구 (Qualitative Study on Group Decision Making with Synchronous Text Communication Medium)

  • 박상혁;조남재
    • Journal of Information Technology Applications and Management
    • /
    • 제11권4호
    • /
    • pp.1-23
    • /
    • 2004
  • This study identifies communication patterns of groups using synchronous text communication medium for their group decision-making, and examines how these patterns are associated with creative solutions to problems. Our research suggests that certain communication behavior of groups, when appropriately organized, can be of help in enhancing creative production of outcomes. A qualitative study was conducted on communication patterns based on an analysis of text-based electronic conversation protocols. Specifically this research tried to overcome existing studies on electronic groups by focusing on interactive process of communication among participants. The major study conclusion; are: (1) The production of creative outcome may depend on the process or sequence of discussion among group members with synchronous text communication medium. That is, proper interactive responses and appropriate control of the discussion process are essential to obtain a high level of performance. (2) It is importantto make discuss rules based on meta-cognitive and interactive protocols in the early stage. Explicit rules relating to internal group processes as well as communication medium use are even more important to groups with electronic communication medium than face-to-face groups.

  • PDF