• Title/Summary/Keyword: Text data

Search Result 2,956, Processing Time 0.028 seconds

Trends in Deep Learning-based Medical Optical Character Recognition (딥러닝 기반의 의료 OCR 기술 동향)

  • Sungyeon Yoon;Arin Choi;Chaewon Kim;Sumin Oh;Seoyoung Sohn;Jiyeon Kim;Hyunhee Lee;Myeongeun Han;Minseo Park
    • The Journal of the Convergence on Culture Technology
    • /
    • v.10 no.2
    • /
    • pp.453-458
    • /
    • 2024
  • Optical Character Recognition is the technology that recognizes text in images and converts them into digital format. Deep learning-based OCR is being used in many industries with large quantities of recorded data due to its high recognition performance. To improve medical services, deep learning-based OCR was actively introduced by the medical industry. In this paper, we discussed trends in OCR engines and medical OCR and provided a roadmap for development of medical OCR. By using natural language processing on detected text data, current medical OCR has improved its recognition performance. However, there are limits to the recognition performance, especially for non-standard handwriting and modified text. To develop advanced medical OCR, databaseization of medical data, image pre-processing, and natural language processing are necessary.

Design Characteristics of Augmented Reality Digital Fashion (증강 현실 디지털 패션의 디자인 특성)

  • Eunjeong Kim;Seunghee Suh
    • Journal of Fashion Business
    • /
    • v.28 no.4
    • /
    • pp.1-20
    • /
    • 2024
  • The aim of this study was to analyze contemporary sociocultural phenomena and values through characteristics of augmented reality (AR) digital fashion design. The research method included a literature review on the metaverse and augmented reality, combined with a case study using both quantitative analysis through big data text mining and qualitative analysis through constant comparison. Data analysis was conducted using Python-based open-source tools: First, 6,725 data entries were collected from AR digital fashion platforms and brands identified in articles from Vogue and Vogue Business containing keywords of 'augmented reality' and 'digital fashion. Second, text preprocessing involved stop word removal, tokenization, and POS-tagging of nouns and adjectives using the NLTK library. Third, top 50 keywords were extracted through term frequency (TF) and TF-IDF analysis, with results visualized using a word cloud. Fourth, characteristics of products' external design and internal concepts that contained top keywords were classified, with their value examined through repeated comparison. Results indicate that AR digital fashion design has the following characteristics. First, it embodies surreal fantasy through designs that mimic natural biological patterns using 3D scanning and modeling technology. Second, it presents a trans-boundary aspect by utilizing the fluidity of body and space to challenge vertical and discriminatory social structures. Third, it imagines a new future transcending traditional sociocultural concepts by expanding perceptions of space and time based on advanced technological aesthetics. Fourth, it contributes to sustainability by exploring alternatives for the fashion industry in response to climate change and ecological concerns.

Big data mining for natural disaster analysis (자연재해 분석을 위한 빅데이터 마이닝 기술)

  • Kim, Young-Min;Hwang, Mi-Nyeong;Kim, Taehong;Jeong, Chang-Hoo;Jeong, Do-Heon
    • Journal of the Korean Data and Information Science Society
    • /
    • v.26 no.5
    • /
    • pp.1105-1115
    • /
    • 2015
  • Big data analysis for disaster have been recently started especially to text data such as social media. Social data usually supports for the final two stages of disaster management, which consists of four stages: prevention, preparation, response and recovery. Otherwise, big data analysis for meteorologic data can contribute to the prevention and preparation. This motivated us to review big data technologies dealing with non-text data rather than text in natural disaster area. To this end, we first explain the main keywords, big data, data mining and machine learning in sec. 2. Then we introduce the state-of-the-art machine learning techniques in meteorology-related field sec. 3. We show how the traditional machine learning techniques have been adapted for climatic data by taking into account the domain specificity. The application of these techniques in natural disaster response are then introduced (sec. 4), and we finally conclude with several future research directions.

Individual Interests Tracking : Beyond Macro-level Issue Tracking (거시적 이슈 트래킹의 한계 극복을 위한 개인 관심 트래킹 방법론)

  • Liu, Chen;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.13 no.4
    • /
    • pp.275-287
    • /
    • 2014
  • Recently, the volume of unstructured text data generated by various social media has been increasing rapidly; consequently, the use of text mining to support decision-making has also been growing. In particular, academia and industry are paying significant attention to topic analysis in order to discover the main issues from a large volume of text documents. Topic analysis can be regarded as static analysis because it analyzes a snapshot of the distribution of various issues. In contrast, some recent studies have attempted to perform dynamic issue tracking, which analyzes and traces issue trends during a predefined period. However, most traditional issue tracking methods have a common limitation : when a new period is included, topic analysis must be repeated for all the documents of the entire period, rather than being conducted only on the new documents of the added period. Additionally, traditional issue tracking methods do not concentrate on the transition of individuals' interests from certain issues to others, although the methods can illustrate macro-level issue trends. In this paper, we propose an individual interests tracking methodology to overcome the two limitations of traditional issue tracking methods. Our main goal is not to track macro-level issue trends but to analyze trends of individual interests flow. Further, our methodology has extensible characteristics because it analyzes only newly added documents when the period of analysis is extended. In this paper, we also analyze the results of applying our methodology to news articles and their access logs.

Business Model Mining: Analyzing a Firm's Business Model with Text Mining of Annual Report

  • Lee, Jihwan;Hong, Yoo S.
    • Industrial Engineering and Management Systems
    • /
    • v.13 no.4
    • /
    • pp.432-441
    • /
    • 2014
  • As the business model is receiving considerable attention these days, the ability to collect business model related information has become essential requirement for a company. The annual report is one of the most important external documents which contain crucial information about the company's business model. By investigating business descriptions and their future strategies within the annual report, we can easily analyze a company's business model. However, given the sheer volume of the data, which is usually over a hundred pages, it is not practical to depend only on manual extraction. The purpose of this study is to complement the manual extraction process by using text mining techniques. In this study, the text mining technique is applied in business model concept extraction and business model evolution analysis. By concept, we mean the overview of a company's business model within a specific year, and, by evolution, we mean temporal changes in the business model concept over time. The efficiency and effectiveness of our methodology is illustrated by a case example of three companies in the US video rental industry.

Improving Lookup Time Complexity of Compressed Suffix Arrays using Multi-ary Wavelet Tree

  • Wu, Zheng;Na, Joong-Chae;Kim, Min-Hwan;Kim, Dong-Kyue
    • Journal of Computing Science and Engineering
    • /
    • v.3 no.1
    • /
    • pp.1-4
    • /
    • 2009
  • In a given text T of size n, we need to search for the information that we are interested. In order to support fast searching, an index must be constructed by preprocessing the text. Suffix array is a kind of index data structure. The compressed suffix array (CSA) is one of the compressed indices based on the regularity of the suffix array, and can be compressed to the $k^{th}$ order empirical entropy. In this paper we improve the lookup time complexity of the compressed suffix array by using the multi-ary wavelet tree at the cost of more space. In our implementation, the lookup time complexity of the compressed suffix array is O(${\log}_{\sigma}^{\varepsilon/(1-{\varepsilon})}\;n\;{\log}_r\;\sigma$), and the space of the compressed suffix array is ${\varepsilon}^{-1}\;nH_k(T)+O(n\;{\log}\;{\log}\;n/{\log}^{\varepsilon}_{\sigma}\;n)$ bits, where a is the size of alphabet, $H_k$ is the kth order empirical entropy r is the branching factor of the multi-ary wavelet tree such that $2{\leq}r{\leq}\sqrt{n}$ and $r{\leq}O({\log}^{1-{\varepsilon}}_{\sigma}\;n)$ and 0 < $\varepsilon$ < 1/2 is a constant.

Conceptual Graph Matching Method for Reading Comprehension Tests

  • Zhang, Zhi-Chang;Zhang, Yu;Liu, Ting;Li, Sheng
    • Journal of information and communication convergence engineering
    • /
    • v.7 no.4
    • /
    • pp.419-430
    • /
    • 2009
  • Reading comprehension (RC) systems are to understand a given text and return answers in response to questions about the text. Many previous studies extract sentences that are the most similar to questions as answers. However, texts for RC tests are generally short and facts about an event or entity are often expressed in multiple sentences. The answers for some questions might be indirectly presented in the sentences having few overlapping words with the questions. This paper proposes a conceptual graph matching method towards RC tests to extract answer strings. The method first represents the text and questions as conceptual graphs, and then extracts subgraphs for every candidate answer concept from the text graph. All candidate answer concepts will be scored and ranked according to the matching similarity between their sub-graphs and question graph. The top one will be returned as answer seed to form a concise answer string. Since the sub-graphs for candidate answer concepts are not restricted to only covering a single sentence, our approach improved the performance of answer extraction on the Remedia test data.

Chinese Prosody Generation Based on C-ToBI Representation for Text-to-Speech (음성합성을 위한 C-ToBI기반의 중국어 운율 경계와 F0 contour 생성)

  • Kim, Seung-Won;Zheng, Yu;Lee, Gary-Geunbae;Kim, Byeong-Chang
    • MALSORI
    • /
    • no.53
    • /
    • pp.75-92
    • /
    • 2005
  • Prosody Generation Based on C-ToBI Representation for Text-to-SpeechSeungwon Kim, Yu Zheng, Gary Geunbae Lee, Byeongchang KimProsody modeling is critical in developing text-to-speech (TTS) systems where speech synthesis is used to automatically generate natural speech. In this paper, we present a prosody generation architecture based on Chinese Tone and Break Index (C-ToBI) representation. ToBI is a multi-tier representation system based on linguistic knowledge to transcribe events in an utterance. The TTS system which adopts ToBI as an intermediate representation is known to exhibit higher flexibility, modularity and domain/task portability compared with the direct prosody generation TTS systems. However, the cost of corpus preparation is very expensive for practical-level performance because the ToBI labeled corpus has been manually constructed by many prosody experts and normally requires a large amount of data for accurate statistical prosody modeling. This paper proposes a new method which transcribes the C-ToBI labels automatically in Chinese speech. We model Chinese prosody generation as a classification problem and apply conditional Maximum Entropy (ME) classification to this problem. We empirically verify the usefulness of various natural language and phonology features to make well-integrated features for ME framework.

  • PDF

Combining Distributed Word Representation and Document Distance for Short Text Document Clustering

  • Kongwudhikunakorn, Supavit;Waiyamai, Kitsana
    • Journal of Information Processing Systems
    • /
    • v.16 no.2
    • /
    • pp.277-300
    • /
    • 2020
  • This paper presents a method for clustering short text documents, such as news headlines, social media statuses, or instant messages. Due to the characteristics of these documents, which are usually short and sparse, an appropriate technique is required to discover hidden knowledge. The objective of this paper is to identify the combination of document representation, document distance, and document clustering that yields the best clustering quality. Document representations are expanded by external knowledge sources represented by a Distributed Representation. To cluster documents, a K-means partitioning-based clustering technique is applied, where the similarities of documents are measured by word mover's distance. To validate the effectiveness of the proposed method, experiments were conducted to compare the clustering quality against several leading methods. The proposed method produced clusters of documents that resulted in higher precision, recall, F1-score, and adjusted Rand index for both real-world and standard data sets. Furthermore, manual inspection of the clustering results was conducted to observe the efficacy of the proposed method. The topics of each document cluster are undoubtedly reflected by members in the cluster.

Evaluation of Vulnerability on Rural Emergency Relief Service using Text Mining (Text Mining 기법을 활용한 농촌마을 긴급구호서비스 접근 취약성 평가)

  • Woo, Jaehyeong;Park, Jinseon;Yoon, Seongsoo
    • Journal of Korean Society of Rural Planning
    • /
    • v.24 no.1
    • /
    • pp.67-74
    • /
    • 2018
  • The rural areas are large residential space with fewer people than urban areas. That is why they are vulnerable to social services such as health care and security. This research analyzed the vulnerability of emergency relief service in rural village through text mining and the weighting value have been calculated. Based on the calculated statistics data, the police facilities are the most important, While the fire fighting and hospital facilities are important as well. In addition, the distance from the emergency relief service facility to the rural village was confirmed by using Open API. By combining these results, The vulnerable areas of the rural villages and the emergency relief service facilities were calculated and classified into 5 levels. For rural areas, the 1st class will have 33 places, following by 1,179 in 2nd class, 199 in 3rd class, 17 in 4th class and 8 in 5th class. Hence in order to further supplement the vulnerable areas to emergency relief service in villages, geographical relocation and policy approach of emergency relief service facilities are necessary.