• Title/Summary/Keyword: automated text analysis

Search Result 39, Processing Time 0.023 seconds

Automatic Title Detection by Spatial Feature and Projection Profile for Document Images (공간 정보와 투영 프로파일을 이용한 문서 영상에서의 타이틀 영역 추출)

  • Park, Hyo-Jin;Kim, Bo-Ram;Kim, Wook-Hyun
    • Journal of the Institute of Convergence Signal Processing
    • /
    • v.11 no.3
    • /
    • pp.209-214
    • /
    • 2010
  • This paper proposes an algorithm of segmentation and title detection for document image. The automated title detection method that we have developed is composed of two phases, segmentation and title area detection. In the first phase, we extract and segment the document image. To perform this operation, the binary map is segmented by combination of morphological operation and CCA(connected component algorithm). The first phase provides segmented regions that would be detected as title area for the second stage. Candidate title areas are detected using geometric information, then we can extract the title region that is performed by removing non-title regions. After classification step that removes non-text regions, projection is performed to detect a title region. From the fact that usually the largest font is used for the title in the document, horizontal projection is performed within text areas. In this paper, we proposed a method of segmentation and title detection for various forms of document images using geometric features and projection profile analysis. The proposed system is expected to have various applications, such as document title recognition, multimedia data searching, real-time image processing and so on.

Currents in Integrative Biochip Informatics

  • Kim, Ju-Han
    • Proceedings of the Korean Society for Bioinformatics Conference
    • /
    • 2001.10a
    • /
    • pp.1-9
    • /
    • 2001
  • scale genomic and postgenomic data means that many of the challenges in biomedical research are now challenges in computational sciences and information technology. The informatics revolutions both in clinical informatics and bioinformatics will change the current paradigm of biomedical sciences and practice of clinical medicine, including diagnostics, therapeutics, and prognostics. Postgenome informatics, powered by high throughput technologies and genomic-scale databases, is likely to transform our biomedical understanding forever much the same way that biochemistry did a generation ago. In this talk, 1 will describe how these technologies will in pact biomedical research and clinical care, emphasizing recent advances in biochip-based functional genomics. Basic data preprocessing with normalization and filtering, primary pattern analysis, and machine teaming algorithms will be presented. Issues of integrated biochip informatics technologies including multivariate data projection, gene-metabolic pathway mapping, automated biomolecular annotation, text mining of factual and literature databases, and integrated management of biomolecular databases will be discussed. Each step will be given with real examples from ongoing research activities in the context of clinical relevance. Issues of linking molecular genotype and clinical phenotype information will be discussed.

  • PDF

Developing the Automated Sentiment Learning Algorithm to Build the Korean Sentiment Lexicon for Finance (재무분야 감성사전 구축을 위한 자동화된 감성학습 알고리즘 개발)

  • Su-Ji Cho;Ki-Kwang Lee;Cheol-Won Yang
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.1
    • /
    • pp.32-41
    • /
    • 2023
  • Recently, many studies are being conducted to extract emotion from text and verify its information power in the field of finance, along with the recent development of big data analysis technology. A number of prior studies use pre-defined sentiment dictionaries or machine learning methods to extract sentiment from the financial documents. However, both methods have the disadvantage of being labor-intensive and subjective because it requires a manual sentiment learning process. In this study, we developed a financial sentiment dictionary that automatically extracts sentiment from the body text of analyst reports by using modified Bayes rule and verified the performance of the model through a binary classification model which predicts actual stock price movements. As a result of the prediction, it was found that the proposed financial dictionary from this research has about 4% better predictive power for actual stock price movements than the representative Loughran and McDonald's (2011) financial dictionary. The sentiment extraction method proposed in this study enables efficient and objective judgment because it automatically learns the sentiment of words using both the change in target price and the cumulative abnormal returns. In addition, the dictionary can be easily updated by re-calculating conditional probabilities. The results of this study are expected to be readily expandable and applicable not only to analyst reports, but also to financial field texts such as performance reports, IR reports, press articles, and social media.

Research on text mining based malware analysis technology using string information (문자열 정보를 활용한 텍스트 마이닝 기반 악성코드 분석 기술 연구)

  • Ha, Ji-hee;Lee, Tae-jin
    • Journal of Internet Computing and Services
    • /
    • v.21 no.1
    • /
    • pp.45-55
    • /
    • 2020
  • Due to the development of information and communication technology, the number of new / variant malicious codes is increasing rapidly every year, and various types of malicious codes are spreading due to the development of Internet of things and cloud computing technology. In this paper, we propose a malware analysis method based on string information that can be used regardless of operating system environment and represents library call information related to malicious behavior. Attackers can easily create malware using existing code or by using automated authoring tools, and the generated malware operates in a similar way to existing malware. Since most of the strings that can be extracted from malicious code are composed of information closely related to malicious behavior, it is processed by weighting data features using text mining based method to extract them as effective features for malware analysis. Based on the processed data, a model is constructed using various machine learning algorithms to perform experiments on detection of malicious status and classification of malicious groups. Data has been compared and verified against all files used on Windows and Linux operating systems. The accuracy of malicious detection is about 93.5%, the accuracy of group classification is about 90%. The proposed technique has a wide range of applications because it is relatively simple, fast, and operating system independent as a single model because it is not necessary to build a model for each group when classifying malicious groups. In addition, since the string information is extracted through static analysis, it can be processed faster than the analysis method that directly executes the code.

Analyzing Contextual Polarity of Unstructured Data for Measuring Subjective Well-Being (주관적 웰빙 상태 측정을 위한 비정형 데이터의 상황기반 긍부정성 분석 방법)

  • Choi, Sukjae;Song, Yeongeun;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.22 no.1
    • /
    • pp.83-105
    • /
    • 2016
  • Measuring an individual's subjective wellbeing in an accurate, unobtrusive, and cost-effective manner is a core success factor of the wellbeing support system, which is a type of medical IT service. However, measurements with a self-report questionnaire and wearable sensors are cost-intensive and obtrusive when the wellbeing support system should be running in real-time, despite being very accurate. Recently, reasoning the state of subjective wellbeing with conventional sentiment analysis and unstructured data has been proposed as an alternative to resolve the drawbacks of the self-report questionnaire and wearable sensors. However, this approach does not consider contextual polarity, which results in lower measurement accuracy. Moreover, there is no sentimental word net or ontology for the subjective wellbeing area. Hence, this paper proposes a method to extract keywords and their contextual polarity representing the subjective wellbeing state from the unstructured text in online websites in order to improve the reasoning accuracy of the sentiment analysis. The proposed method is as follows. First, a set of general sentimental words is proposed. SentiWordNet was adopted; this is the most widely used dictionary and contains about 100,000 words such as nouns, verbs, adjectives, and adverbs with polarities from -1.0 (extremely negative) to 1.0 (extremely positive). Second, corpora on subjective wellbeing (SWB corpora) were obtained by crawling online text. A survey was conducted to prepare a learning dataset that includes an individual's opinion and the level of self-report wellness, such as stress and depression. The participants were asked to respond with their feelings about online news on two topics. Next, three data sources were extracted from the SWB corpora: demographic information, psychographic information, and the structural characteristics of the text (e.g., the number of words used in the text, simple statistics on the special characters used). These were considered to adjust the level of a specific SWB. Finally, a set of reasoning rules was generated for each wellbeing factor to estimate the SWB of an individual based on the text written by the individual. The experimental results suggested that using contextual polarity for each SWB factor (e.g., stress, depression) significantly improved the estimation accuracy compared to conventional sentiment analysis methods incorporating SentiWordNet. Even though literature is available on Korean sentiment analysis, such studies only used only a limited set of sentimental words. Due to the small number of words, many sentences are overlooked and ignored when estimating the level of sentiment. However, the proposed method can identify multiple sentiment-neutral words as sentiment words in the context of a specific SWB factor. The results also suggest that a specific type of senti-word dictionary containing contextual polarity needs to be constructed along with a dictionary based on common sense such as SenticNet. These efforts will enrich and enlarge the application area of sentic computing. The study is helpful to practitioners and managers of wellness services in that a couple of characteristics of unstructured text have been identified for improving SWB. Consistent with the literature, the results showed that the gender and age affect the SWB state when the individual is exposed to an identical queue from the online text. In addition, the length of the textual response and usage pattern of special characters were found to indicate the individual's SWB. These imply that better SWB measurement should involve collecting the textual structure and the individual's demographic conditions. In the future, the proposed method should be improved by automated identification of the contextual polarity in order to enlarge the vocabulary in a cost-effective manner.

A Research on Developing a Card News System based on News Generation Algorithm (알고리즘 기반의 개인화된 카드뉴스 생성 시스템 연구)

  • Kim, Dongwhan;Lee, Sanghyuk;Oh, Jonghwan;Kim, Junsuk;Park, Sungmin;Choi, Woobin;Lee, Joonhwan
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.2
    • /
    • pp.301-316
    • /
    • 2020
  • Algorithm journalism refers to the practices of automated news generation using algorithms that generate human sounding narratives. Algorithm journalism is known to have strengths in automating repetitive tasks through rapid and accurate analysis of data, and has been actively used in news domains such as sports and finance. In this paper, we propose an interactive card news system that generates personalized local election articles in 2018. The system consists of modules that collects and analyzes election data, generates texts and images, and allows users to specify their interests in the local elections. When a user selects interested regions, election types, candidate names, and political parties, the system generates card news according to their interest. In the study, we examined how personalized card news are evaluated in comparison with text and card news articles by human journalists, and derived implications on the potential use of algorithm in reporting political events.

Preparation of Soil Input Files to a Crop Model Using the Korean Soil Information System (흙토람 데이터베이스를 활용한 작물 모델의 토양입력자료 생성)

  • Yoo, Byoung Hyun;Kim, Kwang Soo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.19 no.3
    • /
    • pp.174-179
    • /
    • 2017
  • Soil parameters are required inputs to crop models, which estimate crop yield under a given environment condition. The Korean Soil Information System (KSIS), which provides detailed soil profile record of 390 soil series in the HTML (HyperText Markup Language) format, would be useful to prepare soil input files. Korean Soil Information System Processing Tool (KSISPT) was developed to aid generation of soil input data based on the KSIS database. Java was used to implement the tool that consists of a set of modules for parsing the HTML document of the KSIS, storing data required for preparing soil input file, calculating additional soil parameter, and writing soil input file to a local disk. Using the automated soil data preparation tool, about 940 soil input data were created for the DSSAT model and the ORYZA 2000 model, respectively. In combination with soil series distribution map at 30m resolution, spatial analysis of crop yield could be projected under climate change, which would help the development of adaptation strategies.

A Study on the Effect of the Document Summarization Technique on the Fake News Detection Model (문서 요약 기법이 가짜 뉴스 탐지 모형에 미치는 영향에 관한 연구)

  • Shim, Jae-Seung;Won, Ha-Ram;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.201-220
    • /
    • 2019
  • Fake news has emerged as a significant issue over the last few years, igniting discussions and research on how to solve this problem. In particular, studies on automated fact-checking and fake news detection using artificial intelligence and text analysis techniques have drawn attention. Fake news detection research entails a form of document classification; thus, document classification techniques have been widely used in this type of research. However, document summarization techniques have been inconspicuous in this field. At the same time, automatic news summarization services have become popular, and a recent study found that the use of news summarized through abstractive summarization has strengthened the predictive performance of fake news detection models. Therefore, the need to study the integration of document summarization technology in the domestic news data environment has become evident. In order to examine the effect of extractive summarization on the fake news detection model, we first summarized news articles through extractive summarization. Second, we created a summarized news-based detection model. Finally, we compared our model with the full-text-based detection model. The study found that BPN(Back Propagation Neural Network) and SVM(Support Vector Machine) did not exhibit a large difference in performance; however, for DT(Decision Tree), the full-text-based model demonstrated a somewhat better performance. In the case of LR(Logistic Regression), our model exhibited the superior performance. Nonetheless, the results did not show a statistically significant difference between our model and the full-text-based model. Therefore, when the summary is applied, at least the core information of the fake news is preserved, and the LR-based model can confirm the possibility of performance improvement. This study features an experimental application of extractive summarization in fake news detection research by employing various machine-learning algorithms. The study's limitations are, essentially, the relatively small amount of data and the lack of comparison between various summarization technologies. Therefore, an in-depth analysis that applies various analytical techniques to a larger data volume would be helpful in the future.

Analysis of Naver CAPTCHA with Effective Segmentation (효과적인 글자 분리 방법을 사용한 네이버 캡차 공격)

  • Nyang, Dae Hun;Choi, Yong Heon;Hong, Seok Jun;Lee, Kyunghee
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.23 no.5
    • /
    • pp.909-917
    • /
    • 2013
  • CAPTCHA is an automated test to tell apart computers from human mainly for web services, and it has been evolved since the most naive form in which users are requested to input simple strings has been introduced. Though many types of CAPTCHAs have been proposed, text-based CAPTCHAs have been widely prevailed for user convenience. In this paper, we introduce new segmentation schemes and show an attack method to break the CAPTCHA of Naver that occupies more than 70% of the market share in search engine. The experimental results show that 938 trials out of 1000 have successfully analyzed, which implies that we cannot use the CAPTCHA anymore.

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong;Seo, Hwa-Jeong;Park, Yu-Rang;Kim, Ji-Hun;Bien, Sang Jay;Kim, Ju-Han
    • Genomics & Informatics
    • /
    • v.9 no.1
    • /
    • pp.19-27
    • /
    • 2011
  • Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.