• Title/Summary/Keyword: Text analysis

Search Result 3,350, Processing Time 0.028 seconds

Generative probabilistic model with Dirichlet prior distribution for similarity analysis of research topic

  • Milyahilu, John;Kim, Jong Nam
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.4
    • /
    • pp.595-602
    • /
    • 2020
  • We propose a generative probabilistic model with Dirichlet prior distribution for topic modeling and text similarity analysis. It assigns a topic and calculates text correlation between documents within a corpus. It also provides posterior probabilities that are assigned to each topic of a document based on the prior distribution in the corpus. We then present a Gibbs sampling algorithm for inference about the posterior distribution and compute text correlation among 50 abstracts from the papers published by IEEE. We also conduct a supervised learning to set a benchmark that justifies the performance of the LDA (Latent Dirichlet Allocation). The experiments show that the accuracy for topic assignment to a certain document is 76% for LDA. The results for supervised learning show the accuracy of 61%, the precision of 93% and the f1-score of 96%. A discussion for experimental results indicates a thorough justification based on probabilities, distributions, evaluation metrics and correlation coefficients with respect to topic assignment.

Financial Footnote Analysis for Financial Ratio Predictions based on Text-Mining Techniques (재무제표 주석의 텍스트 분석 통한 재무 비율 예측 향상 연구)

  • Choe, Hyoung-Gyu;Lee, Sang-Yong Tom
    • Knowledge Management Research
    • /
    • v.21 no.2
    • /
    • pp.177-196
    • /
    • 2020
  • Since the adoption of K-IFRS(Korean International Financial Reporting Standards), the amount of financial footnotes has been increased. However, due to the stereotypical phrase and the lack of conciseness, deriving the core information from footnotes is not really easy yet. To propose a solution for this problem, this study tried financial footnote analysis for financial ratio predictions based on text-mining techniques. Using the financial statements data from 2013 to 2018, we tried to predict the earning per share (EPS) of the following quarter. We found that measured prediction errors were significantly reduced when text-mined footnotes data were jointly used. We believe this result came from the fact that discretionary financial figures, which were hardly predicted with quantitative financial data, were more correlated with footnotes texts.

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.

The Analysis for the Ergonomic Design of the Topographical Objects Marking Method (인간공학적 설계를 위한 지형지물 명칭 표기에 대한 분석)

  • Moon, Hyung-Don;Cha, Doo-Won;Park, Peom
    • Proceedings of the ESK Conference
    • /
    • 1997.10a
    • /
    • pp.260-266
    • /
    • 1997
  • The Navigation Streets Map that directly supplies information for a driver plays an important role in Car Navigation System(CNS). Therefore, cause the compatibility, preference, visibility, and readability of the navigation streets map will be hold sway over driver's information acquisition, safety, and performance in CNS, it is important that designer's requirements should be analyuzed for ergonomic design of navigation streets map. Especially, there are differences between the topographical objects marking method and that of the general map in a navigation streets map. Because much information is marked in a narrow space, it is important that the text information should be done in navigation streets map. Also, because it is a important factor to mark the text format for the marking method of the topographical objects, the design requirements will be drawn out through the factor analysis for the text marking method. And when ergonomic designers are doing, this result will be basic research for developing driving simulator of CNS and it will be a ergonomic guideline of navigation streets map.

  • PDF

PubMiner: Machine Learning-based Text Mining for Biomedical Information Analysis

  • Eom, Jae-Hong;Zhang, Byoung-Tak
    • Genomics & Informatics
    • /
    • v.2 no.2
    • /
    • pp.99-106
    • /
    • 2004
  • In this paper we introduce PubMiner, an intelligent machine learning based text mining system for mining biological information from the literature. PubMiner employs natural language processing techniques and machine learning based data mining techniques for mining useful biological information such as protein­protein interaction from the massive literature. The system recognizes biological terms such as gene, protein, and enzymes and extracts their interactions described in the document through natural language processing. The extracted interactions are further analyzed with a set of features of each entity that were collected from the related public databases to infer more interactions from the original interactions. An inferred interaction from the interaction analysis and native interaction are provided to the user with the link of literature sources. The performance of entity and interaction extraction was tested with selected MEDLINE abstracts. The evaluation of inference proceeded using the protein interaction data of S. cerevisiae (bakers yeast) from MIPS and SGD.

Comparison of Neural Network Techniques for Text Data Analysis

  • Kim, Munhee;Kang, Kee-Hoon
    • International Journal of Advanced Culture Technology
    • /
    • v.8 no.2
    • /
    • pp.231-238
    • /
    • 2020
  • Generally, sequential data refers to data having continuity. Text data, which is a representative type of unstructured data, is also sequential data in that it is necessary to know the meaning of the preceding word in order to know the meaning of the following word or context. So far, many techniques for analyzing sequential data such as text data have been proposed. In this paper, four methods of 1d-CNN, LSTM, BiLSTM, and C-LSTM are introduced, focusing on neural network techniques. In addition, by using this, IMDb movie review data was classified into two classes to compare the performance of the techniques in terms of accuracy and analysis time.

Analysis of 'Better Class' Characteristics and Patterns from College Lecture Evaluation by Longitudinal Big Data

  • Nam, Min-Woo;Cho, Eun-Soon
    • International Journal of Contents
    • /
    • v.15 no.3
    • /
    • pp.7-12
    • /
    • 2019
  • The purpose of this study was to analyze characteristics and patterns of 'better class' by using the longitudinal text mining big data analysis technique from subjective lecture evaluation comments. First, this study classified upper 30% classes to deduce certain characteristics and patterns from every five-year subjective text data for 10 years. A total of 47,177courses (100%) from spring semester 2005 to fall semester 2014 were analyzed from a university at a metropolitan city in the mid area of South Korea. This study extracted meaningful words such as good, course, professor, appreciation, lecture, interesting, useful, know, easy, improvement, progress, teaching material, passion, and concern from the order of frequency 2005-2009. The other set of words were class, appreciation, professor, good, course, interesting, understanding, useful, help, student, effort, thinking, not difficult, explanation, lecture, hard, pleasant, easy, study, examination, like, various, fun, and knowledge 2010-2014. This study suggests that the characteristics and patterns of 'better class' at college, should be analyzed according to different academic code such as liberal arts, fine arts, social science, engineering, math and science, and etc.

A Study on De-Identification Methods to Create a Basis for Safety Report Text Mining Analysis (항공안전 보고 데이터 텍스트 분석 기반 조성을 위한 비식별 처리 기술 적용 연구)

  • Hwang, Do-bin;Kim, Young-gon;Sim, Yeong-min
    • Journal of the Korean Society for Aviation and Aeronautics
    • /
    • v.29 no.4
    • /
    • pp.160-165
    • /
    • 2021
  • In order to identify and analyze potential aviation safety hazards, analysis of aviation safety report data must be preceded. Therefore, in consideration of the provisions of the Aviation Safety Act and the recommendations of ICAO Doc 9859 SMM Edition 4th, personal information in the reporting data and sensitive information of the reporter, etc. It identifies the scope of de-identification targets and suggests a method for applying de-identification processing technology to personal and sensitive information including unstructured text data.

Novel Optimizer AdamW+ implementation in LSTM Model for DGA Detection

  • Awais Javed;Adnan Rashdi;Imran Rashid;Faisal Amir
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.11
    • /
    • pp.133-141
    • /
    • 2023
  • This work take deeper analysis of Adaptive Moment Estimation (Adam) and Adam with Weight Decay (AdamW) implementation in real world text classification problem (DGA Malware Detection). AdamW is introduced by decoupling weight decay from L2 regularization and implemented as improved optimizer. This work introduces a novel implementation of AdamW variant as AdamW+ by further simplifying weight decay implementation in AdamW. DGA malware detection LSTM models results for Adam, AdamW and AdamW+ are evaluated on various DGA families/ groups as multiclass text classification. Proposed AdamW+ optimizer results has shown improvement in all standard performance metrics over Adam and AdamW. Analysis of outcome has shown that novel optimizer has outperformed both Adam and AdamW text classification based problems.

Research Trends on Literature Reviews in Scopus Journals by Authors from Indonesia, Japan, South Korea, Vietnam, Singapore, and Malaysia: A Bibliometric Analysis from 2003 to 2022

  • Prakoso Bhairawa Putera;Amelya Gustina
    • Asian Journal of Innovation and Policy
    • /
    • v.12 no.3
    • /
    • pp.304-322
    • /
    • 2023
  • Text data mining ('big data methods') is one of the most widely used approaches during the COVID-19 pandemic. In particular, text data mining on Scopus databases or Web of Science (WoS). Text data mining is widely used to collect literature for later bibliometric analysis, and in the end, it becomes a literature review article. Therefore, in this article, we reveal the trend of publication of literature reviews in Scopus journals from Indonesia, Japan, South Korea, Vietnam, Singapore, and Malaysia. This article describes two essential parts, namely 1) a comparison of international publication trends and subject area of literature review publications, and 2) a comparison of Top 5 for Authors, Affiliation, Source Title, and Collaboration Country.