• Title/Summary/Keyword: Document Embedding

Search Result 59, Processing Time 0.026 seconds

Estimation of the journal distance of Genomics & Informatics from other bioinformatics-driven journals, 2003-2018

  • Oh, Ji-Hye;Nam, Hee-Jo;Park, Hyun-Seok
    • Genomics & Informatics
    • /
    • v.19 no.4
    • /
    • pp.51.1-51.8
    • /
    • 2021
  • This study explored the trends of Genomics & Informatics during the period of 2003-2018 in comparison with 11 other scholarly journals: BMC Bioinformatics, Algorithms for Molecular Biology: AMB, BMC Systems Biology, Journal of Computational Biology, Briefings in Bioinformatics, BMC Genomics, Nucleic Acids Research, American Journal of Human Genetics, Oncogenesis, Disease Markers, and Microarrays. In total, 22,423 research articles were reviewed. Content analysis was the main method employed in the current research. The results were interpreted using descriptive analysis, a clustering analysis, word embedding, and deep learning techniques. Trends are discussed for the 12 journals, both individually and collectively. This is an extension of our previous study (PMCID: PMC6808643).

Research Paper Classification Scheme based on Word Embedding (워드 임베딩 기반 연구 논문 분류 기법)

  • Dipto, Biswas;Gil, Joon-Min
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2021.11a
    • /
    • pp.494-497
    • /
    • 2021
  • 텍스트 분류(text classification)는 원시 텍스트 데이터로부터 정보를 추출할 수 있는 기술에 기반하여 많은 양의 텍스트 데이터를 관심 영역으로 분류하는 것으로 최근에 각광을 받고 있다. 본 논문에서는 워드 임베딩(word embedding) 기법을 이용하여 특정 분야의 연구 논문을 분류하고 추천하는 기법을 제안한다. 워드 임베딩으로 CBOW(Continuous Bag-of-Word)와 Sg(Skip-gram)를 연구 논문의 분류에 적용하고 기존 방식인 TF-IDF(Term Frequency-Inverse Document Frequency)와 성능을 비교 분석한다. 성능 평가 결과는 워드 임베딩에 기반한 연구 논문 분류 기법이 TF-IDF에 기반한 연구 논문 분류 기법보다 좋은 성능을 가진다는 것을 나타낸다.

A Study on the Improvement Model of Document Retrieval Efficiency of Tax Judgment (조세심판 문서 검색 효율 향상 모델에 관한 연구)

  • Lee, Hoo-Young;Park, Koo-Rack;Kim, Dong-Hyun
    • Journal of the Korea Convergence Society
    • /
    • v.10 no.6
    • /
    • pp.41-47
    • /
    • 2019
  • It is very important to search for and obtain an example of a similar judgment in case of court judgment. The existing judge's document search uses a method of searching through key-words entered by the user. However, if it is necessary to input an accurate keyword and the keyword is unknown, it is impossible to search for the necessary document. In addition, the detected document may have different contents. In this paper, we want to improve the effectiveness of the method of vectorizing a document into a three-dimensional space, calculating cosine similarity, and searching close documents in order to search an accurate judge's example. Therefore, after analyzing the similarity of words used in the judge's example, a method is provided for extracting the mode and inserting it into the text of the text, thereby providing a method for improving the cosine similarity of the document to be retrieved. It is hoped that users will be able to provide a fast, accurate search trying to find an example of a tax-related judge through the proposed model.

Mobile App Clustering and Analyzing using Document Embedding (문서임베딩 기반 모바일 앱 분류 및 이를 이용한 마켓 분석)

  • Yoon, Yeo Chan;Pahk, Soo Myung;Lim, Heui Seok
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.378-381
    • /
    • 2018
  • 스마트폰이 출시된 이후로 수많은 어플리케이션이 모바일로 출시되고 있다. 본 논문에서는 모바일 앱을 자동으로 분류하는 방법에 대하여 제안한다. 제안한 방법은 딥러닝 기반의 문서 임베딩 방법을 기반으로 효과적으로 앱을 분류한다. 본 논문에서는 또한 제안한 방법을 이용하여 독점도, 포화도, 인기순위를 기준으로 실제 마켓을 분석한다.

  • PDF

EmXJ : A Framework of Configurable XML Processor for Flexible Embedding (EmXJ : 유연한 임베딩을 위한 XML 처리기 구성 프레임워크)

  • Chung, Won-Ho;Kang, Mi-Yeon
    • The KIPS Transactions:PartA
    • /
    • v.9A no.4
    • /
    • pp.467-478
    • /
    • 2002
  • With the rapid development of wired or wireless Internet, various kinds of resource constrained mobile devices, such as cellular phone, PDA, homepad, smart phone, handhold PC, and so on, have been emerging into personal or commercial usages. Most software to be embedded into those devices has been forced to have the characteristic of flexibility rather than the fixedness which was an inherent property of embedded system. It means that recent technologies require the flexible embedding into the variety of resource constrained mobile devices. A document processor for XML which has been positioned as a standard mark-up language for information representation on the Web, is one of the essential software to be embedded into those devices for browsing the information. In this paper, a framework for configurable XML processor called EmXJ is designed and implemented for flexible embedding into various types of resource constrained mobile devices, and its advantages are compared to conventional XML processors.

OLE File Analysis and Malware Detection using Machine Learning

  • Choi, Hyeong Kyu;Kang, Ah Reum
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.5
    • /
    • pp.149-156
    • /
    • 2022
  • Recently, there have been many reports of document-type malicious code injecting malicious code into Microsoft Office files. Document-type malicious code is often hidden by encoding the malicious code in the document. Therefore, document-type malware can easily bypass anti-virus programs. We found that malicious code was inserted into the Visual Basic for Applications (VBA) macro, a function supported by Microsoft Office. Malicious codes such as shellcodes that run external programs and URL-related codes that download files from external URLs were identified. We selected 354 keywords repeatedly appearing in malicious Microsoft Office files and defined the number of times each keyword appears in the body of the document as a feature. We performed machine learning with SVM, naïve Bayes, logistic regression, and random forest algorithms. As a result, each algorithm showed accuracies of 0.994, 0.659, 0.995, and 0.998, respectively.

Impact of Word Embedding Methods on Performance of Sentiment Analysis with Machine Learning Techniques

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.8
    • /
    • pp.181-188
    • /
    • 2020
  • In this study, we propose a comparative study to confirm the impact of various word embedding techniques on the performance of sentiment analysis. Sentiment analysis is one of opinion mining techniques to identify and extract subjective information from text using natural language processing and can be used to classify the sentiment of product reviews or comments. Since sentiment can be classified as either positive or negative, it can be considered one of the general classification problems. For sentiment analysis, the text must be converted into a language that can be recognized by a computer. Therefore, text such as a word or document is transformed into a vector in natural language processing called word embedding. Various techniques, such as Bag of Words, TF-IDF, and Word2Vec are used as word embedding techniques. Until now, there have not been many studies on word embedding techniques suitable for emotional analysis. In this study, among various word embedding techniques, Bag of Words, TF-IDF, and Word2Vec are used to compare and analyze the performance of movie review sentiment analysis. The research data set for this study is the IMDB data set, which is widely used in text mining. As a result, it was found that the performance of TF-IDF and Bag of Words was superior to that of Word2Vec and TF-IDF performed better than Bag of Words, but the difference was not very significant.

A Study of Research on Methods of Automated Biomedical Document Classification using Topic Modeling and Deep Learning (토픽모델링과 딥 러닝을 활용한 생의학 문헌 자동 분류 기법 연구)

  • Yuk, JeeHee;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.35 no.2
    • /
    • pp.63-88
    • /
    • 2018
  • This research evaluated differences of classification performance for feature selection methods using LDA topic model and Doc2Vec which is based on word embedding using deep learning, feature corpus sizes and classification algorithms. In addition to find the feature corpus with high performance of classification, an experiment was conducted using feature corpus was composed differently according to the location of the document and by adjusting the size of the feature corpus. Conclusionally, in the experiments using deep learning evaluate training frequency and specifically considered information for context inference. This study constructed biomedical document dataset, Disease-35083 which consisted biomedical scholarly documents provided by PMC and categorized by the disease category. Throughout the study this research verifies which type and size of feature corpus produces the highest performance and, also suggests some feature corpus which carry an extensibility to specific feature by displaying efficiency during the training time. Additionally, this research compares the differences between deep learning and existing method and suggests an appropriate method by classification environment.

Categorization of Korean News Articles Based on Convolutional Neural Network Using Doc2Vec and Word2Vec (Doc2Vec과 Word2Vec을 활용한 Convolutional Neural Network 기반 한국어 신문 기사 분류)

  • Kim, Dowoo;Koo, Myoung-Wan
    • Journal of KIISE
    • /
    • v.44 no.7
    • /
    • pp.742-747
    • /
    • 2017
  • In this paper, we propose a novel approach to improve the performance of the Convolutional Neural Network(CNN) word embedding model on top of word2vec with the result of performing like doc2vec in conducting a document classification task. The Word Piece Model(WPM) is empirically proven to outperform other tokenization methods such as the phrase unit, a part-of-speech tagger with substantial experimental evidence (classification rate: 79.5%). Further, we conducted an experiment to classify ten categories of news articles written in Korean by feeding words and document vectors generated by an application of WPM to the baseline and the proposed model. From the results of the experiment, we report the model we proposed showed a higher classification rate (89.88%) than its counterpart model (86.89%), achieving a 22.80% improvement. Throughout this research, it is demonstrated that applying doc2vec in the document classification task yields more effective results because doc2vec generates similar document vector representation for documents belonging to the same category.

Multilayer Knowledge Representation of Customer's Opinion in Reviews (리뷰에서의 고객의견의 다층적 지식표현)

  • Vo, Anh-Dung;Nguyen, Quang-Phuoc;Ock, Cheol-Young
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.652-657
    • /
    • 2018
  • With the rapid development of e-commerce, many customers can now express their opinion on various kinds of product at discussion groups, merchant sites, social networks, etc. Discerning a consensus opinion about a product sold online is difficult due to more and more reviews become available on the internet. Opinion Mining, also known as Sentiment analysis, is the task of automatically detecting and understanding the sentimental expressions about a product from customer textual reviews. Recently, researchers have proposed various approaches for evaluation in sentiment mining by applying several techniques for document, sentence and aspect level. Aspect-based sentiment analysis is getting widely interesting of researchers; however, more complex algorithms are needed to address this issue precisely with larger corpora. This paper introduces an approach of knowledge representation for the task of analyzing product aspect rating. We focus on how to form the nature of sentiment representation from textual opinion by utilizing the representation learning methods which include word embedding and compositional vector models. Our experiment is performed on a dataset of reviews from electronic domain and the obtained result show that the proposed system achieved outstanding methods in previous studies.

  • PDF