• Title/Summary/Keyword: VEC Model

Search Result 151, Processing Time 0.024 seconds

A Comparative Study on Topic Modeling of LDA, Top2Vec, and BERTopic Models Using LIS Journals in WoS (LDA, Top2Vec, BERTopic 모형의 토픽모델링 비교 연구 - 국외 문헌정보학 분야를 중심으로 -)

  • Yong-Gu Lee;SeonWook Kim
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.58 no.1
    • /
    • pp.5-30
    • /
    • 2024
  • The purpose of this study is to extract topics from experimental data using the topic modeling methods(LDA, Top2Vec, and BERTopic) and compare the characteristics and differences between these models. The experimental data consist of 55,442 papers published in 85 academic journals in the field of library and information science, which are indexed in the Web of Science(WoS). The experimental process was as follows: The first topic modeling results were obtained using the default parameters for each model, and the second topic modeling results were obtained by setting the same optimal number of topics for each model. In the first stage of topic modeling, LDA, Top2Vec, and BERTopic models generated significantly different numbers of topics(100, 350, and 550, respectively). Top2Vec and BERTopic models seemed to divide the topics approximately three to five times more finely than the LDA model. There were substantial differences among the models in terms of the average and standard deviation of documents per topic. The LDA model assigned many documents to a relatively small number of topics, while the BERTopic model showed the opposite trend. In the second stage of topic modeling, generating the same 25 topics for all models, the Top2Vec model tended to assign more documents on average per topic and showed small deviations between topics, resulting in even distribution of the 25 topics. When comparing the creation of similar topics between models, LDA and Top2Vec models generated 18 similar topics(72%) out of 25. This high percentage suggests that the Top2Vec model is more similar to the LDA model. For a more comprehensive comparison analysis, expert evaluation is necessary to determine whether the documents assigned to each topic in the topic modeling results are thematically accurate.

Study on the Forecasting and Relationship of Busan Cargo by ARIMA and VAR·VEC (ARIMA와 VAR·VEC 모형에 의한 부산항 물동량 예측과 관련성연구)

  • Lee, Sung-Yhun;Ahn, Ki-Myung
    • Journal of Navigation and Port Research
    • /
    • v.44 no.1
    • /
    • pp.44-52
    • /
    • 2020
  • More accurate forecasting of port cargo in the global long-term recession is critical for the implementation of port policy. In this study, the Busan Port container volume (export cargo and transshipment cargo) was estimated using the Vector Autoregressive (VAR) model and the vector error correction (VEC) model considering the causal relationship between the economic scale (GDP) of Korea, China, and the U.S. as well as ARIMA, a single volume model. The measurement data was the monthly volume of container shipments at the Busan port J anuary 2014-August 2019. According to the analysis, the time series of import and export volume was estimated by VAR because it was relatively stable, and transshipment cargo was non-stationary, but it has cointegration relationship (long-term equilibrium) with economic scale, interest rate, and economic fluctuation, so estimated by the VEC model. The estimation results show that ARIMA is superior in the stationary time-series data (local cargo) and transshipment cargo with a trend are more predictable in estimating by the multivariate model, the VEC model. Import-export cargo, in particular, is closely related to the size of our country's economy, and transshipment cargo is closely related to the size of the Chinese and American economies. It also suggests a strategy to increase transshipment cargo as the size of China's economy appears to be closer than that of the U.S.

Expansion of Topic Modeling with Word2Vec and Case Analysis (Word2Vec를 이용한 토픽모델링의 확장 및 분석사례)

  • Yoon, Sang Hun;Kim, Keun Hyung
    • The Journal of Information Systems
    • /
    • v.30 no.1
    • /
    • pp.45-64
    • /
    • 2021
  • Purpose The traditional topic modeling technique makes it difficult to distinguish the semantic of topics because the key words assigned to each topic would be also assigned to other topics. This problem could become severe when the number of online reviews are small. In this paper, the extended model of topic modeling technique that can be used for analyzing a small amount of online reviews is proposed. Design/methodology/approach The extended model of being proposed in this paper is a form that combines the traditional topic modeling technique and the Word2Vec technique. The extended model only allocates main words to the extracted topics, but also generates discriminatory words between topics. In particular, Word2vec technique is applied in the process of extracting related words semantically for each discriminatory word. In the extended model, main words and discriminatory words with similar words semantically are used in the process of semantic classification and naming of extracted topics, so that the semantic classification and naming of topics can be more clearly performed. For case study, online reviews related with Udo in Tripadvisor web site were analyzed by applying the traditional topic modeling and the proposed extension model. In the process of semantic classification and naming of the extracted topics, the traditional topic modeling technique and the extended model were compared. Findings Since the extended model is a concept that utilizes additional information in the existing topic modeling information, it can be confirmed that it is more effective than the existing topic modeling in semantic division between topics and the process of assigning topic names.

Development of the Machine Learning-based Employment Prediction Model for Internship Applicants (인턴십 지원자를 위한 기계학습기반 취업예측 모델 개발)

  • Kim, Hyun Soo;Kim, Sunho;Kim, Do Hyun
    • Journal of the Semiconductor & Display Technology
    • /
    • v.21 no.2
    • /
    • pp.138-143
    • /
    • 2022
  • The employment prediction model proposed in this paper uses 16 independent variables, including self-introductions of M University students who applied for IPP and work-study internship, and 3 dependent variable data such as large companies, mid-sized companies, and unemployment. The employment prediction model for large companies was developed using Random Forest and Word2Vec with the result of F1_Weighted 82.4%. The employment prediction model for medium-sized companies and above was developed using Logistic Regression and Word2Vec with the result of F1_Weighted 73.24%. These two models can be actively used in predicting employment in large and medium-sized companies for M University students in the future.

Modeling of Convolutional Neural Network-based Recommendation System

  • Kim, Tae-Yeun
    • Journal of Integrative Natural Science
    • /
    • v.14 no.4
    • /
    • pp.183-188
    • /
    • 2021
  • Collaborative filtering is one of the commonly used methods in the web recommendation system. Numerous researches on the collaborative filtering proposed the numbers of measures for enhancing the accuracy. This study suggests the movie recommendation system applied with Word2Vec and ensemble convolutional neural networks. First, user sentences and movie sentences are made from the user, movie, and rating information. Then, the user sentences and movie sentences are input into Word2Vec to figure out the user vector and movie vector. The user vector is input on the user convolutional model while the movie vector is input on the movie convolutional model. These user and movie convolutional models are connected to the fully-connected neural network model. Ultimately, the output layer of the fully-connected neural network model outputs the forecasts for user, movie, and rating. The test result showed that the system proposed in this study showed higher accuracy than the conventional cooperative filtering system and Word2Vec and deep neural network-based system suggested in the similar researches. The Word2Vec and deep neural network-based recommendation system is expected to help in enhancing the satisfaction while considering about the characteristics of users.

Semantic Extention Search for Documents Using the Word2vec (Word2vec을 활용한 문서의 의미 확장 검색방법)

  • Kim, Woo-ju;Kim, Dong-he;Jang, Hee-won
    • The Journal of the Korea Contents Association
    • /
    • v.16 no.10
    • /
    • pp.687-692
    • /
    • 2016
  • Conventional way to search documents is keyword-based queries using vector space model, like tf-idf. Searching process of documents which is based on keywords can make some problems. it cannot recogize the difference of lexically different but semantically same words. This paper studies a scheme of document search based on document queries. In particular, it uses centrality vectors, instead of tf-idf vectors, to represent query documents, combined with the Word2vec method to capture the semantic similarity in contained words. This scheme improves the performance of document search and provides a way to find documents not only lexically, but semantically close to a query document.

On Characteristics of Word Embeddings by the Word2vec Model (Word2vec 모델의 단어 임베딩 특성 연구)

  • Kang, Hyungsuc;Yang, Janghoon
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2019.05a
    • /
    • pp.263-266
    • /
    • 2019
  • 단어 임베딩 모델 중 현재 널리 사용되는 word2vec 모델은 언어의 의미론적 유사성을 잘 반영한다고 알려져 있다. 본 논문은 word2vec 모델로 학습된 단어 벡터가 실제로 의미론적 유사성을 얼마나 잘 반영하는지 확인하는 것을 목표로 한다. 즉, 유사한 범주의 단어들이 벡터 공간상에 가까이 임베딩되는지 그리고 서로 구별되는 범주의 단어들이 뚜렷이 구분되어 임베딩되는지를 확인하는 것이다. 간단한 군집화 알고리즘을 통한 검증의 결과, 상식적인 언어 지식과 달리 특정 범주의 단어들은 임베딩된 벡터 공간에서 뚜렷이 구분되지 않음을 확인했다. 결론적으로, 단어 벡터들의 유사도가 항상 해당 단어들의 의미론적 유사도를 의미하지는 않는다. Word2vec 모델의 결과를 응용하는 향후 연구에서는 이런 한계점에 고려가 요청된다.

DroidVecDeep: Android Malware Detection Based on Word2Vec and Deep Belief Network

  • Chen, Tieming;Mao, Qingyu;Lv, Mingqi;Cheng, Hongbing;Li, Yinglong
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.13 no.4
    • /
    • pp.2180-2197
    • /
    • 2019
  • With the proliferation of the Android malicious applications, malware becomes more capable of hiding or confusing its malicious intent through the use of code obfuscation, which has significantly weaken the effectiveness of the conventional defense mechanisms. Therefore, in order to effectively detect unknown malicious applications on the Android platform, we propose DroidVecDeep, an Android malware detection method using deep learning technique. First, we extract various features and rank them using Mean Decrease Impurity. Second, we transform the features into compact vectors based on word2vec. Finally, we train the classifier based on deep learning model. A comprehensive experimental study on a real sample collection was performed to compare various malware detection approaches. Experimental results demonstrate that the proposed method outperforms other Android malware detection techniques.

Performance Comparison of Automatic Classification Using Word Embeddings of Book Titles (단행본 서명의 단어 임베딩에 따른 자동분류의 성능 비교)

  • Yong-Gu Lee
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.4
    • /
    • pp.307-327
    • /
    • 2023
  • To analyze the impact of word embedding on book titles, this study utilized word embedding models (Word2vec, GloVe, fastText) to generate embedding vectors from book titles. These vectors were then used as classification features for automatic classification. The classifier utilized the k-nearest neighbors (kNN) algorithm, with the categories for automatic classification based on the DDC (Dewey Decimal Classification) main class 300 assigned by libraries to books. In the automatic classification experiment applying word embeddings to book titles, the Skip-gram architectures of Word2vec and fastText showed better results in the automatic classification performance of the kNN classifier compared to the TF-IDF features. In the optimization of various hyperparameters across the three models, the Skip-gram architecture of the fastText model demonstrated overall good performance. Specifically, better performance was observed when using hierarchical softmax and larger embedding dimensions as hyperparameters in this model. From a performance perspective, fastText can generate embeddings for substrings or subwords using the n-gram method, which has been shown to increase recall. The Skip-gram architecture of the Word2vec model generally showed good performance at low dimensions(size 300) and with small sizes of negative sampling (3 or 5).

A Word Semantic Similarity Measure Model using Korean Open Dictionary (우리말샘 사전을 이용한 단어 의미 유사도 측정 모델 개발)

  • Kim, Hoyong;Lee, Min-Ho;Seo, Dongmin
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2018.05a
    • /
    • pp.3-4
    • /
    • 2018
  • 단어 의미 유사도 측정은 정보 검색이나 문서 분류와 같이 자연어 처리 분야 문제를 해결하는 데 큰 도움을 준다. 이러한 의미 유사도 측정 문제를 해결하기 위하여 단어의 계층 구조를 사용한 기존 연구들이 있지만 이는 단어의 의미를 고려하고 있지 않아 만족스럽지 못한 결과를 보여주고 있다. 본 논문에서는 국립국어원에서 간행한 표준국어대사전에 50만 어휘가 추가된 우리말샘 사전을 기반으로 하여 한국어 단어에 대한 계층 구조를 파악했다. 그리고 단어의 용례를 word2vec 모델에 학습하여 단어의 문맥적 의미를 파악하고, 단어의 정의문을 sent2vec 모델에 학습하여 단어의 사전적 의미를 파악했다. 또한, 구축된 계층 구조와 학습된 word2vec, sent2vec 모델을 이용하여 한국어 단어 의미 유사도를 측정하는 모델을 제안했다. 마지막으로 성능 평가를 통해 제안하는 모델이 기존 모델보다 향상된 성능을 보임을 입증했다.

  • PDF