• Title/Summary/Keyword: doc2vec

Search Result 42, Processing Time 0.028 seconds

Proposing a New Approach for Detecting Malware Based on the Event Analysis Technique

  • Vu Ngoc Son
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.12
    • /
    • pp.107-114
    • /
    • 2023
  • The attack technique by the malware distribution form is a dangerous, difficult to detect and prevent attack method. Current malware detection studies and proposals are often based on two main methods: using sign sets and analyzing abnormal behaviors using machine learning or deep learning techniques. This paper will propose a method to detect malware on Endpoints based on Event IDs using deep learning. Event IDs are behaviors of malware tracked and collected on Endpoints' operating system kernel. The malware detection proposal based on Event IDs is a new research approach that has not been studied and proposed much. To achieve this purpose, this paper proposes to combine different data mining methods and deep learning algorithms. The data mining process is presented in detail in section 2 of the paper.

A Method for Measuring Similarity Measure of Thesaurus Transformation Documents using DBSCAN (DBSCAN을 활용한 유의어 변환 문서 유사도 측정 방법)

  • Kim, Byeongsik;Shin, Juhyun
    • Journal of Korea Multimedia Society
    • /
    • v.21 no.9
    • /
    • pp.1035-1043
    • /
    • 2018
  • There is a case where the core content of another person's work is decorated as though it is his own thoughts by changing own thoughts without showing the source. Plagiarism test of copykiller free service used in plagiarism check is performed by comparing plagiarism more than 6th word. However, it is not enough to judge it as a plagiarism with a six - word match if it is replaced with a similar word. Therefore, in this paper, we construct word clusters by using DBSCAN algorithm, find synonyms, convert the words in the clusters into representative synonyms, and construct L-R tables through L-R parsing. We then propose a method for determining the similarity of documents by applying weights to the thesaurus and weights for each paragraph of the thesis.

Evaluation of Similarity Analysis of Newspaper Article Using Natural Language Processing

  • Ayako Ohshiro;Takeo Okazaki;Takashi Kano;Shinichiro Ueda
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.6
    • /
    • pp.1-7
    • /
    • 2024
  • Comparing text features involves evaluating the "similarity" between texts. It is crucial to use appropriate similarity measures when comparing similarities. This study utilized various techniques to assess the similarities between newspaper articles, including deep learning and a previously proposed method: a combination of Pointwise Mutual Information (PMI) and Word Pair Matching (WPM), denoted as PMI+WPM. For performance comparison, law data from medical research in Japan were utilized as validation data in evaluating the PMI+WPM method. The distribution of similarities in text data varies depending on the evaluation technique and genre, as revealed by the comparative analysis. For newspaper data, non-deep learning methods demonstrated better similarity evaluation accuracy than deep learning methods. Additionally, evaluating similarities in law data is more challenging than in newspaper articles. Despite deep learning being the prevalent method for evaluating textual similarities, this study demonstrates that non-deep learning methods can be effective regarding Japanese-based texts.

An Intelligence Support System Research on KTX Rolling Stock Failure Using Case-based Reasoning and Text Mining (사례기반추론과 텍스트마이닝 기법을 활용한 KTX 차량고장 지능형 조치지원시스템 연구)

  • Lee, Hyung Il;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.47-73
    • /
    • 2020
  • KTX rolling stocks are a system consisting of several machines, electrical devices, and components. The maintenance of the rolling stocks requires considerable expertise and experience of maintenance workers. In the event of a rolling stock failure, the knowledge and experience of the maintainer will result in a difference in the quality of the time and work to solve the problem. So, the resulting availability of the vehicle will vary. Although problem solving is generally based on fault manuals, experienced and skilled professionals can quickly diagnose and take actions by applying personal know-how. Since this knowledge exists in a tacit form, it is difficult to pass it on completely to a successor, and there have been studies that have developed a case-based rolling stock expert system to turn it into a data-driven one. Nonetheless, research on the most commonly used KTX rolling stock on the main-line or the development of a system that extracts text meanings and searches for similar cases is still lacking. Therefore, this study proposes an intelligence supporting system that provides an action guide for emerging failures by using the know-how of these rolling stocks maintenance experts as an example of problem solving. For this purpose, the case base was constructed by collecting the rolling stocks failure data generated from 2015 to 2017, and the integrated dictionary was constructed separately through the case base to include the essential terminology and failure codes in consideration of the specialty of the railway rolling stock sector. Based on a deployed case base, a new failure was retrieved from past cases and the top three most similar failure cases were extracted to propose the actual actions of these cases as a diagnostic guide. In this study, various dimensionality reduction measures were applied to calculate similarity by taking into account the meaningful relationship of failure details in order to compensate for the limitations of the method of searching cases by keyword matching in rolling stock failure expert system studies using case-based reasoning in the precedent case-based expert system studies, and their usefulness was verified through experiments. Among the various dimensionality reduction techniques, similar cases were retrieved by applying three algorithms: Non-negative Matrix Factorization(NMF), Latent Semantic Analysis(LSA), and Doc2Vec to extract the characteristics of the failure and measure the cosine distance between the vectors. The precision, recall, and F-measure methods were used to assess the performance of the proposed actions. To compare the performance of dimensionality reduction techniques, the analysis of variance confirmed that the performance differences of the five algorithms were statistically significant, with a comparison between the algorithm that randomly extracts failure cases with identical failure codes and the algorithm that applies cosine similarity directly based on words. In addition, optimal techniques were derived for practical application by verifying differences in performance depending on the number of dimensions for dimensionality reduction. The analysis showed that the performance of the cosine similarity was higher than that of the dimension using Non-negative Matrix Factorization(NMF) and Latent Semantic Analysis(LSA) and the performance of algorithm using Doc2Vec was the highest. Furthermore, in terms of dimensionality reduction techniques, the larger the number of dimensions at the appropriate level, the better the performance was found. Through this study, we confirmed the usefulness of effective methods of extracting characteristics of data and converting unstructured data when applying case-based reasoning based on which most of the attributes are texted in the special field of KTX rolling stock. Text mining is a trend where studies are being conducted for use in many areas, but studies using such text data are still lacking in an environment where there are a number of specialized terms and limited access to data, such as the one we want to use in this study. In this regard, it is significant that the study first presented an intelligent diagnostic system that suggested action by searching for a case by applying text mining techniques to extract the characteristics of the failure to complement keyword-based case searches. It is expected that this will provide implications as basic study for developing diagnostic systems that can be used immediately on the site.

Card Transaction Data-based Deep Tourism Recommendation Study (카드 데이터 기반 심층 관광 추천 연구)

  • Hong, Minsung;Kim, Taekyung;Chung, Namho
    • Knowledge Management Research
    • /
    • v.23 no.2
    • /
    • pp.277-299
    • /
    • 2022
  • The massive card transaction data generated in the tourism industry has become an important resource that implies tourist consumption behaviors and patterns. Based on the transaction data, developing a smart service system becomes one of major goals in both tourism businesses and knowledge management system developer communities. However, the lack of rating scores, which is the basis of traditional recommendation techniques, makes it hard for system designers to evaluate a learning process. In addition, other auxiliary factors such as temporal, spatial, and demographic information are needed to increase the performance of a recommendation system; but, gathering those are not easy in the card transaction context. In this paper, we introduce CTDDTR, a novel approach using card transaction data to recommend tourism services. It consists of two main components: i) Temporal preference Embedding (TE) represents tourist groups and services into vectors through Doc2Vec. And ii) Deep tourism Recommendation (DR) integrates the vectors and the auxiliary factors from a tourism RDF (resource description framework) through MLP (multi-layer perceptron) to provide services to tourist groups. In addition, we adopt RFM analysis from the field of knowledge management to generate explicit feedback (i.e., rating scores) used in the DR part. To evaluate CTDDTR, the card transactions data that happened over eight years on Jeju island is used. Experimental results demonstrate that the proposed method is more positive in effectiveness and efficacies.

A Hybrid Collaborative Filtering-based Product Recommender System using Search Keywords (검색 키워드를 활용한 하이브리드 협업필터링 기반 상품 추천 시스템)

  • Lee, Yunju;Won, Haram;Shim, Jaeseung;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.1
    • /
    • pp.151-166
    • /
    • 2020
  • A recommender system is a system that recommends products or services that best meet the preferences of each customer using statistical or machine learning techniques. Collaborative filtering (CF) is the most commonly used algorithm for implementing recommender systems. However, in most cases, it only uses purchase history or customer ratings, even though customers provide numerous other data that are available. E-commerce customers frequently use a search function to find the products in which they are interested among the vast array of products offered. Such search keyword data may be a very useful information source for modeling customer preferences. However, it is rarely used as a source of information for recommendation systems. In this paper, we propose a novel hybrid CF model based on the Doc2Vec algorithm using search keywords and purchase history data of online shopping mall customers. To validate the applicability of the proposed model, we empirically tested its performance using real-world online shopping mall data from Korea. As the number of recommended products increases, the recommendation performance of the proposed CF (or, hybrid CF based on the customer's search keywords) is improved. On the other hand, the performance of a conventional CF gradually decreased as the number of recommended products increased. As a result, we found that using search keyword data effectively represents customer preferences and might contribute to an improvement in conventional CF recommender systems.

Opera Clustering: K-means on librettos datasets

  • Jeong, Harim;Yoo, Joo Hun
    • Journal of Internet Computing and Services
    • /
    • v.23 no.2
    • /
    • pp.45-52
    • /
    • 2022
  • With the development of artificial intelligence analysis methods, especially machine learning, various fields are widely expanding their application ranges. However, in the case of classical music, there still remain some difficulties in applying machine learning techniques. Genre classification or music recommendation systems generated by deep learning algorithms are actively used in general music, but not in classical music. In this paper, we attempted to classify opera among classical music. To this end, an experiment was conducted to determine which criteria are most suitable among, composer, period of composition, and emotional atmosphere, which are the basic features of music. To generate emotional labels, we adopted zero-shot classification with four basic emotions, 'happiness', 'sadness', 'anger', and 'fear.' After embedding the opera libretto with the doc2vec processing model, the optimal number of clusters is computed based on the result of the elbow method. Decided four centroids are then adopted in k-means clustering to classify unsupervised libretto datasets. We were able to get optimized clustering based on the result of adjusted rand index scores. With these results, we compared them with notated variables of music. As a result, it was confirmed that the four clusterings calculated by machine after training were most similar to the grouping result by period. Additionally, we were able to verify that the emotional similarity between composer and period did not appear significantly. At the end of the study, by knowing the period is the right criteria, we hope that it makes easier for music listeners to find music that suits their tastes.

Investigation on the Effect of Multi-Vector Document Embedding for Interdisciplinary Knowledge Representation

  • Park, Jongin;Kim, Namgyu
    • Knowledge Management Research
    • /
    • v.21 no.1
    • /
    • pp.99-116
    • /
    • 2020
  • Text is the most widely used means of exchanging or expressing knowledge and information in the real world. Recently, researches on structuring unstructured text data for text analysis have been actively performed. One of the most representative document embedding method (i.e. doc2Vec) generates a single vector for each document using the whole corpus included in the document. This causes a limitation that the document vector is affected by not only core words but also other miscellaneous words. Additionally, the traditional document embedding algorithms map each document into only one vector. Therefore, it is not easy to represent a complex document with interdisciplinary subjects into a single vector properly by the traditional approach. In this paper, we introduce a multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. After introducing the previous study on multi-vector document embedding, we visually analyze the effects of the multi-vector document embedding method. Firstly, the new method vectorizes the document using only predefined keywords instead of the entire words. Secondly, the new method decomposes various subjects included in the document and generates multiple vectors for each document. The experiments for about three thousands of academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the multi-vector based method, we ascertained that the information and knowledge in complex documents can be represented more accurately by eliminating the interference among subjects.

Firm Classification based on MBTI Organizational Character Type: Using Firm Review Big Data (MBTI 조직성격유형화에 따른 기업분류: 기업리뷰 빅데이터를 활용하여)

  • Lee, Hanjun;Shin, Dongwon;An, Byungdae
    • Asia-Pacific Journal of Business
    • /
    • v.12 no.3
    • /
    • pp.361-378
    • /
    • 2021
  • Purpose - The purpose of this study is to classify KOSPI listed companies according to their organizational character type based on MBTI. Design/methodology/approach - This study collected 109,989 reviews from an online firm review website, Jobplanet. Using these reviews and the descriptions about organizational character, we conducted document similarity analysis. Doc2Vec technique was hired for the analysis. Findings - First, there are more companies belonging to Extraversion(E), Intuition(N), Feeling(F), and Judging(J) than Introversion(I), Sensing(S), Thinking(T), and Perceiving(P) as organizational character types of MBTI. Second, more companies have EJ and EP as the behavior type and NT and NF as the decision-making type. Third, the top-3 organizational character type of which firms have among 16 types are ENTJ, ENFP, and ENFJ. Finally, companies belonging to the same industry group were found to have similar organizational character. Research implications or Originality - This study provides a noble way to measure organizational character type using firm review big data and document similarity analysis technique. The research results can be practically used for firms in their organizational diagnosis and organizational management, and are meaningful as a basic study for various future studies to empirically analyze the impact of organizational character.

Analysis of Global Entrepreneurship Trends Due to COVID-19: Focusing on Crunchbase (Covid-19에 따른 글로벌 창업 트렌드 분석: Crunchbase를 중심으로)

  • Shinho Kim;Youngjung Geum
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.18 no.3
    • /
    • pp.141-156
    • /
    • 2023
  • Due to the unprecedented worldwide pandemic of the new Covid-19 infection, business trends of companies have changed significantly. Therefore, it is strongly required to monitor the rapid changes of innovation trends to design and plan future businesses. Since the pandemic, many studies have attempted to analyze business changes, but they are limited to specific industries and are insufficient in terms of data objectivity. In response, this study aims to analyze business trends after Covid-19 using Crunchbase, a global startup data. The data is collected and preprocessed every two years from 2018 to 2021 to compare the business trends. To capture the major trends, a network analysis is conducted for the industry groups and industry information based on the co-occurrence. To analyze the minor trends, LDA-based topic modelling and word2vec-based clustering is used. As a result, e-commerce, education, delivery, game and entertainment industries are promising based on their technological advances, showing extension and diversification of industry boundaries as well as digitalization and servitization of business contents. This study is expected to help venture capitalists and entrepreneurs to understand the rapid changes under the impact of Covid-19 and to make right decisions for the future.

  • PDF