• Title/Summary/Keyword: text embedding

Search Result 142, Processing Time 0.027 seconds

Emerging Topic Detection Using Text Embedding and Anomaly Pattern Detection in Text Streaming Data (텍스트 스트리밍 데이터에서 텍스트 임베딩과 이상 패턴 탐지를 이용한 신규 주제 발생 탐지)

  • Choi, Semok;Park, Cheong Hee
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.9
    • /
    • pp.1181-1190
    • /
    • 2020
  • Detection of an anomaly pattern deviating normal data distribution in streaming data is an important technique in many application areas. In this paper, a method for detection of an newly emerging pattern in text streaming data which is an ordered sequence of texts is proposed based on text embedding and anomaly pattern detection. Using text embedding methods such as BOW(Bag Of Words), Word2Vec, and BERT, the detection performance of the proposed method is compared. Experimental results show that anomaly pattern detection using BERT embedding gave an average F1 value of 0.85 and the F1 value of 1 in three cases among five test cases.

Investigation on the Effect of Multi-Vector Document Embedding for Interdisciplinary Knowledge Representation

  • Park, Jongin;Kim, Namgyu
    • Knowledge Management Research
    • /
    • v.21 no.1
    • /
    • pp.99-116
    • /
    • 2020
  • Text is the most widely used means of exchanging or expressing knowledge and information in the real world. Recently, researches on structuring unstructured text data for text analysis have been actively performed. One of the most representative document embedding method (i.e. doc2Vec) generates a single vector for each document using the whole corpus included in the document. This causes a limitation that the document vector is affected by not only core words but also other miscellaneous words. Additionally, the traditional document embedding algorithms map each document into only one vector. Therefore, it is not easy to represent a complex document with interdisciplinary subjects into a single vector properly by the traditional approach. In this paper, we introduce a multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. After introducing the previous study on multi-vector document embedding, we visually analyze the effects of the multi-vector document embedding method. Firstly, the new method vectorizes the document using only predefined keywords instead of the entire words. Secondly, the new method decomposes various subjects included in the document and generates multiple vectors for each document. The experiments for about three thousands of academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the multi-vector based method, we ascertained that the information and knowledge in complex documents can be represented more accurately by eliminating the interference among subjects.

Multi-Vector Document Embedding Using Semantic Decomposition of Complex Documents (복합 문서의 의미적 분해를 통한 다중 벡터 문서 임베딩 방법론)

  • Park, Jongin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.19-41
    • /
    • 2019
  • According to the rapidly increasing demand for text data analysis, research and investment in text mining are being actively conducted not only in academia but also in various industries. Text mining is generally conducted in two steps. In the first step, the text of the collected document is tokenized and structured to convert the original document into a computer-readable form. In the second step, tasks such as document classification, clustering, and topic modeling are conducted according to the purpose of analysis. Until recently, text mining-related studies have been focused on the application of the second steps, such as document classification, clustering, and topic modeling. However, with the discovery that the text structuring process substantially influences the quality of the analysis results, various embedding methods have actively been studied to improve the quality of analysis results by preserving the meaning of words and documents in the process of representing text data as vectors. Unlike structured data, which can be directly applied to a variety of operations and traditional analysis techniques, Unstructured text should be preceded by a structuring task that transforms the original document into a form that the computer can understand before analysis. It is called "Embedding" that arbitrary objects are mapped to a specific dimension space while maintaining algebraic properties for structuring the text data. Recently, attempts have been made to embed not only words but also sentences, paragraphs, and entire documents in various aspects. Particularly, with the demand for analysis of document embedding increases rapidly, many algorithms have been developed to support it. Among them, doc2Vec which extends word2Vec and embeds each document into one vector is most widely used. However, the traditional document embedding method represented by doc2Vec generates a vector for each document using the whole corpus included in the document. This causes a limit that the document vector is affected by not only core words but also miscellaneous words. Additionally, the traditional document embedding schemes usually map each document into a single corresponding vector. Therefore, it is difficult to represent a complex document with multiple subjects into a single vector accurately using the traditional approach. In this paper, we propose a new multi-vector document embedding method to overcome these limitations of the traditional document embedding methods. This study targets documents that explicitly separate body content and keywords. In the case of a document without keywords, this method can be applied after extract keywords through various analysis methods. However, since this is not the core subject of the proposed method, we introduce the process of applying the proposed method to documents that predefine keywords in the text. The proposed method consists of (1) Parsing, (2) Word Embedding, (3) Keyword Vector Extraction, (4) Keyword Clustering, and (5) Multiple-Vector Generation. The specific process is as follows. all text in a document is tokenized and each token is represented as a vector having N-dimensional real value through word embedding. After that, to overcome the limitations of the traditional document embedding method that is affected by not only the core word but also the miscellaneous words, vectors corresponding to the keywords of each document are extracted and make up sets of keyword vector for each document. Next, clustering is conducted on a set of keywords for each document to identify multiple subjects included in the document. Finally, a Multi-vector is generated from vectors of keywords constituting each cluster. The experiments for 3.147 academic papers revealed that the single vector-based traditional approach cannot properly map complex documents because of interference among subjects in each vector. With the proposed multi-vector based method, we ascertained that complex documents can be vectorized more accurately by eliminating the interference among subjects.

SMS Text Messages Filtering using Word Embedding and Deep Learning Techniques (워드 임베딩과 딥러닝 기법을 이용한 SMS 문자 메시지 필터링)

  • Lee, Hyun Young;Kang, Seung Shik
    • Smart Media Journal
    • /
    • v.7 no.4
    • /
    • pp.24-29
    • /
    • 2018
  • Text analysis technique for natural language processing in deep learning represents words in vector form through word embedding. In this paper, we propose a method of constructing a document vector and classifying it into spam and normal text message, using word embedding and deep learning method. Automatic spacing applied in the preprocessing process ensures that words with similar context are adjacently represented in vector space. Additionally, the intentional word formation errors with non-alphabetic or extraordinary characters are designed to avoid being blocked by spam message filter. Two embedding algorithms, CBOW and skip grams, are used to produce the sentence vector and the performance and the accuracy of deep learning based spam filter model are measured by comparing to those of SVM Light.

A Study on the Alternative Method of Video Characteristics Using Captioning in Text-Video Retrieval Model (텍스트-비디오 검색 모델에서의 캡션을 활용한 비디오 특성 대체 방안 연구)

  • Dong-hun, Lee;Chan, Hur;Hyeyoung, Park;Sang-hyo, Park
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.17 no.6
    • /
    • pp.347-353
    • /
    • 2022
  • In this paper, we propose a method that performs a text-video retrieval model by replacing video properties using captions. In general, the exisiting embedding-based models consist of both joint embedding space construction and the CNN-based video encoding process, which requires a lot of computation in the training as well as the inference process. To overcome this problem, we introduce a video-captioning module to replace the visual property of video with captions generated by the video-captioning module. To be specific, we adopt the caption generator that converts candidate videos into captions in the inference process, thereby enabling direct comparison between the text given as a query and candidate videos without joint embedding space. Through the experiment, the proposed model successfully reduces the amount of computation and inference time by skipping the visual processing process and joint embedding space construction on two benchmark dataset, MSR-VTT and VATEX.

A Comparative Study of Word Embedding Models for Arabic Text Processing

  • Assiri, Fatmah;Alghamdi, Nuha
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.8
    • /
    • pp.399-403
    • /
    • 2022
  • Natural texts are analyzed to obtain their intended meaning to be classified depending on the problem under study. One way to represent words is by generating vectors of real values to encode the meaning; this is called word embedding. Similarities between word representations are measured to identify text class. Word embeddings can be created using word2vec technique. However, recently fastText was implemented to provide better results when it is used with classifiers. In this paper, we will study the performance of well-known classifiers when using both techniques for word embedding with Arabic dataset. We applied them to real data collected from Wikipedia, and we found that both word2vec and fastText had similar accuracy with all used classifiers.

Group-based speaker embeddings for text-independent speaker verification (문장 독립 화자 검증을 위한 그룹기반 화자 임베딩)

  • Jung, Youngmoon;Eom, Youngsik;Lee, Yeonghyeon;Kim, Hoirin
    • The Journal of the Acoustical Society of Korea
    • /
    • v.40 no.5
    • /
    • pp.496-502
    • /
    • 2021
  • Recently, deep speaker embedding approach has been widely used in text-independent speaker verification, which shows better performance than the traditional i-vector approach. In this work, to improve the deep speaker embedding approach, we propose a novel method called group-based speaker embedding which incorporates group information. We cluster all speakers of the training data into a predefined number of groups in an unsupervised manner, so that a fixed-length group embedding represents the corresponding group. A Group Decision Network (GDN) produces a group weight, and an aggregated group embedding is generated from the weighted sum of the group embeddings and the group weights. Finally, we generate a group-based embedding by adding the aggregated group embedding to the deep speaker embedding. In this way, a speaker embedding can reduce the search space of the speaker identity by incorporating group information, and thereby can flexibly represent a significant number of speakers. We conducted experiments using the VoxCeleb1 database to show that our proposed approach can improve the previous approaches.

CR-M-SpanBERT: Multiple embedding-based DNN coreference resolution using self-attention SpanBERT

  • Joon-young Jung
    • ETRI Journal
    • /
    • v.46 no.1
    • /
    • pp.35-47
    • /
    • 2024
  • This study introduces CR-M-SpanBERT, a coreference resolution (CR) model that utilizes multiple embedding-based span bidirectional encoder representations from transformers, for antecedent recognition in natural language (NL) text. Information extraction studies aimed to extract knowledge from NL text autonomously and cost-effectively. However, the extracted information may not represent knowledge accurately owing to the presence of ambiguous entities. Therefore, we propose a CR model that identifies mentions referring to the same entity in NL text. In the case of CR, it is necessary to understand both the syntax and semantics of the NL text simultaneously. Therefore, multiple embeddings are generated for CR, which can include syntactic and semantic information for each word. We evaluate the effectiveness of CR-M-SpanBERT by comparing it to a model that uses SpanBERT as the language model in CR studies. The results demonstrate that our proposed deep neural network model achieves high-recognition accuracy for extracting antecedents from NL text. Additionally, it requires fewer epochs to achieve an average F1 accuracy greater than 75% compared with the conventional SpanBERT approach.

New Text Steganography Technique Based on Part-of-Speech Tagging and Format-Preserving Encryption

  • Mohammed Abdul Majeed;Rossilawati Sulaiman;Zarina Shukur
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.1
    • /
    • pp.170-191
    • /
    • 2024
  • The transmission of confidential data using cover media is called steganography. The three requirements of any effective steganography system are high embedding capacity, security, and imperceptibility. The text file's structure, which makes syntax and grammar more visually obvious than in other media, contributes to its poor imperceptibility. Text steganography is regarded as the most challenging carrier to hide secret data because of its insufficient redundant data compared to other digital objects. Unicode characters, especially non-printing or invisible, are employed for hiding data by mapping a specific amount of secret data bits in each character and inserting the character into cover text spaces. These characters are known with limited spaces to embed secret data. Current studies that used Unicode characters in text steganography focused on increasing the data hiding capacity with insufficient redundant data in a text file. A sequential embedding pattern is often selected and included in all available positions in the cover text. This embedding pattern negatively affects the text steganography system's imperceptibility and security. Thus, this study attempts to solve these limitations using the Part-of-speech (POS) tagging technique combined with the randomization concept in data hiding. Combining these two techniques allows inserting the Unicode characters in randomized patterns with specific positions in the cover text to increase data hiding capacity with minimum effects on imperceptibility and security. Format-preserving encryption (FPE) is also used to encrypt a secret message without changing its size before the embedding processes. By comparing the proposed technique to already existing ones, the results demonstrate that it fulfils the cover file's capacity, imperceptibility, and security requirements.

Performance analysis of Various Embedding Models Based on Hyper Parameters (다양한 임베딩 모델들의 하이퍼 파라미터 변화에 따른 성능 분석)

  • Lee, Sanga;Park, Jaeseong;Kang, Sangwoo;Lee, Jeong-Eom;Kim, Seona
    • Annual Conference on Human and Language Technology
    • /
    • 2018.10a
    • /
    • pp.510-513
    • /
    • 2018
  • 본 논문은 다양한 워드 임베딩 모델(word embedding model)들과 하이퍼 파라미터(hyper parameter)들을 조합하였을 때 특정 영역에 어떠한 성능을 보여주는지에 대한 연구이다. 3 가지의 워드 임베딩 모델인 Word2Vec, FastText, Glove의 차원(dimension)과 윈도우 사이즈(window size), 최소 횟수(min count)를 각기 달리하여 총 36개의 임베딩 벡터(embedding vector)를 만들었다. 각 임베딩 벡터를 Fast and Accurate Dependency Parser 모델에 적용하여 각 모들의 성능을 측정하였다. 모든 모델에서 차원이 높을수록 성능이 개선되었으며, FastText가 대부분의 경우에서 높은 성능을 내는 것을 알 수 있었다.

  • PDF