• Title/Summary/Keyword: text embedding

Search Result 146, Processing Time 0.023 seconds

Sentence model based subword embeddings for a dialog system

  • Chung, Euisok;Kim, Hyun Woo;Song, Hwa Jeon
    • ETRI Journal
    • /
    • v.44 no.4
    • /
    • pp.599-612
    • /
    • 2022
  • This study focuses on improving a word embedding model to enhance the performance of downstream tasks, such as those of dialog systems. To improve traditional word embedding models, such as skip-gram, it is critical to refine the word features and expand the context model. In this paper, we approach the word model from the perspective of subword embedding and attempt to extend the context model by integrating various sentence models. Our proposed sentence model is a subword-based skip-thought model that integrates self-attention and relative position encoding techniques. We also propose a clustering-based dialog model for downstream task verification and evaluate its relationship with the sentence-model-based subword embedding technique. The proposed subword embedding method produces better results than previous methods in evaluating word and sentence similarity. In addition, the downstream task verification, a clustering-based dialog system, demonstrates an improvement of up to 4.86% over the results of FastText in previous research.

Trends in Clinical Research of Catgut Embedding for Obesity Treatment (비만 치료에 매선을 이용한 임상 연구 동향 분석)

  • Jung-Sik Park
    • Journal of Korean Medicine Rehabilitation
    • /
    • v.33 no.3
    • /
    • pp.129-134
    • /
    • 2023
  • Objectives The purpose of this study was to review the studies of catgut embedding related to obesity treatment. Methods We searched the papers with key words of obesity and catgut embedding via searching Research Information Sharing Service, DBpia, Koreanstudies Information Service System, Oriental Medicine Advanced Searching Integrated System, Scopus, PubMed. Additional data including study design, study topics, characteristics of participants and treatment, outcomes was extracted from full text of each study. Results There were nine studies about the catgut embedding related to obesity treatment. Five articles were conducted in China, two articles were conducted in Mexico, and two articles was published in Korea. Analysis of seven experimental studies and two observational studies were conducted to describe each research subject, method, and research results. Conclusions More interest and further research will be needed on catgut embedding related to obesity treatment in the Korean medicine to achieve clinical application and to develop treatment protocols for the obesity disease.

Association Modeling on Keyword and Abstract Data in Korean Port Research

  • Yoon, Hee-Young;Kwak, Il-Youp
    • Journal of Korea Trade
    • /
    • v.24 no.5
    • /
    • pp.71-86
    • /
    • 2020
  • Purpose - This study investigates research trends by searching for English keywords and abstracts in 1,511 Korean journal articles in the Korea Citation Index from the 2002-2019 period using the term "Port." The study aims to lay the foundation for a more balanced development of port research. Design/methodology - Using abstract and keyword data, we perform frequency analysis and word embedding (Word2vec). A t-SNE plot shows the main keywords extracted using the TextRank algorithm. To analyze which words were used in what context in our two nine-year subperiods (2002-2010 and 2010-2019), we use Scattertext and scaled F-scores. Findings - First, during the 18-year study period, port research has developed through the convergence of diverse academic fields, covering 102 subject areas and 219 journals. Second, our frequency analysis of 4,431 keywords in 1,511 papers shows that the words "Port" (60 times), "Port Competitiveness" (33 times), and "Port Authority" (29 times), among others, are attractive to most researchers. Third, a word embedding analysis identifies the words highly correlated with the top eight keywords and visually shows four different subject clusters in a t-SNE plot. Fourth, we use Scattertext to compare words used in the two research sub-periods. Originality/value - This study is the first to apply abstract and keyword analysis and various text mining techniques to Korean journal articles in port research and thus has important implications. Further in-depth studies should collect a greater variety of textual data and analyze and compare port studies from different countries.

Word-Level Embedding to Improve Performance of Representative Spatio-temporal Document Classification

  • Byoungwook Kim;Hong-Jun Jang
    • Journal of Information Processing Systems
    • /
    • v.19 no.6
    • /
    • pp.830-841
    • /
    • 2023
  • Tokenization is the process of segmenting the input text into smaller units of text, and it is a preprocessing task that is mainly performed to improve the efficiency of the machine learning process. Various tokenization methods have been proposed for application in the field of natural language processing, but studies have primarily focused on efficiently segmenting text. Few studies have been conducted on the Korean language to explore what tokenization methods are suitable for document classification task. In this paper, an exploratory study was performed to find the most suitable tokenization method to improve the performance of a representative spatio-temporal document classifier in Korean. For the experiment, a convolutional neural network model was used, and for the final performance comparison, tasks were selected for document classification where performance largely depends on the tokenization method. As a tokenization method for comparative experiments, commonly used Jamo, Character, and Word units were adopted. As a result of the experiment, it was confirmed that the tokenization of word units showed excellent performance in the case of representative spatio-temporal document classification task where the semantic embedding ability of the token itself is important.

Impact of Word Embedding Methods on Performance of Sentiment Analysis with Machine Learning Techniques

  • Park, Hoyeon;Kim, Kyoung-jae
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.8
    • /
    • pp.181-188
    • /
    • 2020
  • In this study, we propose a comparative study to confirm the impact of various word embedding techniques on the performance of sentiment analysis. Sentiment analysis is one of opinion mining techniques to identify and extract subjective information from text using natural language processing and can be used to classify the sentiment of product reviews or comments. Since sentiment can be classified as either positive or negative, it can be considered one of the general classification problems. For sentiment analysis, the text must be converted into a language that can be recognized by a computer. Therefore, text such as a word or document is transformed into a vector in natural language processing called word embedding. Various techniques, such as Bag of Words, TF-IDF, and Word2Vec are used as word embedding techniques. Until now, there have not been many studies on word embedding techniques suitable for emotional analysis. In this study, among various word embedding techniques, Bag of Words, TF-IDF, and Word2Vec are used to compare and analyze the performance of movie review sentiment analysis. The research data set for this study is the IMDB data set, which is widely used in text mining. As a result, it was found that the performance of TF-IDF and Bag of Words was superior to that of Word2Vec and TF-IDF performed better than Bag of Words, but the difference was not very significant.

Improving Abstractive Summarization by Training Masked Out-of-Vocabulary Words

  • Lee, Tae-Seok;Lee, Hyun-Young;Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.18 no.3
    • /
    • pp.344-358
    • /
    • 2022
  • Text summarization is the task of producing a shorter version of a long document while accurately preserving the main contents of the original text. Abstractive summarization generates novel words and phrases using a language generation method through text transformation and prior-embedded word information. However, newly coined words or out-of-vocabulary words decrease the performance of automatic summarization because they are not pre-trained in the machine learning process. In this study, we demonstrated an improvement in summarization quality through the contextualized embedding of BERT with out-of-vocabulary masking. In addition, explicitly providing precise pointing and an optional copy instruction along with BERT embedding, we achieved an increased accuracy than the baseline model. The recall-based word-generation metric ROUGE-1 score was 55.11 and the word-order-based ROUGE-L score was 39.65.

Semantic Feature Analysis for Multi-Label Text Classification on Topics of the Al-Quran Verses

  • Gugun Mediamer;Adiwijaya
    • Journal of Information Processing Systems
    • /
    • v.20 no.1
    • /
    • pp.1-12
    • /
    • 2024
  • Nowadays, Islamic content is widely used in research, including Hadith and the Al-Quran. Both are mostly used in the field of natural language processing, especially in text classification research. One of the difficulties in learning the Al-Quran is ambiguity, while the Al-Quran is used as the main source of Islamic law and the life guidance of a Muslim in the world. This research was proposed to relieve people in learning the Al-Quran. We proposed a word embedding feature-based on Tensor Space Model as feature extraction, which is used to reduce the ambiguity. Based on the experiment results and the analysis, we prove that the proposed method yields the best performance with the Hamming loss 0.10317.

Design of a Mirror for Fragrance Recommendation based on Personal Emotion Analysis (개인의 감성 분석 기반 향 추천 미러 설계)

  • Hyeonji Kim;Yoosoo Oh
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.28 no.4
    • /
    • pp.11-19
    • /
    • 2023
  • The paper proposes a smart mirror system that recommends fragrances based on user emotion analysis. This paper combines natural language processing techniques such as embedding techniques (CounterVectorizer and TF-IDF) and machine learning classification models (DecisionTree, SVM, RandomForest, SGD Classifier) to build a model and compares the results. After the comparison, the paper constructs a personal emotion-based fragrance recommendation mirror model based on the SVM and word embedding pipeline-based emotion classifier model with the highest performance. The proposed system implements a personalized fragrance recommendation mirror based on emotion analysis, providing web services using the Flask web framework. This paper uses the Google Speech Cloud API to recognize users' voices and use speech-to-text (STT) to convert voice-transcribed text data. The proposed system provides users with information about weather, humidity, location, quotes, time, and schedule management.

A novel, reversible, Chinese text information hiding scheme based on lookalike traditional and simplified Chinese characters

  • Feng, Bin;Wang, Zhi-Hui;Wang, Duo;Chang, Ching-Yun;Li, Ming-Chu
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.1
    • /
    • pp.269-281
    • /
    • 2014
  • Compared to hiding information into digital image, hiding information into digital text file requires less storage space and smaller bandwidth for data transmission, and it has obvious universality and extensiveness. However, text files have low redundancy, so it is more difficult to hide information in text files. To overcome this difficulty, Wang et al. proposed a reversible information hiding scheme using left-right and up-down representations of Chinese characters, but, when the scheme is implemented, it does not provide good visual steganographic effectiveness, and the embedding and extracting processes are too complicated to be done with reasonable effort and cost. We observed that a lot of traditional and simplified Chinese characters look somewhat the same (also called lookalike), so we utilize this feature to propose a novel information hiding scheme for hiding secret data in lookalike Chinese characters. Comparing to Wang et al.'s scheme, the proposed scheme simplifies the embedding and extracting procedures significantly and improves the effectiveness of visual steganographic images. The experimental results demonstrated the advantages of our proposed scheme.

A Discourse-based Compositional Approach to Overcome Drawbacks of Sequence-based Composition in Text Modeling via Neural Networks (신경망 기반 텍스트 모델링에 있어 순차적 결합 방법의 한계점과 이를 극복하기 위한 담화 기반의 결합 방법)

  • Lee, Kangwook;Han, Sanggyu;Myaeng, Sung-Hyon
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.12
    • /
    • pp.698-702
    • /
    • 2017
  • Since the introduction of Deep Neural Networks to the Natural Language Processing field, two major approaches have been considered for modeling text. One method involved learning embeddings, i.e. the distributed representations containing abstract semantics of words or sentences, with the textual context. The other strategy consisted of composing the embeddings trained by the above to get embeddings of longer texts. However, most studies of the composition methods just adopt word embeddings without consideration of the optimal embedding unit and the optimal method of composition. In this paper, we conducted experiments to analyze the optimal embedding unit and the optimal composition method for modeling longer texts, such as documents. In addition, we suggest a new discourse-based composition to overcome the limitation of the sequential composition method on composing sentence embeddings.