• Title/Summary/Keyword: BERT Embedding Model

Search Result 24, Processing Time 0.023 seconds

HTML Tag Depth Embedding: An Input Embedding Method of the BERT Model for Improving Web Document Reading Comprehension Performance (HTML 태그 깊이 임베딩: 웹 문서 기계 독해 성능 개선을 위한 BERT 모델의 입력 임베딩 기법)

  • Mok, Jin-Wang;Jang, Hyun Jae;Lee, Hyun-Seob
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.5
    • /
    • pp.17-25
    • /
    • 2022
  • Recently the massive amount of data has been generated because of the number of edge devices increases. And especially, the number of raw unstructured HTML documents has been increased. Therefore, MRC(Machine Reading Comprehension) in which a natural language processing model finds the important information within an HTML document is becoming more important. In this paper, we propose HTDE(HTML Tag Depth Embedding Method), which allows the BERT to train the depth of the HTML document structure. HTDE makes a tag stack from the HTML document for each input token in the BERT and then extracts the depth information. After that, we add a HTML embedding layer that takes the depth of the token as input to the step of input embedding of BERT. Since tokenization using HTDE identifies the HTML document structures through the relationship of surrounding tokens, HTDE improves the accuracy of BERT for HTML documents. Finally, we demonstrated that the proposed idea showing the higher accuracy compared than the accuracy using the conventional embedding of BERT.

CR-M-SpanBERT: Multiple embedding-based DNN coreference resolution using self-attention SpanBERT

  • Joon-young Jung
    • ETRI Journal
    • /
    • v.46 no.1
    • /
    • pp.35-47
    • /
    • 2024
  • This study introduces CR-M-SpanBERT, a coreference resolution (CR) model that utilizes multiple embedding-based span bidirectional encoder representations from transformers, for antecedent recognition in natural language (NL) text. Information extraction studies aimed to extract knowledge from NL text autonomously and cost-effectively. However, the extracted information may not represent knowledge accurately owing to the presence of ambiguous entities. Therefore, we propose a CR model that identifies mentions referring to the same entity in NL text. In the case of CR, it is necessary to understand both the syntax and semantics of the NL text simultaneously. Therefore, multiple embeddings are generated for CR, which can include syntactic and semantic information for each word. We evaluate the effectiveness of CR-M-SpanBERT by comparing it to a model that uses SpanBERT as the language model in CR studies. The results demonstrate that our proposed deep neural network model achieves high-recognition accuracy for extracting antecedents from NL text. Additionally, it requires fewer epochs to achieve an average F1 accuracy greater than 75% compared with the conventional SpanBERT approach.

E-commerce data based Sentiment Analysis Model Implementation using Natural Language Processing Model (자연어처리 모델을 이용한 이커머스 데이터 기반 감성 분석 모델 구축)

  • Choi, Jun-Young;Lim, Heui-Seok
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.11
    • /
    • pp.33-39
    • /
    • 2020
  • In the field of Natural Language Processing, Various research such as Translation, POS Tagging, Q&A, and Sentiment Analysis are globally being carried out. Sentiment Analysis shows high classification performance for English single-domain datasets by pretrained sentence embedding models. In this thesis, the classification performance is compared by Korean E-commerce online dataset with various domain attributes and 6 Neural-Net models are built as BOW (Bag Of Word), LSTM[1], Attention, CNN[2], ELMo[3], and BERT(KoBERT)[4]. It has been confirmed that the performance of pretrained sentence embedding models are higher than word embedding models. In addition, practical Neural-Net model composition is proposed after comparing classification performance on dataset with 17 categories. Furthermore, the way of compressing sentence embedding model is mentioned as future work, considering inference time against model capacity on real-time service.

Fine-tuning BERT Models for Keyphrase Extraction in Scientific Articles

  • Lim, Yeonsoo;Seo, Deokjin;Jung, Yuchul
    • Journal of Advanced Information Technology and Convergence
    • /
    • v.10 no.1
    • /
    • pp.45-56
    • /
    • 2020
  • Despite extensive research, performance enhancement of keyphrase (KP) extraction remains a challenging problem in modern informatics. Recently, deep learning-based supervised approaches have exhibited state-of-the-art accuracies with respect to this problem, and several of the previously proposed methods utilize Bidirectional Encoder Representations from Transformers (BERT)-based language models. However, few studies have investigated the effective application of BERT-based fine-tuning techniques to the problem of KP extraction. In this paper, we consider the aforementioned problem in the context of scientific articles by investigating the fine-tuning characteristics of two distinct BERT models - BERT (i.e., base BERT model by Google) and SciBERT (i.e., a BERT model trained on scientific text). Three different datasets (WWW, KDD, and Inspec) comprising data obtained from the computer science domain are used to compare the results obtained by fine-tuning BERT and SciBERT in terms of KP extraction.

Improving Abstractive Summarization by Training Masked Out-of-Vocabulary Words

  • Lee, Tae-Seok;Lee, Hyun-Young;Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.18 no.3
    • /
    • pp.344-358
    • /
    • 2022
  • Text summarization is the task of producing a shorter version of a long document while accurately preserving the main contents of the original text. Abstractive summarization generates novel words and phrases using a language generation method through text transformation and prior-embedded word information. However, newly coined words or out-of-vocabulary words decrease the performance of automatic summarization because they are not pre-trained in the machine learning process. In this study, we demonstrated an improvement in summarization quality through the contextualized embedding of BERT with out-of-vocabulary masking. In addition, explicitly providing precise pointing and an optional copy instruction along with BERT embedding, we achieved an increased accuracy than the baseline model. The recall-based word-generation metric ROUGE-1 score was 55.11 and the word-order-based ROUGE-L score was 39.65.

A Concept Language Model combining Word Sense Information and BERT (의미 정보와 BERT를 결합한 개념 언어 모델)

  • Lee, Ju-Sang;Ock, Cheol-Young
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.3-7
    • /
    • 2019
  • 자연어 표상은 자연어가 가진 정보를 컴퓨터에게 전달하기 위해 표현하는 방법이다. 현재 자연어 표상은 학습을 통해 고정된 벡터로 표현하는 것이 아닌 문맥적 정보에 의해 벡터가 변화한다. 그 중 BERT의 경우 Transformer 모델의 encoder를 사용하여 자연어를 표상하는 기술이다. 하지만 BERT의 경우 학습시간이 많이 걸리며, 대용량의 데이터를 필요로 한다. 본 논문에서는 빠른 자연어 표상 학습을 위해 의미 정보와 BERT를 결합한 개념 언어 모델을 제안한다. 의미 정보로 단어의 품사 정보와, 명사의 의미 계층 정보를 추상적으로 표현했다. 실험을 위해 ETRI에서 공개한 한국어 BERT 모델을 비교 대상으로 하며, 개체명 인식을 학습하여 비교했다. 두 모델의 개체명 인식 결과가 비슷하게 나타났다. 의미 정보가 자연어 표상을 하는데 중요한 정보가 될 수 있음을 확인했다.

  • PDF

Korean Dependency Parsing Using Various Ensemble Models (다양한 앙상블 알고리즘을 이용한 한국어 의존 구문 분석)

  • Jo, Gyeong-Cheol;Kim, Ju-Wan;Kim, Gyun-Yeop;Park, Seong-Jin;Gang, Sang-U
    • Annual Conference on Human and Language Technology
    • /
    • 2019.10a
    • /
    • pp.543-545
    • /
    • 2019
  • 본 논문은 최신 한국어 의존 구문 분석 모델(Korean dependency parsing model)들과 다양한 앙상블 모델(ensemble model)들을 결합하여 그 성능을 분석한다. 단어 표현은 미리 학습된 워드 임베딩 모델(word embedding model)과 ELMo(Embedding from Language Model), Bert(Bidirectional Encoder Representations from Transformer) 그리고 다양한 추가 자질들을 사용한다. 또한 사용된 의존 구문 분석 모델로는 Stack Pointer Network Model, Deep Biaffine Attention Parser와 Left to Right Pointer Parser를 이용한다. 최종적으로 각 모델의 분석 결과를 앙상블 모델인 Bagging 기법과 XGBoost(Extreme Gradient Boosting) 이용하여 최적의 모델을 제안한다.

  • PDF

Document Classification Methodology Using Autoencoder-based Keywords Embedding

  • Seobin Yoon;Namgyu Kim
    • Journal of the Korea Society of Computer and Information
    • /
    • v.28 no.9
    • /
    • pp.35-46
    • /
    • 2023
  • In this study, we propose a Dual Approach methodology to enhance the accuracy of document classifiers by utilizing both contextual and keyword information. Firstly, contextual information is extracted using Google's BERT, a pre-trained language model known for its outstanding performance in various natural language understanding tasks. Specifically, we employ KoBERT, a pre-trained model on the Korean corpus, to extract contextual information in the form of the CLS token. Secondly, keyword information is generated for each document by encoding the set of keywords into a single vector using an Autoencoder. We applied the proposed approach to 40,130 documents related to healthcare and medicine from the National R&D Projects database of the National Science and Technology Information Service (NTIS). The experimental results demonstrate that the proposed methodology outperforms existing methods that rely solely on document or word information in terms of accuracy for document classification.

BERT-Based Logits Ensemble Model for Gender Bias and Hate Speech Detection

  • Sanggeon Yun;Seungshik Kang;Hyeokman Kim
    • Journal of Information Processing Systems
    • /
    • v.19 no.5
    • /
    • pp.641-651
    • /
    • 2023
  • Malicious hate speech and gender bias comments are common in online communities, causing social problems in our society. Gender bias and hate speech detection has been investigated. However, it is difficult because there are diverse ways to express them in words. To solve this problem, we attempted to detect malicious comments in a Korean hate speech dataset constructed in 2020. We explored bidirectional encoder representations from transformers (BERT)-based deep learning models utilizing hyperparameter tuning, data sampling, and logits ensembles with a label distribution. We evaluated our model in Kaggle competitions for gender bias, general bias, and hate speech detection. For gender bias detection, an F1-score of 0.7711 was achieved using an ensemble of the Soongsil-BERT and KcELECTRA models. The general bias task included the gender bias task, and the ensemble model achieved the best F1-score of 0.7166.

Multi-Document Summarization Method of Reviews Using Word Embedding Clustering (워드 임베딩 클러스터링을 활용한 리뷰 다중문서 요약기법)

  • Lee, Pil Won;Hwang, Yun Young;Choi, Jong Seok;Shin, Young Tae
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.11
    • /
    • pp.535-540
    • /
    • 2021
  • Multi-document refers to a document consisting of various topics, not a single topic, and a typical example is online reviews. There have been several attempts to summarize online reviews because of their vast amounts of information. However, collective summarization of reviews through existing summary models creates a problem of losing the various topics that make up the reviews. Therefore, in this paper, we present method to summarize the review with minimal loss of the topic. The proposed method classify reviews through processes such as preprocessing, importance evaluation, embedding substitution using BERT, and embedding clustering. Furthermore, the classified sentences generate the final summary using the trained Transformer summary model. The performance evaluation of the proposed model was compared by evaluating the existing summary model, seq2seq model, and the cosine similarity with the ROUGE score, and performed a high performance summary compared to the existing summary model.