• Title/Summary/Keyword: Unknown Words

Search Result 69, Processing Time 0.026 seconds

Improving Abstractive Summarization by Training Masked Out-of-Vocabulary Words

  • Lee, Tae-Seok;Lee, Hyun-Young;Kang, Seung-Shik
    • Journal of Information Processing Systems
    • /
    • v.18 no.3
    • /
    • pp.344-358
    • /
    • 2022
  • Text summarization is the task of producing a shorter version of a long document while accurately preserving the main contents of the original text. Abstractive summarization generates novel words and phrases using a language generation method through text transformation and prior-embedded word information. However, newly coined words or out-of-vocabulary words decrease the performance of automatic summarization because they are not pre-trained in the machine learning process. In this study, we demonstrated an improvement in summarization quality through the contextualized embedding of BERT with out-of-vocabulary masking. In addition, explicitly providing precise pointing and an optional copy instruction along with BERT embedding, we achieved an increased accuracy than the baseline model. The recall-based word-generation metric ROUGE-1 score was 55.11 and the word-order-based ROUGE-L score was 39.65.

Automatically Extracting Unknown Translations Using Phrase Alignment (정렬기법을 이용한 미등록 대역어의 자동 추출)

  • Kim, Jae-Hoon;Yang, Sung-Il
    • The KIPS Transactions:PartB
    • /
    • v.14B no.3 s.113
    • /
    • pp.231-240
    • /
    • 2007
  • In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.

A Methodology for Urdu Word Segmentation using Ligature and Word Probabilities

  • Khan, Yunus;Nagar, Chetan;Kaushal, Devendra S.
    • International Journal of Ocean System Engineering
    • /
    • v.2 no.1
    • /
    • pp.24-31
    • /
    • 2012
  • This paper introduce a technique for Word segmentation for the handwritten recognition of Urdu script. Word segmentation or word tokenization is a primary technique for understanding the sentences written in Urdu language. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A method is proposed for word segmentation in this paper. It finds the boundaries of words in a sequence of ligatures using probabilistic formulas, by utilizing the knowledge of collocation of ligatures and words in the corpus. The word identification rate using this technique is 97.10% with 66.63% unknown words identification rate.

The Method of Color Image Processing Using Adaptive Saturation Enhancement Algorithm (적응형 채도 향상 알고리즘을 이용한 컬러 영상 처리 기법)

  • Yang, Kyoung-Ok;Yun, Jong-Ho;Cho, Hwa-Hyun;Choi, Myung-Ryul
    • The KIPS Transactions:PartB
    • /
    • v.14B no.3 s.113
    • /
    • pp.145-152
    • /
    • 2007
  • In this paper, we propose an automatic extraction model for unknown translations and implement an unknown translation extraction system using the proposed model. The proposed model as a phrase-alignment model is incorporated with three models: a phrase-boundary model, a language model, and a translation model. Using the proposed model we implement the system for extracting unknown translations, which consists of three parts: construction of parallel corpora, alignment of Korean and English words, extraction of unknown translations. To evaluate the performance of the proposed system, we have established the reference corpus for extracting unknown translation, which comprises of 2,220 parallel sentences including about 1,500 unknown translations. Through several experiments, we have observed that the proposed model is very useful for extracting unknown translations. In the future, researches on objective evaluation and establishment of parallel corpora with good quality should be performed and studies on improving the performance of unknown translation extraction should be kept up.

Relationship between Vocabulary and Design in Design Ideation Process -Focusing on Avant-garde Fashion Design- (디자인 발상 과정에 나타난 어휘와 디자인의 연관성 연구 -아방가르드 패션디자인을 중심으로-)

  • Kim, Yoon Kyoung
    • Journal of the Korean Society of Clothing and Textiles
    • /
    • v.45 no.4
    • /
    • pp.727-739
    • /
    • 2021
  • The purpose of this study is to present the objective evaluation semantic scale of avant-garde design. Apparel majors were asked to express associative vocabulary, design development, and final design intentions for the avant-garde, and the final 70 copies were used for analysis. The results found the item style was shown often in the order of dress, coat, and combination of shirt and pants. In order, the silhouettes appeared as atypical, complex, square, and triangular; the decorations appeared as feathers, frills, and round sculptures; and the idea method appeared as extreme, association, and removal method. In examining the relations of associative words and idea designs, the dress had relations with associative words such as 'peculiar,' 'futuristic,' 'fancy,' 'Comme des Garcons,' and 'deconstruction.' As for the relationship between the idea design and the expression image vocabulary, it was found that 'one piece' recalled 'huge,' 'volume,' 'abundant,' 'peculiar,' and 'unknown,' while 'coat' recalled 'huge,' 'big silhouette,' and 'padding.' In conducting the word cloud technique, the overall design showed the central keywords were 'huge,' 'big silhouette,' 'unbalance,' 'feather,' 'structural,' 'unknown,' and 'frill,' in order.

Disease-Related Vocubulary and its translingual practice in Late 19th to Early 20th century (19세기 말 20세기 초 질병 어휘와 언어횡단적 실천)

  • Lee, Eunryoung
    • Journal of Sasang Constitutional Medicine
    • /
    • v.31 no.1
    • /
    • pp.65-78
    • /
    • 2019
  • Objectives This study aims to investigate how the Korean disease-related vocabulary is established or changed when it is translated into French or English. Through this, we examine changes in the meaning of diseases and the ecosystem of disease-related vocabulary in transition period of $19^{th}$ to $20^{th}$ century. Methods Korean disease-related vocabulary are extracted from a total of 148,000 Korean headwords included in our corpus of three bilingual dictionaries. Among them, the scope of analyisis is limited to group of vocabularies that include a high frequency words, disease(病) and symptom(症). Results The first type of change is the emergence of a neologism. In this case, coexistence of existing vocabulary and new words is observed. The second change is the appearance of loan words written in Hangul. The third is the case where the interpretation of meaning is changed while maintaining the word form. Finally, the fourth change is that the orthographic variants are displayed while maintaining the meaning of the existing vocabulary. Discussion Disease-related vocabulary increased greatly between 1897 and 1931. The increasing factor of vocabulary was the emergence of coined words, compound words and the influx of foreign words. The Korean language and the Western language made a new lexical form in order to introduce a new unknown concept to the Korean. We could also confirm that the way in which English word expanded its semantic field by modifying the way of representing the meaning of Korean Disease-related vocabulary.

Real Time Recognition of Unknown Words based on the Analysis of Similar Words with an Extended Definition (확장 정의된 유사어절의 분석에 근거한 실시간 미등록어 인식)

  • Park, Bong-Rae;Hwang, Young-Sook;Rim, Hae-Chang
    • Annual Conference on Human and Language Technology
    • /
    • 1996.10a
    • /
    • pp.222-228
    • /
    • 1996
  • 기존의 미등록어 추정 방법은 대부분 단일 어절 접근 방법으로 단일 어절에서 추출할 수 있는 추정 정보가 부족하여 과분석과 오분석의 가능성이 높았다. 그래서 동일 미등록어를 가진 어절들을 동시에 분석하는 유사 어절 접근 방법이 제시되었다. 그러나 이 방법도 유사 어절의 범위를 조사나 어미만 다른 어절로 정의함으로써 수집될 수 있는 유사 어절의 수가 제한되어 대략의 텍스트에서만 적용이 가능하였다. 이에 본 논문은 유사어절을 동일 음절열을 공유하는 어절들로 확장 정의하여 작은 크기 N의 텍스트 윈도우에서 유사 어절의 발견 가능성을 높임으로써 실시간으로 미등록어를 추정할 수 있게 하는 방법을 제시한다. N을 100으로 한 실험결과는 미등록어 추정 정확도가 99.3%였고 재현율은 약 32%였다.

  • PDF

Practical Development and Application of a Korean Morphological Analyzer for Automatic Indexing (자동 색인을 위한 한국어 형태소 분석기의 실제적인 구현 및 적용)

  • Choi, Sung-Pil;Seo, Jerry;Chae, Young-Suk
    • The KIPS Transactions:PartB
    • /
    • v.9B no.5
    • /
    • pp.689-700
    • /
    • 2002
  • In this paper, we developed Korean Morphological Analyzer for an automatic indexing that is essential for Information Retrieval. Since it is important to index large-scaled document set efficiently, we concentrated on maximizing the speed of word analysis, modularization and structuralization of the system without new concepts or ideas. In this respect, our system is characterized in terms of software engineering aspect to be used in real world rather than theoretical issues. First, a dictionary of words was structured. Then modules that analyze substantive words and inflected words were introduced. Furthermore numeral analyzer was developed. And we introduced an unknown word analyzer using the patterns of morpheme. This whole system was integrated into K-2000, an information retrieval system.

Relevant Image Retrieval of Korean Documents based on Sentence and Word Importance (문장 및 단어 중요도를 통한 한국어 문서 연관 이미지 검색)

  • Kim, Nam-Gyu;Kang, Shin-Jae
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.20 no.3
    • /
    • pp.43-48
    • /
    • 2019
  • While reading text-only documents and finding unknown words, readers will become the focus disturbed and not be able to understand the content of the documents. Because children have little experience, it is difficult to understand correctly if the description in context is unfamiliar or ambiguous. In this paper, in order to help understand the text and increase the interest of the readers, we analyze the texts of documents and select the contents that are considered important, and implement a system that displays the most relevant images automatically from the web and links the texts and the images together. The implementation of the system divides the article into paragraphs, analyzes the text, selects important sentences for each paragraph and the important words that best represent the meaning of the important sentences, searches for images related to the words on the web, and then links the images to each of the previous paragraphs. Experiments have shown how to select important sentences and how to select important words in the sentences. As a result of the experiment, we could get 60% performance by evaluating the accuracy of the relation between three selected images and corresponding important sentences.

Unmanned Aerial Vehicle Recovery Using a Simultaneous Localization and Mapping Algorithm without the Aid of Global Positioning System

  • Lee, Chang-Hun;Tahk, Min-Jea
    • International Journal of Aeronautical and Space Sciences
    • /
    • v.11 no.2
    • /
    • pp.98-109
    • /
    • 2010
  • This paper deals with a new method of unmanned aerial vehicle (UAV) recovery when a UAV fails to get a global positioning system (GPS) signal at an unprepared site. The proposed method is based on the simultaneous localization and mapping (SLAM) algorithm. It is a process by which a vehicle can build a map of an unknown environment and simultaneously use this map to determine its position. Extensive research on SLAM algorithms proves that the error in the map reaches a lower limit, which is a function of the error that existed when the first observation was made. For this reason, the proposed method can help an inertial navigation system to prevent its error of divergence with regard to the vehicle position. In other words, it is possible that a UAV can navigate with reasonable positional accuracy in an unknown environment without the aid of GPS. This is the main idea of the present paper. Especially, this paper focuses on path planning that maximizes the discussed ability of a SLAM algorithm. In this work, a SLAM algorithm based on extended Kalman filter is used. For simplicity's sake, a blimp-type of UAV model is discussed and three-dimensional pointed-shape landmarks are considered. Finally, the proposed method is evaluated by a number of simulations.