• Title/Summary/Keyword: Word Input

Search Result 225, Processing Time 0.029 seconds

Text Summarization on Large-scale Vietnamese Datasets

  • Ti-Hon, Nguyen;Thanh-Nghi, Do
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.4
    • /
    • pp.309-316
    • /
    • 2022
  • This investigation is aimed at automatic text summarization on large-scale Vietnamese datasets. Vietnamese articles were collected from newspaper websites and plain text was extracted to build the dataset, that included 1,101,101 documents. Next, a new single-document extractive text summarization model was proposed to evaluate this dataset. In this summary model, the k-means algorithm is used to cluster the sentences of the input document using different text representations, such as BoW (bag-of-words), TF-IDF (term frequency - inverse document frequency), Word2Vec (Word-to-vector), Glove, and FastText. The summary algorithm then uses the trained k-means model to rank the candidate sentences and create a summary with the highest-ranked sentences. The empirical results of the F1-score achieved 51.91% ROUGE-1, 18.77% ROUGE-2 and 29.72% ROUGE-L, compared to 52.33% ROUGE-1, 16.17% ROUGE-2, and 33.09% ROUGE-L performed using a competitive abstractive model. The advantage of the proposed model is that it can perform well with O(n,k,p) = O(n(k+2/p)) + O(nlog2n) + O(np) + O(nk2) + O(k) time complexity.

Psalm Text Generator Comparison Between English and Korean Using LSTM Blocks in a Recurrent Neural Network (순환 신경망에서 LSTM 블록을 사용한 영어와 한국어의 시편 생성기 비교)

  • Snowberger, Aaron Daniel;Lee, Choong Ho
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2022.10a
    • /
    • pp.269-271
    • /
    • 2022
  • In recent years, RNN networks with LSTM blocks have been used extensively in machine learning tasks that process sequential data. These networks have proven to be particularly good at sequential language processing tasks by being more able to accurately predict the next most likely word in a given sequence than traditional neural networks. This study trained an RNN / LSTM neural network on three different translations of 150 biblical Psalms - in both English and Korean. The resulting model is then fed an input word and a length number from which it automatically generates a new Psalm of the desired length based on the patterns it recognized while training. The results of training the network on both English text and Korean text are compared and discussed.

  • PDF

Design of Downlink Beamforming Transmitter in OFDMA/ TDD system (OFDMA/TDD 시스템의 하향링크 빔형성 송신기 설계)

  • Park Hyeong-Sook;Park Youn-Ok;Kim Cheol-Sung
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.31 no.5A
    • /
    • pp.493-500
    • /
    • 2006
  • This paper presents the efficient structure and parameter optimization of downlink beamforming transmitter in OFDMA/TDD system. To design downlink beamforming transmitter for multiple transmit antennas, an efficient beamforming structure for multiple users and the choice of word-length of each block are critical in the aspect of its performance and hardware complexity. We propose an efficient beamforming scheme, which stores the weights of subcarriers into memory without user identification at the receiver of base station and calculates the weights for corresponding user in a subcarrier unit of IFFT input at high speed. Also, we obtain the word-length of main data path and other design parameters by fixed-point simulation analysis. The proposed architecture could reduce the memory size proportional to the maximum number of users per frame, and the processing time of an OFDM symbol at the receiver of base station without the need of additional processing time for calculating the weights at the transmitter.

Word Recognition Using VQ and Fuzzy Theory (VQ와 Fuzzy 이론을 이용한 단어인식)

  • Kim, Ja-Ryong;Choi, Kap-Seok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.10 no.4
    • /
    • pp.38-47
    • /
    • 1991
  • The frequency variation among speakers is one of problems in the speech recognition. This paper applies fuzzy theory to solve the variation problem of frequency features. Reference patterns are expressed by fuzzified patterns which are produced by the peak frequency and the peak energy extracted from codebooks which are generated from training words uttered by several speakers, as they should include common features of speech signals. Words are recognized by fuzzy inference which uses the certainty factor between the reference patterns and the test fuzzified patterns which are produced by the peak frequency and the peak energy extracted from the power spectrum of input speech signals. Practically, in computing the certainty factor, to reduce memory capacity and computation requirements we propose a new equation which calculates the improved certainty factor using only the difference between two fuzzy values. As a result of experiments to test this word recognition method by fuzzy interence with Korean digits, it is shown that this word recognition method using the new equation presented in this paper, can solve the variation problem of frequency features and that the memory capacity and computation requirements are reduced.

  • PDF

Development of Autonomous Mobile Robot with Speech Teaching Command Recognition System Based on Hidden Markov Model (HMM을 기반으로 한 자율이동로봇의 음성명령 인식시스템의 개발)

  • Cho, Hyeon-Soo;Park, Min-Gyu;Lee, Hyun-Jeong;Lee, Min-Cheol
    • Journal of Institute of Control, Robotics and Systems
    • /
    • v.13 no.8
    • /
    • pp.726-734
    • /
    • 2007
  • Generally, a mobile robot is moved by original input programs. However, it is very hard for a non-expert to change the program generating the moving path of a mobile robot, because he doesn't know almost the teaching command and operating method for driving the robot. Therefore, the teaching method with speech command for a handicapped person without hands or a non-expert without an expert knowledge to generate the path is required gradually. In this study, for easily teaching the moving path of the autonomous mobile robot, the autonomous mobile robot with the function of speech recognition is developed. The use of human voice as the teaching method provides more convenient user-interface for mobile robot. To implement the teaching function, the designed robot system is composed of three separated control modules, which are speech preprocessing module, DC servo motor control module, and main control module. In this study, we design and implement a speaker dependent isolated word recognition system for creating moving path of an autonomous mobile robot in the unknown environment. The system uses word-level Hidden Markov Models(HMM) for designated command vocabularies to control a mobile robot, and it has postprocessing by neural network according to the condition based on confidence score. As the spectral analysis method, we use a filter-bank analysis model to extract of features of the voice. The proposed word recognition system is tested using 33 Korean words for control of the mobile robot navigation, and we also evaluate the performance of navigation of a mobile robot using only voice command.

Speaker-adaptive Word Recognition Using Mapped Membership Function (사상멤버쉽함수에 의한 화자적응 단어인식)

  • Lee, Ki-Yeong;Choi, Kap-Seok
    • The Journal of the Acoustical Society of Korea
    • /
    • v.11 no.3
    • /
    • pp.40-52
    • /
    • 1992
  • In this paper, we propose the speaker adaptive word recognition method using a mapped membership function, in order to absorb a fluctuation owing to personal difference which is a problem of speaker independent speech recognition. In the training procedure of this method, the mapped membership function is made with the fuzzy theory introducded into a mapped codebook, between an unknown speaker's spectrum pattern and a standard speaker's one. In the recognition procedure, an input pattern of an unknown speaker is reconstructed to the pattern which is adapted to that of a standard speaker by the mapped membership function. To show the validity of this method, word recognition experiments are carried out using 28 DDD area names. The recognition rate of the conventional speaker-adaptive method using a mapped codebook by VQ is 64.9[%], and that made by a fuzzy VQ is 76.2[%]. Throughout the experiment using a mapped membership function, we can achieve 95.4[%] recognition rate. This shows that our proposed method is more excellent in recognition performance. Moreover, this method doesn't need an iterative training procedure to make the mapped membership function, and memory capacity and computation requirements for this method are reduced to 1/30 and 1/500 time of those for the conventional method using a mapped codebook, respectively.

  • PDF

Question Similarity Measurement of Chinese Crop Diseases and Insect Pests Based on Mixed Information Extraction

  • Zhou, Han;Guo, Xuchao;Liu, Chengqi;Tang, Zhan;Lu, Shuhan;Li, Lin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.15 no.11
    • /
    • pp.3991-4010
    • /
    • 2021
  • The Question Similarity Measurement of Chinese Crop Diseases and Insect Pests (QSM-CCD&IP) aims to judge the user's tendency to ask questions regarding input problems. The measurement is the basis of the Agricultural Knowledge Question and Answering (Q & A) system, information retrieval, and other tasks. However, the corpus and measurement methods available in this field have some deficiencies. In addition, error propagation may occur when the word boundary features and local context information are ignored when the general method embeds sentences. Hence, these factors make the task challenging. To solve the above problems and tackle the Question Similarity Measurement task in this work, a corpus on Chinese crop diseases and insect pests(CCDIP), which contains 13 categories, was established. Then, taking the CCDIP as the research object, this study proposes a Chinese agricultural text similarity matching model, namely, the AgrCQS. This model is based on mixed information extraction. Specifically, the hybrid embedding layer can enrich character information and improve the recognition ability of the model on the word boundary. The multi-scale local information can be extracted by multi-core convolutional neural network based on multi-weight (MM-CNN). The self-attention mechanism can enhance the fusion ability of the model on global information. In this research, the performance of the AgrCQS on the CCDIP is verified, and three benchmark datasets, namely, AFQMC, LCQMC, and BQ, are used. The accuracy rates are 93.92%, 74.42%, 86.35%, and 83.05%, respectively, which are higher than that of baseline systems without using any external knowledge. Additionally, the proposed method module can be extracted separately and applied to other models, thus providing reference for related research.

A Comparative study on the Effectiveness of Segmentation Strategies for Korean Word and Sentence Classification tasks (한국어 단어 및 문장 분류 태스크를 위한 분절 전략의 효과성 연구)

  • Kim, Jin-Sung;Kim, Gyeong-min;Son, Jun-young;Park, Jeongbae;Lim, Heui-seok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.12
    • /
    • pp.39-47
    • /
    • 2021
  • The construction of high-quality input features through effective segmentation is essential for increasing the sentence comprehension of a language model. Improving the quality of them directly affects the performance of the downstream task. This paper comparatively studies the segmentation that effectively reflects the linguistic characteristics of Korean regarding word and sentence classification. The segmentation types are defined in four categories: eojeol, morpheme, syllable and subchar, and pre-training is carried out using the RoBERTa model structure. By dividing tasks into a sentence group and a word group, we analyze the tendency within a group and the difference between the groups. By the model with subchar-level segmentation showing higher performance than other strategies by maximal NSMC: +0.62%, KorNLI: +2.38%, KorSTS: +2.41% in sentence classification, and the model with syllable-level showing higher performance at maximum NER: +0.7%, SRL: +0.61% in word classification, the experimental results confirm the effectiveness of those schemes.

Exploiting Chunking for Dependency Parsing in Korean (한국어에서 의존 구문분석을 위한 구묶음의 활용)

  • Namgoong, Young;Kim, Jae-Hoon
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.11 no.7
    • /
    • pp.291-298
    • /
    • 2022
  • In this paper, we present a method for dependency parsing with chunking in Korean. Dependency parsing is a task of determining a governor of every word in a sentence. In general, we used to determine the syntactic governor in Korean and should transform the syntactic structure into semantic structure for further processing like semantic analysis in natural language processing. There is a notorious problem to determine whether syntactic or semantic governor. For example, the syntactic governor of the word "먹고 (eat)" in the sentence "밥을 먹고 싶다 (would like to eat)" is "싶다 (would like to)", which is an auxiliary verb and therefore can not be a semantic governor. In order to mitigate this somewhat, we propose a Korean dependency parsing after chunking, which is a process of segmenting a sentence into constituents. A constituent is a word or a group of words that function as a single unit within a dependency structure and is called a chunk in this paper. Compared to traditional dependency parsing, there are some advantage of the proposed method: (1) The number of input units in parsing can be reduced and then the parsing speed could be faster. (2) The effectiveness of parsing can be improved by considering the relation between two head words in chunks. Through experiments for Sejong dependency corpus, we have shown that the USA and LAS of the proposed method are 86.48% and 84.56%, respectively and the number of input units is reduced by about 22%p.

Korean Word Sense Disambiguation using Dictionary and Corpus (사전과 말뭉치를 이용한 한국어 단어 중의성 해소)

  • Jeong, Hanjo;Park, Byeonghwa
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.1
    • /
    • pp.1-13
    • /
    • 2015
  • As opinion mining in big data applications has been highlighted, a lot of research on unstructured data has made. Lots of social media on the Internet generate unstructured or semi-structured data every second and they are often made by natural or human languages we use in daily life. Many words in human languages have multiple meanings or senses. In this result, it is very difficult for computers to extract useful information from these datasets. Traditional web search engines are usually based on keyword search, resulting in incorrect search results which are far from users' intentions. Even though a lot of progress in enhancing the performance of search engines has made over the last years in order to provide users with appropriate results, there is still so much to improve it. Word sense disambiguation can play a very important role in dealing with natural language processing and is considered as one of the most difficult problems in this area. Major approaches to word sense disambiguation can be classified as knowledge-base, supervised corpus-based, and unsupervised corpus-based approaches. This paper presents a method which automatically generates a corpus for word sense disambiguation by taking advantage of examples in existing dictionaries and avoids expensive sense tagging processes. It experiments the effectiveness of the method based on Naïve Bayes Model, which is one of supervised learning algorithms, by using Korean standard unabridged dictionary and Sejong Corpus. Korean standard unabridged dictionary has approximately 57,000 sentences. Sejong Corpus has about 790,000 sentences tagged with part-of-speech and senses all together. For the experiment of this study, Korean standard unabridged dictionary and Sejong Corpus were experimented as a combination and separate entities using cross validation. Only nouns, target subjects in word sense disambiguation, were selected. 93,522 word senses among 265,655 nouns and 56,914 sentences from related proverbs and examples were additionally combined in the corpus. Sejong Corpus was easily merged with Korean standard unabridged dictionary because Sejong Corpus was tagged based on sense indices defined by Korean standard unabridged dictionary. Sense vectors were formed after the merged corpus was created. Terms used in creating sense vectors were added in the named entity dictionary of Korean morphological analyzer. By using the extended named entity dictionary, term vectors were extracted from the input sentences and then term vectors for the sentences were created. Given the extracted term vector and the sense vector model made during the pre-processing stage, the sense-tagged terms were determined by the vector space model based word sense disambiguation. In addition, this study shows the effectiveness of merged corpus from examples in Korean standard unabridged dictionary and Sejong Corpus. The experiment shows the better results in precision and recall are found with the merged corpus. This study suggests it can practically enhance the performance of internet search engines and help us to understand more accurate meaning of a sentence in natural language processing pertinent to search engines, opinion mining, and text mining. Naïve Bayes classifier used in this study represents a supervised learning algorithm and uses Bayes theorem. Naïve Bayes classifier has an assumption that all senses are independent. Even though the assumption of Naïve Bayes classifier is not realistic and ignores the correlation between attributes, Naïve Bayes classifier is widely used because of its simplicity and in practice it is known to be very effective in many applications such as text classification and medical diagnosis. However, further research need to be carried out to consider all possible combinations and/or partial combinations of all senses in a sentence. Also, the effectiveness of word sense disambiguation may be improved if rhetorical structures or morphological dependencies between words are analyzed through syntactic analysis.