Search | Korea Science

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
- Journal of Intelligence and Information Systems
- /
- v.23 no.2
- /
- pp.71-88
- /
- 2017
Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.
https://doi.org/10.13088/jiis.2017.23.2.071 인용 PDF KSCI

Characteristic on the Layout and Semantic Interpretation of Chungryu-Gugok, Dongaksan Mountain, Gokseong (곡성 동악산 청류구곡(淸流九曲)의 형태 및 의미론적 특성)

Rho, Jae-Hyun;Shin, Sang-Sup;Huh, Joon;Lee, Jung-Han;Han, Sang-Yub
- Journal of the Korean Institute of Traditional Landscape Architecture
- /
- v.32 no.4
- /
- pp.24-36
- /
- 2014
The result of the research conducted for the purpose of investigating the semantic value and the layout of the Cheongryu Gugok of Dorimsa Valley, which exhibits a high level of completeness and scenic preservation value among the three gugoks distributed in the area around Mt. Dongak of Gogseong is as follows.4) The area around Cheongryu Gugok shows a case where the gugok culture, which has been enjoyed as a model of the Neo-Confucianism culture and bedrock scenery, such as waterfall, riverside, pond, and flatland, following the beautiful valley, has been actually substituted, and is an outstanding scenery site as stated in a local map of Gokseong-hyeon in 1872 as "Samnam Jeil Amban Gyeryu Cheongryu-dong(三南第一巖盤溪流淸流洞: Cheongryu-dong, the best rock mooring in the Samnam area)." Cheongryu Gugok, which is differentiated through the seasonal scenery and epigrams established on both land route and waterway, was probably established by the lead of Sun-tae Jeong(丁舜泰, ?~1916) and Byeong-sun Cho(曺秉順, 1876~1921) before 1916 during the Japanese colonization period. However, based on the fact that a number of Janggugiso of ancient sages, such as political activists, Buddhist leaders, and Neo-Confucian scholars, have been established, it is presumed to have been utilized as a hermit site and scenery site visited by masters from long ago. Cheongryu Gugok, which is formed on the rock floor of the bed rock of Dorimsa Valley, is formed in a total length of 1.2km and average gok(曲) length of 149m on a mountain type stream, which appears to be shorter compared to other gugoks in Korea. The rock writings of the three gugoks in Mt. Dongak, such as Cheongryu Gugok, which was the only one verified in the Jeonnam area, total 165 in number, which is determined to be the assembly place for the highest number of rock writings in the nation. In particular, a result of analyzing the rock writings in Cheongryu Gugok totaling 112 places showed 49pieces(43.8%) with the meaning of 'moral training' in epigram, 21pieces (18.8%) of human life, 16pieces(14.2%) of seasonal scenery, and 12pieces(10.6%) of Janggugiso such as Jangguchur, and the ratio occupied by poem verses appeared to be six cases(3.6%). Sweyeonmun(鎖烟門), which was the first gok of land route, and Jesiinganbyeolyucheon(除是人間別有天) which was the ninth gok of the waterway, corresponds to the Hongdanyeonse(虹斷烟鎖) of the first gok and Jesiinganbyeolyucheon of the ninth gok established in Jaecheon, Chungbuk by Se-hwa Park(朴世和, 1834~1910), which is inferred to be the name of Gugok having the same origin. In addition, the Daeeunbyeong(大隱屛) of the sixth gok. of land route corresponds to the Chu Hsi's Wuyi-Gugok of the seventh gok, which is acknowledged as the basis for Gugok Wollim, and the rock writings and stonework of 'Amseojae(巖棲齋)' and 'Pogyeongjae(抱經齋)' between the seventh gok and eighth gok is a trace comparable with Wuyi Jeongsa(武夷精舍) placed below Wuyi Gugok Eunbyeon-bong, which is understood to be the activity base of Cheongryu-dong of the Giho Sarim(畿湖士林). The rock writings in the Mt. Dongak area, including famous sayings by masters such as Sunsaeuhje(鮮史御帝, Emperor Gojong), Bogahyowoo(保家孝友, Emperor Gojong), Manchunmungywol(萬川明月, King Joengjo), Biryeobudong(非禮不動, Chongzhen Emperor of the Ming Dynasty)', Samusa(思無邪, Euijong of the Ming Dynasty), Baksechungpwoong(百世淸風, Chu Hsi), and Chungryususuk-Dongakpungkyung(淸流水石動樂風景, Heungseon Daewongun) can be said to be a repository of semantic symbolic cultural scenery, instead of only expressing Confucian aesthetics. In addition, Cheongryu Gugok is noticeable with its feature as a cluster of cultural scenery of the three religions of Confucian-Buddhism-Taoism, where the Confucianism value system, Buddhist concept, and Taoist concept co-exists for mind training and cultivation. Cheongryu Gugok has a semantic feature and spatial character as a basis for history and cultural struggle for the Anti-Japan spirit that has been conceived during the process of establishing and utilizing the spirit of the learning, loyalty for the Emperor and expulsion of barbarians, and inspiration of Anti-Japan force, by inheriting the sense of Dotong(道統) of Neo-Confucianism by the Confucian scholar class at the end of the Joseon era that is represented by Ik-hyun Choi(崔益鉉, 1833~1906), Woo Jeon(田愚, 1841~1922), Woo-man Gi(奇宇萬, 1846~1916), Byung-sun Song(宋秉璿, 1836~1905), and Hyeon Hwang(黃玹, 1855~1910).
https://doi.org/10.14700/KITLA.2014.32.4.024 인용 PDF KSCI

Search Result 252, Processing Time 0.017 seconds

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

Characteristic on the Layout and Semantic Interpretation of Chungryu-Gugok, Dongaksan Mountain, Gokseong (곡성 동악산 청류구곡(淸流九曲)의 형태 및 의미론적 특성)

이메일무단수집거부

이용약관

제 1 장 총칙

제 2 장 이용계약의 체결

제 3 장 계약 당사자의 의무

제 4 장 서비스의 이용

제 5 장 계약 해지 및 이용 제한

제 6 장 손해배상 및 기타사항

Detail Search

Image Search (β)