• Title/Summary/Keyword: N-gram

Search Result 573, Processing Time 0.022 seconds

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

  • Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.71-88
    • /
    • 2017
  • Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

Differential Diagnosis By Analysis of Pleural Effusion (흉수분석에 의한 질병의 감별진단)

  • Ko, Won-Ki;Lee, Jun-Gu;Jung, Jae-Ho;Park, Mu-Suk;Jeong, Nak-Yeong;Kim, Young-Sam;Yang, Dong-Gyoo;Yoo, Nae-Choon;Ahn, Chul-Min;Kim, Sung-Kyu
    • Tuberculosis and Respiratory Diseases
    • /
    • v.51 no.6
    • /
    • pp.559-569
    • /
    • 2001
  • Background : Pleural effusion is one of the most common clinical manifestations associated with a variety of pulmonary diseases such as malignancy, tuberculosis, and pneumonia. However, there are no useful laboratory tests to determine the specific cause of pleural effusion. Therefore, an attempt was made to analyze the various types of pleural effusion and search for useful laboratory tests for pleural effusion in order to differentiate between the diseases, especially between a malignant pleural effusion and a non-malignant pleural effusion. Methods : 93 patients with a pleural effusion, who visited the Severance hospital from January 1998 to August 1999, were enrolled in this study. Ultrasound-guided thoracentesis was done and a confirmational diagnosis was made by a gram stain, bacterial culture, Ziehl-Neelsen stain, a mycobacterial culture, a pleural biopsy and cytology. Results : The male to female ratio was 56 : 37 and the average age was $47.1{\pm}21.8$ years. There were 16 cases with a malignant effusion, 12 cases with a para-malignant effusion, 36 cases with tuberculosis, 22 cases with a para-pneumonic effusion, and 7 cases with transudate. The LDH2 fraction was significantly higher in the para-malignant effusion group compared to the para-pneumonic effusion group [$30.6{\pm}6.4%$ and $20.2{\pm}7.5%$, respectively (p<0.05)] and both the LDH1 and LDH2 fraction was significantly in the para-malignant effusion group compared to those with tuberculosis [$16.4{\pm}7.2%$ vs. $7.6{\pm}4.7%$, and $30.6{\pm}6.4%$ vs.$17.6{\pm}6.3%$, respectively (p<0.05)]. The pleural effusion/serum LDH4 fraction ratio was significantly lower in the malignant effusion group compared to those with tuberculosis [$1.5{\pm}0.8$ vs. $2.1{\pm}0.6$, respectively (p<0.05)]. The LDH4 fraction and the pleural effusion/serum LDH4 fraction ratio was significantly lower in the para-malignant effusion group compared to those with tuberculosis [$17.0{\pm}5.8%$ vs. $23.5{\pm}4.6%$ and $1.3{\pm}0.4$ vs. $2.1{\pm}0.6$, respectively (p<0.05)]. Conclusion : These results suggest that the LDH isoenzyme was the only useful biochemical test for a differential diagnosis of the various diseases. In particular, the most useful test was the pleural effusion/serum LDH4 fraction ratio to distinguish between a para-malignant effusion and a tuberculous effusion.

  • PDF

Studies on the Natural Distribution and Ecology of Ilex cornuta Lindley et Pax. in Korea (호랑가시나무의 천연분포(天然分布)와 군낙생태(群落生態)에 관한 연구(研究))

  • Lee, Jeong Seok
    • Journal of Korean Society of Forest Science
    • /
    • v.62 no.1
    • /
    • pp.24-42
    • /
    • 1983
  • To develop Ilex cornuta which grow naturally in the southwest seaside district as new ornamental tree, the author chose I. cornuta growing in the four natural communities and those cultivated in Kwangju city as a sample, and investigated its ecology, morphology and characteristics. The results obtained was summarized as follows; 1) The natural distribution of I. cornuta marks $35^{\circ}$43'N and $126^{\circ}$44'E in the southwestern part of Korea and $33^{\circ}$20'N and $126^{\circ}$15'E in Jejoo island. This area has the following necessary conditions for Ilex cornuta: the annual average temperature is above $12^{\circ}C$, the coldness index below $-12.7^{\circ}C$, annual average relative humidity 75-80%, and the number of snow-covering days is 20-25 days, situated within 20km of from coastline and within, 100m above sea level and mainly at the foot of the mountain facing the southeast. 2) The vegetation in I. cornuta community can be divided that upper layer is composed of Pinus thunbergii and P. densiflora, middle layer of Eurya japonica var. montana, Ilex cornuta and Vaccinium bracteatum, and the ground vegetation is composed of Carex lanceolata and Arundinella hirta var. ciliare. The community has high species diversity which indicates it is at the stage of development. Although I. cornuta is a species of the southern type of temperate zone where coniferous tree or broad leaved, evergreen trees grow together, it occasionally grows in the subtropical zone. 3) Parent rock is gneiss or rhyolite etc., and soil is acidic (about pH 4.5-5.0) and the content of available phosphorus is low. 4) At maturity, the height growth averaged $10.48{\pm}0.23cm$ a year and the diameter growth 0.43 cm a year, and the annual ring was not clear. Mean leaf-number was 11.34. There are a significant positive correlation between twig-elongation and leaf-number. 5) One-year-old seedling grows up to 10.66 cm (max. 18.2 cm, min. 4.0 cm) in shoot-height, with its leaf number 12.1 (max. 18, min), its basal diameter 2.24 mm (max. 4.0 mm, min. 1.0 mm) and shows rhythmical growth in high temperature period. There were significant positive correlations between stalk-height and leaf-number, between stalk-height and basal-diameter, and between number and basal diameter. 6) The flowering time ranged from the end of April to the beginning of May, and the flower has tetra-merouscorella and corymb of yellowish green. It has a bisexual flower and dioecism with a sexual ratio 1:1. 7) The fruit, after fertilization, grows 0.87 cm long (0.61-1.31 cm) and 0.8 cm wide (0.62-1.05 cm) by the beginning of May. Fruits begin to turn red and continue to ripen until the end of October or the beginning of November and remain unfading until the end of following May. With the partial change in color of dark-brown at the beginning of the June fruits begin to fall, bur some remain even after three years. 8) The seed acquision ratio is 24.7% by weight, and the number of grains per fruit averages 3.9 and the seed weight per liter is 114.2 gram, while the average weight of 1,000 seeds is 24.56 grams. 9) Seeds after complete removal of sarcocarp, were buried under ground in a fixed temperature and humidity and they began to develop root in October, a year later and germinated in the next April. Under sunlight or drought, however, the dormant state may be continued.

  • PDF