• Title/Summary/Keyword: Generate Data

Search Result 3,065, Processing Time 0.031 seconds

Korean Sentence Generation Using Phoneme-Level LSTM Language Model (한국어 음소 단위 LSTM 언어모델을 이용한 문장 생성)

  • Ahn, SungMahn;Chung, Yeojin;Lee, Jaejoon;Yang, Jiheon
    • Journal of Intelligence and Information Systems
    • /
    • v.23 no.2
    • /
    • pp.71-88
    • /
    • 2017
  • Language models were originally developed for speech recognition and language processing. Using a set of example sentences, a language model predicts the next word or character based on sequential input data. N-gram models have been widely used but this model cannot model the correlation between the input units efficiently since it is a probabilistic model which are based on the frequency of each unit in the training set. Recently, as the deep learning algorithm has been developed, a recurrent neural network (RNN) model and a long short-term memory (LSTM) model have been widely used for the neural language model (Ahn, 2016; Kim et al., 2016; Lee et al., 2016). These models can reflect dependency between the objects that are entered sequentially into the model (Gers and Schmidhuber, 2001; Mikolov et al., 2010; Sundermeyer et al., 2012). In order to learning the neural language model, texts need to be decomposed into words or morphemes. Since, however, a training set of sentences includes a huge number of words or morphemes in general, the size of dictionary is very large and so it increases model complexity. In addition, word-level or morpheme-level models are able to generate vocabularies only which are contained in the training set. Furthermore, with highly morphological languages such as Turkish, Hungarian, Russian, Finnish or Korean, morpheme analyzers have more chance to cause errors in decomposition process (Lankinen et al., 2016). Therefore, this paper proposes a phoneme-level language model for Korean language based on LSTM models. A phoneme such as a vowel or a consonant is the smallest unit that comprises Korean texts. We construct the language model using three or four LSTM layers. Each model was trained using Stochastic Gradient Algorithm and more advanced optimization algorithms such as Adagrad, RMSprop, Adadelta, Adam, Adamax, and Nadam. Simulation study was done with Old Testament texts using a deep learning package Keras based the Theano. After pre-processing the texts, the dataset included 74 of unique characters including vowels, consonants, and punctuation marks. Then we constructed an input vector with 20 consecutive characters and an output with a following 21st character. Finally, total 1,023,411 sets of input-output vectors were included in the dataset and we divided them into training, validation, testsets with proportion 70:15:15. All the simulation were conducted on a system equipped with an Intel Xeon CPU (16 cores) and a NVIDIA GeForce GTX 1080 GPU. We compared the loss function evaluated for the validation set, the perplexity evaluated for the test set, and the time to be taken for training each model. As a result, all the optimization algorithms but the stochastic gradient algorithm showed similar validation loss and perplexity, which are clearly superior to those of the stochastic gradient algorithm. The stochastic gradient algorithm took the longest time to be trained for both 3- and 4-LSTM models. On average, the 4-LSTM layer model took 69% longer training time than the 3-LSTM layer model. However, the validation loss and perplexity were not improved significantly or became even worse for specific conditions. On the other hand, when comparing the automatically generated sentences, the 4-LSTM layer model tended to generate the sentences which are closer to the natural language than the 3-LSTM model. Although there were slight differences in the completeness of the generated sentences between the models, the sentence generation performance was quite satisfactory in any simulation conditions: they generated only legitimate Korean letters and the use of postposition and the conjugation of verbs were almost perfect in the sense of grammar. The results of this study are expected to be widely used for the processing of Korean language in the field of language processing and speech recognition, which are the basis of artificial intelligence systems.

Multi-Dimensional Analysis Method of Product Reviews for Market Insight (마켓 인사이트를 위한 상품 리뷰의 다차원 분석 방안)

  • Park, Jeong Hyun;Lee, Seo Ho;Lim, Gyu Jin;Yeo, Un Yeong;Kim, Jong Woo
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.2
    • /
    • pp.57-78
    • /
    • 2020
  • With the development of the Internet, consumers have had an opportunity to check product information easily through E-Commerce. Product reviews used in the process of purchasing goods are based on user experience, allowing consumers to engage as producers of information as well as refer to information. This can be a way to increase the efficiency of purchasing decisions from the perspective of consumers, and from the seller's point of view, it can help develop products and strengthen their competitiveness. However, it takes a lot of time and effort to understand the overall assessment and assessment dimensions of the products that I think are important in reading the vast amount of product reviews offered by E-Commerce for the products consumers want to compare. This is because product reviews are unstructured information and it is difficult to read sentiment of reviews and assessment dimension immediately. For example, consumers who want to purchase a laptop would like to check the assessment of comparative products at each dimension, such as performance, weight, delivery, speed, and design. Therefore, in this paper, we would like to propose a method to automatically generate multi-dimensional product assessment scores in product reviews that we would like to compare. The methods presented in this study consist largely of two phases. One is the pre-preparation phase and the second is the individual product scoring phase. In the pre-preparation phase, a dimensioned classification model and a sentiment analysis model are created based on a review of the large category product group review. By combining word embedding and association analysis, the dimensioned classification model complements the limitation that word embedding methods for finding relevance between dimensions and words in existing studies see only the distance of words in sentences. Sentiment analysis models generate CNN models by organizing learning data tagged with positives and negatives on a phrase unit for accurate polarity detection. Through this, the individual product scoring phase applies the models pre-prepared for the phrase unit review. Multi-dimensional assessment scores can be obtained by aggregating them by assessment dimension according to the proportion of reviews organized like this, which are grouped among those that are judged to describe a specific dimension for each phrase. In the experiment of this paper, approximately 260,000 reviews of the large category product group are collected to form a dimensioned classification model and a sentiment analysis model. In addition, reviews of the laptops of S and L companies selling at E-Commerce are collected and used as experimental data, respectively. The dimensioned classification model classified individual product reviews broken down into phrases into six assessment dimensions and combined the existing word embedding method with an association analysis indicating frequency between words and dimensions. As a result of combining word embedding and association analysis, the accuracy of the model increased by 13.7%. The sentiment analysis models could be seen to closely analyze the assessment when they were taught in a phrase unit rather than in sentences. As a result, it was confirmed that the accuracy was 29.4% higher than the sentence-based model. Through this study, both sellers and consumers can expect efficient decision making in purchasing and product development, given that they can make multi-dimensional comparisons of products. In addition, text reviews, which are unstructured data, were transformed into objective values such as frequency and morpheme, and they were analysed together using word embedding and association analysis to improve the objectivity aspects of more precise multi-dimensional analysis and research. This will be an attractive analysis model in terms of not only enabling more effective service deployment during the evolving E-Commerce market and fierce competition, but also satisfying both customers.

Climate Change Impact on Nonpoint Source Pollution in a Rural Small Watershed (기후변화에 따른 농촌 소유역에서의 비점오염 영향 분석)

  • Hwang, Sye-Woon;Jang, Tae-Il;Park, Seung-Woo
    • Korean Journal of Agricultural and Forest Meteorology
    • /
    • v.8 no.4
    • /
    • pp.209-221
    • /
    • 2006
  • The purpose of this study is to analyze the effects of climate change on the nonpoint source pollution in a small watershed using a mid-range model. The study area is a basin in a rural area that covers 384 ha with a composition of 50% forest and 19% paddy. The hydrologic and water quality data were monitored from 1996 to 2004, and the feasibility of the GWLF (Generalized Watershed Loading function) model was examined in the agricultural small watershed using the data obtained from the study area. As one of the studies on climate change, KEI (Korea Environment Institute) has presented the monthly variation ratio of rainfall in Korea based on the climate change scenario for rainfall and temperature. These values and observed daily rainfall data of forty-one years from 1964 to 2004 in Suwon were used to generate daily weather data using the stochastic weather generator model (WGEN). Stream runoff was calibrated by the data of $1996{\sim}1999$ and was verified in $2002{\sim}2004$. The results were determination coeff, ($R^2$) of $0.70{\sim}0.91$ and root mean square error (RMSE) of $2.11{\sim}5.71$. Water quality simulation for SS, TN and TP showed $R^2$ values of 0.58, 0.47 and 0.62, respectively, The results for the impact of climate change on nonpoint source pollution show that if the factors of watershed are maintained as in the present circumstances, pollutant TN loads and TP would be expected to increase remarkably for the rainy season in the next fifty years.

Generation of Pseudo Porosity Logs from Seismic Data Using a Polynomial Neural Network Method (다항식 신경망 기법을 이용한 탄성파 탐사 자료로부터의 유사공극률 검층자료 생성)

  • Choi, Jae-Won;Byun, Joong-Moo;Seol, Soon-Jee
    • Journal of the Korean earth science society
    • /
    • v.32 no.6
    • /
    • pp.665-673
    • /
    • 2011
  • In order to estimate the hydrocarbon reserves, the porosity of the reservoir must be determined. The porosity of the area without a well is generally calculated by extrapolating the porosity logs measured at wells. However, if not only well logs but also seismic data exist on the same site, the more accurate pseudo porosity log can be obtained through artificial neural network technique by extracting the relations between the seismic data and well logs at the site. In this study, we have developed a module which creates pseudo porosity logs by using the polynomial neural network method. In order to obtain more accurate pseudo porosity logs, we selected the seismic attributes which have high correlation values in the correlation analysis between the seismic attributes and the porosity logs. Through the training procedure between selected seismic attributes and well logs, our module produces the correlation weights which can be used to generate the pseudo porosity log in the well free area. To verify the reliability and the applicability of the developed module, we have applied the module to the field data acquired from F3 Block in the North Sea and compared the results to those from the probabilistic neural network method in a commercial program. We could confirm the reliability of our module because both results showed similar trend. Moreover, since the pseudo porosity logs from polynomial neural network method are closer to the true porosity logs at the wells than those from probabilistic method, we concluded that the polynomial neural network method is effective for the data sets with insufficient wells such as F3 Block in the North Sea.

Seismic Data Processing and Inversion for Characterization of CO2 Storage Prospect in Ulleung Basin, East Sea (동해 울릉분지 CO2 저장소 특성 분석을 위한 탄성파 자료처리 및 역산)

  • Lee, Ho Yong;Kim, Min Jun;Park, Myong-Ho
    • Economic and Environmental Geology
    • /
    • v.48 no.1
    • /
    • pp.25-39
    • /
    • 2015
  • $CO_2$ geological storage plays an important role in reduction of greenhouse gas emissions, but there is a lack of research for CCS demonstration. To achieve the goal of CCS, storing $CO_2$ safely and permanently in underground geological formations, it is essential to understand the characteristics of them, such as total storage capacity, stability, etc. and establish an injection strategy. We perform the impedance inversion for the seismic data acquired from the Ulleung Basin in 2012. To review the possibility of $CO_2$ storage, we also construct porosity models and extract attributes of the prospects from the seismic data. To improve the quality of seismic data, amplitude preserved processing methods, SWD(Shallow Water Demultiple), SRME(Surface Related Multiple Elimination) and Radon Demultiple, are applied. Three well log data are also analysed, and the log correlations of each well are 0.648, 0.574 and 0.342, respectively. All wells are used in building the low-frequency model to generate more robust initial model. Simultaneous pre-stack inversion is performed on all of the 2D profiles and inverted P-impedance, S-impedance and Vp/Vs ratio are generated from the inversion process. With the porosity profiles generated from the seismic inversion process, the porous and non-porous zones can be identified for the purpose of the $CO_2$ sequestration initiative. More detailed characterization of the geological storage and the simulation of $CO_2$ migration might be an essential for the CCS demonstration.

The Generation of Westerly Waves by Sobaek Mountains (소백산맥에 의한 서풍 파동 발생)

  • Kim, Jin wook;Youn, Daeok
    • Journal of the Korean earth science society
    • /
    • v.38 no.1
    • /
    • pp.24-34
    • /
    • 2017
  • The westerly waves generation is described in the advanced earth science textbook used at high school as follows: as westerly wind approaches and blows over large mountains, the air flow shows wave motions in downwind side, which can be explained by the conservation of potential vorticity. However, there has been no case study showing the phenomena of the mesoscale westerly waves with observational data in the area of small mountains in Korea. And thus the wind speed and time persistency of westerly winds along with the width and length of mountains have never been studied to explain the generation of the westerly waves. As a first step, we assured the westerly waves generated in the downwind side of Sobaek mountains based on surface station wind data nearby. Furthermore, the critical or minimum wind velocity of the westerly wind over Sobaek mountains to generate the downwind wave were derived and calcuated tobe about $0.6m\;s^{-1}$ for Sobaek mountains, which means that the westerly waves could be generated in most cases of westerly blowing over the mountains. Using surface station data and 4-dimensional assimilation data of RDAPS (Regional Data Assimilation and Prediction System) provided by Korea Meteorological Agency, we also analyzed cases of westerly waves occurrence and life cycle in the downwind side of Sobaek mountains for a year of 2014. The westerly waves occurred in meso-${\beta}$ or -${\gamma}$ scales. The westerly waves generated by the mountains disappeared gradually with wind speed decreasing. The occurrence frequency of the vorticity with meso-${\beta}$ scale got to be higher when the stronger westerly wind blew. When we extended the spatial range of the analysis, phenomena of westerly waves were also observed in the downwind side of Yensan mountains in Northeastern China. Our current work will be a study material to help students understand the atmospheric phenomena perturbed by mountains.

A Study on the Application of the Smartphone Hiking Apps for Analyzing the User Characteristics in Forest Recreation Area: Focusing on Daegwallyoung Area (산림휴양공간 이용특성 분석을 위한 국내 스마트폰 산행앱(APP)의 적용성 및 활용방안 연구: 대관령 선자령 일대를 중심으로)

  • Jang, Youn-Sun;Yoo, Rhee-Hwa;Lee, Jeong-Hee
    • Journal of Korean Society of Forest Science
    • /
    • v.108 no.3
    • /
    • pp.382-391
    • /
    • 2019
  • This study was conducted to verify whether smartphone hiking apps, which generate social network data including location information, are useful tools for analyzing the use characteristics of a forest recreation area. For this purpose, the study identified the functions and service characteristics of smartphone hiking apps. Also, the use characteristics of the area of Daegwallyoung were analyzed, compared with the results of the field survey, and the applicability of hiking apps was reviewed. As a result, the service types of hiking apps were analyzed in terms of three categories: "information offering," "hiking record," and "information sharing." This study focused on an app that is one of the "hiking record" types with the greatest number of users. Analysis of the data from hiking apps and a field survey in the Daegwallyoung area showed that both hiking apps and the field survey can be used to identify the movement patterns, but hiking apps based on a global positioning system (GPS) are more efficient and objective tools for understanding the use patterns in a forest recreation area, as well as for extracting user-generated photos. Second, although it is advantageous to analyze the patterns objectively through the walking-speed data generated, field surveys and observation are needed as complements for understanding the types of activities in each space. The hiking apps are based on cellphone use and are specific to "hiking" use, so user bias can limit the usefulness of the data. It is significant that this research shows the applicability of hiking apps for analyzing the use patterns of forest recreation areas through the location-based social network data of app users who record their hiking information voluntarily.

A Study on the Medical Application and Personal Information Protection of Generative AI (생성형 AI의 의료적 활용과 개인정보보호)

  • Lee, Sookyoung
    • The Korean Society of Law and Medicine
    • /
    • v.24 no.4
    • /
    • pp.67-101
    • /
    • 2023
  • The utilization of generative AI in the medical field is also being rapidly researched. Access to vast data sets reduces the time and energy spent in selecting information. However, as the effort put into content creation decreases, there is a greater likelihood of associated issues arising. For example, with generative AI, users must discern the accuracy of results themselves, as these AIs learn from data within a set period and generate outcomes. While the answers may appear plausible, their sources are often unclear, making it challenging to determine their veracity. Additionally, the possibility of presenting results from a biased or distorted perspective cannot be discounted at present on ethical grounds. Despite these concerns, the field of generative AI is continually advancing, with an increasing number of users leveraging it in various sectors, including biomedical and life sciences. This raises important legal considerations regarding who bears responsibility and to what extent for any damages caused by these high-performance AI algorithms. A general overview of issues with generative AI includes those discussed above, but another perspective arises from its fundamental nature as a large-scale language model ('LLM') AI. There is a civil law concern regarding "the memorization of training data within artificial neural networks and its subsequent reproduction". Medical data, by nature, often reflects personal characteristics of patients, potentially leading to issues such as the regeneration of personal information. The extensive application of generative AI in scenarios beyond traditional AI brings forth the possibility of legal challenges that cannot be ignored. Upon examining the technical characteristics of generative AI and focusing on legal issues, especially concerning the protection of personal information, it's evident that current laws regarding personal information protection, particularly in the context of health and medical data utilization, are inadequate. These laws provide processes for anonymizing and de-identification, specific personal information but fall short when generative AI is applied as software in medical devices. To address the functionalities of generative AI in clinical software, a reevaluation and adjustment of existing laws for the protection of personal information are imperative.

A Study on the Digital Drawing of Archaeological Relics Using Open-Source Software (오픈소스 소프트웨어를 활용한 고고 유물의 디지털 실측 연구)

  • LEE Hosun;AHN Hyoungki
    • Korean Journal of Heritage: History & Science
    • /
    • v.57 no.1
    • /
    • pp.82-108
    • /
    • 2024
  • With the transition of archaeological recording method's transition from analog to digital, the 3D scanning technology has been actively adopted within the field. Research on the digital archaeological digital data gathered from 3D scanning and photogrammetry is continuously being conducted. However, due to cost and manpower issues, most buried cultural heritage organizations are hesitating to adopt such digital technology. This paper aims to present a digital recording method of relics utilizing open-source software and photogrammetry technology, which is believed to be the most efficient method among 3D scanning methods. The digital recording process of relics consists of three stages: acquiring a 3D model, creating a joining map with the edited 3D model, and creating an digital drawing. In order to enhance the accessibility, this method only utilizes open-source software throughout the entire process. The results of this study confirms that in terms of quantitative evaluation, the deviation of numerical measurement between the actual artifact and the 3D model was minimal. In addition, the results of quantitative quality analysis from the open-source software and the commercial software showed high similarity. However, the data processing time was overwhelmingly fast for commercial software, which is believed to be a result of high computational speed from the improved algorithm. In qualitative evaluation, some differences in mesh and texture quality occurred. In the 3D model generated by opensource software, following problems occurred: noise on the mesh surface, harsh surface of the mesh, and difficulty in confirming the production marks of relics and the expression of patterns. However, some of the open source software did generate the quality comparable to that of commercial software in quantitative and qualitative evaluations. Open-source software for editing 3D models was able to not only post-process, match, and merge the 3D model, but also scale adjustment, join surface production, and render image necessary for the actual measurement of relics. The final completed drawing was tracked by the CAD program, which is also an open-source software. In archaeological research, photogrammetry is very applicable to various processes, including excavation, writing reports, and research on numerical data from 3D models. With the breakthrough development of computer vision, the types of open-source software have been diversified and the performance has significantly improved. With the high accessibility to such digital technology, the acquisition of 3D model data in archaeology will be used as basic data for preservation and active research of cultural heritage.

An Intelligent Intrusion Detection Model Based on Support Vector Machines and the Classification Threshold Optimization for Considering the Asymmetric Error Cost (비대칭 오류비용을 고려한 분류기준값 최적화와 SVM에 기반한 지능형 침입탐지모형)

  • Lee, Hyeon-Uk;Ahn, Hyun-Chul
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.4
    • /
    • pp.157-173
    • /
    • 2011
  • As the Internet use explodes recently, the malicious attacks and hacking for a system connected to network occur frequently. This means the fatal damage can be caused by these intrusions in the government agency, public office, and company operating various systems. For such reasons, there are growing interests and demand about the intrusion detection systems (IDS)-the security systems for detecting, identifying and responding to unauthorized or abnormal activities appropriately. The intrusion detection models that have been applied in conventional IDS are generally designed by modeling the experts' implicit knowledge on the network intrusions or the hackers' abnormal behaviors. These kinds of intrusion detection models perform well under the normal situations. However, they show poor performance when they meet a new or unknown pattern of the network attacks. For this reason, several recent studies try to adopt various artificial intelligence techniques, which can proactively respond to the unknown threats. Especially, artificial neural networks (ANNs) have popularly been applied in the prior studies because of its superior prediction accuracy. However, ANNs have some intrinsic limitations such as the risk of overfitting, the requirement of the large sample size, and the lack of understanding the prediction process (i.e. black box theory). As a result, the most recent studies on IDS have started to adopt support vector machine (SVM), the classification technique that is more stable and powerful compared to ANNs. SVM is known as a relatively high predictive power and generalization capability. Under this background, this study proposes a novel intelligent intrusion detection model that uses SVM as the classification model in order to improve the predictive ability of IDS. Also, our model is designed to consider the asymmetric error cost by optimizing the classification threshold. Generally, there are two common forms of errors in intrusion detection. The first error type is the False-Positive Error (FPE). In the case of FPE, the wrong judgment on it may result in the unnecessary fixation. The second error type is the False-Negative Error (FNE) that mainly misjudges the malware of the program as normal. Compared to FPE, FNE is more fatal. Thus, when considering total cost of misclassification in IDS, it is more reasonable to assign heavier weights on FNE rather than FPE. Therefore, we designed our proposed intrusion detection model to optimize the classification threshold in order to minimize the total misclassification cost. In this case, conventional SVM cannot be applied because it is designed to generate discrete output (i.e. a class). To resolve this problem, we used the revised SVM technique proposed by Platt(2000), which is able to generate the probability estimate. To validate the practical applicability of our model, we applied it to the real-world dataset for network intrusion detection. The experimental dataset was collected from the IDS sensor of an official institution in Korea from January to June 2010. We collected 15,000 log data in total, and selected 1,000 samples from them by using random sampling method. In addition, the SVM model was compared with the logistic regression (LOGIT), decision trees (DT), and ANN to confirm the superiority of the proposed model. LOGIT and DT was experimented using PASW Statistics v18.0, and ANN was experimented using Neuroshell 4.0. For SVM, LIBSVM v2.90-a freeware for training SVM classifier-was used. Empirical results showed that our proposed model based on SVM outperformed all the other comparative models in detecting network intrusions from the accuracy perspective. They also showed that our model reduced the total misclassification cost compared to the ANN-based intrusion detection model. As a result, it is expected that the intrusion detection model proposed in this paper would not only enhance the performance of IDS, but also lead to better management of FNE.