• Title/Summary/Keyword: distribution systems

Search Result 6,365, Processing Time 0.034 seconds

A Study of 'Emotion Trigger' by Text Mining Techniques (텍스트 마이닝을 이용한 감정 유발 요인 'Emotion Trigger'에 관한 연구)

  • An, Juyoung;Bae, Junghwan;Han, Namgi;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.2
    • /
    • pp.69-92
    • /
    • 2015
  • The explosion of social media data has led to apply text-mining techniques to analyze big social media data in a more rigorous manner. Even if social media text analysis algorithms were improved, previous approaches to social media text analysis have some limitations. In the field of sentiment analysis of social media written in Korean, there are two typical approaches. One is the linguistic approach using machine learning, which is the most common approach. Some studies have been conducted by adding grammatical factors to feature sets for training classification model. The other approach adopts the semantic analysis method to sentiment analysis, but this approach is mainly applied to English texts. To overcome these limitations, this study applies the Word2Vec algorithm which is an extension of the neural network algorithms to deal with more extensive semantic features that were underestimated in existing sentiment analysis. The result from adopting the Word2Vec algorithm is compared to the result from co-occurrence analysis to identify the difference between two approaches. The results show that the distribution related word extracted by Word2Vec algorithm in that the words represent some emotion about the keyword used are three times more than extracted by co-occurrence analysis. The reason of the difference between two results comes from Word2Vec's semantic features vectorization. Therefore, it is possible to say that Word2Vec algorithm is able to catch the hidden related words which have not been found in traditional analysis. In addition, Part Of Speech (POS) tagging for Korean is used to detect adjective as "emotional word" in Korean. In addition, the emotion words extracted from the text are converted into word vector by the Word2Vec algorithm to find related words. Among these related words, noun words are selected because each word of them would have causal relationship with "emotional word" in the sentence. The process of extracting these trigger factor of emotional word is named "Emotion Trigger" in this study. As a case study, the datasets used in the study are collected by searching using three keywords: professor, prosecutor, and doctor in that these keywords contain rich public emotion and opinion. Advanced data collecting was conducted to select secondary keywords for data gathering. The secondary keywords for each keyword used to gather the data to be used in actual analysis are followed: Professor (sexual assault, misappropriation of research money, recruitment irregularities, polifessor), Doctor (Shin hae-chul sky hospital, drinking and plastic surgery, rebate) Prosecutor (lewd behavior, sponsor). The size of the text data is about to 100,000(Professor: 25720, Doctor: 35110, Prosecutor: 43225) and the data are gathered from news, blog, and twitter to reflect various level of public emotion into text data analysis. As a visualization method, Gephi (http://gephi.github.io) was used and every program used in text processing and analysis are java coding. The contributions of this study are as follows: First, different approaches for sentiment analysis are integrated to overcome the limitations of existing approaches. Secondly, finding Emotion Trigger can detect the hidden connections to public emotion which existing method cannot detect. Finally, the approach used in this study could be generalized regardless of types of text data. The limitation of this study is that it is hard to say the word extracted by Emotion Trigger processing has significantly causal relationship with emotional word in a sentence. The future study will be conducted to clarify the causal relationship between emotional words and the words extracted by Emotion Trigger by comparing with the relationships manually tagged. Furthermore, the text data used in Emotion Trigger are twitter, so the data have a number of distinct features which we did not deal with in this study. These features will be considered in further study.

A Time Series Graph based Convolutional Neural Network Model for Effective Input Variable Pattern Learning : Application to the Prediction of Stock Market (효과적인 입력변수 패턴 학습을 위한 시계열 그래프 기반 합성곱 신경망 모형: 주식시장 예측에의 응용)

  • Lee, Mo-Se;Ahn, Hyunchul
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.167-181
    • /
    • 2018
  • Over the past decade, deep learning has been in spotlight among various machine learning algorithms. In particular, CNN(Convolutional Neural Network), which is known as the effective solution for recognizing and classifying images or voices, has been popularly applied to classification and prediction problems. In this study, we investigate the way to apply CNN in business problem solving. Specifically, this study propose to apply CNN to stock market prediction, one of the most challenging tasks in the machine learning research. As mentioned, CNN has strength in interpreting images. Thus, the model proposed in this study adopts CNN as the binary classifier that predicts stock market direction (upward or downward) by using time series graphs as its inputs. That is, our proposal is to build a machine learning algorithm that mimics an experts called 'technical analysts' who examine the graph of past price movement, and predict future financial price movements. Our proposed model named 'CNN-FG(Convolutional Neural Network using Fluctuation Graph)' consists of five steps. In the first step, it divides the dataset into the intervals of 5 days. And then, it creates time series graphs for the divided dataset in step 2. The size of the image in which the graph is drawn is $40(pixels){\times}40(pixels)$, and the graph of each independent variable was drawn using different colors. In step 3, the model converts the images into the matrices. Each image is converted into the combination of three matrices in order to express the value of the color using R(red), G(green), and B(blue) scale. In the next step, it splits the dataset of the graph images into training and validation datasets. We used 80% of the total dataset as the training dataset, and the remaining 20% as the validation dataset. And then, CNN classifiers are trained using the images of training dataset in the final step. Regarding the parameters of CNN-FG, we adopted two convolution filters ($5{\times}5{\times}6$ and $5{\times}5{\times}9$) in the convolution layer. In the pooling layer, $2{\times}2$ max pooling filter was used. The numbers of the nodes in two hidden layers were set to, respectively, 900 and 32, and the number of the nodes in the output layer was set to 2(one is for the prediction of upward trend, and the other one is for downward trend). Activation functions for the convolution layer and the hidden layer were set to ReLU(Rectified Linear Unit), and one for the output layer set to Softmax function. To validate our model - CNN-FG, we applied it to the prediction of KOSPI200 for 2,026 days in eight years (from 2009 to 2016). To match the proportions of the two groups in the independent variable (i.e. tomorrow's stock market movement), we selected 1,950 samples by applying random sampling. Finally, we built the training dataset using 80% of the total dataset (1,560 samples), and the validation dataset using 20% (390 samples). The dependent variables of the experimental dataset included twelve technical indicators popularly been used in the previous studies. They include Stochastic %K, Stochastic %D, Momentum, ROC(rate of change), LW %R(Larry William's %R), A/D oscillator(accumulation/distribution oscillator), OSCP(price oscillator), CCI(commodity channel index), and so on. To confirm the superiority of CNN-FG, we compared its prediction accuracy with the ones of other classification models. Experimental results showed that CNN-FG outperforms LOGIT(logistic regression), ANN(artificial neural network), and SVM(support vector machine) with the statistical significance. These empirical results imply that converting time series business data into graphs and building CNN-based classification models using these graphs can be effective from the perspective of prediction accuracy. Thus, this paper sheds a light on how to apply deep learning techniques to the domain of business problem solving.

Shielding for Critical Organs and Radiation Exposure Dose Distribution in Patients with High Energy Radiotherapy (고 에너지 방사선치료에서 환자의 피폭선량 분포와 생식선의 차폐)

  • Chu, Sung-Sil;Suh, Chang-Ok;Kim, Gwi-Eon
    • Journal of Radiation Protection and Research
    • /
    • v.27 no.1
    • /
    • pp.1-10
    • /
    • 2002
  • High energy photon beams from medical linear accelerators produce large scattered radiation by various components of the treatment head, collimator and walls or objects in the treatment room including the patient. These scattered radiation do not provide therapeutic dose and are considered a hazard from the radiation safety perspective. Scattered dose of therapeutic high energy radiation beams are contributed significant unwanted dose to the patient. ICRP take the position that a dose of 500mGy may cause abortion at any stage of pregnancy and that radiation detriment to the fetus includes risk of mental retardation with a possible threshold in the dose response relationship around 100 mGy for the gestational period. The ICRP principle of as low as reasonably achievable (ALARA) was recommended for protection of occupation upon the linear no-threshold dose response hypothesis for cancer induction. We suggest this ALARA principle be applied to the fetus and testicle in therapeutic treatment. Radiation dose outside a photon treatment filed is mostly due to scattered photons. This scattered dose is a function of the distance from the beam edge, treatment geometry, primary photon energy, and depth in the patient. The need for effective shielding of the fetus and testicle is reinforced when young patients ate treated with external beam radiation therapy and then shielding designed to reduce the scattered photon dose to normal organs have to considered. Irradiation was performed in phantom using high energy photon beams produced by a Varian 2100C/D medical linear accelerator (Varian Oncology Systems, Palo Alto, CA) located at the Yonsei Cancer Center. The composite phantom used was comprised of a commercially available anthropomorphic Rando phantom (Phantom Laboratory Inc., Salem, YN) and a rectangular solid polystyrene phantom of dimensions $30cm{\times}30cm{\times}20cm$. the anthropomorphic Rando phantom represents an average man made from tissue equivalent materials that is transected into transverse 36 slices of 2.5cm thickness. Photon dose was measured using a Capintec PR-06C ionization chamber with Capintec 192 electrometer (Capintec Inc., Ramsey, NJ), TLD( VICTOREEN 5000. LiF) and film dosimetry V-Omat, Kodak). In case of fetus, the dosimeter was placed at a depth of loom in this phantom at 100cm source to axis distance and located centrally 15cm from the inferior edge of the $30cm{\times}30cm^2$ x-ray beam irradiating the Rando phantom chest wall. A acryl bridge of size $40cm{\times}40cm^2$ and a clear space of about 20 cm was fabricated and placed on top of the rectangular polystyrene phantom representing the abdomen of the patient. The leaf pot for testicle shielding was made as various shape, sizes, thickness and supporting stand. The scattered photon with and without shielding were measured at the representative position of the fetus and testicle. Measurement of radiation scattered dose outside fields and critical organs, like fetus position and testicle region, from chest or pelvic irradiation by large fie]d of high energy radiation beam was performed using an ionization chamber and film dosimetry. The scattered doses outside field were measured 5 - 10% of maximum doses in fields and exponentially decrease from field margins. The scattered photon dose received the fetus and testicle from thorax field irradiation was measured about 1 mGy/Gy of photon treatment dose. Shielding construction to reduce this scattered dose was investigated using lead sheet and blocks. Lead pot shield for testicle reduced the scatter dose under 10 mGy when photon beam of 60 Gy was irradiated in abdomen region. The scattered photon dose is reduced when the lead shield was used while the no significant reduction of scattered photon dose was observed and 2-3 mm lead sheets refuted the skin dose under 80% and almost electron contamination. The results indicate that it was possible to improve shielding to reduce scattered photon for fetus and testicle when a young patients were treated with a high energy photon beam.

The Measurement of Sensitivity and Comparative Analysis of Simplified Quantitation Methods to Measure Dopamine Transporters Using [I-123]IPT Pharmacokinetic Computer Simulations ([I-123]IPT 약역학 컴퓨터시뮬레이션을 이용한 민감도 측정 및 간편화된 운반체 정량분석 방법들의 비교분석 연구)

  • Son, Hye-Kyung;Nha, Sang-Kyun;Lee, Hee-Kyung;Kim, Hee-Joung
    • The Korean Journal of Nuclear Medicine
    • /
    • v.31 no.1
    • /
    • pp.19-29
    • /
    • 1997
  • Recently, [I-123]IPT SPECT has been used for early diagnosis of Parkinson's patients(PP) by imaging dopamine transporters. The dynamic time activity curves in basal ganglia(BG) and occipital cortex(OCC) without blood samples were obtained for 2 hours. These data were then used to measure dopamine transporters by operationally defined ratio methods of (BG-OCC)/OCC at 2 hrs, binding potential $R_v=k_3/k_4$ using graphic method or $R_A$= (ABBG-ABOCC)/ABOCC for 2 hrs, where ABBG represents accumulated binding activity in basal ganglia(${\int}^{120min}_0$ BG(t)dt) and ABOCC represents accumulated binding activity in occipital cortex(${\int}^{120min}_0$ OCC(t)dt). The purpose of this study was to examine the IPT pharmacokinetics and investigate the usefulness of simplified methods of (BG-OCC)/OCC, $R_A$, and $R_v$ which are often assumed that these values reflect the true values of $k_3/k_4$. The rate constants $K_1,\;k_2\;k_3$ and $k_4$ to be used for simulations were derived using [I-123]IPT SPECT and aterialized blood data with a standard three compartmental model. The sensitivities and time activity curves in BG and OCC were computed by changing $K_l$ and $k_3$(only BG) for every 5min over 2 hours. The values (BG-OCC)/OCC, $R_A$, and $R_v$ were then computed from the time activity curves and the linear regression analysis was used to measure the accuracies of these methods. The late constants $K_l,\;k_2\;k_3\;k_4$ at BG and OCC were $1.26{\pm}5.41%,\;0.044{\pm}19.58%,\;0.031{\pm}24.36%,\;0.008{\pm}22.78%$ and $1.36{\pm}4.76%,\;0.170{\pm}6.89%,\;0.007{\pm}23.89%,\;0.007{\pm}45.09%$, respectively. The Sensitivities for ((${\Delta}S/S$)/(${\Delta}k_3/k_3$)) and ((${\Delta}S/S$)/(${\Delta}K_l/K_l$)) at 30min and 120min were measured as (0.19, 0.50) and (0.61, 0,23), respectively. The correlation coefficients and slopes of ((BG-OCC)/OCC, $R_A$, and $R_v$) with $k_3/k_4$ were (0.98, 1.00, 0.99) and (1.76, 0.47, 1.25), respectively. These simulation results indicate that a late [I-123]IPT SPECT image may represent the distribution of the dopamine transporters. Good correlations were shown between (3G-OCC)/OCC, $R_A$ or $R_v$ and true $k_3/k_4$, although the slopes between them were not unity. Pharmacokinetic computer simulations may be a very useful technique in studying dopamine transporter systems.

  • PDF

Image Watermarking for Copyright Protection of Images on Shopping Mall (쇼핑몰 이미지 저작권보호를 위한 영상 워터마킹)

  • Bae, Kyoung-Yul
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.4
    • /
    • pp.147-157
    • /
    • 2013
  • With the advent of the digital environment that can be accessed anytime, anywhere with the introduction of high-speed network, the free distribution and use of digital content were made possible. Ironically this environment is raising a variety of copyright infringement, and product images used in the online shopping mall are pirated frequently. There are many controversial issues whether shopping mall images are creative works or not. According to Supreme Court's decision in 2001, to ad pictures taken with ham products is simply a clone of the appearance of objects to deliver nothing but the decision was not only creative expression. But for the photographer's losses recognized in the advertising photo shoot takes the typical cost was estimated damages. According to Seoul District Court precedents in 2003, if there are the photographer's personality and creativity in the selection of the subject, the composition of the set, the direction and amount of light control, set the angle of the camera, shutter speed, shutter chance, other shooting methods for capturing, developing and printing process, the works should be protected by copyright law by the Court's sentence. In order to receive copyright protection of the shopping mall images by the law, it is simply not to convey the status of the product, the photographer's personality and creativity can be recognized that it requires effort. Accordingly, the cost of making the mall image increases, and the necessity for copyright protection becomes higher. The product images of the online shopping mall have a very unique configuration unlike the general pictures such as portraits and landscape photos and, therefore, the general image watermarking technique can not satisfy the requirements of the image watermarking. Because background of product images commonly used in shopping malls is white or black, or gray scale (gradient) color, it is difficult to utilize the space to embed a watermark and the area is very sensitive even a slight change. In this paper, the characteristics of images used in shopping malls are analyzed and a watermarking technology which is suitable to the shopping mall images is proposed. The proposed image watermarking technology divide a product image into smaller blocks, and the corresponding blocks are transformed by DCT (Discrete Cosine Transform), and then the watermark information was inserted into images using quantization of DCT coefficients. Because uniform treatment of the DCT coefficients for quantization cause visual blocking artifacts, the proposed algorithm used weighted mask which quantizes finely the coefficients located block boundaries and coarsely the coefficients located center area of the block. This mask improves subjective visual quality as well as the objective quality of the images. In addition, in order to improve the safety of the algorithm, the blocks which is embedded the watermark are randomly selected and the turbo code is used to reduce the BER when extracting the watermark. The PSNR(Peak Signal to Noise Ratio) of the shopping mall image watermarked by the proposed algorithm is 40.7~48.5[dB] and BER(Bit Error Rate) after JPEG with QF = 70 is 0. This means the watermarked image is high quality and the algorithm is robust to JPEG compression that is used generally at the online shopping malls. Also, for 40% change in size and 40 degrees of rotation, the BER is 0. In general, the shopping malls are used compressed images with QF which is higher than 90. Because the pirated image is used to replicate from original image, the proposed algorithm can identify the copyright infringement in the most cases. As shown the experimental results, the proposed algorithm is suitable to the shopping mall images with simple background. However, the future study should be carried out to enhance the robustness of the proposed algorithm because the robustness loss is occurred after mask process.

Analysis of Twitter for 2012 South Korea Presidential Election by Text Mining Techniques (텍스트 마이닝을 이용한 2012년 한국대선 관련 트위터 분석)

  • Bae, Jung-Hwan;Son, Ji-Eun;Song, Min
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.141-156
    • /
    • 2013
  • Social media is a representative form of the Web 2.0 that shapes the change of a user's information behavior by allowing users to produce their own contents without any expert skills. In particular, as a new communication medium, it has a profound impact on the social change by enabling users to communicate with the masses and acquaintances their opinions and thoughts. Social media data plays a significant role in an emerging Big Data arena. A variety of research areas such as social network analysis, opinion mining, and so on, therefore, have paid attention to discover meaningful information from vast amounts of data buried in social media. Social media has recently become main foci to the field of Information Retrieval and Text Mining because not only it produces massive unstructured textual data in real-time but also it serves as an influential channel for opinion leading. But most of the previous studies have adopted broad-brush and limited approaches. These approaches have made it difficult to find and analyze new information. To overcome these limitations, we developed a real-time Twitter trend mining system to capture the trend in real-time processing big stream datasets of Twitter. The system offers the functions of term co-occurrence retrieval, visualization of Twitter users by query, similarity calculation between two users, topic modeling to keep track of changes of topical trend, and mention-based user network analysis. In addition, we conducted a case study on the 2012 Korean presidential election. We collected 1,737,969 tweets which contain candidates' name and election on Twitter in Korea (http://www.twitter.com/) for one month in 2012 (October 1 to October 31). The case study shows that the system provides useful information and detects the trend of society effectively. The system also retrieves the list of terms co-occurred by given query terms. We compare the results of term co-occurrence retrieval by giving influential candidates' name, 'Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn' as query terms. General terms which are related to presidential election such as 'Presidential Election', 'Proclamation in Support', Public opinion poll' appear frequently. Also the results show specific terms that differentiate each candidate's feature such as 'Park Jung Hee' and 'Yuk Young Su' from the query 'Guen Hae Park', 'a single candidacy agreement' and 'Time of voting extension' from the query 'Jae In Moon' and 'a single candidacy agreement' and 'down contract' from the query 'Chul Su Ahn'. Our system not only extracts 10 topics along with related terms but also shows topics' dynamic changes over time by employing the multinomial Latent Dirichlet Allocation technique. Each topic can show one of two types of patterns-Rising tendency and Falling tendencydepending on the change of the probability distribution. To determine the relationship between topic trends in Twitter and social issues in the real world, we compare topic trends with related news articles. We are able to identify that Twitter can track the issue faster than the other media, newspapers. The user network in Twitter is different from those of other social media because of distinctive characteristics of making relationships in Twitter. Twitter users can make their relationships by exchanging mentions. We visualize and analyze mention based networks of 136,754 users. We put three candidates' name as query terms-Geun Hae Park', 'Jae In Moon', and 'Chul Su Ahn'. The results show that Twitter users mention all candidates' name regardless of their political tendencies. This case study discloses that Twitter could be an effective tool to detect and predict dynamic changes of social issues, and mention-based user networks could show different aspects of user behavior as a unique network that is uniquely found in Twitter.

Derivation of Digital Music's Ranking Change Through Time Series Clustering (시계열 군집분석을 통한 디지털 음원의 순위 변화 패턴 분류)

  • Yoo, In-Jin;Park, Do-Hyung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.171-191
    • /
    • 2020
  • This study focused on digital music, which is the most valuable cultural asset in the modern society and occupies a particularly important position in the flow of the Korean Wave. Digital music was collected based on the "Gaon Chart," a well-established music chart in Korea. Through this, the changes in the ranking of the music that entered the chart for 73 weeks were collected. Afterwards, patterns with similar characteristics were derived through time series cluster analysis. Then, a descriptive analysis was performed on the notable features of each pattern. The research process suggested by this study is as follows. First, in the data collection process, time series data was collected to check the ranking change of digital music. Subsequently, in the data processing stage, the collected data was matched with the rankings over time, and the music title and artist name were processed. Each analysis is then sequentially performed in two stages consisting of exploratory analysis and explanatory analysis. First, the data collection period was limited to the period before 'the music bulk buying phenomenon', a reliability issue related to music ranking in Korea. Specifically, it is 73 weeks starting from December 31, 2017 to January 06, 2018 as the first week, and from May 19, 2019 to May 25, 2019. And the analysis targets were limited to digital music released in Korea. In particular, digital music was collected based on the "Gaon Chart", a well-known music chart in Korea. Unlike private music charts that are being serviced in Korea, Gaon Charts are charts approved by government agencies and have basic reliability. Therefore, it can be considered that it has more public confidence than the ranking information provided by other services. The contents of the collected data are as follows. Data on the period and ranking, the name of the music, the name of the artist, the name of the album, the Gaon index, the production company, and the distribution company were collected for the music that entered the top 100 on the music chart within the collection period. Through data collection, 7,300 music, which were included in the top 100 on the music chart, were identified for a total of 73 weeks. On the other hand, in the case of digital music, since the cases included in the music chart for more than two weeks are frequent, the duplication of music is removed through the pre-processing process. For duplicate music, the number and location of the duplicated music were checked through the duplicate check function, and then deleted to form data for analysis. Through this, a list of 742 unique music for analysis among the 7,300-music data in advance was secured. A total of 742 songs were secured through previous data collection and pre-processing. In addition, a total of 16 patterns were derived through time series cluster analysis on the ranking change. Based on the patterns derived after that, two representative patterns were identified: 'Steady Seller' and 'One-Hit Wonder'. Furthermore, the two patterns were subdivided into five patterns in consideration of the survival period of the music and the music ranking. The important characteristics of each pattern are as follows. First, the artist's superstar effect and bandwagon effect were strong in the one-hit wonder-type pattern. Therefore, when consumers choose a digital music, they are strongly influenced by the superstar effect and the bandwagon effect. Second, through the Steady Seller pattern, we confirmed the music that have been chosen by consumers for a very long time. In addition, we checked the patterns of the most selected music through consumer needs. Contrary to popular belief, the steady seller: mid-term pattern, not the one-hit wonder pattern, received the most choices from consumers. Particularly noteworthy is that the 'Climbing the Chart' phenomenon, which is contrary to the existing pattern, was confirmed through the steady-seller pattern. This study focuses on the change in the ranking of music over time, a field that has been relatively alienated centering on digital music. In addition, a new approach to music research was attempted by subdividing the pattern of ranking change rather than predicting the success and ranking of music.

Open Digital Textbook for Smart Education (스마트교육을 위한 오픈 디지털교과서)

  • Koo, Young-Il;Park, Choong-Shik
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.2
    • /
    • pp.177-189
    • /
    • 2013
  • In Smart Education, the roles of digital textbook is very important as face-to-face media to learners. The standardization of digital textbook will promote the industrialization of digital textbook for contents providers and distributers as well as learner and instructors. In this study, the following three objectives-oriented digital textbooks are looking for ways to standardize. (1) digital textbooks should undertake the role of the media for blended learning which supports on-off classes, should be operating on common EPUB viewer without special dedicated viewer, should utilize the existing framework of the e-learning learning contents and learning management. The reason to consider the EPUB as the standard for digital textbooks is that digital textbooks don't need to specify antoher standard for the form of books, and can take advantage od industrial base with EPUB standards-rich content and distribution structure (2) digital textbooks should provide a low-cost open market service that are currently available as the standard open software (3) To provide appropriate learning feedback information to students, digital textbooks should provide a foundation which accumulates and manages all the learning activity information according to standard infrastructure for educational Big Data processing. In this study, the digital textbook in a smart education environment was referred to open digital textbook. The components of open digital textbooks service framework are (1) digital textbook terminals such as smart pad, smart TVs, smart phones, PC, etc., (2) digital textbooks platform to show and perform digital contents on digital textbook terminals, (3) learning contents repository, which exist on the cloud, maintains accredited learning, (4) App Store providing and distributing secondary learning contents and learning tools by learning contents developing companies, and (5) LMS as a learning support/management tool which on-site class teacher use for creating classroom instruction materials. In addition, locating all of the hardware and software implement a smart education service within the cloud must have take advantage of the cloud computing for efficient management and reducing expense. The open digital textbooks of smart education is consdered as providing e-book style interface of LMS to learners. In open digital textbooks, the representation of text, image, audio, video, equations, etc. is basic function. But painting, writing, problem solving, etc are beyond the capabilities of a simple e-book. The Communication of teacher-to-student, learner-to-learnert, tems-to-team is required by using the open digital textbook. To represent student demographics, portfolio information, and class information, the standard used in e-learning is desirable. To process learner tracking information about the activities of the learner for LMS(Learning Management System), open digital textbook must have the recording function and the commnincating function with LMS. DRM is a function for protecting various copyright. Currently DRMs of e-boook are controlled by the corresponding book viewer. If open digital textbook admitt DRM that is used in a variety of different DRM standards of various e-book viewer, the implementation of redundant features can be avoided. Security/privacy functions are required to protect information about the study or instruction from a third party UDL (Universal Design for Learning) is learning support function for those with disabilities have difficulty in learning courses. The open digital textbook, which is based on E-book standard EPUB 3.0, must (1) record the learning activity log information, and (2) communicate with the server to support the learning activity. While the recording function and the communication function, which is not determined on current standards, is implemented as a JavaScript and is utilized in the current EPUB 3.0 viewer, ths strategy of proposing such recording and communication functions as the next generation of e-book standard, or special standard (EPUB 3.0 for education) is needed. Future research in this study will implement open source program with the proposed open digital textbook standard and present a new educational services including Big Data analysis.

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

A Study of Anomaly Detection for ICT Infrastructure using Conditional Multimodal Autoencoder (ICT 인프라 이상탐지를 위한 조건부 멀티모달 오토인코더에 관한 연구)

  • Shin, Byungjin;Lee, Jonghoon;Han, Sangjin;Park, Choong-Shik
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.57-73
    • /
    • 2021
  • Maintenance and prevention of failure through anomaly detection of ICT infrastructure is becoming important. System monitoring data is multidimensional time series data. When we deal with multidimensional time series data, we have difficulty in considering both characteristics of multidimensional data and characteristics of time series data. When dealing with multidimensional data, correlation between variables should be considered. Existing methods such as probability and linear base, distance base, etc. are degraded due to limitations called the curse of dimensions. In addition, time series data is preprocessed by applying sliding window technique and time series decomposition for self-correlation analysis. These techniques are the cause of increasing the dimension of data, so it is necessary to supplement them. The anomaly detection field is an old research field, and statistical methods and regression analysis were used in the early days. Currently, there are active studies to apply machine learning and artificial neural network technology to this field. Statistically based methods are difficult to apply when data is non-homogeneous, and do not detect local outliers well. The regression analysis method compares the predictive value and the actual value after learning the regression formula based on the parametric statistics and it detects abnormality. Anomaly detection using regression analysis has the disadvantage that the performance is lowered when the model is not solid and the noise or outliers of the data are included. There is a restriction that learning data with noise or outliers should be used. The autoencoder using artificial neural networks is learned to output as similar as possible to input data. It has many advantages compared to existing probability and linear model, cluster analysis, and map learning. It can be applied to data that does not satisfy probability distribution or linear assumption. In addition, it is possible to learn non-mapping without label data for teaching. However, there is a limitation of local outlier identification of multidimensional data in anomaly detection, and there is a problem that the dimension of data is greatly increased due to the characteristics of time series data. In this study, we propose a CMAE (Conditional Multimodal Autoencoder) that enhances the performance of anomaly detection by considering local outliers and time series characteristics. First, we applied Multimodal Autoencoder (MAE) to improve the limitations of local outlier identification of multidimensional data. Multimodals are commonly used to learn different types of inputs, such as voice and image. The different modal shares the bottleneck effect of Autoencoder and it learns correlation. In addition, CAE (Conditional Autoencoder) was used to learn the characteristics of time series data effectively without increasing the dimension of data. In general, conditional input mainly uses category variables, but in this study, time was used as a condition to learn periodicity. The CMAE model proposed in this paper was verified by comparing with the Unimodal Autoencoder (UAE) and Multi-modal Autoencoder (MAE). The restoration performance of Autoencoder for 41 variables was confirmed in the proposed model and the comparison model. The restoration performance is different by variables, and the restoration is normally well operated because the loss value is small for Memory, Disk, and Network modals in all three Autoencoder models. The process modal did not show a significant difference in all three models, and the CPU modal showed excellent performance in CMAE. ROC curve was prepared for the evaluation of anomaly detection performance in the proposed model and the comparison model, and AUC, accuracy, precision, recall, and F1-score were compared. In all indicators, the performance was shown in the order of CMAE, MAE, and AE. Especially, the reproduction rate was 0.9828 for CMAE, which can be confirmed to detect almost most of the abnormalities. The accuracy of the model was also improved and 87.12%, and the F1-score was 0.8883, which is considered to be suitable for anomaly detection. In practical aspect, the proposed model has an additional advantage in addition to performance improvement. The use of techniques such as time series decomposition and sliding windows has the disadvantage of managing unnecessary procedures; and their dimensional increase can cause a decrease in the computational speed in inference.The proposed model has characteristics that are easy to apply to practical tasks such as inference speed and model management.