Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)
-
- Journal of Intelligence and Information Systems
- /
- v.25 no.2
- /
- pp.141-166
- /
- 2019
Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.
Maintenance and prevention of failure through anomaly detection of ICT infrastructure is becoming important. System monitoring data is multidimensional time series data. When we deal with multidimensional time series data, we have difficulty in considering both characteristics of multidimensional data and characteristics of time series data. When dealing with multidimensional data, correlation between variables should be considered. Existing methods such as probability and linear base, distance base, etc. are degraded due to limitations called the curse of dimensions. In addition, time series data is preprocessed by applying sliding window technique and time series decomposition for self-correlation analysis. These techniques are the cause of increasing the dimension of data, so it is necessary to supplement them. The anomaly detection field is an old research field, and statistical methods and regression analysis were used in the early days. Currently, there are active studies to apply machine learning and artificial neural network technology to this field. Statistically based methods are difficult to apply when data is non-homogeneous, and do not detect local outliers well. The regression analysis method compares the predictive value and the actual value after learning the regression formula based on the parametric statistics and it detects abnormality. Anomaly detection using regression analysis has the disadvantage that the performance is lowered when the model is not solid and the noise or outliers of the data are included. There is a restriction that learning data with noise or outliers should be used. The autoencoder using artificial neural networks is learned to output as similar as possible to input data. It has many advantages compared to existing probability and linear model, cluster analysis, and map learning. It can be applied to data that does not satisfy probability distribution or linear assumption. In addition, it is possible to learn non-mapping without label data for teaching. However, there is a limitation of local outlier identification of multidimensional data in anomaly detection, and there is a problem that the dimension of data is greatly increased due to the characteristics of time series data. In this study, we propose a CMAE (Conditional Multimodal Autoencoder) that enhances the performance of anomaly detection by considering local outliers and time series characteristics. First, we applied Multimodal Autoencoder (MAE) to improve the limitations of local outlier identification of multidimensional data. Multimodals are commonly used to learn different types of inputs, such as voice and image. The different modal shares the bottleneck effect of Autoencoder and it learns correlation. In addition, CAE (Conditional Autoencoder) was used to learn the characteristics of time series data effectively without increasing the dimension of data. In general, conditional input mainly uses category variables, but in this study, time was used as a condition to learn periodicity. The CMAE model proposed in this paper was verified by comparing with the Unimodal Autoencoder (UAE) and Multi-modal Autoencoder (MAE). The restoration performance of Autoencoder for 41 variables was confirmed in the proposed model and the comparison model. The restoration performance is different by variables, and the restoration is normally well operated because the loss value is small for Memory, Disk, and Network modals in all three Autoencoder models. The process modal did not show a significant difference in all three models, and the CPU modal showed excellent performance in CMAE. ROC curve was prepared for the evaluation of anomaly detection performance in the proposed model and the comparison model, and AUC, accuracy, precision, recall, and F1-score were compared. In all indicators, the performance was shown in the order of CMAE, MAE, and AE. Especially, the reproduction rate was 0.9828 for CMAE, which can be confirmed to detect almost most of the abnormalities. The accuracy of the model was also improved and 87.12%, and the F1-score was 0.8883, which is considered to be suitable for anomaly detection. In practical aspect, the proposed model has an additional advantage in addition to performance improvement. The use of techniques such as time series decomposition and sliding windows has the disadvantage of managing unnecessary procedures; and their dimensional increase can cause a decrease in the computational speed in inference.The proposed model has characteristics that are easy to apply to practical tasks such as inference speed and model management.
The wall shear stress in the vicinity of end-to end anastomoses under steady flow conditions was measured using a flush-mounted hot-film anemometer(FMHFA) probe. The experimental measurements were in good agreement with numerical results except in flow with low Reynolds numbers. The wall shear stress increased proximal to the anastomosis in flow from the Penrose tubing (simulating an artery) to the PTFE: graft. In flow from the PTFE graft to the Penrose tubing, low wall shear stress was observed distal to the anastomosis. Abnormal distributions of wall shear stress in the vicinity of the anastomosis, resulting from the compliance mismatch between the graft and the host artery, might be an important factor of ANFH formation and the graft failure. The present study suggests a correlation between regions of the low wall shear stress and the development of anastomotic neointimal fibrous hyperplasia(ANPH) in end-to-end anastomoses. 30523 T00401030523 ^x Air pressure decay(APD) rate and ultrafiltration rate(UFR) tests were performed on new and saline rinsed dialyzers as well as those roused in patients several times. C-DAK 4000 (Cordis Dow) and CF IS-11 (Baxter Travenol) reused dialyzers obtained from the dialysis clinic were used in the present study. The new dialyzers exhibited a relatively flat APD, whereas saline rinsed and reused dialyzers showed considerable amount of decay. C-DAH dialyzers had a larger APD(11.70
The wall shear stress in the vicinity of end-to end anastomoses under steady flow conditions was measured using a flush-mounted hot-film anemometer(FMHFA) probe. The experimental measurements were in good agreement with numerical results except in flow with low Reynolds numbers. The wall shear stress increased proximal to the anastomosis in flow from the Penrose tubing (simulating an artery) to the PTFE: graft. In flow from the PTFE graft to the Penrose tubing, low wall shear stress was observed distal to the anastomosis. Abnormal distributions of wall shear stress in the vicinity of the anastomosis, resulting from the compliance mismatch between the graft and the host artery, might be an important factor of ANFH formation and the graft failure. The present study suggests a correlation between regions of the low wall shear stress and the development of anastomotic neointimal fibrous hyperplasia(ANPH) in end-to-end anastomoses. 30523 T00401030523 ^x Air pressure decay(APD) rate and ultrafiltration rate(UFR) tests were performed on new and saline rinsed dialyzers as well as those roused in patients several times. C-DAK 4000 (Cordis Dow) and CF IS-11 (Baxter Travenol) reused dialyzers obtained from the dialysis clinic were used in the present study. The new dialyzers exhibited a relatively flat APD, whereas saline rinsed and reused dialyzers showed considerable amount of decay. C-DAH dialyzers had a larger APD(11.70
The purpose of this paper was to schedule optimum cutting strategy which could maximize the total yield under certain restrictions on periodic timber removals and harvest areas from an industrial forest, based on a linear programming technique. Sensitivity of the regulation model to variations in restrictions has also been analyzed to get information on the changes of total yield in the planning period. The regulation procedure has been made on the experimental forest of the Agricultural College of Seoul National University. The forest is composed of 219 cutting units, and characterized by younger age group which is very common in Korea. The planning period is devided into 10 cutting periods of five years each, and cutting is permissible only on the stands of age groups 5-9. It is also assumed in the study that the subsequent forests are established immediately after cutting existing forests, non-stocked forest lands are planted in first cutting period, and established forests are fully stocked until next harvest. All feasible cutting regimes have been defined to each unit depending on their age groups. Total yield (Vi, k) of each regime expected in the planning period has been projected using stand yield tables and forest inventory data, and the regime which gives highest Vi, k has been selected as a optimum cutting regime. After calculating periodic yields and cutting areas, and total yield from the optimum regimes selected without any restrictions, the upper and lower limits of periodic yields(Vj-max, Vj-min) and those of periodic cutting areas (Aj-max, Aj-min) have been decided. The optimum regimes under such restrictions have been selected by linear programming. The results of the study may be summarized as follows:- 1. The fluctuations of periodic harvest yields and areas under cutting regimes selected without restrictions were very great, because of irregular composition of age classes and growing stocks of existing stands. About 68.8 percent of total yield is expected in period 10, while none of yield in periods 6 and 7. 2. After inspection of the above solution, restricted optimum cutting regimes were obtained under the restrictions of Amin=150 ha, Amax=400ha,
Background: The prevalence of tuberculosis in Korea decreased remarkably for the past 30 years, while the incidence of disease caused by mycobacteria other than tuberculosis is unknown. Korean Academy of Tuberculosis and Respiratory Diseases performed national survey to estimate the incidence of mycobacterial diseases other than tuberculosis in Korea. We analyzed the clinical data of confirmed cases for the practice of primary care physicians and pulmonary specialists. Methods: The period of study was from January 1981 to October 1994. We collected the data retrospectively by correspondence with physicians in the hospitals that referred the specimens to Korean Institute of Tuberculosis, The Korean National Tuberculosis Association for the detection of mycobacteria other than tuberculosis. In confirmed cases, we obtained the records for clinical, laboratory and radiological findings in detail using protocols. Results: 1) Mycobacterial diseases other than tuberculosis were confirmed that 1 case was in 1981, 2 cases in 1982, 4 cases in 1983, 2 cases in 1984, 5 cases in 1985, 1 case in 1986, 3 cases in 1987, 1 case in 1988, 6 cases in 1989, 9 cases in 1990, 14 cases in 1990, 10 cases in 1992, 4 cases in 1993, and 96 cases in 1994. Cases since 1990 were 133 cases(84.2%) of a total. 2) Fifty seven percent of patients were in the age group of over 60 years. The ratio of male to female patients was 2.6:1. 3) The distribution of hospitals in Korea showed that 61 cases(38.6%) were referred from Double Cross Clinic, 42 cases(26.6%) from health centers, 21 cases(13.3%) from tertiary referral hospitals, 15 cases(9.5%) from secondary referral hospitals, and 10 cases(6.3%) from primary care hospitals. The area distribution in Korea revealed that 98 cases(62%) were in Seoul, 17 cases(10.8%) in Gyeongsangbuk-do, 12 cases(7.6%) in Kyongki-do, 8 cases(5.1%) in Chungchongnam-do, each 5 cases(3.2%) in Gyeongsangnam-do and Chungchongbuk-do, 6 cases(3.8%) in other areas. 4) In the species of isolated mycobacteria other than tuberculosis, M. avium-intracellulare was found in 104 cases(65.2%), M. fortuitum in 20 cases(12.7%), M. chelonae in 15 cases(9.5%), M. gordonae in 7 cases(4.4%), M. terrae in 5 cases(3.2%), M. scrofulaceum in 3 cases(1.9%), M. kansasii and M. szulgai in each 2 cases(1.3%), and M. avium-intracellulare coexisting with M. terrae in 1 case(0.6%). 5) In pre-existing pulmonary diseases, pulmonary tuberculosis was 113 cases(71.5%), bronchiectasis 6 cases(3.8%), chronic bronchitis 10 cases(6.3%), and pulmonary fibrosis 6 cases(3.8%). The timing of diagnosis as having pulmonary tuberculosis was within 1 year in 7 cases(6.2%), 2~5 years ago in 32 cases(28.3%), 6~10 years ago in 29 cases(25.7%), 11~15 years ago in 16 cases(14.2%), 16~20 years ago in 15 cases (13.3%), and 20 years ago in 14 cases(12.4%). Duration of anti-tuberculous treatment was within 3 months in 6 cases(5.3%), 4~6 months in 17 cases(15%), 7~9 months in 16 cases(14.2%), 10~12 months in 11 cases(9.7%), 1~2 years in 21 cases(18.6%), and over 2 years in 8 cases(7.1%). The results of treatment were cure in 44 cases(27.9%) and failure in 25 cases(15.8%). 6) Associated extra-pulmonary diseases were chronic liver disease coexisting with chronic renal failure in 1 case(0.6%), diabetes mellitus in 9 cases(5.7%), cardiovascular diseases in 2 cases(1.3%), long-term therapy with steroid in 2 cases(1.3%) and chronic liver disease, chronic renal failure, colitis and pneumoconiosis in each 1 case(0.6%). 7) The clinical presentations of mycobacterial diseases other than tuberculosis were 86 cases (54.4%) of chronic pulmonary infections, 1 case(0.6%) of cervical or other site lymphadenitis, 3 cases(1.9%) of endobronchial tuberculosis, and 1 case(0.6%) of intestinal tuberculosis. 8) The symptoms of patients were cough(62%), sputum(61.4%), dyspnea(30.4%), hemoptysis or blood-tinged sputum(20.9%), weight loss(13.3%), fever(6.3%), and others(4.4%). 9) Smear negative with culture negative cases were 24 cases(15.2%) in first examination, 27 cases(17.1%) in second one, 22 cases(13.9%) in third one, and 17 cases(10.8%) in fourth one. Smear negative with culture positive cases were 59 cases(37.3%) in first examination, 36 cases (22.8%) in second one, 24 cases(15.2%) in third one, and 23 cases(14.6%) in fourth one. Smear positive with culture negative cases were 1 case(0.6%) in first examination, 4 cases(2.5%) in second one, 1 case (0.6%) in third one, and 2 cases(1.3%) in fourth one. Smear positive with culture positive cases were 48 cases(30.4%) in first examination, 34 cases(21.5%) in second one, 34 cases(21.5%) in third one, and 22 cases(13.9%) in fourth one. 10) The specimens isolated mycobacteria other than tuberculosis were sputum in 143 cases (90.5%), sputum and bronchial washing in 4 cases(2.5%), bronchial washing in 1 case(0.6%). 11) Drug resistance against all species of mycobacteria other than tuberculosis were that INH was 62%, EMB 55.7%, RMP 52.5%, PZA 34.8%, OFX 29.1%, SM 36.7%, KM 27.2%, TUM 24.1%, CS 23.4%, TH 34.2%, and PAS 44.9%. Drug resistance against M. avium-intracellulare were that INH was 62.5%, EMB 59.6%, RMP 51.9%, PZA 29.8%, OFX 33.7%, SM 30.8%, KM 20.2%, TUM 17.3%, CS 14.4%, TH 31.7%, and PAS 38.5%. Drug resistance against M. chelonae were that INH was 66.7%, EMB 66.7%, RMP 66.7%, PZA 40%, OFX 26.7%, SM 66.7%, KM 53.3%, TUM 53.3%, CS 60%, TH 53.3%, and PAS 66.7%. Drug resistance against M. fortuitum were that INH was 65%, EMB 55%, RMP 65%, PZA 50%, OFX 25%, SM 55%, KM 45%, TUM 55%, CS 65%, TH 45%, and PAS 60%. 12) The activities of disease on chest roentgenogram showed that no active disease was 7 cases(4.4%), mild 20 cases(12.7%), moderate 67 cases(42.4%), and severe 47 cases(29.8%). Cavities were found in 43 cases(27.2%) and pleurisy in 18 cases(11.4%). 13) Treatment of mycobacterial diseases other than tuberculosis was done in 129 cases(81.7%). In cases treated with the first line anti-tuberculous drugs, combination chemotherapy including INH and RMP was done in 86 cases(66.7%), INH or RMP in 30 cases(23.3%), and not including INH and RMP in 9 cases(7%). In 65 cases treated with the second line anti-tuberculous drugs, combination chemotherapy including below 2 drugs were in 2 cases(3.1%), 3 drugs in 15 cases(23.1%), 4 drugs in 20 cases(30.8%), 5 drugs in 9 cases(13.8%), and over 6 drugs in 19 cases (29.2%). The results of treatment were improvement in 36 cases(27.9%), no interval changes in 65 cases(50.4%), aggravation in 4 cases(3.1%), and death in 4 cases(3.1%). In improved 36 cases, 34 cases(94.4%) attained negative conversion of mycobacteria other than tuberculosis on cultures. The timing in attaining negative conversion on cultures was within 1 month in 2 cases(1.3%), within 3 months in 11 cases(7%), within 6 months in 14 eases(8.9%), within 1 year in 2 cases(1.3%) and over 1 year in 1 case(0.6%). Conclusion: Clinical, laboratory and radiological findings of mycobacterial diseases other than tuberculosis were summarized. This collected datas will assist in the more detection of mycobacterial diseases other than tuberculosis in Korea in near future.
Volatility in the stock market returns is a measure of investment risk. It plays a central role in portfolio optimization, asset pricing and risk management as well as most theoretical financial models. Engle(1982) presented a pioneering paper on the stock market volatility that explains the time-variant characteristics embedded in the stock market return volatility. His model, Autoregressive Conditional Heteroscedasticity (ARCH), was generalized by Bollerslev(1986) as GARCH models. Empirical studies have shown that GARCH models describes well the fat-tailed return distributions and volatility clustering phenomenon appearing in stock prices. The parameters of the GARCH models are generally estimated by the maximum likelihood estimation (MLE) based on the standard normal density. But, since 1987 Black Monday, the stock market prices have become very complex and shown a lot of noisy terms. Recent studies start to apply artificial intelligent approach in estimating the GARCH parameters as a substitute for the MLE. The paper presents SVR-based GARCH process and compares with MLE-based GARCH process to estimate the parameters of GARCH models which are known to well forecast stock market volatility. Kernel functions used in SVR estimation process are linear, polynomial and radial. We analyzed the suggested models with KOSPI 200 Index. This index is constituted by 200 blue chip stocks listed in the Korea Exchange. We sampled KOSPI 200 daily closing values from 2010 to 2015. Sample observations are 1487 days. We used 1187 days to train the suggested GARCH models and the remaining 300 days were used as testing data. First, symmetric and asymmetric GARCH models are estimated by MLE. We forecasted KOSPI 200 Index return volatility and the statistical metric MSE shows better results for the asymmetric GARCH models such as E-GARCH or GJR-GARCH. This is consistent with the documented non-normal return distribution characteristics with fat-tail and leptokurtosis. Compared with MLE estimation process, SVR-based GARCH models outperform the MLE methodology in KOSPI 200 Index return volatility forecasting. Polynomial kernel function shows exceptionally lower forecasting accuracy. We suggested Intelligent Volatility Trading System (IVTS) that utilizes the forecasted volatility results. IVTS entry rules are as follows. If forecasted tomorrow volatility will increase then buy volatility today. If forecasted tomorrow volatility will decrease then sell volatility today. If forecasted volatility direction does not change we hold the existing buy or sell positions. IVTS is assumed to buy and sell historical volatility values. This is somewhat unreal because we cannot trade historical volatility values themselves. But our simulation results are meaningful since the Korea Exchange introduced volatility futures contract that traders can trade since November 2014. The trading systems with SVR-based GARCH models show higher returns than MLE-based GARCH in the testing period. And trading profitable percentages of MLE-based GARCH IVTS models range from 47.5% to 50.0%, trading profitable percentages of SVR-based GARCH IVTS models range from 51.8% to 59.7%. MLE-based symmetric S-GARCH shows +150.2% return and SVR-based symmetric S-GARCH shows +526.4% return. MLE-based asymmetric E-GARCH shows -72% return and SVR-based asymmetric E-GARCH shows +245.6% return. MLE-based asymmetric GJR-GARCH shows -98.7% return and SVR-based asymmetric GJR-GARCH shows +126.3% return. Linear kernel function shows higher trading returns than radial kernel function. Best performance of SVR-based IVTS is +526.4% and that of MLE-based IVTS is +150.2%. SVR-based GARCH IVTS shows higher trading frequency. This study has some limitations. Our models are solely based on SVR. Other artificial intelligence models are needed to search for better performance. We do not consider costs incurred in the trading process including brokerage commissions and slippage costs. IVTS trading performance is unreal since we use historical volatility values as trading objects. The exact forecasting of stock market volatility is essential in the real trading as well as asset pricing models. Further studies on other machine learning-based GARCH models can give better information for the stock market investors.