• Title/Summary/Keyword: Sparse Data Set

Search Result 47, Processing Time 0.029 seconds

OLAP System and Performance Evaluation for Analyzing Web Log Data (웹 로그 분석을 위한 OLAP 시스템 및 성능 평가)

  • 김지현;용환승
    • Journal of Korea Multimedia Society
    • /
    • v.6 no.5
    • /
    • pp.909-920
    • /
    • 2003
  • Nowadays, IT for CRM has been growing and developed rapidly. Typical techniques are statistical analysis tools, on-line multidimensional analytical processing (OLAP) tools, and data mining algorithms (such neural networks, decision trees, and association rules). Among customer data, web log data is very important and to use these data efficiently, applying OLAP technology to analyze multi-dimensionally. To make OLAP cube, we have to precalculate multidimensional summary results in order to get fast response. But as the number of dimensions and sparse cells increases, data explosion occurs seriously and the performance of OLAP decreases. In this paper, we presented why the web log data sparsity occurs and then what kinds of sparsity patterns generate in the two and t.he three dimensions for OLAP. Based on this research, we set up the multidimensional data models and query models for benchmark with each sparsity patterns. Finally, we evaluated the performance of three OLAP systems (MS SQL 2000 Analysis Service, Oracle Express and C-MOLAP).

  • PDF

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

Case-Related News Filtering via Topic-Enhanced Positive-Unlabeled Learning

  • Wang, Guanwen;Yu, Zhengtao;Xian, Yantuan;Zhang, Yu
    • Journal of Information Processing Systems
    • /
    • v.17 no.6
    • /
    • pp.1057-1070
    • /
    • 2021
  • Case-related news filtering is crucial in legal text mining and divides news into case-related and case-unrelated categories. Because case-related news originates from various fields and has different writing styles, it is difficult to establish complete filtering rules or keywords for data collection. In addition, the labeled corpus for case-related news is sparse; therefore, to train a high-performance classification model, it is necessary to annotate the corpus. To address this challenge, we propose topic-enhanced positive-unlabeled learning, which selects positive and negative samples guided by topics. Specifically, a topic model based on a variational autoencoder (VAE) is trained to extract topics from unlabeled samples. By using these topics in the iterative process of positive-unlabeled (PU) learning, the accuracy of identifying case-related news can be improved. From the experimental results, it can be observed that the F1 value of our method on the test set is 1.8% higher than that of the PU learning baseline model. In addition, our method is more robust with low initial samples and high iterations, and compared with advanced PU learning baselines such as nnPU and I-PU, we obtain a 1.1% higher F1 value, which indicates that our method can effectively identify case-related news.

Gaussian models for bond strength evaluation of ribbed steel bars in concrete

  • Prabhat R., Prem;Branko, Savija
    • Structural Engineering and Mechanics
    • /
    • v.84 no.5
    • /
    • pp.651-664
    • /
    • 2022
  • A precise prediction of the ultimate bond strength between rebar and surrounding concrete plays a major role in structural design, as it effects the load-carrying capacity and serviceability of a member significantly. In the present study, Gaussian models are employed for modelling bond strength of ribbed steel bars embedded in concrete. Gaussian models offer a non-parametric method based on Bayesian framework which is powerful, versatile, robust and accurate. Five different Gaussian models are explored in this paper-Gaussian Process (GP), Variational Heteroscedastic Gaussian Process (VHGP), Warped Gaussian Process (WGP), Sparse Spectrum Gaussian Process (SSGP), and Twin Gaussian Process (TGP). The effectiveness of the models is also evaluated in comparison to the numerous design formulae provided by the codes. The predictions from the Gaussian models are found to be closer to the experiments than those predicted using the design equations provided in various codes. The sensitivity of the models to various parameters, input feature space and sampling is also presented. It is found that GP, VHGP and SSGP are effective in prediction of the bond strength. For large data set, GP, VHGP, WGP and TGP can be computationally expensive. In such cases, SSGP can be utilized.

Compressive Sensing of the FIR Filter Coefficients for Multiplierless Implementation (무곱셈 구현을 위한 FIR 필터 계수의 압축 센싱)

  • Kim, Seehyun
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.18 no.10
    • /
    • pp.2375-2381
    • /
    • 2014
  • In case the coefficient set of an FIR filter is represented in the canonic signed digit (CSD) format with a few nonzero digits, it is possible to implement high data rate digital filters with low hardware cost. Designing an FIR filter with CSD format coefficients, whose number of nonzero signed digits is minimal, is equivalent to finding sparse nonzero signed digits in the coefficient set of the filter which satisfies the target frequency response with minimal maximum error. In this paper, a compressive sensing based CSD coefficient FIR filter design algorithm is proposed for multiplierless and high speed implementation. Design examples show that multiplierless FIR filters can be designed using less than two additions per tap on average with approximate frequency response to the target, which are suitable for high speed filtering applications.

Evaluation of Spatio-temporal Fusion Models of Multi-sensor High-resolution Satellite Images for Crop Monitoring: An Experiment on the Fusion of Sentinel-2 and RapidEye Images (작물 모니터링을 위한 다중 센서 고해상도 위성영상의 시공간 융합 모델의 평가: Sentinel-2 및 RapidEye 영상 융합 실험)

  • Park, Soyeon;Kim, Yeseul;Na, Sang-Il;Park, No-Wook
    • Korean Journal of Remote Sensing
    • /
    • v.36 no.5_1
    • /
    • pp.807-821
    • /
    • 2020
  • The objective of this study is to evaluate the applicability of representative spatio-temporal fusion models developed for the fusion of mid- and low-resolution satellite images in order to construct a set of time-series high-resolution images for crop monitoring. Particularly, the effects of the characteristics of input image pairs on the prediction performance are investigated by considering the principle of spatio-temporal fusion. An experiment on the fusion of multi-temporal Sentinel-2 and RapidEye images in agricultural fields was conducted to evaluate the prediction performance. Three representative fusion models, including Spatial and Temporal Adaptive Reflectance Fusion Model (STARFM), SParse-representation-based SpatioTemporal reflectance Fusion Model (SPSTFM), and Flexible Spatiotemporal DAta Fusion (FSDAF), were applied to this comparative experiment. The three spatio-temporal fusion models exhibited different prediction performance in terms of prediction errors and spatial similarity. However, regardless of the model types, the correlation between coarse resolution images acquired on the pair dates and the prediction date was more significant than the difference between the pair dates and the prediction date to improve the prediction performance. In addition, using vegetation index as input for spatio-temporal fusion showed better prediction performance by alleviating error propagation problems, compared with using fused reflectance values in the calculation of vegetation index. These experimental results can be used as basic information for both the selection of optimal image pairs and input types, and the development of an advanced model in spatio-temporal fusion for crop monitoring.

A Bitmap Index for Chunk-Based MOLAP Cubes (청크 기반 MOLAP 큐브를 위한 비트맵 인덱스)

  • Lim, Yoon-Sun;Kim, Myung
    • Journal of KIISE:Databases
    • /
    • v.30 no.3
    • /
    • pp.225-236
    • /
    • 2003
  • MOLAP systems store data in a multidimensional away called a 'cube' and access them using way indexes. When a cube is placed into disk, it can be Partitioned into a set of chunks of the same side length. Such a cube storage scheme is called the chunk-based MOLAP cube storage scheme. It gives data clustering effect so that all the dimensions are guaranteed to get a fair chance in terms of the query processing speed. In order to achieve high space utilization, sparse chunks are further compressed. Due to data compression, the relative position of chunks cannot be obtained in constant time without using indexes. In this paper, we propose a bitmap index for chunk-based MOLAP cubes. The index can be constructed along with the corresponding cube generation. The relative position of chunks is retained in the index so that chunk retrieval can be done in constant time. We placed in an index block as many chunks as possible so that the number of index searches is minimized for OLAP operations such as range queries. We showed the proposed index is efficient by comparing it with multidimensional indexes such as UB-tree and grid file in terms of time and space.

Korea Emissions Inventory Processing Using the US EPA's SMOKE System

  • Kim, Soon-Tae;Moon, Nan-Kyoung;Byun, Dae-Won W.
    • Asian Journal of Atmospheric Environment
    • /
    • v.2 no.1
    • /
    • pp.34-46
    • /
    • 2008
  • Emissions inputs for use in air quality modeling of Korea were generated with the emissions inventory data from the National Institute of Environmental Research (NIER), maintained under the Clean Air Policy Support System (CAPSS) database. Source Classification Codes (SCC) in the Korea emissions inventory were adapted to use with the U.S. EPA's Sparse Matrix Operator Kernel Emissions (SMOKE) by finding the best-matching SMOKE default SCCs for the chemical speciation and temporal allocation. A set of 19 surrogate spatial allocation factors for South Korea were developed utilizing the Multi-scale Integrated Modeling System (MIMS) Spatial Allocator and Korean GIS databases. The mobile and area source emissions data, after temporal allocation, show typical sinusoidal diurnal variations with high peaks during daytime, while point source emissions show weak diurnal variations. The model-ready emissions are speciated for the carbon bond version 4 (CB-4) chemical mechanism. Volatile organic carbon (VOC) emissions from painting related industries in area source category significantly contribute to TOL (Toluene) and XYL (Xylene) emissions. ETH (Ethylene) emissions are largely contributed from point industrial incineration facilities and various mobile sources. On the other hand, a large portion of OLE (Olefin) emissions are speciated from mobile sources in addition to those contributed by the polypropylene industry in point source. It was found that FORM (Formaldehyde) is mostly emitted from petroleum industry and heavy duty diesel vehicles. Chemical speciation of PM2.5 emissions shows that PEC (primary fine elemental carbon) and POA (primary fine organic aerosol) are the most abundant species from diesel and gasoline vehicles. To reduce uncertainties in processing the Korea emission inventory due to the mapping of Korean SCCs to those of U.S., it would be practical to develop and use domestic source profiles for the top 10 SCCs for area and point sources and top 5 SCCs for on-road mobile sources when VOC emissions from the sources are more than 90% of the total.

Development of a Prototype of Guidance System for Rice-transplanter

  • Zhang, Fang-Ming;Shin, Beom-Soo;Feng, Xi-Ming;Li, Yuan;Shou, Ru-Jiang
    • Journal of Biosystems Engineering
    • /
    • v.38 no.4
    • /
    • pp.255-263
    • /
    • 2013
  • Purpose: It is not easy to drive a rice-transplanter avoiding underlapped or overlapped transplanting in paddy fields. An automated guidance system for the riding-type rice-transplanter would be necessary to operate the rice-transplanter autonomously or to assist the beginning drivers as a driving aid. Methods: A prototype of guidance system was composed of embedded computers, RTK-GPS, and a power-steering mechanism. Two Kalman filters were adopted to overcome sparse positioning data (1 Hz) from the RTK-GPS. A global Kalman filter estimated the posture of rice-transplanter every one second, and a local Kalman filter calculated the posture from every new estimation of the global Kalman filter with an interval of 200 ms. A PID controller was applied to the row-following mode control. A control method of U-turning mode was developed as well. A stepping motor with a reduction gear set was used to rotate the shaft of steering wheel. Results: Test trials for U-turning and row-following modes were done in a paddy field after some parameters have been tuned at the ground speed range of 0.3 ~ 1.2 m/s. The minimum RMS error of offset was 3.13 cm at the ground speed of 0.3 m/s while the maximum RMS error was 13.01 cm at 1.2 m/s. The offset RMS error tended to increase as the ground speed increased. The target point distance, LT also affected the system performance and PID controller parameters should be adjusted on different ground speeds. Conclusions: A target angle-based PID controller plus stationary steering angle controller made it possible for the rice-transplanter to steer autonomously by following a reference line accurately and even on U-turning mode. However, as condition in paddy fields is very complicated, the system should control the ground speed that prevents it from deviating too much due to ditch and slope.

Assessing Spatial Uncertainty Distributions in Classification of Remote Sensing Imagery using Spatial Statistics (공간 통계를 이용한 원격탐사 화상 분류의 공간적 불확실성 분포 추정)

  • Park No-Wook;Chi Kwang-Hoon;Kwon Byung-Doo
    • Korean Journal of Remote Sensing
    • /
    • v.20 no.6
    • /
    • pp.383-396
    • /
    • 2004
  • The application of spatial statistics to obtain the spatial uncertainty distributions in classification of remote sensing images is investigated in this paper. Two quantitative methods are presented for describing two kinds of uncertainty; one related to class assignment and the other related to the connection of reference samples. Three quantitative indices are addressed for the first category of uncertainty. Geostatistical simulation is applied both to integrate the exhaustive classification results with the sparse reference samples and to obtain the spatial uncertainty or accuracy distributions connected to those reference samples. To illustrate the proposed methods and to discuss the operational issues, the experiment was done on a multi-sensor remote sensing data set for supervised land-cover classification. As an experimental result, the two quantitative methods presented in this paper could provide additional information for interpreting and evaluating the classification results and more experiments should be carried out for verifying the presented methods.