• Title/Summary/Keyword: small data set

Search Result 664, Processing Time 0.033 seconds

Dynamic RNN-CNN malware classifier correspond with Random Dimension Input Data (임의 차원 데이터 대응 Dynamic RNN-CNN 멀웨어 분류기)

  • Lim, Geun-Young;Cho, Young-Bok
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.23 no.5
    • /
    • pp.533-539
    • /
    • 2019
  • This study proposes a malware classification model that can handle arbitrary length input data using the Microsoft Malware Classification Challenge dataset. We are based on imaging existing data from malware. The proposed model generates a lot of images when malware data is large, and generates a small image of small data. The generated image is learned as time series data by Dynamic RNN. The output value of the RNN is classified into malware by using only the highest weighted output by applying the Attention technique, and learning the RNN output value by Residual CNN again. Experiments on the proposed model showed a Micro-average F1 score of 92% in the validation data set. Experimental results show that the performance of a model capable of learning and classifying arbitrary length data can be verified without special feature extraction and dimension reduction.

Optimizing Multi-way Join Query Over Data Streams (데이타 스트림에서의 다중 조인 질의 최적화 방법)

  • Park, Hong-Kyu;Lee, Won-Suk
    • Journal of KIISE:Databases
    • /
    • v.35 no.6
    • /
    • pp.459-468
    • /
    • 2008
  • A data stream which is a massive unbounded sequence of data elements continuously generated at a rapid rate. Many recent research activities for emerging applications often need to deal with the data stream. Such applications can be web click monitoring, sensor data processing, network traffic analysis. telephone records and multi-media data. For this. data processing over a data stream are not performed on the stored data but performed the newly updated data with pre-registered queries, and then return a result immediately or periodically. Recently, many studies are focused on dealing with a data stream more than a stored data set. Especially. there are many researches to optimize continuous queries in order to perform them efficiently. This paper proposes a query optimization algorithm to manage continuous query which has multiple join operators(Multi-way join) over data streams. It is called by an Extended Greedy query optimization based on a greedy algorithm. It defines a join cost by a required operation to compute a join and an operation to process a result and then stores all information for computing join cost and join cost in the statistics catalog. To overcome a weak point of greedy algorithm which has poor performance, the algorithm selects the set of operators with a small lay, instead of operator with the smallest cost. The set is influenced the accuracy and execution time of the algorithm and can be controlled adaptively by two user-defined values. Experiment results illustrate the performance of the EGA algorithm in various stream environments.

Reduction of Approximate Rule based on Probabilistic Rough sets (확률적 러프 집합에 기반한 근사 규칙의 간결화)

  • Kwon, Eun-Ah;Kim, Hong-Gi
    • The KIPS Transactions:PartD
    • /
    • v.8D no.3
    • /
    • pp.203-210
    • /
    • 2001
  • These days data is being collected and accumulated in a wide variety of fields. Stored data itself is to be an information system which helps us to make decisions. An information system includes many kinds of necessary and unnecessary attribute. So many algorithms have been developed for finding useful patterns from the data and reasoning approximately new objects. We are interested in the simple and understandable rules that can represent useful patterns. In this paper we propose an algorithm which can reduce the information in the system to a minimum, based on a probabilistic rough set theory. The proposed algorithm uses a value that tolerates accuracy of classification. The tolerant value helps minimizing the necessary attribute which is needed to reason a new object by reducing conditional attributes. It has the advantage that it reduces the time of generalizing rules. We experiment a proposed algorithm with the IRIS data and Wisconsin Breast Cancer data. The experiment results show that this algorithm retrieves a small reduct, and minimizes the size of the rule under the tolerant classification rate.

  • PDF

A Study about the Influence of Pollutant Load on Water Quality in a Small Stream Watershed (소하천의 오염부하량이 수질에 미치는 영향에 관한 연구)

  • Lee, Sang-Hoon;Cho, Wook-Sang
    • Journal of Environmental Impact Assessment
    • /
    • v.10 no.1
    • /
    • pp.9-19
    • /
    • 2001
  • An intensive watershed survey including water quality measurement of 6 times was carried out in order to find out the relationship between pollutant load and water quality in a small stream watershed where livestock wastewater is the main source of water pollution. The findings from the survey are as follows. 1) The number of livestock showed large disagreement among county office, myon, and insite survey. It is vital to check the data at the beginning of watershed survey. 2) The fluctuation of streamflow and water quality was so large depending on the day of measurement that it is essential to set up continuous telemetering system to get reliable data about delivery ratio of pollutants. 3) It was helpful for setting the priority of investigation to check water quality and quantity at several points along the stream after dividing the watershed into 5 drainage areas. 4) To control the livestock wastewater, especially in case of cows, it is necessary to have roof system and prevent overland flow from the ground. In case of pig farms, it is recommended to have public treatment system instead of private treatment system. The exact emission load of livestock wastewater was difficult to estimate, and requires more study.

  • PDF

Rank-Based Nonlinear Normalization of Oligonucleotide Arrays

  • Park, Peter J.;Kohane, Isaac S.;Kim, Ju Han
    • Genomics & Informatics
    • /
    • v.1 no.2
    • /
    • pp.94-100
    • /
    • 2003
  • Motivation: Many have observed a nonlinear relationship between the signal intensity and the transcript abundance in microarray data. The first step in analyzing the data is to normalize it properly, and this should include a correction for the nonlinearity. The commonly used linear normalization schemes do not address this problem. Results: Nonlinearity is present in both cDNA and oligonucleotide arrays, but we concentrate on the latter in this paper. Across a set of chips, we identify those genes whose within-chip ranks are relatively constant compared to other genes of similar intensity. For each gene, we compute the sum of the squares of the differences in its within-chip ranks between every pair of chips as our statistic and we select a small fraction of the genes with the minimal changes in ranks at each intensity level. These genes are most likely to be non-differentially expressed and are subsequently used in the normalization procedure. This method is a generalization of the rank-invariant normalization (Li and Wong, 2001), using all available chips rather than two at a time to gather more information, while using the chip that is least likely to be affected by nonlinear effects as the reference chip. The assumption in our method is that there are at least a small number of non­differentially expressed genes across the intensity range. The normalized expression values can be substantially different from the unnormalized values and may result in altered down-stream analysis.

A Study on the Applicability of ENERWATER for Evaluation of the Energy Consumption Label of WWTPs in Korea (국내 하수처리시설 에너지 등급 평가를 위한 ENERWATER의 적용 가능성에 관한 연구)

  • Park, Minoh;Lee, Hosik
    • Journal of Korean Society on Water Environment
    • /
    • v.38 no.5
    • /
    • pp.231-239
    • /
    • 2022
  • In this study, we applied ENERWATER to evaluate the energy consumption labeling of wastewater treatment plants in Korea using the Korea sewerage statistics data. The results showed that the energy label status was excellent in the SBR process for small and medium-scale wastewater treatment plants and the A2O process for large-scale wastewater treatment plants. The energy labeling of wastewater treatment plants of 50,000 tons capacity was excellent. The statuses of metropolitan cities and Jeollanam-do province were excellent. We analyzed the effects of renewable energy on wastewater treatment plants' energy consumption and found out that digestion gas for large-scale plants and photovoltaic energy for small-scale plants were effective in improving energy labeling. In addition, we compared the energy labels of four wastewater treatment plants in "Z" city and wastewater treatment plant "X" had the best energy label, and the wastewater treatment plants "V" and "Y" had to be selected as priorities for the energy diagnosis and improvement project. In a comprehensive conclusion, the applicability of ENERWATER was confirmed based on sewage statistics data and labeling can be used to set priorities for the energy diagnosis and improvement project.

CBIR-based Data Augmentation and Its Application to Deep Learning (CBIR 기반 데이터 확장을 이용한 딥 러닝 기술)

  • Kim, Sesong;Jung, Seung-Won
    • Journal of Broadcast Engineering
    • /
    • v.23 no.3
    • /
    • pp.403-408
    • /
    • 2018
  • Generally, a large data set is required for learning of deep learning. However, since it is not easy to create large data sets, there are a lot of techniques that make small data sets larger through data expansion such as rotation, flipping, and filtering. However, these simple techniques have limitation on extendibility because they are difficult to escape from the features already possessed. In order to solve this problem, we propose a method to acquire new image data by using existing data. This is done by retrieving and acquiring similar images using existing image data as a query of the content-based image retrieval (CBIR). Finally, we compare the performance of the base model with the model using CBIR.

Trapdoor Digital Shredder: A New Technique for Improved Data Security without Cryptographic Encryption

  • Youn, Taek-Young;Jho, Nam-Su
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.3
    • /
    • pp.1249-1262
    • /
    • 2020
  • Along with the increase of the importance of information used in practice, adversaries tried to take valuable information in diverse ways. The simple and fundamental solution is to encrypt the whole data. Since the cost of encryption is increasing along with the size of data, the cost for securing the data is a burden to a system where the size of the data is not small. For the reason, in some applications where huge data are used for service, service providers do not use any encryption scheme for higher security, which could be a source of trouble. In this work, we introduce a new type of data securing technique named Trapdoor Digital Shredder(TDS) which disintegrates a data to multiple pieces to make it hard to re-construct the original data except the owner of the file who holds some secret keys. The main contribution of the technique is to increase the difficulty in obtaining private information even if an adversary obtains some shredded pieces. To prove the security of our scheme, we first introduce a new security model so called IND-CDA to examine the indistinguishability of shredded pieces. Then, we show that our scheme is secure under IND-CDA model, which implies that an adversary cannot distinguish a subset of shreds of a file from a set of random shreds.

Design 5Q MPI Hardware Unit Supporting Standard Mode (표준 모드를 지원하는 5Q MPI 하드웨어 유닛 설계)

  • Park, Jae-Won;Chung, Won-Young;Lee, Seung-Woo;Lee, Yong-Surk
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.37 no.1B
    • /
    • pp.59-66
    • /
    • 2012
  • The use of MPSoC has been increasing because of a rise of use of mobile devices and complex applications. For improving the performance of MPSoC, number of processor has been increasing. Standard MPI is used for efficiently sending data in distributed memory architecture that has advantage in multi processor. Standard In this paper, we propose a scalable distributed memory system with a low cost hardware message passing interface(MPI). The proposed architecture improves transfer rate with buffered send for small size packet. Three queues, Ready Queue, Request Queue, and Reservation Queue, work as previous architecture, and two queues, Small Ready Queue and Small Request Queue, are added to send small size packet. When the critical point is set 8 bytes, the proposed architecture takes more than 2 times the performance improvement in the data that below the critical point.

Performance Enhancement for Speaker Verification Using Incremental Robust Adaptation in GMM (가무시안 혼합모델에서 점진적 강인적응을 통한 화자확인 성능개선)

  • Kim, Eun-Young;Seo, Chang-Woo;Lim, Yong-Hwan;Jeon, Seong-Chae
    • The Journal of the Acoustical Society of Korea
    • /
    • v.28 no.3
    • /
    • pp.268-272
    • /
    • 2009
  • In this paper, we propose a Gaussian Mixture Model (GMM) based incremental robust adaptation with a forgetting factor for the speaker verification. Speaker recognition system uses a speaker model adaptation method with small amounts of data in order to obtain a good performance. However, a conventional adaptation method has vulnerable to the outlier from the irregular utterance variations and the presence noise, which results in inaccurate speaker model. As time goes by, a rate in which new data are adapted to a model is reduced. The proposed algorithm uses an incremental robust adaptation in order to reduce effect of outlier and use forgetting factor in order to maintain adaptive rate of new data on GMM based speaker model. The incremental robust adaptation uses a method which registers small amount of data in a speaker recognition model and adapts a model to new data to be tested. Experimental results from the data set gathered over seven months show that the proposed algorithm is robust against outliers and maintains adaptive rate of new data.