• Title/Summary/Keyword: 이론 기반 데이터 과학

Search Result 119, Processing Time 0.03 seconds

Design and Implementation of a Metadata Structure for Large-Scale Shared-Disk File System (대용량 공유디스크 파일 시스템에 적합한 메타 데이타 구조의 설계 및 구현)

  • 이용주;김경배;신범주
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.30 no.1
    • /
    • pp.33-49
    • /
    • 2003
  • Recently, there have been large storage demands for manipulating multimedia data. To solve the tremendous storage demands, one of the major researches is the SAN(Storage Area Network) that provides the local file requests directly from shared-disk storage and also eliminates the server bottlenecks to performance and availability. SAN also improve the network latency and bandwidth through new channel interface like FC(Fibre Channel). But to manipulate the efficient storage network like SAN, traditional local file system and distributed file system are not adaptable and also are lack of researches in terms of a metadata structure for large-scale inode object such as file and directory. In this paper, we describe the architecture and design issues of our shared-disk file system and provide the efficient bitmap for providing the well-formed block allocation in each host, extent-based semi flat structure for storing large-scale file data, and two-phase directory structure of using Extendible Hashing. Also we describe a detailed algorithm for implementing the file system's device driver in Linux Kernel and compare our file system with the general file system like EXT2 and shard disk file system like GFS in terms of file creation, directory creation and I/O rate.

An Efficient Periodic-Request-Grouping Technique for Reduced Seek Time in Disk Array-based Video-on-Demand Server (디스크 배열-기반 주문형 비디오 서버에서의 탐색 시간 단축을 위한 효율적인 주기적 요청 묶음 기법)

  • Kim, Un-Seok;Kim, Ji-Hong;Min, Sang-Ryeol;No, Sam-Hyeok
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.28 no.12
    • /
    • pp.660-673
    • /
    • 2001
  • In Video-on-Demand (VoD) servers, disk throughput is an important system design parameter because it is directly related to the number of user requests that can be served simultaneously. In this paper, we propose an efficient periodic request grouping scheme for disk array-based VoD servers that reduces the disk seek time, thus improving the disk throughput of VoD disk arrays. To reduce the disk seek time, the proposed scheme groups the periodic requests that access data blocks stored in adjacent regions into one, and arranges these groups in a pre-determined order (e.g., in left-symmetric or right-symmetric fashion). Our simulation result shows that the proposed scheme reduces the average disk bandwidth required by a single video stream and can serve more user requests than existing schemes. For a data block size of 192KB, the number of simultaneously served user requests is increased by 8% while the average waiting time for a user request is decreased by 20%. We also propose an adaptation technique that conforms the proposed scheme to the user preference changes for video streams.

  • PDF

Design and Implementation of An I/O System for Irregular Application under Parallel System Environments (병렬 시스템 환경하에서 비정형 응용 프로그램을 위한 입출력 시스템의 설계 및 구현)

  • No, Jae-Chun;Park, Seong-Sun;;Gwon, O-Yeong
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.26 no.11
    • /
    • pp.1318-1332
    • /
    • 1999
  • 본 논문에서는 입출력 응용을 위해 collective I/O 기법을 기반으로 한 실행시간 시스템의 설계, 구현 그리고 그 성능평가를 기술한다. 여기서는 모든 프로세서가 동시에 I/O 요구에 따라 스케쥴링하며 I/O를 수행하는 collective I/O 방안과 프로세서들이 여러 그룹으로 묶이어, 다음 그룹이 데이터를 재배열하는 통신을 수행하는 동안 오직 한 그룹만이 동시에 I/O를 수행하는 pipelined collective I/O 등의 두 가지 설계방안을 살펴본다. Pipelined collective I/O의 전체 과정은 I/O 노드 충돌을 동적으로 줄이기 위해 파이프라인된다. 이상의 설계 부분에서는 동적으로 충돌 관리를 위한 지원을 제공한다. 본 논문에서는 다른 노드의 메모리 영역에 이미 존재하는 데이터를 재 사용하여 I/O 비용을 줄이기 위해 collective I/O 방안에서의 소프트웨어 캐슁 방안과 두 가지 모형에서의 chunking과 온라인 압축방안을 기술한다. 그리고 이상에서 기술한 방안들이 입출력을 위해 높은 성능을 보임을 기술하는데, 이 성능결과는 Intel Paragon과 ASCI/Red teraflops 기계 상에서 실험한 것이다. 그 결과 응용 레벨에서의 bandwidth는 peak point가 55%까지 측정되었다.Abstract In this paper we present the design, implementation and evaluation of a runtime system based on collective I/O techniques for irregular applications. We present two designs, namely, "Collective I/O" and "Pipelined Collective I/O". In the first scheme, all processors participate in the I/O simultaneously, making scheduling of I/O requests simpler but creating a possibility of contention at the I/O nodes. In the second approach, processors are grouped into several groups, so that only one group performs I/O simultaneously, while the next group performs communication to rearrange data, and this entire process is pipelined to reduce I/O node contention dynamically. In other words, the design provides support for dynamic contention management. Then we present a software caching method using collective I/O to reduce I/O cost by reusing data already present in the memory of other nodes. Finally, chunking and on-line compression mechanisms are included in both models. We demonstrate that we can obtain significantly high-performance for I/O above what has been possible so far. The performance results are presented on an Intel Paragon and on the ASCI/Red teraflops machine. Application level I/O bandwidth up to 55% of the peak is observed.he peak is observed.

Accelerated Loarning of Latent Topic Models by Incremental EM Algorithm (점진적 EM 알고리즘에 의한 잠재토픽모델의 학습 속도 향상)

  • Chang, Jeong-Ho;Lee, Jong-Woo;Eom, Jae-Hong
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.12
    • /
    • pp.1045-1055
    • /
    • 2007
  • Latent topic models are statistical models which automatically captures salient patterns or correlation among features underlying a data collection in a probabilistic way. They are gaining an increased popularity as an effective tool in the application of automatic semantic feature extraction from text corpus, multimedia data analysis including image data, and bioinformatics. Among the important issues for the effectiveness in the application of latent topic models to the massive data set is the efficient learning of the model. The paper proposes an accelerated learning technique for PLSA model, one of the popular latent topic models, by an incremental EM algorithm instead of conventional EM algorithm. The incremental EM algorithm can be characterized by the employment of a series of partial E-steps that are performed on the corresponding subsets of the entire data collection, unlike in the conventional EM algorithm where one batch E-step is done for the whole data set. By the replacement of a single batch E-M step with a series of partial E-steps and M-steps, the inference result for the previous data subset can be directly reflected to the next inference process, which can enhance the learning speed for the entire data set. The algorithm is advantageous also in that it is guaranteed to converge to a local maximum solution and can be easily implemented just with slight modification of the existing algorithm based on the conventional EM. We present the basic application of the incremental EM algorithm to the learning of PLSA and empirically evaluate the acceleration performance with several possible data partitioning methods for the practical application. The experimental results on a real-world news data set show that the proposed approach can accomplish a meaningful enhancement of the convergence rate in the learning of latent topic model. Additionally, we present an interesting result which supports a possible synergistic effect of the combination of incremental EM algorithm with parallel computing.

Using a Learning Progression to Characterize Korean Secondary Students' Knowledge and Submicroscopic Representations of the Particle Nature of Matter (Learning Progression을 적용한 중·고등학생의 '물질의 입자성'에 관한 지식과 미시적 표상에 대한 특성 분석)

  • Shin, Namsoo;Koh, Eun Jung;Choi, Chui Im;Jeong, Dae Hong
    • Journal of The Korean Association For Science Education
    • /
    • v.34 no.5
    • /
    • pp.437-447
    • /
    • 2014
  • Learning progressions (LP), which describe how students may develop more sophisticated understanding over a defined period of time, can inform the design of instructional materials and assessment by providing a coherent, systematic measure of what can be regarded as "level appropriate." We developed LPs for the nature of matter for grades K-16. In order to empirically test Korean students, we revised one of the constructs and associated assessment items based on Korean National Science Standards. The assessment was administered to 124 Korean secondary students to measure their knowledge and submicroscopic representations, and to assign them to a level of learning progression for the particle nature of matter. We characterized the level of students' understanding and models of the particle nature of matter, and described how students interpret various representations of atoms and molecules to explain scientific phenomena. The results revealed that students have difficulties in understanding the relationship between the macroscopic and molecular levels of phenomena, even in high school science. Their difficulties may be attributed to a limited understanding of scientific modeling, a lack of understanding of the models used to represent the particle nature of matter, or limited understanding of the structure of matter. This work will inform assessment and curriculum materials development related to the fundamental relationship between macroscopic, observed phenomena and the behavior of atoms and molecules, and can be used to create individualized learning environments. In addition, the results contribute to scientific research literature on learning progressions on the nature of matter.

A Non-Shared Metadata Management Scheme for Large Distributed File Systems (대용량 분산파일시스템을 위한 비공유 메타데이타 관리 기법)

  • Yun, Jong-Byeon;Park, Yang-Bun;Lee, Seok-Jae;Jang, Su-Min;Yoo, Jae-Soo;Kim, Hong-Yeon;Kim, Young-Kyun
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.36 no.4
    • /
    • pp.259-273
    • /
    • 2009
  • Most of large-scale distributed file systems decouple a metadata operation from read and write operations for a file. In the distributed file systems, a certain server named a metadata server (MDS) maintains metadata information in file system such as access information for a file, the position of a file in the repository, the namespace of the file system, and so on. But, the existing systems used restrictive metadata management schemes, because most of the distributed file systems designed to focus on the distributed management and the input/output performance of data rather than the metadata. Therefore, in the existing systems, the metadata throughput and expandability of the metadata server are limited. In this paper, we propose a new non-shared metadata management scheme in order to provide the high metadata throughput and scalability for a cluster of MDSs. First, we derive a dictionary partitioning scheme as a new metadata distribution technique. Then, we present a load balancing technique based on the distribution technique. It is shown through various experiments that our scheme outperforms existing metadata management schemes in terms of scalability and load balancing.

Recognition of Superimposed Patterns with Selective Attention based on SVM (SVM기반의 선택적 주의집중을 이용한 중첩 패턴 인식)

  • Bae, Kyu-Chan;Park, Hyung-Min;Oh, Sang-Hoon;Choi, Youg-Sun;Lee, Soo-Young
    • Journal of the Institute of Electronics Engineers of Korea SP
    • /
    • v.42 no.5 s.305
    • /
    • pp.123-136
    • /
    • 2005
  • We propose a recognition system for superimposed patterns based on selective attention model and SVM which produces better performance than artificial neural network. The proposed selective attention model includes attention layer prior to SVM which affects SVM's input parameters. It also behaves as selective filter. The philosophy behind selective attention model is to find the stopping criteria to stop training and also defines the confidence measure of the selective attention's outcome. Support vector represents the other surrounding sample vectors. The support vector closest to the initial input vector in consideration is chosen. Minimal euclidean distance between the modified input vector based on selective attention and the chosen support vector defines the stopping criteria. It is difficult to define the confidence measure of selective attention if we apply common selective attention model, A new way of doffing the confidence measure can be set under the constraint that each modified input pixel does not cross over the boundary of original input pixel, thus the range of applicable information get increased. This method uses the following information; the Euclidean distance between an input pattern and modified pattern, the output of SVM, the support vector output of hidden neuron that is the closest to the initial input pattern. For the recognition experiment, 45 different combinations of USPS digit data are used. Better recognition performance is seen when selective attention is applied along with SVM than SVM only. Also, the proposed selective attention shows better performance than common selective attention.

Optimal Construction of Multiple Indexes for Time-Series Subsequence Matching (시계열 서브시퀀스 매칭을 위한 최적의 다중 인덱스 구성 방안)

  • Lim, Seung-Hwan;Kim, Sang-Wook;Park, Hee-Jin
    • Journal of KIISE:Databases
    • /
    • v.33 no.2
    • /
    • pp.201-213
    • /
    • 2006
  • A time-series database is a set of time-series data sequences, each of which is a list of changing values of the object in a given period of time. Subsequence matching is an operation that searches for such data subsequences whose changing patterns are similar to a query sequence from a time-series database. This paper addresses a performance issue of time-series subsequence matching. First, we quantitatively examine the performance degradation caused by the window size effect, and then show that the performance of subsequence matching with a single index is not satisfactory in real applications. We argue that index interpolation is fairly useful to resolve this problem. The index interpolation performs subsequence matching by selecting the most appropriate one from multiple indexes built on windows of their inherent sizes. For index interpolation, we first decide the sites of windows for multiple indexes to be built. In this paper, we solve the problem of selecting optimal window sizes in the perspective of physical database design. For this, given a set of query sequences to be peformed in a target time-series database and a set of window sizes for building multiple indexes, we devise a formula that estimates the cost of all the subsequence matchings. Based on this formula, we propose an algorithm that determines the optimal window sizes for maximizing the performance of entire subsequence matchings. We formally Prove the optimality as well as the effectiveness of the algorithm. Finally, we perform a series of extensive experiments with a real-life stock data set and a large volume of a synthetic data set. The results reveal that the proposed approach improves the previous one by 1.5 to 7.8 times.

An Exploration of MIS Quarterly Research Trends: Applying Topic Modeling and Keyword Network Analysis (MIS Quarterly 연구동향 탐색: 토픽모델링 및 키워드 네트워크 분석 활용)

  • Kang, Eunkyung;Jung, Yeonsik;Yang, Seonuk;Kwon, Jiyoon;Yang, Sung-Byung
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.207-235
    • /
    • 2022
  • In a knowledge-based society where knowledge and information industries are the main pillars of the economy, knowledge sharing and diffusion and its systematic management are recognized as essential strategies for improving national competitiveness and sustainable social development. In the field of Information Systems (IS) research, where the convergence of information technology and management takes place in various ways, the evolution of knowledge occurs only when researchers cooperate in turning old knowledge into new knowledge from the perspective of the scientific knowledge network. In particular, it is possible to derive new insights by identifying topics of interest in the relevant research field, applied methodologies, and research trends through network-based interdisciplinary graftings such as citations, co-authorships, and keywords. In previous studies, various attempts have been made to understand the structure of the knowledge system and the research trends of the relevant community by revealing the relationship between research topics, methodologies, and co-authors. However, most studies have compared two or more journals and been limited to a certain period; hence, there is a lack of research that looked at research trends covering the entire history of IS research. Therefore, this study was conducted in the following order for all the papers (from its first issue in 1977 to the first quarter of 2022) published in the MIS Quarterly (MISQ) Journal, which plays a leading role in revealing knowledge in the IS research field: (1) After extracting keywords, (2) classifying the extracted keywords into research topics, methodologies, and theories, and (3) using topic modeling and keyword network analysis in order to identify the changes from the beginning to the present of the IS research in a chronological manner. Through this study, it is expected that by examining the changes in IS research published in MISQ, the developing patterns of IS research can be revealed, and a new research direction can be presented to IS researchers, nurturing the sustainability of future research.

A Comparison Analysis among Structural Equation Modeling (AMOS, LISREL and PLS) Using the Same Data (동일 데이터를 이용한 구조방정식 툴 간의 비교분석)

  • Nam, Soo-tai;Kim, Do-goan;Jin, Chan-yong
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.22 no.7
    • /
    • pp.978-984
    • /
    • 2018
  • Structural equation modeling is pointing to statistical procedures that simultaneously perform path analysis and confirmatory factor analysis. Today, this statistical procedure is an essential tool for researchers in the social sciences. There are as AMOS, LISREL and PLS representative tools that can perform structural equation modeling analysis. AMOS provides a convenient graphical user interface for beginners to use. PLS has the advantage of not having a constraint on normal distribution as well as a graphical user interface. Therefore, we compared and analyzed the three most commonly used tools (applications) in social sciences. Based on structural equation modeling, confirmatory factor analysis was performed using the IBM AMOS Ver. 23, the LISREL 8.70 and the SmartPLS 2.0. The comparative results show that LISREL has the highest explanatory power of dependent variables than other analytical tools. The path coefficients and T-values presented by the analysis results showed similar results for all three analysis tools. This study suggests practical and theoretical implications based on the results.