• 제목/요약/키워드: high dimensional large-scale data

검색결과 45건 처리시간 0.03초

고차원 대용량 자료의 시각화에 대한 고찰 (A study on high dimensional large-scale data visualization)

  • 이은경;황나영;이윤동
    • 응용통계연구
    • /
    • 제29권6호
    • /
    • pp.1061-1075
    • /
    • 2016
  • 본 논문에서는 고차원 대용량 자료의 시각화에서 발생할 수 있는 문제점들을 살펴보고 이에 대하여 개발된 방법들에 대하여 논의하였다. 고차원 자료의 경우 2차원 공간상에 표현하기 위하여 중요 변수를 선택해야하며 다양한 시각적 표현 속성과 다면화 방법을 이용하여 좀 더 많은 변수들을 표현할 수 있었다. 또한 관심있는 뷰를 보이는 낮은 차원을 찾는 사영추정방법을 이용할 수 있다. 대용량 자료에서는 점들이 겹쳐지는 문제점을 흩트림과 알파 블렌딩 등을 이용하여 해결할 수 있었다. 또한 고차원 대용량 자료의 탐색을 위하여 개발된 R 패키지인 tabplot과 scagnostics, 그리고 대화형 웹 그래프를 위한 다양한 형태의 R 패키지들을 살펴보았다.

Enhanced Locality Sensitive Clustering in High Dimensional Space

  • Chen, Gang;Gao, Hao-Lin;Li, Bi-Cheng;Hu, Guo-En
    • Transactions on Electrical and Electronic Materials
    • /
    • 제15권3호
    • /
    • pp.125-129
    • /
    • 2014
  • A dataset can be clustered by merging the bucket indices that come from the random projection of locality sensitive hashing functions. It should be noted that for this to work the merging interval must be calculated first. To improve the feasibility of large scale data clustering in high dimensional space we propose an enhanced Locality Sensitive Hashing Clustering Method. Firstly, multiple hashing functions are generated. Secondly, data points are projected to bucket indices. Thirdly, bucket indices are clustered to get class labels. Experimental results showed that on synthetic datasets this method achieves high accuracy at much improved cluster speeds. These attributes make it well suited to clustering data in high dimensional space.

대용량 데이터의 내용 기반 검색을 위한 분산 고차원 색인 구조 (A Distributed High Dimensional Indexing Structure for Content-based Retrieval of Large Scale Data)

  • 최현화;이미영;김영창;장재우;이규철
    • 한국정보과학회논문지:데이타베이스
    • /
    • 제37권5호
    • /
    • pp.228-237
    • /
    • 2010
  • 고차원 데이터에 대한 다양한 색인 구조가 제안되어 왔음에도 불구하고, 인터넷 서비스로서 이미지 및 동영상의 내용 기반 검색을 지원하기 위해서는 고확장성 지원 및 k-최근접점 검색 성능 향상을 지원하는 새로운 고차원 데이터의 색인 구조가 절실히 요구된다. 이에 우리는 다중 컴퓨팅 노드를 바탕으로 구축되는 분산 색인 구조로 분산 벡터 근사 트리(Distributed Vector Approximation-tree)를 제안한다. 분산 벡터 근사 트리는 대용량의 고차원 데이터로부터 추출한 샘플 데이터를 바탕으로 hybrid spill-tree를 구축하고, hybrid spill-tree외 말단 노드 각각에 분산 컴퓨팅 노드를 매핑하여 VA-file용 구축하는 두 레벨의 분산 색인 구조이다. 우리는 다중 컴퓨팅 노드들 상에 구축된 분산 벡터 근사 트리를 바탕으로 병렬 k-최근접점 검색을 수행함으로써 검씩 성능을 향상시킨다. 본 논문에서는 서로 다른 분포의 데이터 집합을 바탕으로 한 성능 시험 결과를 통하여, 분산 벡터 근사 트리가 기존의 고확장성을 지원하는 색인 구조와 비교하여 검색 정확도에 대한 손실 없이 더 빠른 k-최근접점 검색을 수행함을 보인다.

Cooperative Coevolution Differential Evolution Based on Spark for Large-Scale Optimization Problems

  • Tan, Xujie;Lee, Hyun-Ae;Shin, Seong-Yoon
    • Journal of information and communication convergence engineering
    • /
    • 제19권3호
    • /
    • pp.155-160
    • /
    • 2021
  • Differential evolution is an efficient algorithm for solving continuous optimization problems. However, its performance deteriorates rapidly, and the runtime increases exponentially when differential evolution is applied for solving large-scale optimization problems. Hence, a novel cooperative coevolution differential evolution based on Spark (known as SparkDECC) is proposed. The divide-and-conquer strategy is used in SparkDECC. First, the large-scale problem is decomposed into several low-dimensional subproblems using the random grouping strategy. Subsequently, each subproblem can be addressed in a parallel manner by exploiting the parallel computation capability of the resilient distributed datasets model in Spark. Finally, the optimal solution of the entire problem is obtained using the cooperation mechanism. The experimental results on 13 high-benchmark functions show that the new algorithm performs well in terms of speedup and scalability. The effectiveness and applicability of the proposed algorithm are verified.

ADMM for least square problems with pairwise-difference penalties for coefficient grouping

  • Park, Soohee;Shin, Seung Jun
    • Communications for Statistical Applications and Methods
    • /
    • 제29권4호
    • /
    • pp.441-451
    • /
    • 2022
  • In the era of bigdata, scalability is a crucial issue in learning models. Among many others, the Alternating Direction of Multipliers (ADMM, Boyd et al., 2011) algorithm has gained great popularity in solving large-scale problems efficiently. In this article, we propose applying the ADMM algorithm to solve the least square problem penalized by the pairwise-difference penalty, frequently used to identify group structures among coefficients. ADMM algorithm enables us to solve the high-dimensional problem efficiently in a unified fashion and thus allows us to employ several different types of penalty functions such as LASSO, Elastic Net, SCAD, and MCP for the penalized problem. Additionally, the ADMM algorithm naturally extends the algorithm to distributed computation and real-time updates, both desirable when dealing with large amounts of data.

3D/BIM Applications to Large-scale Complex Building Projects in Japan

  • Yamazaki, Yusuke;Tabuchi, Tou;Kataoka, Makoto;Shimazaki, Dai
    • 국제초고층학회논문집
    • /
    • 제3권4호
    • /
    • pp.311-323
    • /
    • 2014
  • This paper introduces recent applications of three-dimensional building/construction data modeling (3D) and building information modeling (BIM) to large-scale complex building construction projects in Japan. Recently, BIM has been utilized as a tool in construction process innovation through planning, design, engineering, procurement and construction to establish a front-loading-type design building system. Firstly, the background and introduction processes of 3D and BIM are described to clarify their purposes and scopes of applications. Secondly, 3D and BIM applications for typical large-scale complex building construction projects to improve planning and management efficiency in building construction are presented. Finally, future directions and further research issues with 3D and BIM applications are proposed.

Very deep super-resolution for efficient cone-beam computed tomographic image restoration

  • Hwang, Jae Joon;Jung, Yun-Hoa;Cho, Bong-Hae;Heo, Min-Suk
    • Imaging Science in Dentistry
    • /
    • 제50권4호
    • /
    • pp.331-337
    • /
    • 2020
  • Purpose: As cone-beam computed tomography (CBCT) has become the most widely used 3-dimensional (3D) imaging modality in the dental field, storage space and costs for large-capacity data have become an important issue. Therefore, if 3D data can be stored at a clinically acceptable compression rate, the burden in terms of storage space and cost can be reduced and data can be managed more efficiently. In this study, a deep learning network for super-resolution was tested to restore compressed virtual CBCT images. Materials and Methods: Virtual CBCT image data were created with a publicly available online dataset (CQ500) of multidetector computed tomography images using CBCT reconstruction software (TIGRE). A very deep super-resolution (VDSR) network was trained to restore high-resolution virtual CBCT images from the low-resolution virtual CBCT images. Results: The images reconstructed by VDSR showed better image quality than bicubic interpolation in restored images at various scale ratios. The highest scale ratio with clinically acceptable reconstruction accuracy using VDSR was 2.1. Conclusion: VDSR showed promising restoration accuracy in this study. In the future, it will be necessary to experiment with new deep learning algorithms and large-scale data for clinical application of this technology.

Data Mining for High Dimensional Data in Drug Discovery and Development

  • Lee, Kwan R.;Park, Daniel C.;Lin, Xiwu;Eslava, Sergio
    • Genomics & Informatics
    • /
    • 제1권2호
    • /
    • pp.65-74
    • /
    • 2003
  • Data mining differs primarily from traditional data analysis on an important dimension, namely the scale of the data. That is the reason why not only statistical but also computer science principles are needed to extract information from large data sets. In this paper we briefly review data mining, its characteristics, typical data mining algorithms, and potential and ongoing applications of data mining at biopharmaceutical industries. The distinguishing characteristics of data mining lie in its understandability, scalability, its problem driven nature, and its analysis of retrospective or observational data in contrast to experimentally designed data. At a high level one can identify three types of problems for which data mining is useful: description, prediction and search. Brief review of data mining algorithms include decision trees and rules, nonlinear classification methods, memory-based methods, model-based clustering, and graphical dependency models. Application areas covered are discovery compound libraries, clinical trial and disease management data, genomics and proteomics, structural databases for candidate drug compounds, and other applications of pharmaceutical relevance.

Set Covering 기반의 대용량 오믹스데이터 특징변수 추출기법 (Set Covering-based Feature Selection of Large-scale Omics Data)

  • 마정우;안기동;김광수;류홍서
    • 한국경영과학회지
    • /
    • 제39권4호
    • /
    • pp.75-84
    • /
    • 2014
  • In this paper, we dealt with feature selection problem of large-scale and high-dimensional biological data such as omics data. For this problem, most of the previous approaches used simple score function to reduce the number of original variables and selected features from the small number of remained variables. In the case of methods that do not rely on filtering techniques, they do not consider the interactions between the variables, or generate approximate solutions to the simplified problem. Unlike them, by combining set covering and clustering techniques, we developed a new method that could deal with total number of variables and consider the combinatorial effects of variables for selecting good features. To demonstrate the efficacy and effectiveness of the method, we downloaded gene expression datasets from TCGA (The Cancer Genome Atlas) and compared our method with other algorithms including WEKA embeded feature selection algorithms. In the experimental results, we showed that our method could select high quality features for constructing more accurate classifiers than other feature selection algorithms.

필터링에 기반한 고차원 색인구조의 동시성 제어기법의 설계 및 구현 (Design and Implementation of High-dimensional Index Structure for the support of Concurrency Control)

  • 이용주;장재우;김학영;김명준
    • 정보처리학회논문지D
    • /
    • 제10D권1호
    • /
    • pp.1-12
    • /
    • 2003
  • 최근 이미지, 비디오와 같은 멀티미디어 데이터에 대한 효율적인 검색을 위해 많은 다차원 및 고차원 색인 구조들에 대한 연구가 활발히 진행되고 있다. 하지만 기존의 색인 구조의 연구 방향은 검색의 효율을 극대화 하는데 초점을 맞추어 왔으며 최근의 멀티미디어 데이터베이스나 데이터 마이닝 분야와 같은 다수 사용자 환경을 요구하는 환경에서는 부적합한 실정이다. 이에 본 논문에서는 기존의 제시된 차원이 증가하면서 급속하게 성능이 저하되는 문제를 특징 벡터의 시그니쳐를 구성하여 완화시킨 필터링에 기반한 고차원 색인 구조에 동시성 제어기법을 설계 및 구현하여 위스콘신 대학에서 개발한 지속성 객체 저장 시스템인 SHORE 하부저장 시스템과 밀결합 방식으로 통합하였다. 확장된 SHORE 하부저장 시스템은 고차원 데이터에 대한 효율적인 검색 뿐만 아니라 레코드 레벨의 색인 데이터에 대한 동시성 제어를 지원하며 시그니쳐 파일을 모두 메모리에 로딩하는 구조를 개선하여 페이지 레벨의 관리가 가능하다. 아울러 본 논문에서 제시한 확장된 SHOE 하부저장 시스템을 실제 응용 시스템에 적용하기 위해 플랫폼 독립적인 환경을 지원하는 자바 언어를 사용하여 미들웨어 구축 방안을 제시한다. 또한 구축된 미들웨어를 통해 쓰레드 별로 대표적인 내용기반 질의 형태인 포인트질의, 범위질의, k-최근접 질의에 대한 다수 사용자 환경에서의 성능 평가를 수행하였다.