• Title/Summary/Keyword: SparkR

Search Result 79, Processing Time 0.035 seconds

Comparison of Scala and R for Machine Learning in Spark (스파크에서 스칼라와 R을 이용한 머신러닝의 비교)

  • Woo-Seok Ryu
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.18 no.1
    • /
    • pp.85-90
    • /
    • 2023
  • Data analysis methodology in the healthcare field is shifting from traditional statistics-oriented research methods to predictive research using machine learning. In this study, we survey various machine learning tools, and compare several programming models, which utilize R and Spark, for applying R, a statistical tool widely used in the health care field, to machine learning. In addition, we compare the performance of linear regression model using scala, which is the basic languages of Spark and R. As a result of the experiment, the learning execution time when using SparkR increased by 10 to 20% compared to Scala. Considering the presented performance degradation, SparkR's distributed processing was confirmed as useful in R as the traditional statistical analysis tool that could be used as it is.

Distributed Processing of Big Data Analysis based on R using SparkR (SparkR을 이용한 R 기반 빅데이터 분석의 분산 처리)

  • Ryu, Woo-Seok
    • The Journal of the Korea institute of electronic communication sciences
    • /
    • v.17 no.1
    • /
    • pp.161-166
    • /
    • 2022
  • In this paper, we analyze the problems that occur when performing the big data analysis using R as a data analysis tool, and present the usefulness of the data analysis with SparkR which connects R and Spark to support distributed processing of big data effectively. First, we study the memory allocation problem of R which occurs when loading large amounts of data and performing operations, and the characteristics and programming environment of SparkR. And then, we perform the comparison analysis of the execution performance when linear regression analysis is performed in each environment. As a result of the analysis, it was shown that R can be used for data analysis through SparkR without additional language learning, and the code written in R can be effectively processed distributedly according to the increase in the number of nodes in the cluster.

A Pulser System with Parallel Spark Gaps at High Repetition Rate

  • Lee, Byung-Joon;Nam, Jong-Woo;Rahaman, Hasibur;Nam, Sang-Hoon;Ahn, Jae-Woon;Jo, Seung-Whan;Kwon, Hae-Ok
    • Journal of IKEEE
    • /
    • v.15 no.4
    • /
    • pp.305-312
    • /
    • 2011
  • A primary interest of this work is to develop an efficient and powerful repetitive pulser system for the application of ultra wide band generation. The important component of the pulser system is a small-sized coaxial type spark gap with planar electrodes filled with SF6 gas. A repetitive switching action by the coaxial spark gap generates two consecutive pulses in less than a microsecond with rise times of a few hundred picoseconds (ps). A set of several parameters for the repetitive switching of the spark gap is required to be optimized in charging and discharging systems of the pulser. The parameters in the charging system include a circuit scheme, circuit elements, the applied voltage and current ratings from power supplies. The parameters in the discharging system include the spark gap geometry, electrode gap distance, gas type, gas pressure and the load. The characteristics of the spark gap discharge, such as breakdown voltage, output current pulse and recovery rate are too dynamic to control by switching continuously at a high pulse repetition rate (PRR). This leads to a low charging efficiency of the spark gap system. The breakthrough of the low charging efficiency is achieved by a parallel operation of two spark gaps system. The operational behavior of the two spark gaps system is presented in this paper. The work has focused on improvement of the charging efficiency by scaling the PRR of each spark gap in the two spark gaps system.

Processing large-scale data with Apache Spark (Apache Spark를 활용한 대용량 데이터의 처리)

  • Ko, Seyoon;Won, Joong-Ho
    • The Korean Journal of Applied Statistics
    • /
    • v.29 no.6
    • /
    • pp.1077-1094
    • /
    • 2016
  • Apache Spark is a fast and general-purpose cluster computing package. It provides a new abstraction named resilient distributed dataset, which is capable of support for fault tolerance while keeping data in memory. This type of abstraction results in a significant speedup compared to legacy large-scale data framework, MapReduce. In particular, Spark framework is suitable for iterative machine learning applications such as logistic regression and K-means clustering, and interactive data querying. Spark also supports high level libraries for various applications such as machine learning, streaming data processing, database querying and graph data mining thanks to its versatility. In this work, we introduce the concept and programming model of Spark as well as show some implementations of simple statistical computing applications. We also review the machine learning package MLlib, and the R language interface SparkR.

Consolidation Behavior of Ti-6Al-4V Powder by Spark Plasma Sintering (Spark plasma sintering에 의한 Ti-6Al-4V 합금분말의 성형성)

  • Kim, J.H.;Lee, J.K.;Kim, T.S.
    • Journal of Powder Materials
    • /
    • v.14 no.1 s.60
    • /
    • pp.32-37
    • /
    • 2007
  • Using spark plasma sintering process (SPS), Ti-6Al-4V alloy powders were successfully consolidated without any contamination happened due to reaction between the alloy powders and graphite mold. Variation of microstructure and mechanical properties were investigated as a function of SPS temperature and time. Compared with hot isostatic pressing (HIP), the sintering time and temperature could be lowered to be 10 min. and $900^{\circ}C$, respectively. At the SPS condition, UTS and elongation were about 890 MPa and 24%, respectively. Considering the density of 98.5% and elongation of 24%, further improving the tensile strength would obtain by increasing the SPS pressure.

Direct Solid Sample Analysis in the Moderate Power He Mip with the Spark Generation

  • S. R. Koirtyohann;Yong-Nam Pak
    • Bulletin of the Korean Chemical Society
    • /
    • v.15 no.8
    • /
    • pp.622-627
    • /
    • 1994
  • Conducting solid samples are successfully analyzed with the spark ablation combined to the moderate power (500W) Helium Microwave Induced Plasma (He MIP). The relative standard deviations are in the range of 3-10% and the detection limits are around 50 ${\mu}$g $g^-1$. These values are higher than those of Ar MIP or Ar Inductively Coupled Plasma. Spark ablated particles are examined to investigate the analytical characteristics of the system.

The Phase Analysis of MgB2 Fabricated by Spark Plasma Sintering after Ball Milling (볼 밀링 후 방전플라즈마 소결법에 의해 제조된 MgB2의 상 분석)

  • Kang, Deuk-Kyun;Choi, Sung-Hyun;Ahn, In-Shup
    • Journal of Powder Materials
    • /
    • v.15 no.5
    • /
    • pp.371-377
    • /
    • 2008
  • This paper deals with the phase analysis of $MgB_2$ bulk using spark plasma sintering process after ball milling. Mg and amorphous B powders were used as raw materials, and milled by planetary-mill for 9 hours at argon atmosphere. In order to confirm formation of $MgB_2$ phase, DTA and XRD were used. The milled powders were fabricated to $MgB_2$ bulk at the various temperatures by Spark Plasma Sintering. The fabricated $MgB_2$ bulk was evaluated with XRD, EDS, FE-SEM and PPMS. In the DTA result, reaction on formation of $MgB_2$ phase started at $340^{\circ}C$. This means that ball milling process improves reactivity on formation of $MgB_2$ phase. The $MgB_2$ MgO and FeB phases were characterized from XRD result. MgO and FeB were undesirable phases which affect formation of $MgB_2$ phase, and it's distribution could be confirmed from EDS mapping result. Spark Plasma Sintered sample for 5 min at $700^{\circ}C$ was relatively densified and it's density and transition temperature showing super conducting property were $1.87\;g/cm^3$ and 21K.

Computer Simulation of Fuel States and Spark Timing in Engine Model (엔진모델에서의 연료상태와 점화시기의 컴퓨터 해석)

  • Lee, Deog-Kyoo;Kim, You-Nam;Park, Hee-Chul;Woo, Kwang-Bang
    • Proceedings of the KIEE Conference
    • /
    • 1989.07a
    • /
    • pp.89-93
    • /
    • 1989
  • In this paper, a mathematical engine model based on the actual engine operation is formulated to be adapted for the evalution and development of engine control system. In the model the classification of fuel paticle siza is considered. The model is simulated through the mathematical interpretation of intake manifold in the rapidly-accerated state. The spark-timing is analyzed with respect to engine r.p.m. The result shows that the model behaves similar performance to the actual engine operation and the spark-timing is very important to the characterization of engine r.p.m..

  • PDF

A Performance Comparison of Machine Learning Library based on Apache Spark for Real-time Data Processing (실시간 데이터 처리를 위한 아파치 스파크 기반 기계 학습 라이브러리 성능 비교)

  • Song, Jun-Seok;Kim, Sang-Young;Song, Byung-Hoo;Kim, Kyung-Tae;Youn, Hee-Yong
    • Proceedings of the Korean Society of Computer Information Conference
    • /
    • 2017.01a
    • /
    • pp.15-16
    • /
    • 2017
  • IoT 시대가 도래함에 따라 실시간으로 대규모 데이터가 발생하고 있으며 이를 효율적으로 처리하고 활용하기 위한 분산 처리 및 기계 학습에 대한 관심이 높아지고 있다. 아파치 스파크는 RDD 기반의 인 메모리 처리 방식을 지원하는 분산 처리 플랫폼으로 다양한 기계 학습 라이브러리와의 연동을 지원하여 최근 차세대 빅 데이터 분석 엔진으로 주목받고 있다. 본 논문에서는 아파치 스파크 기반 기계 학습 라이브러리 성능 비교를 통해 아파치 스파크와 연동 가능한 기계 학습라이브러리인 MLlib와 아파치 머하웃, SparkR의 데이터 처리 성능을 비교한다. 이를 위해, 대표적인 기계 학습 알고리즘인 나이브 베이즈 알고리즘을 사용했으며 학습 시간 및 예측 시간을 비교하여 아파치 스파크 기반에서 실시간 데이터 처리에 적합한 기계 학습 라이브러리를 확인한다.

  • PDF

Performance Comparison of Python and Scala APIs in Spark Distributed Cluster Computing System (Spark 기반에서 Python과 Scala API의 성능 비교 분석)

  • Ji, Keung-yeup;Kwon, Youngmi
    • Journal of Korea Multimedia Society
    • /
    • v.23 no.2
    • /
    • pp.241-246
    • /
    • 2020
  • Hadoop is a framework to process large data sets in a distributed way across clusters of nodes. It has been a popular platform to process big data, but in recent years, other platforms became competitive ones depending on the characteristics of the application. Spark is one of distributed platforms to enable real-time data processing and improve overall processing performance over Hadoop by introducing in-memory processing instead of disk I/O. Whereas Hadoop is designed to work on Java and data analysis is processed using Java API, Spark provides a variety of APIs with Scala, Python, Java and R. In this paper, the goal is to find out whether the APIs of different programming languages af ect the performances in Spark. We chose two popular APIs: Python and Scala. Python is easy to learn and is used in AI domain in a wide range. Scala is a programming language with advantages of parallelism. Our experiment shows much faster processing with Scala API than Python API. For the performance issues on AI-based analysis, further study is needed.