• Title/Summary/Keyword: Big data algorithms

Search Result 269, Processing Time 0.034 seconds

Support vector machines for big data analysis (빅 데이터 분석을 위한 지지벡터기계)

  • Choi, Hosik;Park, Hye Won;Park, Changyi
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.5
    • /
    • pp.989-998
    • /
    • 2013
  • We cannot analyze big data, which attracts recent attentions in industry and academy, by batch processing algorithms developed in data mining because big data, by definition, cannot be uploaded and processed in the memory of a single system. So an imminent issue is to develop various leaning algorithms so that they can be applied to big data. In this paper, we review various algorithms for support vector machines in the literature. Particularly, we introduce online type and parallel processing algorithms that are expected to be useful in big data classifications and compare the strengths, the weaknesses and the performances of those algorithms through simulations for linear classification.

The Effect of AI and Big Data on an Entry Firm: Game Theoretic Approach (인공지능과 빅데이터가 시장진입 기업에 미치는 영향관계 분석, 게임이론 적용을 중심으로)

  • Jeong, Jikhan
    • Journal of Digital Convergence
    • /
    • v.19 no.7
    • /
    • pp.95-111
    • /
    • 2021
  • Despite the innovation of AI and Big Data, theoretical research bout the effect of AI and Big Data on market competition is still in early stages; therefore, this paper analyzes the effect of AI, Big Data, and data sharing on an entry firm by using game theory. In detail, the firms' business environments are divided into internal and external ones. Then, AI algorithms are divided into algorithms for (1) customer marketing, (2) cost reduction without automation, and (3) cost reduction with automation. Big Data is also divided into external and internal data. this study shows that the sharing of external data does not affect the incumbent firm's algorithms for consumer marketing while lessening the entry firm's entry barrier. Improving the incumbent firm's algorithms for cost reduction (with and without automation) and external data can be an entry barrier for the entry firm. These findings can be helpful (1) to analyze the effect of AI, Big Data, and data sharing on market structure, market competition, and firm behaviors and (2) to design policy for AI and Big Data.

An Efficient Algorithm of Data Anonymity based on Anonymity Groups (익명 그룹 기반의 효율적인 데이터 익명화 알고리즘)

  • Kwon, Ho Yeol
    • Journal of Industrial Technology
    • /
    • v.36
    • /
    • pp.89-92
    • /
    • 2016
  • In this paper, we propose an efficient anonymity algorithm for personal information protections in big data systems. Firstly, we briefly introduce fundamental algorithms of k-anonymity, l-diversity, t-closeness. And then we propose an anonymity algorithm using controlling the size of anonymity groups as well as exchanging the data tuple between anonymity groups. Finally, we demonstrate an example on which proposed algorithm applied. The proposed scheme gave an efficient and simple algorithms for the processing of a big amount of data.

  • PDF

Scalable Prediction Models for Airbnb Listing in Spark Big Data Cluster using GPU-accelerated RAPIDS

  • Muralidharan, Samyuktha;Yadav, Savita;Huh, Jungwoo;Lee, Sanghoon;Woo, Jongwook
    • Journal of information and communication convergence engineering
    • /
    • v.20 no.2
    • /
    • pp.96-102
    • /
    • 2022
  • We aim to build predictive models for Airbnb's prices using a GPU-accelerated RAPIDS in a big data cluster. The Airbnb Listings datasets are used for the predictive analysis. Several machine-learning algorithms have been adopted to build models that predict the price of Airbnb listings. We compare the results of traditional and big data approaches to machine learning for price prediction and discuss the performance of the models. We built big data models using Databricks Spark Cluster, a distributed parallel computing system. Furthermore, we implemented models using multiple GPUs using RAPIDS in the spark cluster. The model was developed using the XGBoost algorithm, whereas other models were developed using traditional central processing unit (CPU)-based algorithms. This study compared all models in terms of accuracy metrics and computing time. We observed that the XGBoost model with RAPIDS using GPUs had the highest accuracy and computing time.

Understanding Child Abuse Based on Big Data Analysis -A Basic Study on the Development of Machine Learning Algorithm- (빅데이터 분석에 기반한 아동학대의 이해 -머신러닝 알고리즘 개발 기초연구-)

  • Bae, Jungho;Burm, Eunae
    • Journal of Internet of Things and Convergence
    • /
    • v.8 no.4
    • /
    • pp.57-63
    • /
    • 2022
  • The purpose of this study is to provide basic data on policy development using big data analysis and machine learning algorithms as part of preparing measures to prevent child abuse. In order to analyze big data for developing machine learning algorithms to prevent child abuse, frequency analysis, related word analysis, and emotional analysis were performed after defining academic databases and social network service data as big data. related words, and emotional analysis were conducted. As a result of the study, a preventive child abuse algorithm can be developed by preparing a data collection and sharing network system to prevent child abuse from the perspective of children affected by child abuse, perpetrators, and government authorities. Although it will be possible by institutionalizing infant self-esteem, depression, and anxiety tests with clues that depression and anxiety appear due to a decrease in self-concept in the characteristics of children affected by child abuse. We suggest that continuous progress of big data collection and analysis and algorithm development research to prevent child abuse, and expects that effective policies to prevent child abuse will be realized to eradicate child abuse crimes.

Analysis of problems caused by Big Data's private information handling (빅데이터 개인정보 취급에 따른 문제점 분석)

  • Choi, Hee Sik;Cho, Yang Hyun
    • Journal of Korea Society of Digital Industry and Information Management
    • /
    • v.10 no.1
    • /
    • pp.89-97
    • /
    • 2014
  • Recently, spread of Smartphones caused activation of mobile services, because of that Big Data such as clouding service able to proceed with large amount of data which are hard to collect, save, search and analyze. Many companies collected variety of private and personal information without users' agreement for their business strategy and marketing. This situation raised social issues. As companies use Big Data, numbers of damage cases are growing. In this Thesis, when Big Data process, methods of analyze and research of data are very important. This thesis will suggest that choices of security levels and algorithms are important for security of private informations. To use Big Data, it has to encrypt the personal data to emphasize the importance of security level and selection of algorithm. Thesis will also suggest that research of utilization of Big Data and protection of private informations and making guidelines for users are require for security of private information and activation of Big Data industries.

Iowa Liquor Sales Data Predictive Analysis Using Spark

  • Ankita Paul;Shuvadeep Kundu;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.31 no.2
    • /
    • pp.185-196
    • /
    • 2021
  • The paper aims to analyze and predict sales of liquor in the state of Iowa by applying machine learning algorithms to models built for prediction. We have taken recourse of Azure ML and Spark ML for our predictive analysis, which is legacy machine learning (ML) systems and Big Data ML, respectively. We have worked on the Iowa liquor sales dataset comprising of records from 2012 to 2019 in 24 columns and approximately 1.8 million rows. We have concluded by comparing the models with different algorithms applied and their accuracy in predicting the sales using both Azure ML and Spark ML. We find that the Linear Regression model has the highest precision and Decision Forest Regression has the fastest computing time with the sample data set using the legacy Azure ML systems. Decision Tree Regression model in Spark ML has the highest accuracy with the quickest computing time for the entire data set using the Big Data Spark systems.

Feature Selection for Creative People Based on Big 5 Personality traits and Machine Learning Algorithms (Big 5 성격 요소와 머신 러닝 알고리즘을 통한 창의적인 사람들의 특징 연구)

  • Kim, Yong-Jun
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.19 no.1
    • /
    • pp.97-102
    • /
    • 2019
  • There are many difficulties to define because there is no systematic classification and analysis method using accurate criteria or numerical values for creative people. In order to solve this problem, this study attempts to analyze how to distinguish creative people and what kind of personality they have when distinguishing creative people. In this study, I first survey the Big 5 personality trait, classify and analyze the data set using the data mining tool WEKA, and then analyze the data set related to the creativity The goal is to analyze the features using various machine learning techniques. I use seven feature selection algorithms, select feature groups classified by feature selection algorithms, apply them to machine learning algorithms to find out the accuracy, and derive the results.

Agriculture Big Data Analysis System Based on Korean Market Information

  • Chuluunsaikhan, Tserenpurev;Song, Jin-Hyun;Yoo, Kwan-Hee;Rah, Hyung-Chul;Nasridinov, Aziz
    • Journal of Multimedia Information System
    • /
    • v.6 no.4
    • /
    • pp.217-224
    • /
    • 2019
  • As the world's population grows, how to maintain the food supply is becoming a bigger problem. Now and in the future, big data will play a major role in decision making in the agriculture industry. The challenge is how to obtain valuable information to help us make future decisions. Big data helps us to see history clearer, to obtain hidden values, and make the right decisions for the government and farmers. To contribute to solving this challenge, we developed the Agriculture Big Data Analysis System. The system consists of agricultural big data collection, big data analysis, and big data visualization. First, we collected structured data like price, climate, yield, etc., and unstructured data, such as news, blogs, TV programs, etc. Using the data that we collected, we implement prediction algorithms like ARIMA, Decision Tree, LDA, and LSTM to show the results in data visualizations.

Feature Selection Using Submodular Approach for Financial Big Data

  • Attigeri, Girija;Manohara Pai, M.M.;Pai, Radhika M.
    • Journal of Information Processing Systems
    • /
    • v.15 no.6
    • /
    • pp.1306-1325
    • /
    • 2019
  • As the world is moving towards digitization, data is generated from various sources at a faster rate. It is getting humungous and is termed as big data. The financial sector is one domain which needs to leverage the big data being generated to identify financial risks, fraudulent activities, and so on. The design of predictive models for such financial big data is imperative for maintaining the health of the country's economics. Financial data has many features such as transaction history, repayment data, purchase data, investment data, and so on. The main problem in predictive algorithm is finding the right subset of representative features from which the predictive model can be constructed for a particular task. This paper proposes a correlation-based method using submodular optimization for selecting the optimum number of features and thereby, reducing the dimensions of the data for faster and better prediction. The important proposition is that the optimal feature subset should contain features having high correlation with the class label, but should not correlate with each other in the subset. Experiments are conducted to understand the effect of the various subsets on different classification algorithms for loan data. The IBM Bluemix BigData platform is used for experimentation along with the Spark notebook. The results indicate that the proposed approach achieves considerable accuracy with optimal subsets in significantly less execution time. The algorithm is also compared with the existing feature selection and extraction algorithms.