• Title/Summary/Keyword: Big data processing

Search Result 1,063, Processing Time 0.032 seconds

A Study On YouTube Fake News Detection System Using Sentence-BERT (Sentence-BERT를 활용한 YouTube 가짜뉴스 탐지 시스템 연구)

  • Beom Jung Kim;Ji Hye Huh;Hyeopgeon Lee;Young Woon Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2023.11a
    • /
    • pp.667-668
    • /
    • 2023
  • IT 기술의 발달로 인해 뉴스를 제공하는 플랫폼들이 다양해 졌고 최근 해외 인터뷰 영상, 해외 뉴스를 Youtube Shorts형태로 제작하여 화자의 의도와는 다른 자막을 달며 가짜 뉴스가 생성되는 문제가 대두되고 있다. 이에 본 논문에서는 Sentence-BERT를 활용한 YouTube 가짜 뉴스 탐지 시스템을 제안한다. 제안하는 시스템은 Python 라이브러리를 사용해 유튜브 영상에서 음성과 영상 데이터를 분류하고 분류된 영상 데이터는 EasyOCR을 사용해 자막 데이터를 텍스트로 추출 후 Sentence-BERT를 활용해 문자 유사도를 분석한다. 분석결과 음성 데이터와 영상 자막 데이터가 일치한 경우 일치하지 않은 경우보다 약 62% 더 높은 문장 유사도를 보였다.

Analysis of Difference between W3C WebAssembly and CNCF WebAssembly For Cross-Platform Application (크로스 플랫폼 어플리케이션 개발을 위한 W3C WebAssembly와 CNCF WebAssembly의 차이점 비교 분석)

  • Hayoon Kim;Wonjib Kim;Hyeop Geon Lee;Young Woon Kim
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2024.05a
    • /
    • pp.79-80
    • /
    • 2024
  • 크로스 플랫폼은 한 번의 개발로 다수의 플랫폼에서 동일하게 동작 가능한 어플리케이션을 개발하는 방법으로, 개발비용 절감과 유지보수에 유리하다. 시스템은 자발성, 자율성, 사회성, 반응성을 갖는 독립된 프로그램인 에이전트를 조합하여 구성되는 시스템으로 일반 사용자에게 편리하고 자연수러운 메타포를 제공한다. 그러나 개발자 측면에서는 에이전트 시스템에서 요구하는 각종 기능 및 제약규칙.

Implementation of High Speed Big Data Processing System using In Memory Data Grid in Semiconductor Process (반도체 공정에서 인 메모리 데이터 그리드를 이용한 고속의 빅데이터 처리 시스템 구현)

  • Park, Jong-Beom;Lee, Alex;Kim, Tony
    • The Journal of The Korea Institute of Intelligent Transport Systems
    • /
    • v.15 no.5
    • /
    • pp.125-133
    • /
    • 2016
  • Data processing capacity and speed are rapidly increasing due to the development of hardware and software in recent time. As a result, data usage is geometrically increasing and the amount of data which computers have to process has already exceeded five-thousand transaction per second. That is, the importance of Big Data is due to its 'real-time' and this makes it possible to analyze all the data in order to obtain accurate data at right time under any circumstances. Moreover, there are many researches about this as construction of smart factory with the application of Big Data is expected to have reduction in development, production, and quality management cost. In this paper, system using In-Memory Data Grid for high speed processing is implemented in semiconductor process which numerous data occur and improved performance is proven with experiments. Implemented system is expected to be possible to apply on not only the semiconductor but also any fields using Big Data and further researches will be made for possible application on other fields.

LDBAS: Location-aware Data Block Allocation Strategy for HDFS-based Applications in the Cloud

  • Xu, Hua;Liu, Weiqing;Shu, Guansheng;Li, Jing
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.12 no.1
    • /
    • pp.204-226
    • /
    • 2018
  • Big data processing applications have been migrated into cloud gradually, due to the advantages of cloud computing. Hadoop Distributed File System (HDFS) is one of the fundamental support systems for big data processing on MapReduce-like frameworks, such as Hadoop and Spark. Since HDFS is not aware of the co-location of virtual machines in the cloud, the default scheme of block allocation in HDFS does not fit well in the cloud environments behaving in two aspects: data reliability loss and performance degradation. In this paper, we present a novel location-aware data block allocation strategy (LDBAS). LDBAS jointly optimizes data reliability and performance for upper-layer applications by allocating data blocks according to the locations and different processing capacities of virtual nodes in the cloud. We apply LDBAS to two stages of data allocation of HDFS in the cloud (the initial data allocation and data recovery), and design the corresponding algorithms. Finally, we implement LDBAS into an actual Hadoop cluster and evaluate the performance with the benchmark suite BigDataBench. The experimental results show that LDBAS can guarantee the designed data reliability while reducing the job execution time of the I/O-intensive applications in Hadoop by 8.9% on average and up to 11.2% compared with the original Hadoop in the cloud.

An Assessment System for Evaluating Big Data Capability Based on a Reference Model (빅데이터 역량 평가를 위한 참조모델 및 수준진단시스템 개발)

  • Cheon, Min-Kyeong;Baek, Dong-Hyun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.2
    • /
    • pp.54-63
    • /
    • 2016
  • As technology has developed and cost for data processing has reduced, big data market has grown bigger. Developed countries such as the United States have constantly invested in big data industry and achieved some remarkable results like improving advertisement effects and getting patents for customer service. Every company aims to achieve long-term survival and profit maximization, but it needs to establish a good strategy, considering current industrial conditions so that it can accomplish its goal in big data industry. However, since domestic big data industry is at its initial stage, local companies lack systematic method to establish competitive strategy. Therefore, this research aims to help local companies diagnose their big data capabilities through a reference model and big data capability assessment system. Big data reference model consists of five maturity levels such as Ad hoc, Repeatable, Defined, Managed and Optimizing and five key dimensions such as Organization, Resources, Infrastructure, People, and Analytics. Big data assessment system is planned based on the reference model's key factors. In the Organization area, there are 4 key diagnosis factors, big data leadership, big data strategy, analytical culture and data governance. In Resource area, there are 3 factors, data management, data integrity and data security/privacy. In Infrastructure area, there are 2 factors, big data platform and data management technology. In People area, there are 3 factors, training, big data skills and business-IT alignment. In Analytics area, there are 2 factors, data analysis and data visualization. These reference model and assessment system would be a useful guideline for local companies.

Evaluation of Predictive Models for Early Identification of Dropout Students

  • Lee, JongHyuk;Kim, Mihye;Kim, Daehak;Gil, Joon-Min
    • Journal of Information Processing Systems
    • /
    • v.17 no.3
    • /
    • pp.630-644
    • /
    • 2021
  • Educational data analysis is attracting increasing attention with the rise of the big data industry. The amounts and types of learning data available are increasing steadily, and the information technology required to analyze these data continues to develop. The early identification of potential dropout students is very important; education is important in terms of social movement and social achievement. Here, we analyze educational data and generate predictive models for student dropout using logistic regression, a decision tree, a naïve Bayes method, and a multilayer perceptron. The multilayer perceptron model using independent variables selected via the variance analysis showed better performance than the other models. In addition, we experimentally found that not only grades but also extracurricular activities were important in terms of preventing student dropout.

Design of Distributed Cloud System for Managing large-scale Genomic Data

  • Seine Jang;Seok-Jae Moon
    • International Journal of Internet, Broadcasting and Communication
    • /
    • v.16 no.2
    • /
    • pp.119-126
    • /
    • 2024
  • The volume of genomic data is constantly increasing in various modern industries and research fields. This growth presents new challenges and opportunities in terms of the quantity and diversity of genetic data. In this paper, we propose a distributed cloud system for integrating and managing large-scale gene databases. By introducing a distributed data storage and processing system based on the Hadoop Distributed File System (HDFS), various formats and sizes of genomic data can be efficiently integrated. Furthermore, by leveraging Spark on YARN, efficient management of distributed cloud computing tasks and optimal resource allocation are achieved. This establishes a foundation for the rapid processing and analysis of large-scale genomic data. Additionally, by utilizing BigQuery ML, machine learning models are developed to support genetic search and prediction, enabling researchers to more effectively utilize data. It is expected that this will contribute to driving innovative advancements in genetic research and applications.

Financial and Economic Risk Prevention and Countermeasures Based on Big Data and Internet of Things

  • Songyan Liu;Pengfei Liu;Hecheng Wang
    • Journal of Information Processing Systems
    • /
    • v.20 no.3
    • /
    • pp.391-398
    • /
    • 2024
  • Given the further promotion of economic globalization, China's financial market has also expanded. However, at present, this market faces substantial risks. The main financial and economic risks in China are in the areas of policy, credit, exchange rates, accounting, and interest rates. The current status of China's financial market is as follows: insufficient attention from upper management; insufficient innovation in the development of the financial economy; and lack of a sound financial and economic risk protection system. To further understand the current situation of China's financial market, we conducted a questionnaire survey on the financial market and reached the following conclusions. A comprehensive enterprise questionnaire from the government's perspective, the enterprise's perspective and the individual's perspective showed that the following problems exist in the financial and economic risk prevention aspects of big data and Internet of Things in China. The political system at the country's grassroots level is not comprehensive enough. The legal regulatory system is not comprehensive enough, leading to serious incidents of loan fraud. The top management of enterprises does not pay enough attention to financial risk prevention. Therefore, we constructed a financial and economic risk prevention model based on big data and Internet of Things that has effective preventive capabilities for both enterprises and individuals. The concept reflected in the model is to obtain data through Internet of Things, use big data for screening, and then pass these data to the big data analysis system at the grassroots level for analysis. The data initially screened as big data are analyzed in depth, and we obtain the original data that can be used to make decisions. Finally, we put forward the corresponding opinions, and their main contents represent the following points: the key is to build a sound national financial and economic risk prevention and assessment system, the guarantee is to strengthen the supervision of national financial risks, and the purpose is to promote the marketization of financial interest rates.

An Efficient Implementation of Mobile Raspberry Pi Hadoop Clusters for Robust and Augmented Computing Performance

  • Srinivasan, Kathiravan;Chang, Chuan-Yu;Huang, Chao-Hsi;Chang, Min-Hao;Sharma, Anant;Ankur, Avinash
    • Journal of Information Processing Systems
    • /
    • v.14 no.4
    • /
    • pp.989-1009
    • /
    • 2018
  • Rapid advances in science and technology with exponential development of smart mobile devices, workstations, supercomputers, smart gadgets and network servers has been witnessed over the past few years. The sudden increase in the Internet population and manifold growth in internet speeds has occasioned the generation of an enormous amount of data, now termed 'big data'. Given this scenario, storage of data on local servers or a personal computer is an issue, which can be resolved by utilizing cloud computing. At present, there are several cloud computing service providers available to resolve the big data issues. This paper establishes a framework that builds Hadoop clusters on the new single-board computer (SBC) Mobile Raspberry Pi. Moreover, these clusters offer facilities for storage as well as computing. Besides the fact that the regular data centers require large amounts of energy for operation, they also need cooling equipment and occupy prime real estate. However, this energy consumption scenario and the physical space constraints can be solved by employing a Mobile Raspberry Pi with Hadoop clusters that provides a cost-effective, low-power, high-speed solution along with micro-data center support for big data. Hadoop provides the required modules for the distributed processing of big data by deploying map-reduce programming approaches. In this work, the performance of SBC clusters and a single computer were compared. It can be observed from the experimental data that the SBC clusters exemplify superior performance to a single computer, by around 20%. Furthermore, the cluster processing speed for large volumes of data can be enhanced by escalating the number of SBC nodes. Data storage is accomplished by using a Hadoop Distributed File System (HDFS), which offers more flexibility and greater scalability than a single computer system.

Development of Big-data Management Platform Considering Docker Based Real Time Data Connecting and Processing Environments (도커 기반의 실시간 데이터 연계 및 처리 환경을 고려한 빅데이터 관리 플랫폼 개발)

  • Kim, Dong Gil;Park, Yong-Soon;Chung, Tae-Yun
    • IEMEK Journal of Embedded Systems and Applications
    • /
    • v.16 no.4
    • /
    • pp.153-161
    • /
    • 2021
  • Real-time access is required to handle continuous and unstructured data and should be flexible in management under dynamic state. Platform can be built to allow data collection, storage, and processing from local-server or multi-server. Although the former centralize method is easy to control, it creates an overload problem because it proceeds all the processing in one unit, and the latter distributed method performs parallel processing, so it is fast to respond and can easily scale system capacity, but the design is complex. This paper provides data collection and processing on one platform to derive significant insights from various data held by an enterprise or agency in the latter manner, which is intuitively available on dashboards and utilizes Spark to improve distributed processing performance. All service utilize dockers to distribute and management. The data used in this study was 100% collected from Kafka, showing that when the file size is 4.4 gigabytes, the data processing speed in spark cluster mode is 2 minute 15 seconds, about 3 minutes 19 seconds faster than the local mode.