• Title/Summary/Keyword: Big Data Clustering

Search Result 146, Processing Time 0.024 seconds

Development of Medical Cost Prediction Model Based on the Machine Learning Algorithm (머신러닝 알고리즘 기반의 의료비 예측 모델 개발)

  • Han Bi KIM;Dong Hoon HAN
    • Journal of Korea Artificial Intelligence Association
    • /
    • v.1 no.1
    • /
    • pp.11-16
    • /
    • 2023
  • Accurate hospital case modeling and prediction are crucial for efficient healthcare. In this study, we demonstrate the implementation of regression analysis methods in machine learning systems utilizing mathematical statics and machine learning techniques. The developed machine learning model includes Bayesian linear, artificial neural network, decision tree, decision forest, and linear regression analysis models. Through the application of these algorithms, corresponding regression models were constructed and analyzed. The results suggest the potential of leveraging machine learning systems for medical research. The experiment aimed to create an Azure Machine Learning Studio tool for the speedy evaluation of multiple regression models. The tool faciliates the comparision of 5 types of regression models in a unified experiment and presents assessment results with performance metrics. Evaluation of regression machine learning models highlighted the advantages of boosted decision tree regression, and decision forest regression in hospital case prediction. These findings could lay the groundwork for the deliberate development of new directions in medical data processing and decision making. Furthermore, potential avenues for future research may include exploring methods such as clustering, classification, and anomaly detection in healthcare systems.

The study of Defense Artificial Intelligence and Block-chain Convergence (국방분야 인공지능과 블록체인 융합방안 연구)

  • Kim, Seyong;Kwon, Hyukjin;Choi, Minwoo
    • Journal of Internet Computing and Services
    • /
    • v.21 no.2
    • /
    • pp.81-90
    • /
    • 2020
  • The purpose of this study is to study how to apply block-chain technology to prevent data forgery and alteration in the defense sector of AI(Artificial intelligence). AI is a technology for predicting big data by clustering or classifying it by applying various machine learning methodologies, and military powers including the U.S. have reached the completion stage of technology. If data-based AI's data forgery and modulation occurs, the processing process of the data, even if it is perfect, could be the biggest enemy risk factor, and the falsification and modification of the data can be too easy in the form of hacking. Unexpected attacks could occur if data used by weaponized AI is hacked and manipulated by North Korea. Therefore, a technology that prevents data from being falsified and altered is essential for the use of AI. It is expected that data forgery prevention will solve the problem by applying block-chain, a technology that does not damage data, unless more than half of the connected computers agree, even if a single computer is hacked by a distributed storage of encrypted data as a function of seawater.

Outlier Detection and Labeling of Ship Main Engine using LSTM-AutoEncoder (LSTM-AutoEncoder를 활용한 선박 메인엔진의 이상 탐지 및 라벨링)

  • Dohee Kim;Yeongjae Han;Hyemee Kim;Seong-Phil Kang;Ki-Hun Kim;Hyerim Bae
    • The Journal of Bigdata
    • /
    • v.7 no.1
    • /
    • pp.125-137
    • /
    • 2022
  • The transportation industry is one of the important industries due to the geographical requirements surrounded by the sea on three sides of Korea and the problem of resource poverty, which relies on imports for most of its resource consumption. Among them, the proportion of the shipping industry is large enough to account for most of the transportation industry, and maintenance in the shipping industry is also important in improving the operational efficiency and reducing costs of ships. However, currently, inspections are conducted every certain period of time for maintenance of ships, resulting in time and cost, and the cause is not properly identified. Therefore, in this study, the proposed methodology, LSTM-AutoEncoder, is used to detect abnormalities that may cause ship failure by considering the time of actual ship operation data. In addition, clustering is performed through clustering, and the potential causes of ship main engine failure are identified by grouping outlier by factor. This enables faster monitoring of various information on the ship and identifies the degree of abnormality. In addition, the current ship's fault monitoring system will be equipped with a concrete alarm point setting and a fault diagnosis system, and it will be able to help find the maintenance time.

Identify the Failure Mode of Weapon System (or equipment) using Machine Learning (Machine Learning을 이용한 무기 체계(or 구성품) 고장 유형 식별)

  • Park, Yun-Kyung;Lee, Hye-Won;Kim, Sang-Moon
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.19 no.8
    • /
    • pp.64-70
    • /
    • 2018
  • The development of weapon systems (or components) is hindered by the number of tests due to the limited development period and cost, which reduces the scale of accumulated data related to failures. Nevertheless, because a large amount of failure data and maintenance details during the operational period are managed by computerized data, the cause of failure of weapon systems (or components) can be analyzed using the data. On the other hand, analyzing the failure and maintenance details of various weapon systems is difficult because of the variation among groups and companies, and details of the cause of failure are described as unstructured text data. Fortunately, the recent developments of big data processing technology, machine learning algorithm, and improved HW computation ability have supported major research into various methods for processing the above unstructured data. In this paper, unstructured data related to the failure / maintenance of defense weapon systems (or components) is presented by applying doc2vec, a machine learning technique, to analyze the failure cases.

Big Data Analysis of News on Purchasing Second-hand Clothing and Second-hand Luxury Goods: Identification of Social Perception and Current Situation Using Text Mining (중고의류와 중고명품 구매 관련 언론 보도 빅데이터 분석: 텍스트마이닝을 활용한 사회적 인식과 현황 파악)

  • Hwa-Sook Yoo
    • Human Ecology Research
    • /
    • v.61 no.4
    • /
    • pp.687-707
    • /
    • 2023
  • This study was conducted to obtain useful information on the development of the future second-hand fashion market by obtaining information on the current situation through unstructured text data distributed as news articles related to 'purchase of second-hand clothing' and 'purchase of second-hand luxury goods'. Text-based unstructured data was collected on a daily basis from Naver news from January 1st to December 31st, 2022, using 'purchase of second-hand clothing' and 'purchase of second-hand luxury goods' as collection keywords. This was analyzed using text mining, and the results are as follows. First, looking at the frequency, the collection data related to the purchase of second-hand luxury goods almost quadrupled compared to the data related to the purchase of second-hand clothing, indicating that the purchase of second-hand luxury goods is receiving more social attention. Second, there were common words between the data obtained by the two collection keywords, but they had different words. Regarding second-hand clothing, words related to donations, sharing, and compensation sales were mainly mentioned, indicating that the purchase of second-hand clothing tends to be recognized as an eco-friendly transaction. In second-hand luxury goods, resale and genuine controversy related to the transaction of second-hand luxury goods, second-hand trading platforms, and luxury brands were frequently mentioned. Third, as a result of clustering, data related to the purchase of second-hand clothing were divided into five groups, and data related to the purchase of second-hand luxury goods were divided into six groups.

Feature-selection algorithm based on genetic algorithms using unstructured data for attack mail identification (공격 메일 식별을 위한 비정형 데이터를 사용한 유전자 알고리즘 기반의 특징선택 알고리즘)

  • Hong, Sung-Sam;Kim, Dong-Wook;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.20 no.1
    • /
    • pp.1-10
    • /
    • 2019
  • Since big-data text mining extracts many features and data, clustering and classification can result in high computational complexity and low reliability of the analysis results. In particular, a term document matrix obtained through text mining represents term-document features, but produces a sparse matrix. We designed an advanced genetic algorithm (GA) to extract features in text mining for detection model. Term frequency inverse document frequency (TF-IDF) is used to reflect the document-term relationships in feature extraction. Through a repetitive process, a predetermined number of features are selected. And, we used the sparsity score to improve the performance of detection model. If a spam mail data set has the high sparsity, detection model have low performance and is difficult to search the optimization detection model. In addition, we find a low sparsity model that have also high TF-IDF score by using s(F) where the numerator in fitness function. We also verified its performance by applying the proposed algorithm to text classification. As a result, we have found that our algorithm shows higher performance (speed and accuracy) in attack mail classification.

Analysis of shopping website visit types and shopping pattern (쇼핑 웹사이트 탐색 유형과 방문 패턴 분석)

  • Choi, Kyungbin;Nam, Kihwan
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.1
    • /
    • pp.85-107
    • /
    • 2019
  • Online consumers browse products belonging to a particular product line or brand for purchase, or simply leave a wide range of navigation without making purchase. The research on the behavior and purchase of online consumers has been steadily progressed, and related services and applications based on behavior data of consumers have been developed in practice. In recent years, customization strategies and recommendation systems of consumers have been utilized due to the development of big data technology, and attempts are being made to optimize users' shopping experience. However, even in such an attempt, it is very unlikely that online consumers will actually be able to visit the website and switch to the purchase stage. This is because online consumers do not just visit the website to purchase products but use and browse the websites differently according to their shopping motives and purposes. Therefore, it is important to analyze various types of visits as well as visits to purchase, which is important for understanding the behaviors of online consumers. In this study, we explored the clustering analysis of session based on click stream data of e-commerce company in order to explain diversity and complexity of search behavior of online consumers and typified search behavior. For the analysis, we converted data points of more than 8 million pages units into visit units' sessions, resulting in a total of over 500,000 website visit sessions. For each visit session, 12 characteristics such as page view, duration, search diversity, and page type concentration were extracted for clustering analysis. Considering the size of the data set, we performed the analysis using the Mini-Batch K-means algorithm, which has advantages in terms of learning speed and efficiency while maintaining the clustering performance similar to that of the clustering algorithm K-means. The most optimized number of clusters was derived from four, and the differences in session unit characteristics and purchasing rates were identified for each cluster. The online consumer visits the website several times and learns about the product and decides the purchase. In order to analyze the purchasing process over several visits of the online consumer, we constructed the visiting sequence data of the consumer based on the navigation patterns in the web site derived clustering analysis. The visit sequence data includes a series of visiting sequences until one purchase is made, and the items constituting one sequence become cluster labels derived from the foregoing. We have separately established a sequence data for consumers who have made purchases and data on visits for consumers who have only explored products without making purchases during the same period of time. And then sequential pattern mining was applied to extract frequent patterns from each sequence data. The minimum support is set to 10%, and frequent patterns consist of a sequence of cluster labels. While there are common derived patterns in both sequence data, there are also frequent patterns derived only from one side of sequence data. We found that the consumers who made purchases through the comparative analysis of the extracted frequent patterns showed the visiting pattern to decide to purchase the product repeatedly while searching for the specific product. The implication of this study is that we analyze the search type of online consumers by using large - scale click stream data and analyze the patterns of them to explain the behavior of purchasing process with data-driven point. Most studies that typology of online consumers have focused on the characteristics of the type and what factors are key in distinguishing that type. In this study, we carried out an analysis to type the behavior of online consumers, and further analyzed what order the types could be organized into one another and become a series of search patterns. In addition, online retailers will be able to try to improve their purchasing conversion through marketing strategies and recommendations for various types of visit and will be able to evaluate the effect of the strategy through changes in consumers' visit patterns.

A study on the Robust and Systolic Topology for the Resilient Dynamic Multicasting Routing Protocol

  • Lee, Kang-Whan;Kim, Sung-Uk
    • Journal of information and communication convergence engineering
    • /
    • v.6 no.3
    • /
    • pp.255-260
    • /
    • 2008
  • In the recently years, there has been a big interest in ad hoc wireless network as they have tremendous military and commercial potential. An Ad hoc wireless network is composed of mobile computing devices that use having no fixed infrastructure of a multi-hop wireless network formed. So, the fact that limited resource could support the network of robust, simple framework and energy conserving etc. In this paper, we propose a new ad hoc multicast routing protocol for based on the ontology scheme called inference network. Ontology knowledge-based is one of the structure of context-aware. And the ontology clustering adopts a tree structure to enhance resilient against mobility and routing complexity. This proposed multicast routing protocol utilizes node locality to be improve the flexible connectivity and stable mobility on local discovery routing and flooding discovery routing. Also attempts to improve route recovery efficiency and reduce data transmissions of context-awareness. We also provide simulation results to validate the model complexity. We have developed that proposed an algorithm have design multi-hierarchy layered networks to simulate a desired system.

Analysis of Characteristics of Clusters of Middle School Students Using K-Means Cluster Analysis (K-평균 군집분석을 활용한 중학생의 군집화 및 특성 분석)

  • Jaebong, Lee
    • Journal of The Korean Association For Science Education
    • /
    • v.42 no.6
    • /
    • pp.611-619
    • /
    • 2022
  • The purpose of this study is to explore the possibility of applying big data analysis to provide appropriate feedback to students using evaluation data in science education at a time when interest in educational data mining has recently increased in education. In this study, we use the evaluation data of 2,576 students who took 24 questions of the national assessment of educational achievement. And we use K-means cluster analysis as a method of unsupervised machine learning for clustering. As a result of clustering, students were divided into six clusters. The middle-ranking students are divided into various clusters when compared to upper or lower ranks. According to the results of the cluster analysis, the most important factor influencing clusterization is academic achievement, and each cluster shows different characteristics in terms of content domains, subject competencies, and affective characteristics. Learning motivation is important among the affective domains in the lower-ranking achievement cluster, and scientific inquiry and problem-solving competency, as well as scientific communication competency have a major influence in terms of subject competencies. In the content domain, achievement of motion and energy and matter are important factors to distinguish the characteristics of the cluster. As a result, we can provide students with customized feedback for learning based on the characteristics of each cluster. We discuss implications of these results for science education, such as the possibility of using this study results, balanced learning by content domains, enhancement of subject competency, and improvement of scientific attitude.

Signed Hellinger measure for directional association (연관성 방향을 고려한 부호 헬링거 측도의 제안)

  • Park, Hee Chang
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.2
    • /
    • pp.353-362
    • /
    • 2016
  • By Wikipedia, data mining is the process of discovering patterns in a big data set involving methods at the intersection of association rule, decision tree, clustering, artificial intelligence, machine learning. and database systems. Association rule is a method for discovering interesting relations between items in large transactions by interestingness measures. Association rule interestingness measures play a major role within a knowledge discovery process in databases, and have been developed by many researchers. Among them, the Hellinger measure is a good association threshold considering the information content and the generality of a rule. But it has the drawback that it can not determine the direction of the association. In this paper we proposed a signed Hellinger measure to be able to interpret operationally, and we checked three conditions of association threshold. Furthermore, we investigated some aspects through a few examples. The results showed that the signed Hellinger measure was better than the Hellinger measure because the signed one was able to estimate the right direction of association.