• Title/Summary/Keyword: machine learning

Search Result 5,305, Processing Time 0.041 seconds

Building the Outlier Candidate Discrimination Training Data based on Inventory for Automatic Classification of Transferred Records (이관 기록물 분류 자동화를 위한 목록 기반 이상치 판별 학습데이터 구축)

  • Jeong, Ji-Hye;Lee, Gemma;Wang, Hosung;Oh, Hyo-Jung
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.22 no.1
    • /
    • pp.43-59
    • /
    • 2022
  • Electronic public records are classified simultaneously as production, a preservation period is granted, and after a certain period, they are transferred to an archive and preserved. This study intends to find a way to improve the efficiency in classifying transferred records and maintain consistent standards. To this end, the current record classification work process carried out by the National Archives of Korea was analyzed, and problems were identified. As a way to minimize the manual work of record classification by converging the required improvement, the process of identifying outlier candidates based on a list consisting of classified information of the transferred records was proposed and systemized. Furthermore, the proposed outlier discrimination process was applied to the actual records transferred to the National Archives of Korea. The results were standardized and constructed as a training data format that can be used for machine learning in the future.

A Study on Verification of Back TranScription(BTS)-based Data Construction (Back TranScription(BTS)기반 데이터 구축 검증 연구)

  • Park, Chanjun;Seo, Jaehyung;Lee, Seolhwa;Moon, Hyeonseok;Eo, Sugyeong;Lim, Heuiseok
    • Journal of the Korea Convergence Society
    • /
    • v.12 no.11
    • /
    • pp.109-117
    • /
    • 2021
  • Recently, the use of speech-based interfaces is increasing as a means for human-computer interaction (HCI). Accordingly, interest in post-processors for correcting errors in speech recognition results is also increasing. However, a lot of human-labor is required for data construction. in order to manufacture a sequence to sequence (S2S) based speech recognition post-processor. To this end, to alleviate the limitations of the existing construction methodology, a new data construction method called Back TranScription (BTS) was proposed. BTS refers to a technology that combines TTS and STT technology to create a pseudo parallel corpus. This methodology eliminates the role of a phonetic transcriptor and can automatically generate vast amounts of training data, saving the cost. This paper verified through experiments that data should be constructed in consideration of text style and domain rather than constructing data without any criteria by extending the existing BTS research.

Domain Knowledge Incorporated Counterfactual Example-Based Explanation for Bankruptcy Prediction Model (부도예측모형에서 도메인 지식을 통합한 반사실적 예시 기반 설명력 증진 방법)

  • Cho, Soo Hyun;Shin, Kyung-shik
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.2
    • /
    • pp.307-332
    • /
    • 2022
  • One of the most intensively conducted research areas in business application study is a bankruptcy prediction model, a representative classification problem related to loan lending, investment decision making, and profitability to financial institutions. Many research demonstrated outstanding performance for bankruptcy prediction models using artificial intelligence techniques. However, since most machine learning algorithms are "black-box," AI has been identified as a prominent research topic for providing users with an explanation. Although there are many different approaches for explanations, this study focuses on explaining a bankruptcy prediction model using a counterfactual example. Users can obtain desired output from the model by using a counterfactual-based explanation, which provides an alternative case. This study introduces a counterfactual generation technique based on a genetic algorithm (GA) that leverages both domain knowledge (i.e., causal feasibility) and feature importance from a black-box model along with other critical counterfactual variables, including proximity, distribution, and sparsity. The proposed method was evaluated quantitatively and qualitatively to measure the quality and the validity.

Degree Programs in Data Science at the School of Information in the States (미국 정보 대학의 데이터사이언스 학위 현황 연구)

  • Park, Hyoungjoo
    • Journal of Korean Library and Information Science Society
    • /
    • v.53 no.2
    • /
    • pp.305-332
    • /
    • 2022
  • This preliminary study examined the degree programs in data science at the School of Information in the States. The focus of this study was the data science degrees offered at the School of Information awarded by the 64 Library and Information Science (LIS) programs accredited by the American Library Association (ALA) in 2022. In addition, this study examined the degrees, majors, minors, specialized tracks, and certificates in data science, as well as the potential careers after earning a data science degree. Overall, eight Schools of Information (iSchools) offered 12 data science degrees. Data science courses at the School of Information focus on topics such as introduction to data science, information retrieval, data mining, database, data and humanities, machine learning, metadata, research methods, data analysis and visualization, internship/capstone, ethics and security, user, policy, and curation and management. Most schools did not offer traditional LIS courses. After earning the data science degree in the School of Information, the potential careers included data scientists, data engineers and data analysts. The researcher hopes the findings of this study can be used as a starting point to discuss the directions of data science programs from the perspectives of the information field, specifically the degrees, majors, minors, specialized tracks and certificates in data science.

Development and Validation of Digital Twin for Analysis of Plant Factory Airflow (식물공장 기류해석을 위한 디지털트윈 개발 및 실증)

  • Jeong, Jin-Lip;Won, Bo-Young;Yoo, Ho-Dong;Kim, Tag Gon;Kang, Dae-Hyun;Hong, Kyung-Jin
    • Journal of the Korea Society for Simulation
    • /
    • v.31 no.1
    • /
    • pp.29-41
    • /
    • 2022
  • As one of the alternatives to solve the problem of unstable food supply and demand imbalance caused by abnormal climate change, the need for plant factories is increasing. Airflow in plant factory is recognized as one of important factor of plant which influence transpiration and heat transfer. On the other hand, Digital Twin (DT) is getting attention as a means of providing various services that are impossible only with the real system by replicating the real system in the virtual world. This study aimed to develop a digital twin model for airflow prediction that can predict airflow in various situations by applying the concept of digital twin to a plant factory in operation. To this end, first, the mathematical formalism of the digital twin model for airflow analysis in plant factories is presented, and based on this, the information necessary for airflow prediction modeling of a plant factory in operation is specified. Then, the shape of the plant factory is implemented in CAD and the DT model is developed by combining the computational fluid dynamics (CFD) components for airflow behavior analysis. Finally, the DT model for high-accuracy airflow prediction is completed through the validation of the model and the machine learning-based calibration process by comparing the simulation analysis result of the DT model with the actual airflow value collected from the plant factory.

A Study on the Performance Degradation Pattern of Caisson-type Quay Wall Port Facilities (케이슨식 안벽 항만시설의 성능저하패턴 연구)

  • Na, Yong Hyoun;Park, Mi Yeon;Jang, Shinwoo
    • Journal of the Society of Disaster Information
    • /
    • v.18 no.1
    • /
    • pp.146-153
    • /
    • 2022
  • Purpose: In the case of domestic port facilities, port structures that have been in use for a long time have many problems in terms of safety performance and functionality due to the enlargement of ships, increased frequency of use, and the effects of natural disasters due to climate change. A big data analysis method was studied to develop an approximate model that can predict the aging pattern of a port facility based on the maintenance history data of the port facility. Method: In this study, member-level maintenance history data for caisson-type quay walls were collected, defined as big data, and based on the data, a predictive approximation model was derived to estimate the aging pattern and deterioration of the facility at the project level. A state-based aging pattern prediction model generated through Gaussian process (GP) and linear interpolation (SLPT) techniques was proposed, and models suitable for big data utilization were compared and proposed through validation. Result: As a result of examining the suitability of the proposed method, the SLPT method has RMSE of 0.9215 and 0.0648, and the predictive model applied with the SLPT method is considered suitable. Conclusion: Through this study, it is expected that the study of predicting performance degradation of big data-based facilities will become an important system in decision-making regarding maintenance.

A Study on the Prediction of Strawberry Production in Machine Learning Infrastructure (머신러닝 기반 시설재배 딸기 생산량 예측 연구)

  • Oh, HanByeol;Lim, JongHyun;Yang, SeungWeon;Cho, YongYun;Shin, ChangSun
    • Smart Media Journal
    • /
    • v.11 no.5
    • /
    • pp.9-16
    • /
    • 2022
  • Recently, agricultural sites are automating into digital agricultural smart farms by applying technologies such as big data and Internet of Things (IoT). These smart farms aim to increase production and improve crop quality by measuring the environment of crops, investigating and processing data. Production prediction is an important study in smart farm digital agriculture, which is a high-tech agriculture, and it is necessary to analyze environmental data using big data and further standardized research to manage the quality of growth information data. In this paper, environmental and production data collected from smart farm strawberry farms were analyzed and studied. Based on regression analysis, crop production prediction models were analyzed using Ridge Regression, LightGBM, and XGBoost. Among the three models, the optimal model was XGBoost, and R2 showed 82.5 percent explanatory power. As a result of the study, the correlation between the amount of positive fluid absorption and environmental data was confirmed, and significant results were obtained for the production prediction study. In the future, it is expected to contribute to the prevention of environmental pollution and reduction of sheep through the management of sheep by studying the amount of sheep absorption, such as information on the growing environment of crops and the ingredients of sheep.

Design and Implementation of Memory-Centric Computing System for Big Data Analysis

  • Jung, Byung-Kwon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.27 no.7
    • /
    • pp.1-7
    • /
    • 2022
  • Recently, as the use of applications such as big data programs and machine learning programs that are driven while generating large amounts of data in the program itself becomes common, the existing main memory alone lacks memory, making it difficult to execute the program quickly. In particular, the need to derive results more quickly has emerged in a situation where it is necessary to analyze whether the entire sequence is genetically altered due to the outbreak of the coronavirus. As a result of measuring performance by applying large-capacity data to a computing system equipped with a self-developed memory pool MOCA host adapter instead of processing large-capacity data from an existing SSD, performance improved by 16% compared to the existing SSD system. In addition, in various other benchmark tests, IO performance was 92.8%, 80.6%, and 32.8% faster than SSD in computing systems equipped with memory pool MOCA host adapters such as SortSampleBam, ApplyBQSR, and GatherBamFiles by task of workflow. When analyzing large amounts of data, such as electrical dielectric pipeline analysis, it is judged that the measurement delay occurring at runtime can be reduced in the computing system equipped with the memory pool MOCA host adapter developed in this research.

Prediction and Analysis of PM2.5 Concentration in Seoul Using Ensemble-based Model (앙상블 기반 모델을 이용한 서울시 PM2.5 농도 예측 및 분석)

  • Ryu, Minji;Son, Sanghun;Kim, Jinsoo
    • Korean Journal of Remote Sensing
    • /
    • v.38 no.6_1
    • /
    • pp.1191-1205
    • /
    • 2022
  • Particulate matter(PM) among air pollutants with complex and widespread causes is classified according to particle size. Among them, PM2.5 is very small in size and can cause diseases in the human respiratory tract or cardiovascular system if inhaled by humans. In order to prepare for these risks, state-centered management and preventable monitoring and forecasting are important. This study tried to predict PM2.5 in Seoul, where high concentrations of fine dust occur frequently, using two ensemble models, random forest (RF) and extreme gradient boosting (XGB) using 15 local data assimilation and prediction system (LDAPS) weather-related factors, aerosol optical depth (AOD) and 4 chemical factors as independent variables. Performance evaluation and factor importance evaluation of the two models used for prediction were performed, and seasonal model analysis was also performed. As a result of prediction accuracy, RF showed high prediction accuracy of R2 = 0.85 and XGB R2 = 0.91, and it was confirmed that XGB was a more suitable model for PM2.5 prediction than RF. As a result of the seasonal model analysis, it can be said that the prediction performance was good compared to the observed values with high concentrations in spring. In this study, PM2.5 of Seoul was predicted using various factors, and an ensemble-based PM2.5 prediction model showing good performance was constructed.

A study on time series linkage in the Household Income and Expenditure Survey (가계동향조사 지출부문 시계열 연계 방안에 관한 연구)

  • Kim, Sihyeon;Seong, Byeongchan;Choi, Young-Geun;Yeo, In-kwon
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.4
    • /
    • pp.553-568
    • /
    • 2022
  • The Household Income and Expenditure Survey is a representative survey of Statistics Korea, which aims to measure and analyze national income and consumption levels and their changes by understanding the current state of household balances. Recently, the disconnection problem in these time series caused by the large-scale reorganization of the survey methods in 2017 and 2019 has become an issue. In this study, we model the characteristics of the time series in the Household Income and Expenditure Survey up to 2016, and use the modeling to compute forecasts for linking the expenditures in 2017 and 2018. In order to evenly reflect the characteristics across all expenditure item series and to reduce the impact of a specific forecast model, we synthesize a total of 8 models such as regression models, time series models, and machine learning techniques. In particular, the noteworthy aspect of this study is that it improves the forecast by using the optimal combination technique that can exactly reflect the hierarchical structure of the Household Income and Expenditure Survey without loss of information as in the top-down or bottom-up methods. As a result of applying the proposed method to forecast expenditure series from 2017 to 2019, it contributed to the recovery of time series linkage and improved the forecast. In addition, it was confirmed that the hierarchical time series forecasts by the optimal combination method make linkage results closer to the actual survey series.