• Title/Summary/Keyword: 통계데이터

Search Result 3,229, Processing Time 0.038 seconds

Toxicity prediction of chemicals using OECD test guideline data with graph-based deep learning models (OECD TG데이터를 이용한 그래프 기반 딥러닝 모델 분자 특성 예측)

  • Daehwan Hwang;Changwon Lim
    • The Korean Journal of Applied Statistics
    • /
    • v.37 no.3
    • /
    • pp.355-380
    • /
    • 2024
  • In this paper, we compare the performance of graph-based deep learning models using OECD test guideline (TG) data. OECD TG are a unique tool for assessing the potential effects of chemicals on health and environment. but many guidelines include animal testing. Animal testing is time-consuming and expensive, and has ethical issues, so methods to find or minimize alternatives are being studied. Deep learning is used in various fields using chemicals including toxicity prediciton, and research on graph-based models is particularly active. Our goal is to compare the performance of graph-based deep learning models on OECD TG data to find the best performance model on there. We collected the results of OECD TG from the website eChemportal.org operated by the OECD, and chemicals that were impossible or inappropriate to learn were removed through pre-processing. The toxicity prediction performance of five graph-based models was compared using the collected OECD TG data and MoleculeNet data, a benchmark dataset for predicting chemical properties.

통계학과에서의 데이터베이스 교육 방안

  • An, Jeong-Yong;Han, Gyeong-Su
    • Proceedings of the Korean Statistical Society Conference
    • /
    • 2002.11a
    • /
    • pp.231-234
    • /
    • 2002
  • 통계학과 교육과정의 개선에 관한 연구를 비롯한 몇몇 응용 논문을 통하여 통계학과에서 데이터베이스 교육이 중요하다는 사실은 간혹 언급되어 왔으나, 그 구체적인 교육 방법에 관해 논의한 연구는 찾아보기 힘들다. 본 연구에서는 통계학과에서 데이터베이스 교육의 필요성 및 교육 방안에 관해 생각해보고자 한다. 따라서 본 논문의 목적은 통계학과에서 데이터베이스를 교육할 때 어떻게 통계학이라는 학문과 연관지어 교육할 수 있을 것인가에 대해 살펴보는 것이다.

  • PDF

Development of Web Contents for Statistical Analysis Using Statistical Package and Active Server Page (통계패키지와 Active Server Page를 이용한 통계 분석 웹 컨텐츠 개발)

  • Kang, Tae-Gu;Lee, Jae-Kwan;Kim, Mi-Ah;Park, Chan-Keun;Heo, Tae-Young
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.15 no.1
    • /
    • pp.109-114
    • /
    • 2010
  • In this paper, we developed the web content of statistical analysis using statistical package and Active Server Page (ASP). A statistical package is very difficult to learn and use for non-statisticians, however, non-statisticians want to do analyze the data without learning statistical packages such as SAS, S-plus, and R. Therefore, we developed the web based statistical analysis contents using S-plus which is the popular statistical package and ASP. In real application, we developed the web content for various statistical analyses such as exploratory data analysis, analysis of variance, and time series on the web using water quality data. The developed statistical analysis web content is very useful for non-statisticians such as public service person and researcher. Consequently, combining a web based contents with a statistical package, the users can access the site quickly and analyze data easily.

Feature selection for text data via sparse principal component analysis (희소주성분분석을 이용한 텍스트데이터의 단어선택)

  • Won Son
    • The Korean Journal of Applied Statistics
    • /
    • v.36 no.6
    • /
    • pp.501-514
    • /
    • 2023
  • When analyzing high dimensional data such as text data, if we input all the variables as explanatory variables, statistical learning procedures may suffer from over-fitting problems. Furthermore, computational efficiency can deteriorate with a large number of variables. Dimensionality reduction techniques such as feature selection or feature extraction are useful for dealing with these problems. The sparse principal component analysis (SPCA) is one of the regularized least squares methods which employs an elastic net-type objective function. The SPCA can be used to remove insignificant principal components and identify important variables from noisy observations. In this study, we propose a dimension reduction procedure for text data based on the SPCA. Applying the proposed procedure to real data, we find that the reduced feature set maintains sufficient information in text data while the size of the feature set is reduced by removing redundant variables. As a result, the proposed procedure can improve classification accuracy and computational efficiency, especially for some classifiers such as the k-nearest neighbors algorithm.

Statistical Data Extraction and Validation from Graph for Data Integration and Meta-analysis (데이터통합과 메타분석을 위한 그래프 통계량 추출과 검증)

  • Sung Ryul Shim;Yo Hwan Lim;Myunghee Hong;Gyuseon Song;Hyun Wook Han
    • The Journal of Bigdata
    • /
    • v.6 no.2
    • /
    • pp.61-70
    • /
    • 2021
  • The objective of this study was to describe specific approaches for data extraction from graph when statistical information is not directly reported in some articles, enabling data intergration and meta-analysis for quantitative data synthesis. Particularly, meta-analysis is an important analysis tool that allows the right decision making for evidence-based medicine by systematically and objectively selects target literature, quantifies the results of individual studies, and provides the overall effect size. For data integration and meta-analysis, we investigated the strength points about the introduction and application of Adobe Acrobet Reader and Python-based Jupiter Lab software, a computer tool that extracts accurate statistical figures from graphs. We used as an example data that was statistically verified throught an previous studies and the original data could be obtained from ClinicalTrials.gov. As a result of meta-analysis of the original data and the extraction values of each computer software, there was no statistically significant difference between the extraction methods. In addition, the intra-rater reliability of between researchers was confirmed and the consistency was high. Therefore, In terms of maintaining the integrity of statistical information, measurement using a computational tool is recommended rather than the classically used methods.

A Spatial Analysis of Seismic Vulnerability of Buildings Using Statistical and Machine Learning Techniques Comparative Analysis (통계분석 기법과 머신러닝 기법의 비교분석을 통한 건물의 지진취약도 공간분석)

  • Seong H. Kim;Sang-Bin Kim;Dae-Hyeon Kim
    • Journal of Industrial Convergence
    • /
    • v.21 no.1
    • /
    • pp.159-165
    • /
    • 2023
  • While the frequency of seismic occurrence has been increasing recently, the domestic seismic response system is weak, the objective of this research is to compare and analyze the seismic vulnerability of buildings using statistical analysis and machine learning techniques. As the result of using statistical technique, the prediction accuracy of the developed model through the optimal scaling method showed about 87%. As the result of using machine learning technique, because the accuracy of Random Forest method is 94% in case of Train Set, 76.7% in case of Test Set, which is the highest accuracy among the 4 analyzed methods, Random Forest method was finally chosen. Therefore, Random Forest method was derived as the final machine learning technique. Accordingly, the statistical analysis technique showed higher accuracy of about 87%, whereas the machine learning technique showed the accuracy of about 76.7%. As the final result, among the 22,296 analyzed building data, the seismic vulnerabilities of 1,627(0.1%) buildings are expected as more dangerous when the statistical analysis technique is used, 10,146(49%) buildings showed the same rate, and the remaining 10,523(50%) buildings are expected as more dangerous when the machine learning technique is used. As the comparison of the results of using advanced machine learning techniques in addition to the existing statistical analysis techniques, in spatial analysis decisions, it is hoped that this research results help to prepare more reliable seismic countermeasures.

A Study on Construction of Multimedia Statistic Post Office Box for Wireless Internet Services (무선인터넷 서비스를 위한 멀티미디어 통계사서함 구축에 관한 연구)

  • 이종득;김대경
    • Journal of the Korea Computer Industry Society
    • /
    • v.5 no.1
    • /
    • pp.1-8
    • /
    • 2004
  • As more and more information is processed and stored in the digital form, many techniques and systems have been developed for service multimedia informations in wireless internet. In this paper, we propose MSPOB(Multimedia Statistics Post Office Box) structure for service datum which are related with similarity to subject a set of documents through grouping. The proposed structure is determined by relationship of datum based on count index and inverted file and is determined it through the semantic similarity between objects

  • PDF

A Study on Secure and Statistical Data Aggregation in Ad Hoc Networks (애드혹 네트워크에서 안전한 통계정보수집 기법에 관한 연구)

  • Cho, Kwantae;Lee, Byung-Gil
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2014.11a
    • /
    • pp.561-563
    • /
    • 2014
  • 애드혹 네트워크(Ad Hoc Network)는 최근 이슈가 되고 있는 사물 간 인터넷(Internet of Things), 스마트그리드(Smart Grid), 사람 중심 도시 센싱(People-Centric Urban Sensing), 해상 통신(Maritime Communications) 환경에서 다양하게 활용되는 네트워크 구조다. 이러한 환경에서의 애플리케이션들은 사용자들에게 다양한 편의성을 제공하기 위하여 사용자의 민감한 프라이버시 정보를 요구하기도 한다. 하지만, 만약 수집되어지는 프라이버시 정보가 인가되지 않은 공격자에게 노출된다면, 사용자는 불안함을 호소할 수 있을 것이고, 동시에 해당 데이터를 수집하고자 했던 서비스제공자는 경제적으로 커다란 손실을 입을 수 있다. 이러한 프라이버시 정보 노출을 방지하기 위하여 안전한 데이터 수집 기법들이 연구되어 왔다. 하지만, 대부분의 기법들은 기밀성만 제공할 뿐, 부인방지 및 익명성은 제공하지 않는다. 그리고 더 나아가 기존 기법들은 통계정보 수집과 개별적인 정보 수집을 동시에 제공하지 않는다. 본 논문은 개별정보수집 및 통계정보 수집을 모두 지원하며 동시에, 사용자에게 강화된 익명성 개념인 비연결성을 제공하는 새로운 데이터 수집 기법을 소개한다.

Exploratory data analysis for Chatterjee's ξ coefficient (Chatterjee의 ξ 계수에 대한 탐색적자료분석)

  • Jang, Dae-Heung
    • The Korean Journal of Applied Statistics
    • /
    • v.35 no.3
    • /
    • pp.421-434
    • /
    • 2022
  • Chatterjee (2021) proposed a new correlation coefficient ξ. Focusing on two questions (1. Is ξ coefficient distinguishable for Anscombe's quartet data set?, 2. How does the ξ coefficient value change according to the number of data for various kinds of scatterplots?), an exploratory data analysis is attempted for ξ coefficient. We can compare three measures (ξ coefficient, Pearson's correlation coefficient and mutual information).

A Study on Scheduling of Distributed Log Analysis by the importance of the measure (중요도에 따른 분산 로그분석 스케줄링)

  • Back, BongHyun;Ahn, Byoungchul
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2009.04a
    • /
    • pp.1511-1514
    • /
    • 2009
  • 이기종(異機種) 시스템환경에서 발생하는 수많은 로그 데이터는 중요도에 따라 실시간 로그 분석이 필요하고 대용량의 로그 데이터의 경우 특정 시간내에 로그 분석 처리를 종료해야만 한다. 보안에 관련된 로그 정보의 경우 실시간 분석과 빠른 통계 처리를 요구할 것이며, 대용량의 비실시간성 로그 분석의 경우 로그 분석 및 통계처리를 주어진 특정 시간 내에 하여야 한다. 본 논문에서는 로그 데이터의 중요도에 따른 실시간 로그 분석 처리와 비실시간 대용량 통계 로그의 로그 분석 처리 마감 시간을 충족하는 로그 분석 스케줄링 정책을 제안한다.