• 제목/요약/키워드: Dataset: Review

검색결과 133건 처리시간 0.019초

크라우드소싱 기반 문장재구성 방법을 통한 의견 스팸 데이터셋 구축 및 평가 (A Crowdsourcing-Based Paraphrased Opinion Spam Dataset and Its Implication on Detection Performance)

  • 이성운;김성순;박동현;강재우
    • 정보과학회 컴퓨팅의 실제 논문지
    • /
    • 제22권7호
    • /
    • pp.338-343
    • /
    • 2016
  • 웹이 정보 교환의 주된 수단으로 사용되면서, 온라인 리뷰의 중요도가 증가하는 동시에 사용자의 올바른 의사결정을 저해하는 의견 스팸 이슈가 부각되고 있으며, 관련 연구가 활발하게 진행되고 있다. 하지만 분석 및 학습에 필요한 기준 데이터셋의 부족함과 한계점들은 관련 연구의 발전을 더디게 하고 있다. 본 논문에서는 사실 리뷰를 모사한 새로운 형태의 Paraphrased Opinion Spam(POS) 데이터셋을 소개한다. 우리는 실제 스패머들이 스팸을 작성할 때 실제 리뷰를 참고한다는 경향에 착안하여, 실제 리뷰어들이 작성한 리뷰를 의역하는 과정을 통하여 본문에 포함되어 있는 사실 정보와 경험을 담은 스팸 데이터 셋을 생성하였다. 실험 결과, 새롭게 생성된 POS 데이터셋이 언어학적으로 실제 리뷰들과 유사하여 스팸 분류 모델을 이용하여 분류 시 기존의 데이터셋들보다 더 분류하기 힘들다는 것을 발견했다. 또한 데이터의 학습량에 따라서 스팸 리뷰의 분류 정확도가 비례적으로 증가하는 것을 확인함으로써, 데이터의 양이 스팸 분류 모델 성능에 중요한 요소로 작용한다는 것을 확인할 수 있었다.

COVID-19 International Collaborative Research by the Health Insurance Review and Assessment Service Using Its Nationwide Real-world Data: Database, Outcomes, and Implications

  • Rho, Yeunsook;Cho, Do Yeon;Son, Yejin;Lee, Yu Jin;Kim, Ji Woo;Lee, Hye Jin;You, Seng Chan;Park, Rae Woong;Lee, Jin Yong
    • Journal of Preventive Medicine and Public Health
    • /
    • 제54권1호
    • /
    • pp.8-16
    • /
    • 2021
  • This article aims to introduce the inception and operation of the COVID-19 International Collaborative Research Project, the world's first coronavirus disease 2019 (COVID-19) open data project for research, along with its dataset and research method, and to discuss relevant considerations for collaborative research using nationwide real-world data (RWD). COVID-19 has spread across the world since early 2020, becoming a serious global health threat to life, safety, and social and economic activities. However, insufficient RWD from patients was available to help clinicians efficiently diagnose and treat patients with COVID-19, or to provide necessary information to the government for policy-making. Countries that saw a rapid surge of infections had to focus on leveraging medical professionals to treat patients, and the circumstances made it even more difficult to promptly use COVID-19 RWD. Against this backdrop, the Health Insurance Review and Assessment Service (HIRA) of Korea decided to open its COVID-19 RWD collected through Korea's universal health insurance program, under the title of the COVID-19 International Collaborative Research Project. The dataset, consisting of 476 508 claim statements from 234 427 patients (7590 confirmed cases) and 18 691 318 claim statements of the same patients for the previous 3 years, was established and hosted on HIRA's in-house server. Researchers who applied to participate in the project uploaded analysis code on the platform prepared by HIRA, and HIRA conducted the analysis and provided outcome values. As of November 2020, analyses have been completed for 129 research projects, which have been published or are in the process of being published in prestigious journals.

딥페이크 영상 학습을 위한 데이터셋 평가기준 개발 (Development of Dataset Evaluation Criteria for Learning Deepfake Video)

  • 김량형;김태구
    • 산업경영시스템학회지
    • /
    • 제44권4호
    • /
    • pp.193-207
    • /
    • 2021
  • As Deepfakes phenomenon is spreading worldwide mainly through videos in web platforms and it is urgent to address the issue on time. More recently, researchers have extensively discussed deepfake video datasets. However, it has been pointed out that the existing Deepfake datasets do not properly reflect the potential threat and realism due to various limitations. Although there is a need for research that establishes an agreed-upon concept for high-quality datasets or suggests evaluation criterion, there are still handful studies which examined it to-date. Therefore, this study focused on the development of the evaluation criterion for the Deepfake video dataset. In this study, the fitness of the Deepfake dataset was presented and evaluation criterions were derived through the review of previous studies. AHP structuralization and analysis were performed to advance the evaluation criterion. The results showed that Facial Expression, Validation, and Data Characteristics are important determinants of data quality. This is interpreted as a result that reflects the importance of minimizing defects and presenting results based on scientific methods when evaluating quality. This study has implications in that it suggests the fitness and evaluation criterion of the Deepfake dataset. Since the evaluation criterion presented in this study was derived based on the items considered in previous studies, it is thought that all evaluation criterions will be effective for quality improvement. It is also expected to be used as criteria for selecting an appropriate deefake dataset or as a reference for designing a Deepfake data benchmark. This study could not apply the presented evaluation criterion to existing Deepfake datasets. In future research, the proposed evaluation criterion will be applied to existing datasets to evaluate the strengths and weaknesses of each dataset, and to consider what implications there will be when used in Deepfake research.

SimKoR: 한국어 리뷰 데이터를 활용한 문장 유사도 데이터셋 제안 및 대조학습에서의 활용 방안 (SimKoR: A Sentence Similarity Dataset based on Korean Review Data and Its Application to Contrastive Learning for NLP )

  • 김재민;나요한;김강민;이상락;채동규
    • 한국정보과학회 언어공학연구회:학술대회논문집(한글 및 한국어 정보처리)
    • /
    • 한국정보과학회언어공학연구회 2022년도 제34회 한글 및 한국어 정보처리 학술대회
    • /
    • pp.245-248
    • /
    • 2022
  • 최근 자연어 처리 분야에서 문맥적 의미를 반영하기 위한 대조학습 (contrastive learning) 에 대한 연구가 활발히 이뤄지고 있다. 이 때 대조학습을 위한 양질의 학습 (training) 데이터와 검증 (validation) 데이터를 이용하는 것이 중요하다. 그러나 한국어의 경우 대다수의 데이터셋이 영어로 된 데이터를 한국어로 기계 번역하여 검토 후 제공되는 데이터셋 밖에 존재하지 않는다. 이는 기계번역의 성능에 의존하는 단점을 갖고 있다. 본 논문에서는 한국어 리뷰 데이터로 임베딩의 의미 반영 정도를 측정할 수 있는 간단한 검증 데이터셋 구축 방법을 제안하고, 이를 활용한 데이터셋인 SimKoR (Similarity Korean Review dataset) 을 제안한다. 제안하는 검증 데이터셋을 이용해서 대조학습을 수행하고 효과성을 보인다.

  • PDF

Applications of Machine Learning Models on Yelp Data

  • Ruchi Singh;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • 제29권1호
    • /
    • pp.35-49
    • /
    • 2019
  • The paper attempts to document the application of relevant Machine Learning (ML) models on Yelp (a crowd-sourced local business review and social networking site) dataset to analyze, predict and recommend business. Strategically using two cloud platforms to minimize the effort and time required for this project. Seven machine learning algorithms in Azure ML of which four algorithms are implemented in Databricks Spark ML. The analyzed Yelp business dataset contained 70 business attributes for more than 350,000 registered business. Additionally, review tips and likes from 500,000 users have been processed for the project. A Recommendation Model is built to provide Yelp users with recommendations for business categories based on their previous business ratings, as well as the business ratings of other users. Classification Model is implemented to predict the popularity of the business as defining the popular business to have stars greater than 3 and unpopular business to have stars less than 3. Text Analysis model is developed by comparing two algorithms, uni-gram feature extraction and n-feature extraction in Azure ML studio and logistic regression model in Spark. Comparative conclusions have been made related to efficiency of Spark ML and Azure ML for these models.

Precision Agriculture using Internet of Thing with Artificial Intelligence: A Systematic Literature Review

  • Noureen Fatima;Kainat Fareed Memon;Zahid Hussain Khand;Sana Gul;Manisha Kumari;Ghulam Mujtaba Sheikh
    • International Journal of Computer Science & Network Security
    • /
    • 제23권7호
    • /
    • pp.155-164
    • /
    • 2023
  • Machine learning with its high precision algorithms, Precision agriculture (PA) is a new emerging concept nowadays. Many researchers have worked on the quality and quantity of PA by using sensors, networking, machine learning (ML) techniques, and big data. However, there has been no attempt to work on trends of artificial intelligence (AI) techniques, dataset and crop type on precision agriculture using internet of things (IoT). This research aims to systematically analyze the domains of AI techniques and datasets that have been used in IoT based prediction in the area of PA. A systematic literature review is performed on AI based techniques and datasets for crop management, weather, irrigation, plant, soil and pest prediction. We took the papers on precision agriculture published in the last six years (2013-2019). We considered 42 primary studies related to the research objectives. After critical analysis of the studies, we found that crop management; soil and temperature areas of PA have been commonly used with the help of IoT devices and AI techniques. Moreover, different artificial intelligence techniques like ANN, CNN, SVM, Decision Tree, RF, etc. have been utilized in different fields of Precision agriculture. Image processing with supervised and unsupervised learning practice for prediction and monitoring the PA are also used. In addition, most of the studies are forfaiting sensory dataset to measure different properties of soil, weather, irrigation and crop. To this end, at the end, we provide future directions for researchers and guidelines for practitioners based on the findings of this review.

Integration of Single-Cell RNA-Seq Datasets: A Review of Computational Methods

  • Yeonjae Ryu;Geun Hee Han;Eunsoo Jung;Daehee Hwang
    • Molecules and Cells
    • /
    • 제46권2호
    • /
    • pp.106-119
    • /
    • 2023
  • With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.

혼합모드 잠재범주모형을 통한 텍스트 자료의 분석 (Latent class model for mixed variables with applications to text data)

  • 신현수;서병태
    • 응용통계연구
    • /
    • 제32권6호
    • /
    • pp.837-849
    • /
    • 2019
  • 일종의 혼합다항분포 모형이라고 볼 수 있는 잠재범주모형은 범주형 자료에서 직접 관측되지 않은 중요한 정보를 얻어낼 수 있는 유용한 도구이다. 하지만 자료에 범주형 변수 뿐 아니라 연속형 변수 혹은 빈도형 변수가 함께 포함되어 있을 경우 이 모형을 직접적으로 사용할 수 없다. 본 논문에서는 특히 범주형 변수와 빈도형 변수가 함께 포함되어 있는 경우에 잠재범주모형인 혼합모드 잠재범주모형을 사용하여 텍스트 후기와 범주형 응답문항이 모두 포함된 의약품 사용 후기자료를 분석하였다. 이 분석을 통해 범주형 응답만을 사용한 보통의 잠재범주 모형에 비해 텍스트 자료를 함께 사용한 혼합모드 잠재범주모형을 사용했을때 잠재범주에 대한 보다 자세한 정보를 얻을 수 있는 것을 확인하였다.

건강보험 청구자료를 이용한 일반 질 지표로서의 위험도 표준화 재입원율 산출: 방법론적 탐색과 시사점 (Developing a Hospital-Wide All-Cause Risk-Standardized Readmission Measure Using Administrative Claims Data in Korea: Methodological Explorations and Implications)

  • 김명화;김홍수;황수희
    • 보건행정학회지
    • /
    • 제25권3호
    • /
    • pp.197-206
    • /
    • 2015
  • Background: The purpose of this study was to propose a method for developing a measure of hospital-wide all-cause risk-standardized readmissions using administrative claims data in Korea and to discuss further considerations in the refinement and implementation of the readmission measure. Methods: By adapting the methodology of the United States Center for Medicare & Medicaid Services for creating a 30-day readmission measure, we developed a 6-step approach for generating a comparable measure using Korean datasets. Using the 2010 Korean National Health Insurance (NHI) claims data as the development dataset, hierarchical regression models were fitted to calculate a hospital-wide all-cause risk-standardized readmission measure. Six regression models were fitted to calculate the readmission rates of six clinical condition groups, respectively and a single, weighted, overall readmission rate was calculated from the readmission rates of these subgroups. Lastly, the case mix differences among hospitals were risk-adjusted using patient-level comorbidity variables. The model was validated using the 2009 NHI claims data as the validation dataset. Results: The unadjusted, hospital-wide all-cause readmission rate was 13.37%, and the adjusted risk-standardized rate was 10.90%, varying by hospital type. The highest risk-standardized readmission rate was in hospitals (11.43%), followed by general hospitals (9.40%) and tertiary hospitals (7.04%). Conclusion: The newly developed, hospital-wide all-cause readmission measure can be used in quality and performance evaluations of hospitals in Korea. Needed are further methodological refinements of the readmission measures and also strategies to implement the measure as a hospital performance indicator.

대용량 데이터를 위한 사례기반 추론기법의 실시간 처리속도 개선방안에 대한 연구: 심장병 예측을 중심으로 (A Case-Based Reasoning Method Improving Real-Time Computational Performances: Application to Diagnose for Heart Disease)

  • 박윤주
    • 경영정보학연구
    • /
    • 제16권1호
    • /
    • pp.37-50
    • /
    • 2014
  • 사례기반 추론기법(case-based reasoning)은 수많은 데이터 속에서 현재 문제와 유사한 과거데이터를 실시간으로 탐색하고 복원해내야 하기 때문에, 과거에 축적된 데이터의 양이 방대하거나 또는 데이터의 축적 속도가 빠를 경우 계산비용(computational cost)이 급격히 높아지는 확장성(scalability) 문제를 갖는다. 이러한 문제를 해결하기 위하여, 기존의 일부 연구들은 클러스터링(clustering) 기법을 적용하여, 전체 데이타를 사전에 몇 개의 그룹으로 분류한 후, 특정 클러스터 내에서만 과거 사례를 탐색하도록 하는 클러스터링과 사례기반 추론의 하이브리드 기법을 제안하였다. 그러나 이러한 기법은 클러스터 수를 얼마로 설정했는지에 따른 성능편차가 심하고, 또한 기본적인 사례기반 추론기법에 비해 일반적으로 낮은 예측성능을 도출하는 문제점이 있다. 본 연구는 이러한 기존의 클러스터-사례기반추론기법의 문제점을 실증적으로 분석하고, 이를 극복할 수 있는 새로운 하이브리드(hybrid) 사례기반 추론기법을 제안한다. 제안된 기법은 실제 심장병환자를 예측하는 문제에 적용하였으며, 그 결과 제안된 기법이 기존의 사례기반 추론기법에 비해 현격하게 낮은 계산비용을 사용하면서도, 유사한 수준의 예측성능을 도출할 수 있음을 확인하였다.