• Title/Summary/Keyword: Dataset: Review

Search Result 130, Processing Time 0.026 seconds

A Crowdsourcing-Based Paraphrased Opinion Spam Dataset and Its Implication on Detection Performance (크라우드소싱 기반 문장재구성 방법을 통한 의견 스팸 데이터셋 구축 및 평가)

  • Lee, Seongwoon;Kim, Seongsoon;Park, Donghyeon;Kang, Jaewoo
    • KIISE Transactions on Computing Practices
    • /
    • v.22 no.7
    • /
    • pp.338-343
    • /
    • 2016
  • Today, opinion reviews on the Web are often used as a means of information exchange. As the importance of opinion reviews continues to grow, the number of issues for opinion spam also increases. Even though many research studies on detecting spam reviews have been conducted, some limitations of gold-standard datasets hinder research. Therefore, we introduce a new dataset called "Paraphrased Opinion Spam (POS)" that contains a new type of review spam that imitates truthful reviews. We have noticed that spammers refer to existing truthful reviews to fabricate spam reviews. To create such a seemingly truthful review spam dataset, we asked task participants to paraphrase truthful reviews to create a new deceptive review. The experiment results show that classifying our POS dataset is more difficult than classifying the existing spam datasets since the reviews in our dataset more linguistically look like truthful reviews. Also, training volume has been found to be an important factor for classification model performance.

COVID-19 International Collaborative Research by the Health Insurance Review and Assessment Service Using Its Nationwide Real-world Data: Database, Outcomes, and Implications

  • Rho, Yeunsook;Cho, Do Yeon;Son, Yejin;Lee, Yu Jin;Kim, Ji Woo;Lee, Hye Jin;You, Seng Chan;Park, Rae Woong;Lee, Jin Yong
    • Journal of Preventive Medicine and Public Health
    • /
    • v.54 no.1
    • /
    • pp.8-16
    • /
    • 2021
  • This article aims to introduce the inception and operation of the COVID-19 International Collaborative Research Project, the world's first coronavirus disease 2019 (COVID-19) open data project for research, along with its dataset and research method, and to discuss relevant considerations for collaborative research using nationwide real-world data (RWD). COVID-19 has spread across the world since early 2020, becoming a serious global health threat to life, safety, and social and economic activities. However, insufficient RWD from patients was available to help clinicians efficiently diagnose and treat patients with COVID-19, or to provide necessary information to the government for policy-making. Countries that saw a rapid surge of infections had to focus on leveraging medical professionals to treat patients, and the circumstances made it even more difficult to promptly use COVID-19 RWD. Against this backdrop, the Health Insurance Review and Assessment Service (HIRA) of Korea decided to open its COVID-19 RWD collected through Korea's universal health insurance program, under the title of the COVID-19 International Collaborative Research Project. The dataset, consisting of 476 508 claim statements from 234 427 patients (7590 confirmed cases) and 18 691 318 claim statements of the same patients for the previous 3 years, was established and hosted on HIRA's in-house server. Researchers who applied to participate in the project uploaded analysis code on the platform prepared by HIRA, and HIRA conducted the analysis and provided outcome values. As of November 2020, analyses have been completed for 129 research projects, which have been published or are in the process of being published in prestigious journals.

Development of Dataset Evaluation Criteria for Learning Deepfake Video (딥페이크 영상 학습을 위한 데이터셋 평가기준 개발)

  • Kim, Rayng-Hyung;Kim, Tae-Gu
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.44 no.4
    • /
    • pp.193-207
    • /
    • 2021
  • As Deepfakes phenomenon is spreading worldwide mainly through videos in web platforms and it is urgent to address the issue on time. More recently, researchers have extensively discussed deepfake video datasets. However, it has been pointed out that the existing Deepfake datasets do not properly reflect the potential threat and realism due to various limitations. Although there is a need for research that establishes an agreed-upon concept for high-quality datasets or suggests evaluation criterion, there are still handful studies which examined it to-date. Therefore, this study focused on the development of the evaluation criterion for the Deepfake video dataset. In this study, the fitness of the Deepfake dataset was presented and evaluation criterions were derived through the review of previous studies. AHP structuralization and analysis were performed to advance the evaluation criterion. The results showed that Facial Expression, Validation, and Data Characteristics are important determinants of data quality. This is interpreted as a result that reflects the importance of minimizing defects and presenting results based on scientific methods when evaluating quality. This study has implications in that it suggests the fitness and evaluation criterion of the Deepfake dataset. Since the evaluation criterion presented in this study was derived based on the items considered in previous studies, it is thought that all evaluation criterions will be effective for quality improvement. It is also expected to be used as criteria for selecting an appropriate deefake dataset or as a reference for designing a Deepfake data benchmark. This study could not apply the presented evaluation criterion to existing Deepfake datasets. In future research, the proposed evaluation criterion will be applied to existing datasets to evaluate the strengths and weaknesses of each dataset, and to consider what implications there will be when used in Deepfake research.

SimKoR: A Sentence Similarity Dataset based on Korean Review Data and Its Application to Contrastive Learning for NLP (SimKoR: 한국어 리뷰 데이터를 활용한 문장 유사도 데이터셋 제안 및 대조학습에서의 활용 방안 )

  • Jaemin Kim;Yohan Na;Kangmin Kim;Sang Rak Lee;Dong-Kyu Chae
    • Annual Conference on Human and Language Technology
    • /
    • 2022.10a
    • /
    • pp.245-248
    • /
    • 2022
  • 최근 자연어 처리 분야에서 문맥적 의미를 반영하기 위한 대조학습 (contrastive learning) 에 대한 연구가 활발히 이뤄지고 있다. 이 때 대조학습을 위한 양질의 학습 (training) 데이터와 검증 (validation) 데이터를 이용하는 것이 중요하다. 그러나 한국어의 경우 대다수의 데이터셋이 영어로 된 데이터를 한국어로 기계 번역하여 검토 후 제공되는 데이터셋 밖에 존재하지 않는다. 이는 기계번역의 성능에 의존하는 단점을 갖고 있다. 본 논문에서는 한국어 리뷰 데이터로 임베딩의 의미 반영 정도를 측정할 수 있는 간단한 검증 데이터셋 구축 방법을 제안하고, 이를 활용한 데이터셋인 SimKoR (Similarity Korean Review dataset) 을 제안한다. 제안하는 검증 데이터셋을 이용해서 대조학습을 수행하고 효과성을 보인다.

  • PDF

Applications of Machine Learning Models on Yelp Data

  • Ruchi Singh;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.29 no.1
    • /
    • pp.35-49
    • /
    • 2019
  • The paper attempts to document the application of relevant Machine Learning (ML) models on Yelp (a crowd-sourced local business review and social networking site) dataset to analyze, predict and recommend business. Strategically using two cloud platforms to minimize the effort and time required for this project. Seven machine learning algorithms in Azure ML of which four algorithms are implemented in Databricks Spark ML. The analyzed Yelp business dataset contained 70 business attributes for more than 350,000 registered business. Additionally, review tips and likes from 500,000 users have been processed for the project. A Recommendation Model is built to provide Yelp users with recommendations for business categories based on their previous business ratings, as well as the business ratings of other users. Classification Model is implemented to predict the popularity of the business as defining the popular business to have stars greater than 3 and unpopular business to have stars less than 3. Text Analysis model is developed by comparing two algorithms, uni-gram feature extraction and n-feature extraction in Azure ML studio and logistic regression model in Spark. Comparative conclusions have been made related to efficiency of Spark ML and Azure ML for these models.

Precision Agriculture using Internet of Thing with Artificial Intelligence: A Systematic Literature Review

  • Noureen Fatima;Kainat Fareed Memon;Zahid Hussain Khand;Sana Gul;Manisha Kumari;Ghulam Mujtaba Sheikh
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.7
    • /
    • pp.155-164
    • /
    • 2023
  • Machine learning with its high precision algorithms, Precision agriculture (PA) is a new emerging concept nowadays. Many researchers have worked on the quality and quantity of PA by using sensors, networking, machine learning (ML) techniques, and big data. However, there has been no attempt to work on trends of artificial intelligence (AI) techniques, dataset and crop type on precision agriculture using internet of things (IoT). This research aims to systematically analyze the domains of AI techniques and datasets that have been used in IoT based prediction in the area of PA. A systematic literature review is performed on AI based techniques and datasets for crop management, weather, irrigation, plant, soil and pest prediction. We took the papers on precision agriculture published in the last six years (2013-2019). We considered 42 primary studies related to the research objectives. After critical analysis of the studies, we found that crop management; soil and temperature areas of PA have been commonly used with the help of IoT devices and AI techniques. Moreover, different artificial intelligence techniques like ANN, CNN, SVM, Decision Tree, RF, etc. have been utilized in different fields of Precision agriculture. Image processing with supervised and unsupervised learning practice for prediction and monitoring the PA are also used. In addition, most of the studies are forfaiting sensory dataset to measure different properties of soil, weather, irrigation and crop. To this end, at the end, we provide future directions for researchers and guidelines for practitioners based on the findings of this review.

Integration of Single-Cell RNA-Seq Datasets: A Review of Computational Methods

  • Yeonjae Ryu;Geun Hee Han;Eunsoo Jung;Daehee Hwang
    • Molecules and Cells
    • /
    • v.46 no.2
    • /
    • pp.106-119
    • /
    • 2023
  • With the increased number of single-cell RNA sequencing (scRNA-seq) datasets in public repositories, integrative analysis of multiple scRNA-seq datasets has become commonplace. Batch effects among different datasets are inevitable because of differences in cell isolation and handling protocols, library preparation technology, and sequencing platforms. To remove these batch effects for effective integration of multiple scRNA-seq datasets, a number of methodologies have been developed based on diverse concepts and approaches. These methods have proven useful for examining whether cellular features, such as cell subpopulations and marker genes, identified from a certain dataset, are consistently present, or whether their condition-dependent variations, such as increases in cell subpopulations in particular disease-related conditions, are consistently observed in different datasets generated under similar or distinct conditions. In this review, we summarize the concepts and approaches of the integration methods and their pros and cons as has been reported in previous literature.

Latent class model for mixed variables with applications to text data (혼합모드 잠재범주모형을 통한 텍스트 자료의 분석)

  • Shin, Hyun Soo;Seo, Byungtae
    • The Korean Journal of Applied Statistics
    • /
    • v.32 no.6
    • /
    • pp.837-849
    • /
    • 2019
  • Latent class models (LCM) are useful tools to draw hidden information from categorical data. This model can also be interpreted as a mixture model with multinomial component distributions. In some cases, however, an available dataset may contain both categorical and count or continuous data. For such cases, we can extend the LCM to a mixture model with both multinomial and other component distributions such as normal and Poisson distributions. In this paper, we consider a LCM for the data containing categorical and count data to analyze the Drug Review dataset which contains categorical responses and text review. From this data analysis, we show that we can obtain more specific hidden inforamtion than those from the LCM only with categorical responses.

Developing a Hospital-Wide All-Cause Risk-Standardized Readmission Measure Using Administrative Claims Data in Korea: Methodological Explorations and Implications (건강보험 청구자료를 이용한 일반 질 지표로서의 위험도 표준화 재입원율 산출: 방법론적 탐색과 시사점)

  • Kim, Myunghwa;Kim, Hongsoo;Hwang, Soo-Hee
    • Health Policy and Management
    • /
    • v.25 no.3
    • /
    • pp.197-206
    • /
    • 2015
  • Background: The purpose of this study was to propose a method for developing a measure of hospital-wide all-cause risk-standardized readmissions using administrative claims data in Korea and to discuss further considerations in the refinement and implementation of the readmission measure. Methods: By adapting the methodology of the United States Center for Medicare & Medicaid Services for creating a 30-day readmission measure, we developed a 6-step approach for generating a comparable measure using Korean datasets. Using the 2010 Korean National Health Insurance (NHI) claims data as the development dataset, hierarchical regression models were fitted to calculate a hospital-wide all-cause risk-standardized readmission measure. Six regression models were fitted to calculate the readmission rates of six clinical condition groups, respectively and a single, weighted, overall readmission rate was calculated from the readmission rates of these subgroups. Lastly, the case mix differences among hospitals were risk-adjusted using patient-level comorbidity variables. The model was validated using the 2009 NHI claims data as the validation dataset. Results: The unadjusted, hospital-wide all-cause readmission rate was 13.37%, and the adjusted risk-standardized rate was 10.90%, varying by hospital type. The highest risk-standardized readmission rate was in hospitals (11.43%), followed by general hospitals (9.40%) and tertiary hospitals (7.04%). Conclusion: The newly developed, hospital-wide all-cause readmission measure can be used in quality and performance evaluations of hospitals in Korea. Needed are further methodological refinements of the readmission measures and also strategies to implement the measure as a hospital performance indicator.

A Case-Based Reasoning Method Improving Real-Time Computational Performances: Application to Diagnose for Heart Disease (대용량 데이터를 위한 사례기반 추론기법의 실시간 처리속도 개선방안에 대한 연구: 심장병 예측을 중심으로)

  • Park, Yoon-Joo
    • Information Systems Review
    • /
    • v.16 no.1
    • /
    • pp.37-50
    • /
    • 2014
  • Conventional case-based reasoning (CBR) does not perform efficiently for high volume dataset because of case-retrieval time. In order to overcome this problem, some previous researches suggest clustering a case-base into several small groups, and retrieve neighbors within a corresponding group to a target case. However, this approach generally produces less accurate predictive performances than the conventional CBR. This paper suggests a new hybrid case-based reasoning method which dynamically composing a searching pool for each target case. This method is applied to diagnose for the heart disease dataset. The results show that the suggested hybrid method produces statistically the same level of predictive performances with using significantly less computational cost than the CBR method and also outperforms the basic clustering-CBR (C-CBR) method.