• Title/Summary/Keyword: random sets

Search Result 276, Processing Time 0.025 seconds

Quantile estimation using near optimal unbalanced ranked set sampling

  • Nautiyal, Raman;Tiwari, Neeraj;Chandra, Girish
    • Communications for Statistical Applications and Methods
    • /
    • v.28 no.6
    • /
    • pp.643-653
    • /
    • 2021
  • Few studies are found in literature on estimation of population quantiles using the method of ranked set sampling (RSS). The optimal RSS strategy is to select observations with at most two fixed rank order statistics from different ranked sets. In this paper, a near optimal unbalanced RSS model for estimating pth(0 < p < 1) population quantile is proposed. Main advantage of this model is to use each rank order statistics and is distributionfree. The asymptotic relative efficiency (ARE) for balanced RSS, unbalanced optimal and proposed near-optimal methods are computed for different values of p. We also compared these AREs with respect to simple random sampling. The results show that proposed unbalanced RSS performs uniformly better than balanced RSS for all set sizes and is very close to the optimal RSS for large set sizes. For the practical utility, the near optimal unbalanced RSS is recommended for estimating the quantiles.

Incorporating BERT-based NLP and Transformer for An Ensemble Model and its Application to Personal Credit Prediction

  • Sophot Ky;Ju-Hong Lee;Kwangtek Na
    • Smart Media Journal
    • /
    • v.13 no.4
    • /
    • pp.9-15
    • /
    • 2024
  • Tree-based algorithms have been the dominant methods used build a prediction model for tabular data. This also includes personal credit data. However, they are limited to compatibility with categorical and numerical data only, and also do not capture information of the relationship between other features. In this work, we proposed an ensemble model using the Transformer architecture that includes text features and harness the self-attention mechanism to tackle the feature relationships limitation. We describe a text formatter module, that converts the original tabular data into sentence data that is fed into FinBERT along with other text features. Furthermore, we employed FT-Transformer that train with the original tabular data. We evaluate this multi-modal approach with two popular tree-based algorithms known as, Random Forest and Extreme Gradient Boosting, XGBoost and TabTransformer. Our proposed method shows superior Default Recall, F1 score and AUC results across two public data sets. Our results are significant for financial institutions to reduce the risk of financial loss regarding defaulters.

BMDL of blood lead for ADHD based on two longitudinal data sets (주의력 결핍 과잉 행동장애를 종점으로 하는 혈중 납의 벤치마크 용량 하한 도출: 두 동집단 자료의 병합)

  • Kim, Si Yeon;Ha, Mina;Kwon, Hojang;Kim, Byung Soo
    • The Korean Journal of Applied Statistics
    • /
    • v.31 no.1
    • /
    • pp.13-28
    • /
    • 2018
  • The ministry of Environment of Korea initiated two follow-up surveys in 2005 and 2006 to investigate environmental effect on children's health. These two cohorts, referred to as the 2005 Cohort and 2006 Cohort, were followed up three times every two years. This data set was referred to as the Children's Health and Environmental Research (CHEER) data set. This paper reproduces the existing research results of Kim et al. (Journal of the Korean Data and Information Science Society, 25, 987-998, 2014) and Lee et al. (The Korean Journal of Applied Statistics, 29, 1295-1310, 2016) and derive a benchmark dose lower limit (BMDL) for blood lead level for attention deficit hyperactivity disorder (ADHD) after pooling two cohort data sets. The different ADHD rating scales were unified by applying the conversion formula proposed by Lee et al. (2016). The random effect model and AR(1) model were built to reflect the longitudinal characteristics and regression to the mean phenomenon. Based on these models the BMDLs for blood lead levels were derived using the BMDL formula and the simulation. We obtained a hight level of BMDLs when we pooled two independent cohort data sets.

A Report on the Inter-Gene Correlations in cDNA Microarray Data Sets (cDNA 마이크로어레이에서 유전자간 상관 관계에 대한 보고)

  • Kim, Byung-Soo;Jang, Jee-Sun;Kim, Sang-Cheol;Lim, Jo-Han
    • The Korean Journal of Applied Statistics
    • /
    • v.22 no.3
    • /
    • pp.617-626
    • /
    • 2009
  • A series of recent papers reported that the inter-gene correlations in Affymetrix microarray data sets were strong and long-ranged, and the assumption of independence or weak dependence among gene expression signals which was often employed without justification was in conflict with actual data. Qui et al. (2005) indicated that applying the nonparametric empirical Bayes method in which test statistics were pooled across genes for performing the statistical inference resulted in the large variance of the number of differentially expressed genes. Qui et al. (2005) attributed this effect to strong and long-ranged inter-gene correlations. Klebanov and Yakovlev (2007) demonstrated that the inter-gene correlations provided a rich source of information rather than being a nuisance in the statistical analysis and they developed, by transforming the original gene expression sequence, a sequence of independent random variables which they referred to as a ${\delta}$-sequence. We note in this report using two cDNA microarray data sets experimented in this country that the strong and long-ranged inter-gene correlations were still valid in cDNA microarray data and also the ${\delta}$-sequence of independence could be derived from the cDNA microarray data. This note suggests that the inter-gene correlations be considered in the future analysis of the cDNA microarray data sets.

Accuracy of genomic-polygenic estimated breeding value for milk yield and fat yield in the Thai multibreed dairy population with five single nucleotide polymorphism sets

  • Wongpom, Bodin;Koonawootrittriron, Skorn;Elzo, Mauricio A.;Suwanasopee, Thanathip;Jattawa, Danai
    • Asian-Australasian Journal of Animal Sciences
    • /
    • v.32 no.9
    • /
    • pp.1340-1348
    • /
    • 2019
  • Objective: The objectives were to compare variance components, genetic parameters, prediction accuracies, and genomic-polygenic estimated breeding value (EBV) rankings for milk yield (MY) and fat yield (FY) in the Thai multibreed dairy population using five single nucleotide polymorphism (SNP) sets from GeneSeek GGP80K chip. Methods: The dataset contained monthly MY and FY of 8,361 first-lactation cows from 810 farms. Variance components, genetic parameters, and EBV for five SNP sets from the GeneSeek GGP80K chip were obtained using a 2-trait single-step average-information restricted maximum likelihood procedure. The SNP sets were the complete SNP set (all available SNP; SNP100), top 75% set (SNP75), top 50% set (SNP50), top 25% set (SNP25), and top 5% set (SNP5). The 2-trait models included herd-year-season, heterozygosity and age at first calving as fixed effects, and animal additive genetic and residual as random effects. Results: The estimates of additive genetic variances for MY and FY from SNP subsets were mostly higher than those of the complete set. The SNP25 MY and FY heritability estimates (0.276 and 0.183) were higher than those from SNP75 (0.265 and 0.168), SNP50 (0.275 and 0.179), SNP5 (0.231 and 0.169), and SNP100 (0.251and 0.159). The SNP25 EBV accuracies for MY and FY (39.76% and 33.82%) were higher than for SNP75 (35.01% and 32.60%), SNP50 (39.64% and 33.38%), SNP5 (38.61% and 29.70%), and SNP100 (34.43% and 31.61%). All rank correlations between SNP100 and SNP subsets were above 0.98 for both traits, except for SNP100 and SNP5 (0.93 for MY; 0.92 for FY). Conclusion: The high SNP25 estimates of genetic variances, heritabilities, EBV accuracies, and rank correlations between SNP100 and SNP25 for MY and FY indicated that genotyping animals with SNP25 dedicated chip would be a suitable to maintain genotyping costs low while speeding up genetic progress for MY and FY in the Thai dairy population.

Dual-mode Pseudorandom Number Generator Extension for Embedded System (임베디드 시스템에 적합한 듀얼 모드 의사 난수 생성 확장 모듈의 설계)

  • Lee, Suk-Han;Hur, Won;Lee, Yong-Surk
    • Journal of the Institute of Electronics Engineers of Korea SD
    • /
    • v.46 no.8
    • /
    • pp.95-101
    • /
    • 2009
  • Random numbers are used in many sorts of applications. Some applications, like simple software simulation tests, communication protocol verifications, cryptography verification and so forth, need various levels of randomness with various process speeds. In this paper, we propose a fast pseudorandom generator module for embedded systems. The generator module is implemented in hardware which can run in two modes, one of which can generate random numbers with higher randomness but which requires six cycles, the other providing its result within one cycle but with less randomness. An ASIP (Application Specific Instruction set Processor) was designed to implement the proposed pseudorandom generator instruction sets. We designed a processor based on the MIPS architecture,, by using LISA, and have run statistical tests passing the sequence of the Diehard test suite. The HDL models of the processor were generated using CoWare's Processor Designer and synthesized into the Dong-bu 0.18um CMOS cell library using the Synopsys Design Compiler. With the proposed pseudorandom generator module, random number generation performance was 239% faster than software model, but the area increased only 2.0% of the proposed ASIP.

Prediction of Customer Satisfaction Using RFE-SHAP Feature Selection Method (RFE-SHAP을 활용한 온라인 리뷰를 통한 고객 만족도 예측)

  • Olga Chernyaeva;Taeho Hong
    • Journal of Intelligence and Information Systems
    • /
    • v.29 no.4
    • /
    • pp.325-345
    • /
    • 2023
  • In the rapidly evolving domain of e-commerce, our study presents a cohesive approach to enhance customer satisfaction prediction from online reviews, aligning methodological innovation with practical insights. We integrate the RFE-SHAP feature selection with LDA topic modeling to streamline predictive analytics in e-commerce. This integration facilitates the identification of key features-specifically, narrowing down from an initial set of 28 to an optimal subset of 14 features for the Random Forest algorithm. Our approach strategically mitigates the common issue of overfitting in models with an excess of features, leading to an improved accuracy rate of 84% in our Random Forest model. Central to our analysis is the understanding that certain aspects in review content, such as quality, fit, and durability, play a pivotal role in influencing customer satisfaction, especially in the clothing sector. We delve into explaining how each of these selected features impacts customer satisfaction, providing a comprehensive view of the elements most appreciated by customers. Our research makes significant contributions in two key areas. First, it enhances predictive modeling within the realm of e-commerce analytics by introducing a streamlined, feature-centric approach. This refinement in methodology not only bolsters the accuracy of customer satisfaction predictions but also sets a new standard for handling feature selection in predictive models. Second, the study provides actionable insights for e-commerce platforms, especially those in the clothing sector. By highlighting which aspects of customer reviews-like quality, fit, and durability-most influence satisfaction, we offer a strategic direction for businesses to tailor their products and services.

Software Development Effort Estimation Using Partition of Project Delivery Rate Group (프로젝트 인도율 그룹 분할 방법을 이용한 소프트웨어 개발노력 추정)

  • Lee, Sang-Un;No, Myeong-Ok;Lee, Bu-Gwon
    • The KIPS Transactions:PartD
    • /
    • v.9D no.2
    • /
    • pp.259-266
    • /
    • 2002
  • The main issue in software development is the ability of software project effort and cost estimation in the early phase of software life cycle. The regression models for project effort and cost estimation are presented by function point that is a software sire. The data sets used to conduct previous studies are of ten small and not too recent. Applying these models to 789 project data developed from 1990 ; the models only explain fewer than 0.53 $R^2$(Coefficient of determination) of the data variation. Homogeneous group in accordance with project delivery rate (PDR) divides the data sets. Then this paper presents general effort estimation models using project delivery rate. The presented model has a random distribution of residuals and explains more than 0.93 $R^2$ of data variation in most of PDR ranges.

Non-rigid 3D Shape Recovery from Stereo 2D Video Sequence (스테레오 2D 비디오 영상을 이용한 비정형 3D 형상 복원)

  • Koh, Sung-shik
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.20 no.2
    • /
    • pp.281-288
    • /
    • 2016
  • The natural moving objects are the most non-rigid shapes with randomly time-varying deformation, and its types also very diverse. Methods of non-rigid shape reconstruction have widely applied in field of movie or game industry in recent years. However, a realistic approach requires moving object to stick many beacon sets. To resolve this drawback, non-rigid shape reconstruction researches from input video without beacon sets are investigated in multimedia application fields. In this regard, our paper propose novel CPSRF(Chained Partial Stereo Rigid Factorization) algorithm that can reconstruct a non-rigid 3D shape. Our method is focused on the real-time reconstruction of non-rigid 3D shape and motion from stereo 2D video sequences per frame. And we do not constrain that the deformation of the time-varying non-rigid shape is limited by a Gaussian distribution. The experimental results show that the 3D reconstruction performance of the proposed CPSRF method is superior to that of the previous method which does not consider the random deformation of shape.

Classifying Social Media Users' Stance: Exploring Diverse Feature Sets Using Machine Learning Algorithms

  • Kashif Ayyub;Muhammad Wasif Nisar;Ehsan Ullah Munir;Muhammad Ramzan
    • International Journal of Computer Science & Network Security
    • /
    • v.24 no.2
    • /
    • pp.79-88
    • /
    • 2024
  • The use of the social media has become part of our daily life activities. The social web channels provide the content generation facility to its users who can share their views, opinions and experiences towards certain topics. The researchers are using the social media content for various research areas. Sentiment analysis, one of the most active research areas in last decade, is the process to extract reviews, opinions and sentiments of people. Sentiment analysis is applied in diverse sub-areas such as subjectivity analysis, polarity detection, and emotion detection. Stance classification has emerged as a new and interesting research area as it aims to determine whether the content writer is in favor, against or neutral towards the target topic or issue. Stance classification is significant as it has many research applications like rumor stance classifications, stance classification towards public forums, claim stance classification, neural attention stance classification, online debate stance classification, dialogic properties stance classification etc. This research study explores different feature sets such as lexical, sentiment-specific, dialog-based which have been extracted using the standard datasets in the relevant area. Supervised learning approaches of generative algorithms such as Naïve Bayes and discriminative machine learning algorithms such as Support Vector Machine, Naïve Bayes, Decision Tree and k-Nearest Neighbor have been applied and then ensemble-based algorithms like Random Forest and AdaBoost have been applied. The empirical based results have been evaluated using the standard performance measures of Accuracy, Precision, Recall, and F-measures.