DOI QR코드

DOI QR Code

A Study on Improving Performance of Software Requirements Classification Models by Handling Imbalanced Data

불균형 데이터 처리를 통한 소프트웨어 요구사항 분류 모델의 성능 개선에 관한 연구

  • 최종우 (한국과학기술원 소프트웨어대학원) ;
  • 이영준 (한국과학기술원 전산학부) ;
  • 임채균 (한국과학기술원 전산학부) ;
  • 최호진 (한국과학기술원 전산학부)
  • Received : 2022.12.22
  • Accepted : 2023.04.21
  • Published : 2023.07.31

Abstract

Software requirements written in natural language may have different meanings from the stakeholders' viewpoint. When designing an architecture based on quality attributes, it is necessary to accurately classify quality attribute requirements because the efficient design is possible only when appropriate architectural tactics for each quality attribute are selected. As a result, although many natural language processing models have been studied for the classification of requirements, which is a high-cost task, few topics improve classification performance with the imbalanced quality attribute datasets. In this study, we first show that the classification model can automatically classify the Korean requirement dataset through experiments. Based on these results, we explain that data augmentation through EDA(Easy Data Augmentation) techniques and undersampling strategies can improve the imbalance of quality attribute datasets, and show that they are effective in classifying requirements. The results improved by 5.24%p on F1-score, indicating that handling imbalanced data helps classify Korean requirements of classification models. Furthermore, detailed experiments of EDA illustrate operations that help improve classification performance.

자연어로 작성되는 소프트웨어 요구사항은 이해관계자가 바라보는 관점에 따라 의미가 달라질 수 있다. 품질 속성 기반으로 아키텍처 설계시에 품질 속성별로 적합한 설계 전술(Tactic)을 선택해야 효율적인 설계가 가능해 품질 속성 요구사항의 정확한 분류가 필요하다. 이에 따라 고비용 작업인 요구사항 분류에 관한 자연어처리 모델이 많이 연구되고 있지만, 품질 속성 데이터셋(dataset)의 불균형을 처리해 분류 성능을 개선하는 주제는 많이 다루고 있지 않다. 본 연구에서는 먼저 실험을 통해 분류 모델이 한국어 요구사항 데이터셋을 자동으로 분류할 수 있음을 보인다. 이 결과를 바탕으로 EDA(Easy Data Augmentation) 기법을 통한 데이터 증강과 언더샘플링(undersampling) 전략으로 품질 속성 데이터셋의 불균형을 개선할 수 있음을 설명하고 요구사항의 카테고리 분류에 효과가 있음을 보인다. 실험 결과 F1 점수(F1-Score) 기준으로 최대 5.24%p 향상되어 불균형 데이터 처리 기법이 분류 모델의 한국어 요구사항 분류에 도움이 됨을 확인할 수 있다. 또한, EDA의 세부 실험을 통해 분류 성능 개선에 도움이 되는 데이터 증강 연산에 관해 설명한다.

Keywords

Acknowledgement

이 논문은 2022년도 정부(과학기술정보통신부)의 재원으로 정보통신기획평가원의 지원을 받아 수행된 연구임(No. 2013-2-00131, (엑소브레인-총괄/1세부) 휴먼 지식증강 서비스를 위한 지능진화형 WiseQA 플랫폼 기술 개발).

References

  1. K. Wiegers and J. Beatty, "Software Requirements," Sydney: Pearson Education, 2013.
  2. J. Eckhardt, A. Vogelsang, and D. M. Fernandez, "Are non-functional requirements really non-functional?: An investigation of non-functional requirements in practice," Proceedings of the 38th International Conference on Software Engineering, 2016.
  3. M. Glinz, "On non-functional requirements," in 15th IEEE International Requirements Engineering Conference (RE), pp.21-26, 2007.
  4. L. Bass, P. Clements, and R. Kazman, "Software architecture in practice," Upper Saddle River (N.J.): Addison-Wesley, 2013.
  5. Z. S. H. Abad, O. Karras, P. Ghazi, M. Glinz, G. Ruhe, and K. Schneider, "What works better? A study of classifying requirements," in 2017 IEEE 25th International Requirements Engineering Conference (RE), pp.496-501, 2017.
  6. M. Binkhonain and L. Zhao, "A machine learning approach for hierarchical classification of software requirements," arXiv preprint arXiv:2302.12599, 2023.
  7. X. Luo, Y. Xue, Z. Xing, and J. Sun, "PRCBERT: Prompt learning for requirement classification using BERT-based pretrained language models," In 37th IEEE/ACM International Conference on Automated Software Engineering, pp.1-13, 2022.
  8. Z. Kurtanovic and W. Maalej, "Automatically classifying functional and non-functional requirements using supervised machine learning," in 2017 IEEE 25th International Requirements Engineering Conference (RE), pp.490-495, 2017.
  9. J. W. Wei and K. Zou, "EDA: Easy data augmentation techniques for boosting performance on text classification tasks," arXiv:1901.11196, 2019.
  10. E. Dias Canedo and B. Cordeiro Mendes, "Software requirements classification using machine learning algorithms," Entropy, Vol.22, No.9, pp.1057, 2020.
  11. M. Lima, V. Valle, E. Costa, F. Lira, and B. Gadelha, "Software engineering repositories: Expanding the PROMISE database," in Proceedings of the 33rd Brazilian Symposium on Software Engineering, pp.427-436, 2019.
  12. R. Navarro-Almanza, R. Juarez-Ramirez, and G. Licea, "Towards supporting software engineering using deep learning: A case of software requirements classification," in 2017 5th International Conference in Software Engineering Research and Innovation (CONISOFT), pp.116-120, 2017.
  13. T. Hey, J. Keim, A. Koziolek, and W. Tichy, "NoRBERT: Transfer learning for requirements classification," In 2020 IEEE 28th International Requirements Engineering Conference (RE), pp.169-179, 2020.
  14. ISO/IEC/IEEE 29148:2018. Systems and software engineering. Life cycle processes. Requirements engineering (2018) [Internet], https://www.iso.org/standard/72089.html
  15. M. I. Limaylla-Lunarejo, N. Condori-Fernandez, and M. R. Luaces, "Towards an automatic requirements classification in a new Spanish dataset," 2022 IEEE 30th International Requirements Engineering Conference (RE), IEEE, 2022.
  16. J. Devlin, M. Chang, K. Lee, and K. Toutanova, "BERT: Pretraining of deep bidirectional transformers for language understanding," in Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol.1 (Long and Short Papers). Linguistics, pp.4171-4186, 2019.
  17. Github - kiyoungkim1/LMKor, Github, 2022. [Internet], https://github.com/kiyoungkim1/LMkor
  18. Github - Beomi/KcBERT, Github, 2022. [Internet], https://github.com/Beomi/KcBERT
  19. Wordnet.kaist.ac.kr, 2022. [Internet], http://wordnet.kaist.ac.kr/
  20. J. Eisenschlos, S. Ruder, P. Czapla, M. Kadras, S. Gugger, and J. Howard, "MultiFiT: Efficient multi-lingual language model fine-tuning," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.5706-5711, 2019.