• Title/Summary/Keyword: Korean text classification

Search Result 413, Processing Time 0.029 seconds

Research on text mining based malware analysis technology using string information (문자열 정보를 활용한 텍스트 마이닝 기반 악성코드 분석 기술 연구)

  • Ha, Ji-hee;Lee, Tae-jin
    • Journal of Internet Computing and Services
    • /
    • v.21 no.1
    • /
    • pp.45-55
    • /
    • 2020
  • Due to the development of information and communication technology, the number of new / variant malicious codes is increasing rapidly every year, and various types of malicious codes are spreading due to the development of Internet of things and cloud computing technology. In this paper, we propose a malware analysis method based on string information that can be used regardless of operating system environment and represents library call information related to malicious behavior. Attackers can easily create malware using existing code or by using automated authoring tools, and the generated malware operates in a similar way to existing malware. Since most of the strings that can be extracted from malicious code are composed of information closely related to malicious behavior, it is processed by weighting data features using text mining based method to extract them as effective features for malware analysis. Based on the processed data, a model is constructed using various machine learning algorithms to perform experiments on detection of malicious status and classification of malicious groups. Data has been compared and verified against all files used on Windows and Linux operating systems. The accuracy of malicious detection is about 93.5%, the accuracy of group classification is about 90%. The proposed technique has a wide range of applications because it is relatively simple, fast, and operating system independent as a single model because it is not necessary to build a model for each group when classifying malicious groups. In addition, since the string information is extracted through static analysis, it can be processed faster than the analysis method that directly executes the code.

A Classification Model for Attack Mail Detection based on the Authorship Analysis (작성자 분석 기반의 공격 메일 탐지를 위한 분류 모델)

  • Hong, Sung-Sam;Shin, Gun-Yoon;Han, Myung-Mook
    • Journal of Internet Computing and Services
    • /
    • v.18 no.6
    • /
    • pp.35-46
    • /
    • 2017
  • Recently, attackers using malicious code in cyber security have been increased by attaching malicious code to a mail and inducing the user to execute it. Especially, it is dangerous because it is easy to execute by attaching a document type file. The author analysis is a research area that is being studied in NLP (Neutral Language Process) and text mining, and it studies methods of analyzing authors by analyzing text sentences, texts, and documents in a specific language. In case of attack mail, it is created by the attacker. Therefore, by analyzing the contents of the mail and the attached document file and identifying the corresponding author, it is possible to discover more distinctive features from the normal mail and improve the detection accuracy. In this pager, we proposed IADA2(Intelligent Attack mail Detection based on Authorship Analysis) model for attack mail detection. The feature vector that can classify and detect attack mail from the features used in the existing machine learning based spam detection model and the features used in the author analysis of the document and the IADA2 detection model. We have improved the detection models of attack mails by simply detecting term features and extracted features that reflect the sequence characteristics of words by applying n-grams. Result of experiment show that the proposed method improves performance according to feature combinations, feature selection techniques, and appropriate models.

The Study on Penetration Acupuncture - Classification and Indication (투자침법(透刺鍼法)에 관(關)한 문헌적(文獻的) 고찰(考察) -분류(分類)와 적응증(適應症)을 중심(中心)으로-)

  • Jun, Chul-Ki;Kim, Young-Suk;Choi, Do-Young;Park, Dong-Suk
    • Journal of Acupuncture Research
    • /
    • v.17 no.4
    • /
    • pp.51-68
    • /
    • 2000
  • In order to study the classificadon, indication, acupoints and technique of penetration acupuncture. We searched related journals and text books. The results are as follows : 1. Penetration acupuncture(PA) already started at Okryongka(王龍歌). 2. Classification of PA according to angle of insertion was as follows: perpendicular PA, oblique PA, horizontal PA. 3. The effect of PA was as follows: increase of associaton between involved acupoints, quantity of stimulation and lesion of stimulation. 4. PA decreased the pain according to many acupoints. 5. The indication of PA was facial palsy, hemipanesis. Conclusion : PA was classified to perpendicular PA, oblique PA, horizontal PA according to angle of insertion and the indication of PA was disease like facial palsy and hemipatesis.

  • PDF

Development of Classification Model for Healthcare Contents on the Online Community (온라인 커뮤니티에서의 건강 관련 콘텐츠 분류 모형 개발)

  • Kim, Tae-Yun;Kim, Yoo-Sin;Choi, Sang-Hyun;Kim, Do-Hun;Chang, You-Jin
    • The Journal of Information Systems
    • /
    • v.26 no.4
    • /
    • pp.285-301
    • /
    • 2017
  • Purpose In this paper we verified the reliabilities of healthcare-related information provided by various users on the site of Naver Jisikin, a Korean typical search platform. Based on Q&A contents we validated answers' reliabilities to the asked questions about a lung cancer with the help of professors at a medical school. Design/methodology/approach The content analysis includes that the types of questions are classified into symptom/diagnosis, therapy, prognosis, after-management and so on. The answers contains advice, advertisement, oriental medicine, and religion as well as the above 5 question categories. The validation results of medical evidence about each answer show that only 49% among all answers have medical grounds. Findings We classified the medical grounded answers into three levels; high, medium and low. Among all answers we need to find out the answers including advertisement because the answers can be harmful to patients. We found the method to select the answers containing advertisement contents with the help of text mining research. The selection model presents high performance as 84% classification accuracy.

The Auto Regressive Parameter Estimation and Pattern Classification of EKS Signals for Automatic Diagnosis (심전도 신호의 자동분석을 위한 자기회귀모델 변수추정과 패턴분류)

  • 이윤선;윤형로
    • Journal of Biomedical Engineering Research
    • /
    • v.9 no.1
    • /
    • pp.93-100
    • /
    • 1988
  • The Auto Regressive Parameter Estimation and Pattern Classification of EKG Signal for Automatic Diagnosis. This paper presents the results from pattern discriminant analysis of an AR (auto regressive) model parameter group, which represents the HRV (heart rate variability) that is being considered as time series data. HRV data was extracted using the correct R-point of the EKG wave that was A/D converted from the I/O port both by hardware and software functions. Data number (N) and optimal (P), which were used for analysis, were determined by using Burg's maximum entropy method and Akaike's Information Criteria test. The representative values were extracted from the distribution of the results. In turn, these values were used as the index for determining the range o( pattern discriminant analysis. By carrying out pattern discriminant analysis, the performance of clustering was checked, creating the text pattern, where the clustering was optimum. The analysis results showed first that the HRV data were considered sufficient to ensure the stationarity of the data; next, that the patern discrimimant analysis was able to discriminate even though the optimal order of each syndrome was dissimilar.

  • PDF

A Dynamic Recommendation Agent System for E-Mail Management based on Rule Filtering Component (이메일 관리를 위한 룰 필터링 컴포넌트 기반 능동형 추천 에이전트 시스템)

  • Jeong, Ok-Ran;Cho, Dong-Sub
    • Proceedings of the KIEE Conference
    • /
    • 2004.05a
    • /
    • pp.126-128
    • /
    • 2004
  • As e-mail is becoming increasingly important in every day life activity, mail users spend more and more time organizing and classifying the e-mails they receive into folder. Many existing recommendation systems or text classification are mostly focused on recommending the products for the commercial purposes or web documents. So this study aims to apply these application to e-mail more necessary to users. This paper suggests a dynamic recommendation agent system based on Rule Filtering Component recommending the relevant category to enable users directly to manage the optimum classification when a new e-mail is received as the effective method for E-Mail Management. Moreover we try to improve the accuracy as eliminating the limits of misclassification that can be key in classifying e-mails by category. While the existing Bayesian Learning Algorithm mostly uses the fixed threshold, we prove to improve the satisfaction of users as increasing the accuracy by changing the fixed threshold to the dynamic threshold. We designed main modules by rule filtering component for enhanced scalability and reusability of our system.

  • PDF

Towards Improving Causality Mining using BERT with Multi-level Feature Networks

  • Ali, Wajid;Zuo, Wanli;Ali, Rahman;Rahman, Gohar;Zuo, Xianglin;Ullah, Inam
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.10
    • /
    • pp.3230-3255
    • /
    • 2022
  • Causality mining in NLP is a significant area of interest, which benefits in many daily life applications, including decision making, business risk management, question answering, future event prediction, scenario generation, and information retrieval. Mining those causalities was a challenging and open problem for the prior non-statistical and statistical techniques using web sources that required hand-crafted linguistics patterns for feature engineering, which were subject to domain knowledge and required much human effort. Those studies overlooked implicit, ambiguous, and heterogeneous causality and focused on explicit causality mining. In contrast to statistical and non-statistical approaches, we present Bidirectional Encoder Representations from Transformers (BERT) integrated with Multi-level Feature Networks (MFN) for causality recognition, called BERT+MFN for causality recognition in noisy and informal web datasets without human-designed features. In our model, MFN consists of a three-column knowledge-oriented network (TC-KN), bi-LSTM, and Relation Network (RN) that mine causality information at the segment level. BERT captures semantic features at the word level. We perform experiments on Alternative Lexicalization (AltLexes) datasets. The experimental outcomes show that our model outperforms baseline causality and text mining techniques.

A Transformer-Based Emotion Classification Model Using Transfer Learning and SHAP Analysis (전이 학습 및 SHAP 분석을 활용한 트랜스포머 기반 감정 분류 모델)

  • Subeen Leem;Byeongcheon Lee;Insu Jeon;Jihoon Moon
    • Annual Conference of KIPS
    • /
    • 2023.05a
    • /
    • pp.706-708
    • /
    • 2023
  • In this study, we embark on a journey to uncover the essence of emotions by exploring the depths of transfer learning on three pre-trained transformer models. Our quest to classify five emotions culminates in discovering the KLUE (Korean Language Understanding Evaluation)-BERT (Bidirectional Encoder Representations from Transformers) model, which is the most exceptional among its peers. Our analysis of F1 scores attests to its superior learning and generalization abilities on the experimental data. To delve deeper into the mystery behind its success, we employ the powerful SHAP (Shapley Additive Explanations) method to unravel the intricacies of the KLUE-BERT model. The findings of our investigation are presented with a mesmerizing text plot visualization, which serves as a window into the model's soul. This approach enables us to grasp the impact of individual tokens on emotion classification and provides irrefutable, visually appealing evidence to support the predictions of the KLUE-BERT model.

Developing and Evaluating Damage Information Classifier of High Impact Weather by Using News Big Data (재해기상 언론기사 빅데이터를 활용한 피해정보 자동 분류기 개발)

  • Su-Ji, Cho;Ki-Kwang Lee
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.46 no.3
    • /
    • pp.7-14
    • /
    • 2023
  • Recently, the importance of impact-based forecasting has increased along with the socio-economic impact of severe weather have emerged. As news articles contain unconstructed information closely related to the people's life, this study developed and evaluated a binary classification algorithm about snowfall damage information by using media articles text mining. We collected news articles during 2009 to 2021 which containing 'heavy snow' in its body context and labelled whether each article correspond to specific damage fields such as car accident. To develop a classifier, we proposed a probability-based classifier based on the ratio of the two conditional probabilities, which is defined as I/O Ratio in this study. During the construction process, we also adopted the n-gram approach to consider contextual meaning of each keyword. The accuracy of the classifier was 75%, supporting the possibility of application of news big data to the impact-based forecasting. We expect the performance of the classifier will be improve in the further research as the various training data is accumulated. The result of this study can be readily expanded by applying the same methodology to other disasters in the future. Furthermore, the result of this study can reduce social and economic damage of high impact weather by supporting the establishment of an integrated meteorological decision support system.

Developing a Text Categorization System Based on Unsupervised Learning Using an Information Retrieval Technique (정보검색 기술을 이용한 비지도 학습 기반 문서 분류 시스템 개발)

  • Noh, Dae-Wook;Lee, Soo-Yong;Ra, Dong-Yul
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.2
    • /
    • pp.160-168
    • /
    • 2007
  • For developing a text classifier using supervised learning, a manually labeled corpus of large size is required. However, it takes a lot of time and human effort. Recently a research paradigm was proposed to use a raw corpus and a small amount of seed information instead of manually labeled corpus. In this paper we introduce an unsupervised learning method that makes it possible to achieve better performance than other related works. The characteristics of our approach is that average mutual information is used to learn representative words and their weights and then update of the weights is done using a technique inspired by the works in information retrieval. By iterating this teaming process it was shown that a high performance system can be developed.