• Title/Summary/Keyword: Training Document

Search Result 173, Processing Time 0.034 seconds

Efficient Hangul Word Processor (HWP) Malware Detection Using Semi-Supervised Learning with Augmented Data Utility Valuation (효율적인 HWP 악성코드 탐지를 위한 데이터 유용성 검증 및 확보 기반 준지도학습 기법)

  • JinHyuk Son;Gihyuk Ko;Ho-Mook Cho;Young-Kuk Kim
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.34 no.1
    • /
    • pp.71-82
    • /
    • 2024
  • With the advancement of information and communication technology (ICT), the use of electronic document types such as PDF, MS Office, and HWP files has increased. Such trend has led the cyber attackers increasingly try to spread malicious documents through e-mails and messengers. To counter such attacks, AI-based methodologies have been actively employed in order to detect malicious document files. The main challenge in detecting malicious HWP(Hangul Word Processor) files is the lack of quality dataset due to its usage is limited in Korea, compared to PDF and MS-Office files that are highly being utilized worldwide. To address this limitation, data augmentation have been proposed to diversify training data by transforming existing dataset, but as the usefulness of the augmented data is not evaluated, augmented data could end up harming model's performance. In this paper, we propose an effective semi-supervised learning technique in detecting malicious HWP document files, which improves overall AI model performance via quantifying the utility of augmented data and filtering out useless training data.

Host Plant-Antheraea mylitta Interactions and Its Effect on Reproductive and Commercial Parameters

  • Rath, S.S.;Singh, G.S.;Singh, S.S.;Singh, M.K.;Suryanarayana, N.;Vijayaprakash, N.B.
    • International Journal of Industrial Entomology and Biomaterials
    • /
    • v.17 no.2
    • /
    • pp.205-209
    • /
    • 2008
  • Impact of food plant on reproductive and commercial parameters in Antheraea mylitta, a polyphagous insect of economic importance was studied upon feeding the insect larvae on the same host plants for six continuous generations. A. mylitta larvae were fed upon Terminalia tomentosa, Terminalia arjuna and Zizyphus jujuba and restricted them to the same host plant for six generations to document the quantitative improvement in reproductive and commercial parameters. The parameters showed significant improvement in all the host plants studied over their respective controls. Fecundity among the reproductive parameters was highly improved than others (85.9% in T. tomentosa; 58% in T. arjuna and 49.7% in Z. jujuba). Likewise in commercial parameters, the shell weight in male showed the highest improvement (by 52.9%, 45.8% and 42.1% in T. tomentosa; T. arjuna and Z. jujuba respectively). On the other hand, the shell ratio percentage in female recorded the lowest improvement. The values for all characters were recorded a decline in T. arjuna and Z. jujuba fed ones over T. tomentosa, except that of shell ratio percentage in female has registered an increase in Z jujuba fed. The study thus revealed the comparative superiority of T. tomentosa over T. arjuna and Z. jujuba.

An Analysis on the Factors Affectingy Online Search Effect (온라인 정보탐색의 효과변인 분석)

  • Kim Sun-Ho
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.22
    • /
    • pp.361-396
    • /
    • 1992
  • The purpose of this study is to verify the correlations between the amount of the online searcher's search experience and their search effect. In order to achieve this purpose, the 28 online searchers working at the chosen libraries and information centers have participated in the study as subjects. The subjects have been classified into the two types of cognitive style by Group Embedded Figure Test. As the result of the GEFT, two groups have been identified: the 15 Field Independance ( FI ) searchers and the 13 Field Dependance ( FD ) searchers. The subject's search experience consists of the 3 elements: disciplinary, training, and working experience. In order to get the data of these empirical elements, a questionnaire have been sent to the 28 subjects. An online searching request form prepared by a practical user was sent to all subjects, who conducted searches of the oversea databases through Dialog to retrieve what was requested. The resultant outcomes were collected and sent back to the user to evaluate relevance and pertinence of the search effect by the individual. In this study, the search effect has been divide into relevance and pertinence. The relevance has been then subdivided into the 3 elements : the number of the relevant documents, recall ratio, and the cost per a relevant document. The relevance has been subdivided into the 3 elements: the number of the pertinent documents, utility ratio, and the cost per a pertinent document. The correlations between the 3 elements of the subject's experience and the 6 elements of the search effect has been analysed in the FI and in the FD searchers separately. At the standard of the 0.01 significance level, findings and conclusions made in the study are summarised as follows : 1. There are strong correlations between the amount of training and the recall ratio, the number of the pertinent documents, and the utility ratio on the part of FI searchers. 2. There are strong correlations between the amount of working experience and the number of the relevant documents, the recall ratio on the part of FD searchers. However, there is also a significant converse correlation between the amount of working experience and the search cost per a pertinent document on the part of FD searchers. 3. The amount of working experience has stronger correlations with the number of the pertinent documents and the utility ratio on the part of FD searchers than the amount of training. 4. There is a strong correlation between the amount of training and the pertinence on both part of FI and FD searchers.

  • PDF

Utilizing Unlabeled Documents in Automatic Classification with Inter-document Similarities (문헌간 유사도를 이용한 자동분류에서 미분류 문헌의 활용에 관한 연구)

  • Kim, Pan-Jun;Lee, Jae-Yun
    • Journal of the Korean Society for information Management
    • /
    • v.24 no.1 s.63
    • /
    • pp.251-271
    • /
    • 2007
  • This paper studies the problem of classifying documents with labeled and unlabeled learning data, especially with regards to using document similarity features. The problem of using unlabeled data is practically important because in many information systems obtaining training labels is expensive, while large quantities of unlabeled documents are readily available. There are two steps In general semi-supervised learning algorithm. First, it trains a classifier using the available labeled documents, and classifies the unlabeled documents. Then, it trains a new classifier using all the training documents which were labeled either manually or automatically. We suggested two types of semi-supervised learning algorithm with regards to using document similarity features. The one is one step semi-supervised learning which is using unlabeled documents only to generate document similarity features. And the other is two step semi-supervised learning which is using unlabeled documents as learning examples as well as similarity features. Experimental results, obtained using support vector machines and naive Bayes classifier, show that we can get improved performance with small labeled and large unlabeled documents then the performance of supervised learning which uses labeled-only data. When considering the efficiency of a classifier system, the one step semi-supervised learning algorithm which is suggested in this study could be a good solution for improving classification performance with unlabeled documents.

The Study on the Effective Automatic Classification of Internet Document Using the Machine Learning (기계학습을 기반으로 한 인터넷 학술문서의 효과적 자동분류에 관한 연구)

  • 노영희
    • Journal of Korean Library and Information Science Society
    • /
    • v.32 no.3
    • /
    • pp.307-330
    • /
    • 2001
  • This study experimented the performance of categorization methods using the kNN classifier. Most sample based automatic text categorization techniques like the kNN classifier reduces the feature set of the training documents. We sought to find out which percentage reductions in the feature set would result in high performances. In addition, the kNN classifier has to find the k number of training documents most similar to the test documents in the training documents. We sought to verify the most appropriate k value through experiments.

  • PDF

An Ensemble Approach for Cyber Bullying Text messages and Images

  • Zarapala Sunitha Bai;Sreelatha Malempati
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.11
    • /
    • pp.59-66
    • /
    • 2023
  • Text mining (TM) is most widely used to find patterns from various text documents. Cyber-bullying is the term that is used to abuse a person online or offline platform. Nowadays cyber-bullying becomes more dangerous to people who are using social networking sites (SNS). Cyber-bullying is of many types such as text messaging, morphed images, morphed videos, etc. It is a very difficult task to prevent this type of abuse of the person in online SNS. Finding accurate text mining patterns gives better results in detecting cyber-bullying on any platform. Cyber-bullying is developed with the online SNS to send defamatory statements or orally bully other persons or by using the online platform to abuse in front of SNS users. Deep Learning (DL) is one of the significant domains which are used to extract and learn the quality features dynamically from the low-level text inclusions. In this scenario, Convolutional neural networks (CNN) are used for training the text data, images, and videos. CNN is a very powerful approach to training on these types of data and achieved better text classification. In this paper, an Ensemble model is introduced with the integration of Term Frequency (TF)-Inverse document frequency (IDF) and Deep Neural Network (DNN) with advanced feature-extracting techniques to classify the bullying text, images, and videos. The proposed approach also focused on reducing the training time and memory usage which helps the classification improvement.

Use Analysis and Evaluation of MEDLIS(MEDical Library Information System) Document Delivery Service (의학학술지종합정보시스템(MEDLIS)의 원문제공서비스 이용 분석과 평가)

  • Chang, Hye-Rhan;Kim, Jeong-A
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.46 no.3
    • /
    • pp.233-250
    • /
    • 2012
  • The purpose of this study is to assess the development, current states, and problems of MEDLIS document delivery service. With the analysis of MEDLIS transaction data from 2001 to 2011, we identified continuous usage decrease, unbalanced contribution by type of institution, high dependence on back issues, use differences among subfields of medicine, relatively low success rate, and various reasons for failure. Based on the results, recommendations for the maintenance of union catalog database, technical support for search capability enhancements, establishment of back issue archiving policy, user training and publicity, and membership expansion are suggested to promote the service.

News Recommendation Exploiting Document Summarization based on Deep Learning (딥러닝 기반의 문서요약기법을 활용한 뉴스 추천)

  • Heu, Jee-Uk
    • The Journal of the Institute of Internet, Broadcasting and Communication
    • /
    • v.22 no.4
    • /
    • pp.23-28
    • /
    • 2022
  • Recently smart device(such as smart phone and tablet PC) become a role as an information gateway, using of the web news by multiple users from the web portal has been more important things. However, the quantity of creating web news on the web makes hard to catch the information which the user wants and confuse the users cause of the similar and repeated contents. In this paper, we propose the news recommend system using the document summarization based on KoBART which gives the selected news to users from the candidate news on the news portal. As a result, our proposed system shows higher performance and recommending the news efficiently by pre-training and fine-tuning the KoBART using collected news data.

An Optimal Weighting Method in Supervised Learning of Linguistic Model for Text Classification

  • Mikawa, Kenta;Ishida, Takashi;Goto, Masayuki
    • Industrial Engineering and Management Systems
    • /
    • v.11 no.1
    • /
    • pp.87-93
    • /
    • 2012
  • This paper discusses a new weighting method for text analyzing from the view point of supervised learning. The term frequency and inverse term frequency measure (tf-idf measure) is famous weighting method for information retrieval, and this method can be used for text analyzing either. However, it is an experimental weighting method for information retrieval whose effectiveness is not clarified from the theoretical viewpoints. Therefore, other effective weighting measure may be obtained for document classification problems. In this study, we propose the optimal weighting method for document classification problems from the view point of supervised learning. The proposed measure is more suitable for the text classification problem as used training data than the tf-idf measure. The effectiveness of our proposal is clarified by simulation experiments for the text classification problems of newspaper article and the customer review which is posted on the web site.

An Active Learning-based Method for Composing Training Document Set in Bayesian Text Classification Systems (베이지언 문서분류시스템을 위한 능동적 학습 기반의 학습문서집합 구성방법)

  • 김제욱;김한준;이상구
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.12
    • /
    • pp.966-978
    • /
    • 2002
  • There are two important problems in improving text classification systems based on machine learning approach. The first one, called "selection problem", is how to select a minimum number of informative documents from a given document collection. The second one, called "composition problem", is how to reorganize selected training documents so that they can fit an adopted learning method. The former problem is addressed in "active learning" algorithms, and the latter is discussed in "boosting" algorithms. This paper proposes a new learning method, called AdaBUS, which proactively solves the above problems in the context of Naive Bayes classification systems. The proposed method constructs more accurate classification hypothesis by increasing the valiance in "weak" hypotheses that determine the final classification hypothesis. Consequently, the proposed algorithm yields perturbation effect makes the boosting algorithm work properly. Through the empirical experiment using the Routers-21578 document collection, we show that the AdaBUS algorithm more significantly improves the Naive Bayes-based classification system than other conventional learning methodson system than other conventional learning methods