• Title/Summary/Keyword: document classification

Search Result 449, Processing Time 0.022 seconds

A Research for Web Documents Genre Classification using STW (STW를 이용한 웹 문서 장르 분류에 관한 연구)

  • Ko, Byeong-Kyu;Oh, Kun-Seok;Kim, Pan-Koo
    • Journal of Information Technology and Architecture
    • /
    • v.9 no.4
    • /
    • pp.413-422
    • /
    • 2012
  • Many researchers have been studied to reveal human natural language to let machine understand its meaning by text based, page rank based or more. Particularly, it has been considered that URL and HTML Tag information in web documents are attracting people' attention again to analyze huge amount of web document automatically. In this paper, we propose a STW (Semantic Term Weight) approach based on syntactic and linguistic structure of web documents in order to classify what genres are. For the evaluation, we analyzed more than 1,000 documents from 20-Genre-collection corpus for training the documents based on SVM algorithm. Afterwards, we tested KI-04 corpus to evaluate performance of our proposed method. This paper measured their accuracy by classifying them into an experiment using STW and one without u sing STW. As the results, the proposed STW based approach showed approximately 10.2% which Is higher than one without use of STW.

Study on Security Grade Classification of Financial Company Documents (금융기관 문서 보안등급 분류에 관한 연구)

  • Kang, Bu Il;Kim, Seung Joo
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.24 no.6
    • /
    • pp.1319-1328
    • /
    • 2014
  • While the recent advance in network system has made it easier to collect and process personal information, the loss of customers, financial companies and even nations is getting bigger due to the leakage of personal information. Therefore, it is required to take a measure to prevent additional damage from the illegal use of leakaged personal information. Currently, financial companies use access control in accordance with job title or position on general documents as well as important documents including personal information. Therefore, even if a documents is confidential, it is possible for a person of the same job title or position to access the document properly. This paper propose setting up security grade of documents to improve current access control system. It will help preventing the leakage of personal information.

The Classification System of the Official Documents in the Colonial Period (일제하 조선총독부의 공문서 분류방식)

  • Park, Sung-jin
    • The Korean Journal of Archival Studies
    • /
    • no.5
    • /
    • pp.179-208
    • /
    • 2002
  • In this paper, I explained the dominating/dominated relationship of Japan and Colonized Korea by analysing the management system of official documents. I examined the theory and practices of the classification used by the office of the Governor-General for preserving official documents whose production and circulation ended. In summary, first, the office of the Governor-General and its municipal authorities classified and filed documents according to the nature and regulations on apportionment for the organizations. The apportionment of the central and local organs was not fixed through the colonial period and changed chronologically. The organization and apportionment of the central and local organs reflected the changes in the colonial policies. As a result, even in the same organs, the composition of documents had differences at different times. The essential way of classifying documents in the colonial period was to sort out official documents which should be preserved serially and successively according to each function of the colonial authorities. The filing of documents was taken place in the form of the direct reflection of organizing and apportioning of the function among several branches of the office of the Governor-General and other governmental organs. However, for the reason that filing documents was guided at the level of the organs, each organ's members responsible for documents hardly composed the filing unit as a sub-category of the organ itself. Second, Japan constructed the infrastructure of colonial rule through the management system of official documents. After Kabo Reform, the management system of official documents had the same principles as those of the Japan proper. The office of the Governor-General not only adopted several regulations on the management of official documents, but also controlled the arrangement and the situation of document managing in the local governmental organizations with the constant censorship. The management system of documents was fundamentally based on the reality of colonial rule and neglected many principles of archival science. For example, the office of Governor-General labelled many policy documents as classified and burnt them only because of the administrative and managerial purposes. Those practices were inherited in the document management system of post-colonial Korea and resulted in scrapping of official documents in large quantities because the system produced too many "classified documents".

An Ensemble Approach for Cyber Bullying Text messages and Images

  • Zarapala Sunitha Bai;Sreelatha Malempati
    • International Journal of Computer Science & Network Security
    • /
    • v.23 no.11
    • /
    • pp.59-66
    • /
    • 2023
  • Text mining (TM) is most widely used to find patterns from various text documents. Cyber-bullying is the term that is used to abuse a person online or offline platform. Nowadays cyber-bullying becomes more dangerous to people who are using social networking sites (SNS). Cyber-bullying is of many types such as text messaging, morphed images, morphed videos, etc. It is a very difficult task to prevent this type of abuse of the person in online SNS. Finding accurate text mining patterns gives better results in detecting cyber-bullying on any platform. Cyber-bullying is developed with the online SNS to send defamatory statements or orally bully other persons or by using the online platform to abuse in front of SNS users. Deep Learning (DL) is one of the significant domains which are used to extract and learn the quality features dynamically from the low-level text inclusions. In this scenario, Convolutional neural networks (CNN) are used for training the text data, images, and videos. CNN is a very powerful approach to training on these types of data and achieved better text classification. In this paper, an Ensemble model is introduced with the integration of Term Frequency (TF)-Inverse document frequency (IDF) and Deep Neural Network (DNN) with advanced feature-extracting techniques to classify the bullying text, images, and videos. The proposed approach also focused on reducing the training time and memory usage which helps the classification improvement.

A Korean Document Sentiment Classification System based on Semantic Properties of Sentiment Words (감정 단어의 의미적 특성을 반영한 한국어 문서 감정분류 시스템)

  • Hwang, Jae-Won;Ko, Young-Joong
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.4
    • /
    • pp.317-322
    • /
    • 2010
  • This paper proposes how to improve performance of the Korean document sentiment-classification system using semantic properties of the sentiment words. A sentiment word means a word with sentiment, and sentiment features are defined by a set of the sentiment words which are important lexical resource for the sentiment classification. Sentiment feature represents different sentiment intensity in general field and in specific domain. In general field, we can estimate the sentiment intensity using a snippet from a search engine, while in specific domain, training data can be used for this estimation. When the sentiment intensity of the sentiment features are estimated, it is called semantic orientation and is used to estimate the sentiment intensity of the sentences in the text documents. After estimating sentiment intensity of the sentences, we apply that to the weights of sentiment features. In this paper, we evaluate our system in three different cases such as general, domain-specific, and general/domain-specific semantic orientation using support vector machine. Our experimental results show the improved performance in all cases, and, especially in general/domain-specific semantic orientation, our proposed method performs 3.1% better than a baseline system indexed by only content words.

A Study on Spam Document Classification Method using Characteristics of Keyword Repetition (단어 반복 특징을 이용한 스팸 문서 분류 방법에 관한 연구)

  • Lee, Seong-Jin;Baik, Jong-Bum;Han, Chung-Seok;Lee, Soo-Won
    • The KIPS Transactions:PartB
    • /
    • v.18B no.5
    • /
    • pp.315-324
    • /
    • 2011
  • In Web environment, a flood of spam causes serious social problems such as personal information leak, monetary loss from fishing and distribution of harmful contents. Moreover, types and techniques of spam distribution which must be controlled are varying as days go by. The learning based spam classification method using Bag-of-Words model is the most widely used method until now. However, this method is vulnerable to anti-spam avoidance techniques, which recent spams commonly have, because it classifies spam documents utilizing only keyword occurrence information from classification model training process. In this paper, we propose a spam document detection method using a characteristic of repeating words occurring in spam documents as a solution of anti-spam avoidance techniques. Recently, most spam documents have a trend of repeating key phrases that are designed to spread, and this trend can be used as a measure in classifying spam documents. In this paper, we define six variables, which represent a characteristic of word repetition, and use those variables as a feature set for constructing a classification model. The effectiveness of proposed method is evaluated by an experiment with blog posts and E-mail data. The result of experiment shows that the proposed method outperforms other approaches.

Self Introduction Essay Classification Using Doc2Vec for Efficient Job Matching (Doc2Vec 모형에 기반한 자기소개서 분류 모형 구축 및 실험)

  • Kim, Young Soo;Moon, Hyun Sil;Kim, Jae Kyeong
    • Journal of Information Technology Services
    • /
    • v.19 no.1
    • /
    • pp.103-112
    • /
    • 2020
  • Job seekers are making various efforts to find a good company and companies attempt to recruit good people. Job search activities through self-introduction essay are nowadays one of the most active processes. Companies spend time and cost to reviewing all of the numerous self-introduction essays of job seekers. Job seekers are also worried about the possibility of acceptance of their self-introduction essays by companies. This research builds a classification model and conducted an experiments to classify self-introduction essays into pass or fail using deep learning and decision tree techniques. Real world data were classified using stratified sampling to alleviate the data imbalance problem between passed self-introduction essays and failed essays. Documents were embedded using Doc2Vec method developed from existing Word2Vec, and they were classified using logistic regression analysis. The decision tree model was chosen as a benchmark model, and K-fold cross-validation was conducted for the performance evaluation. As a result of several experiments, the area under curve (AUC) value of PV-DM results better than that of other models of Doc2Vec, i.e., PV-DBOW and Concatenate. Furthmore PV-DM classifies passed essays as well as failed essays, while PV_DBOW can not classify passed essays even though it classifies well failed essays. In addition, the classification performance of the logistic regression model embedded using the PV-DM model is better than the decision tree-based classification model. The implication of the experimental results is that company can reduce the cost of recruiting good d job seekers. In addition, our suggested model can help job candidates for pre-evaluating their self-introduction essays.

Classification Performance Analysis of Cross-Language Text Categorization using Machine Translation (기계번역을 이용한 교차언어 문서 범주화의 분류 성능 분석)

  • Lee, Yong-Gu
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.43 no.1
    • /
    • pp.313-332
    • /
    • 2009
  • Cross-language text categorization(CLTC) can classify documents automatically using training set from other language. In this study, collections appropriated for CLTC were extracted from KTSET. Classification performance of various CLTC methods were compared by SVM classifier using machine translation. Results showed that the classification performance in the order of poly-lingual training method, training-set translation and test-set translation. However, training-set translation could be regarded as the most useful method among CLTC, because it was efficient for machine translation and easily adapted to general environment. On the other hand, low performance was shown to be due to the feature reduction or features with no subject characteristics, which occurred in the process of machine translation of CLTC.

A Study on the Improvement of the BRM Classification System for Policy Information Service (정책정보제공서비스를 위한 BRM분류체계 개선에 관한 연구)

  • Noh, Younghee;Park, Yang-Ha
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.48 no.4
    • /
    • pp.135-171
    • /
    • 2014
  • The aim of this study was to suggest a classification system adapted to provide policy information services. For this purpose, this study completed the following processes; BRM taxonomy analysis, document analysis, analysis of classification systems providing policy information, consulting classification experts, surveys and interviews with policy information consumers, and an empirical validation process through the actual construction of policy information materials. Finally, this study complemented and modified the BRM taxonomy system and proposed a classification system appropriate to policy information resources. Through the procedures of experts discussion, the steps of BRM analysis appropriate to provide policy information services is determined as three steps. The domestic institute websites for policy information services has confirmed the appropriateness of the BRM taxonomy system through the analysis system and service research to provide policy information resources. Also through the specialist interview, the confirmation of BRM and the improvement has been drawn. Through the questionaires, the study analyzes the appropriateness of available BRM taxonomy system and the requirements by subjects. And through the empirical verificaion, it determines the subject of BRM taxonomy system for policy information services.

A Study on Patent Literature Classification Using Distributed Representation of Technical Terms (기술용어 분산표현을 활용한 특허문헌 분류에 관한 연구)

  • Choi, Yunsoo;Choi, Sung-Pil
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.53 no.2
    • /
    • pp.179-199
    • /
    • 2019
  • In this paper, we propose optimal methodologies for classifying patent literature by examining various feature extraction methods, machine learning and deep learning models, and provide optimal performance through experiments. We compared the traditional BoW method and a distributed representation method (word embedding vector) as a feature extraction, and compared the morphological analysis and multi gram as the method of constructing the document collection. In addition, classification performance was verified using traditional machine learning model and deep learning model. Experimental results show that the best performance is achieved when we apply the deep learning model with distributed representation and morphological analysis based feature extraction. In Section, Class and Subclass classification experiments, We improved the performance by 5.71%, 18.84% and 21.53%, respectively, compared with traditional classification methods.