• Title/Summary/Keyword: Document Categorizing

Search Result 10, Processing Time 0.026 seconds

Weighted Bayesian Automatic Document Categorization Based on Association Word Knowledge Base by Apriori Algorithm (Apriori알고리즘에 의한 연관 단어 지식 베이스에 기반한 가중치가 부여된 베이지만 자동 문서 분류)

  • 고수정;이정현
    • Journal of Korea Multimedia Society
    • /
    • v.4 no.2
    • /
    • pp.171-181
    • /
    • 2001
  • The previous Bayesian document categorization method has problems that it requires a lot of time and effort in word clustering and it hardly reflects the semantic information between words. In this paper, we propose a weighted Bayesian document categorizing method based on association word knowledge base acquired by mining technique. The proposed method constructs weighted association word knowledge base using documents in training set. Then, classifier using Bayesian probability categorizes documents based on the constructed association word knowledge base. In order to evaluate performance of the proposed method, we compare our experimental results with those of weighted Bayesian document categorizing method using vocabulary dictionary by mutual information, weighted Bayesian document categorizing method, and simple Bayesian document categorizing method. The experimental result shows that weighted Bayesian categorizing method using association word knowledge base has improved performance 0.87% and 2.77% and 5.09% over weighted Bayesian categorizing method using vocabulary dictionary by mutual information and weighted Bayesian method and simple Bayesian method, respectively.

  • PDF

Dynamic Text Categorizing Method using Text Mining and Association Rule

  • Kim, Young-Wook;Kim, Ki-Hyun;Lee, Hong-Chul
    • Journal of the Korea Society of Computer and Information
    • /
    • v.23 no.10
    • /
    • pp.103-109
    • /
    • 2018
  • In this paper, we propose a dynamic document classification method which breaks away from existing document classification method with artificial categorization rules focusing on suppliers and has changing categorization rules according to users' needs or social trends. The core of this dynamic document classification method lies in the fact that it creates classification criteria real-time by using topic modeling techniques without standardized category rules, which does not force users to use unnecessary frames. In addition, it can also search the details through the relevance analysis by calculating the relationship between the words that is difficult to grasp by word frequency alone. Rather than for logical and systematic documents, this method proposed can be used more effectively for situation analysis and retrieving information of unstructured data which do not fit the category of existing classification such as VOC (Voice Of Customer), SNS and customer reviews of Internet shopping malls and it can react to users' needs flexibly. In addition, it has no process of selecting the classification rules by the suppliers and in case there is a misclassification, it requires no manual work, which reduces unnecessary workload.

Hierarchical Automatic Classification of News Articles based on Association Rules (연관규칙을 이용한 뉴스기사의 계층적 자동분류기법)

  • Joo, Kil-Hong;Shin, Eun-Young;Lee, Joo-Il;Lee, Won-Suk
    • Journal of Korea Multimedia Society
    • /
    • v.14 no.6
    • /
    • pp.730-741
    • /
    • 2011
  • With the development of the internet and computer technology, the amount of information through the internet is increasing rapidly and it is managed in document form. For this reason, the research into the method to manage for a large amount of document in an effective way is necessary. The conventional document categorization method used only the keywords of related documents for document classification. However, this paper proposed keyword extraction method of based on association rule. This method extracts a set of related keywords which are involved in document's category and classifies representative keyword by using the classification rule proposed in this paper. In addition, this paper proposed the preprocessing method for efficient keywords creation and predicted the new document's category. We can design the classifier and measure the performance throughout the experiment to increase the profile's classification performance. When predicting the category, substituting all the classification rules one by one is the major reason to decrease the process performance in a profile. Finally, this paper suggested automatically categorizing plan which can be applied to hierarchical category architecture, extended from simple category architecture.

A Methodology for Automatic Multi-Categorization of Single-Categorized Documents (단일 카테고리 문서의 다중 카테고리 자동확장 방법론)

  • Hong, Jin-Sung;Kim, Namgyu;Lee, Sangwon
    • Journal of Intelligence and Information Systems
    • /
    • v.20 no.3
    • /
    • pp.77-92
    • /
    • 2014
  • Recently, numerous documents including unstructured data and text have been created due to the rapid increase in the usage of social media and the Internet. Each document is usually provided with a specific category for the convenience of the users. In the past, the categorization was performed manually. However, in the case of manual categorization, not only can the accuracy of the categorization be not guaranteed but the categorization also requires a large amount of time and huge costs. Many studies have been conducted towards the automatic creation of categories to solve the limitations of manual categorization. Unfortunately, most of these methods cannot be applied to categorizing complex documents with multiple topics because the methods work by assuming that one document can be categorized into one category only. In order to overcome this limitation, some studies have attempted to categorize each document into multiple categories. However, they are also limited in that their learning process involves training using a multi-categorized document set. These methods therefore cannot be applied to multi-categorization of most documents unless multi-categorized training sets are provided. To overcome the limitation of the requirement of a multi-categorized training set by traditional multi-categorization algorithms, we propose a new methodology that can extend a category of a single-categorized document to multiple categorizes by analyzing relationships among categories, topics, and documents. First, we attempt to find the relationship between documents and topics by using the result of topic analysis for single-categorized documents. Second, we construct a correspondence table between topics and categories by investigating the relationship between them. Finally, we calculate the matching scores for each document to multiple categories. The results imply that a document can be classified into a certain category if and only if the matching score is higher than the predefined threshold. For example, we can classify a certain document into three categories that have larger matching scores than the predefined threshold. The main contribution of our study is that our methodology can improve the applicability of traditional multi-category classifiers by generating multi-categorized documents from single-categorized documents. Additionally, we propose a module for verifying the accuracy of the proposed methodology. For performance evaluation, we performed intensive experiments with news articles. News articles are clearly categorized based on the theme, whereas the use of vulgar language and slang is smaller than other usual text document. We collected news articles from July 2012 to June 2013. The articles exhibit large variations in terms of the number of types of categories. This is because readers have different levels of interest in each category. Additionally, the result is also attributed to the differences in the frequency of the events in each category. In order to minimize the distortion of the result from the number of articles in different categories, we extracted 3,000 articles equally from each of the eight categories. Therefore, the total number of articles used in our experiments was 24,000. The eight categories were "IT Science," "Economy," "Society," "Life and Culture," "World," "Sports," "Entertainment," and "Politics." By using the news articles that we collected, we calculated the document/category correspondence scores by utilizing topic/category and document/topics correspondence scores. The document/category correspondence score can be said to indicate the degree of correspondence of each document to a certain category. As a result, we could present two additional categories for each of the 23,089 documents. Precision, recall, and F-score were revealed to be 0.605, 0.629, and 0.617 respectively when only the top 1 predicted category was evaluated, whereas they were revealed to be 0.838, 0.290, and 0.431 when the top 1 - 3 predicted categories were considered. It was very interesting to find a large variation between the scores of the eight categories on precision, recall, and F-score.

Measuring the Confidence of Human Disaster Risk Case based on Text Mining (텍스트마이닝 기반의 인적재난사고사례 신뢰도 측정연구)

  • Lee, Young-Jai;Lee, Sung-Soo
    • The Journal of Information Systems
    • /
    • v.20 no.3
    • /
    • pp.63-79
    • /
    • 2011
  • Deducting the risk level of infrastructure and buildings based on past human disaster risk cases and implementing prevention measures are important activities for disaster prevention. The object of this study is to measure the confidence to proceed quantitative analysis of various disaster risk cases through text mining methodology. Indeed, by examining confidence calculation process and method, this study suggests also a basic quantitative framework. The framework to measure the confidence is composed into four stages. First step describes correlation by categorizing basic elements based on human disaster ontology. Secondly, terms and cases of Term-Document Matrix will be created and the frequency of certain cases and terms will be quantified, the correlation value will be added to the missing values. In the third stage, association rules will be created according to the basic elements of human disaster risk cases. Lastly, the confidence value of disaster risk cases will be measured through association rules. This kind of confidence value will become a key element when deciding a risk level of a new disaster risk, followed up by preventive measures. Through collection of human disaster risk cases related to road infrastructure, this study will demonstrate a case where the four steps of the quantitative framework and process had been actually used for verification.

A Study on Characteristics of Pink Color and Fashion Images Used in Gender Neutral Men's Fashion (젠더 뉴트럴 남성 패션에 사용된 핑크색의 특성과 패션이미지 분석)

  • Hong, YunJung;Joo, Mi Young
    • Journal of Fashion Business
    • /
    • v.24 no.5
    • /
    • pp.52-71
    • /
    • 2020
  • This study examines the color characteristics of the usage of the color pink in menswear by analyzing its usage status and method. It involves an empirical research method establishing the frame of the study through a document study centered on trend, gender neutral considerations, and the utilization of the color pink in men's fashion, by analyzing the characteristics of color and tone by extracting the pink color shown in menswear collections as well as analyzing and categorizing the fashion image and genderless characteristics. Analyzing the color and tone of the pink color shown indicate that bright, light and pale tones had higher proportions. Pink color can also be said to be utilized as a design element that gives off a younger and more vital color image in menswear. Further, the use of brighter and softer pink colors can be interpreted as reflecting modern society's demands of masculinity to change into a more sophisticated and soft image. To analyze the characteristics of the color pink utilized in gender neutral fashion, fashion images were presented as the analysis standard. An image grouping technique was used to classify pink while utilizing genderless types-fashion style. The result showed that even with the same pink color, the fashion image can vary with different methods of expression in terms of clothes and styling. The results of this study can serve as basic data for planning fashion design concepts as it analyzed pink-using fashion images and the genderless concept type.

How can we narrow the digital divide among SMEs in APEC member economies? (중소기업 정보화 수준 격차 해소방안에 관한 국가 간 비교연구)

  • Kwon, Sun-Dong;Yang, Hee-Dong;Sohn, Yong-Yeop;Lee, Seong-Bong;Sirh, Jin-Young;Cho, Taek-Hee
    • Journal of Information Technology Applications and Management
    • /
    • v.12 no.2
    • /
    • pp.79-106
    • /
    • 2005
  • This study, by adopting case study methodology, is focused on examining the present state and analyzing the cause of the digital divide, and suggesting policies for bridging the divide, specifically in view of SMEs. We have taken cases of manufacturing companies, visiting and interviewing 18 SMEs in 10 APEC member economies which show sharp difference in usage of ICT. In order to analyze the digital gap among SMEs, we used 5 variables that are composed of computer hardware, computer software, Internet, readiness of ICT, and performance of ICT adoption, while categorizing the cases into low and high tier based on the national ICT index. From a computer hardware perspective, the high tier (0.66) has almost double the number of PC’s per employee, compared with the low tiers (0.34). This gap can be explained by financial availability of low income and high tariff in the developing economies. In the computer software perspective, the SMEs in the low tier had some restrictive use of computer applications such as financial and accounting management and document management, while those in the high tier enjoyed more diversity in the use of applications such as inventory management, sales management, financial and accounting management, procurement management, CRM, and ERP. In view of the readiness of ICT, the difference in ICT infrastructure and financial status between the low and high tier was far wider than any other variables. As a result of ICT adoption, SMEs benefited in view of learning and growth, internal business processes, customer service, and financial affairs. To effectively bridge the digital divide between the low and high tier, actions such as setting up a secondary market of used computers among cooperating developed and developing countries, developing and diffusing good business applications, and building speedy, low-cost telecommunication infrastructures should be taken.

  • PDF

Classifying Sub-Categories of Apartment Defect Repair Tasks: A Machine Learning Approach (아파트 하자 보수 시설공사 세부공종 머신러닝 분류 시스템에 관한 연구)

  • Kim, Eunhye;Ji, HongGeun;Kim, Jina;Park, Eunil;Ohm, Jay Y.
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.10 no.9
    • /
    • pp.359-366
    • /
    • 2021
  • A number of construction companies in Korea invest considerable human and financial resources to construct a system for managing apartment defect data and for categorizing repair tasks. Thus, this study proposes machine learning models to automatically classify defect complaint text-data into one of the sub categories of 'finishing work' (i.e., one of the defect repair tasks). In the proposed models, we employed two word representation methods (Bag-of-words, Term Frequency-Inverse Document Frequency (TF-IDF)) and two machine learning classifiers (Support Vector Machine, Random Forest). In particular, we conducted both binary- and multi- classification tasks to classify 9 sub categories of finishing work: home appliance installation work, paperwork, painting work, plastering work, interior masonry work, plaster finishing work, indoor furniture installation work, kitchen facility installation work, and tiling work. The machine learning classifiers using the TF-IDF representation method and Random Forest classification achieved more than 90% accuracy, precision, recall, and F1 score. We shed light on the possibility of constructing automated defect classification systems based on the proposed machine learning models.

Review of International Cases for Managing Input Data in Safety Assessment for High-Level Radioactive Waste Deep Disposal Facilities (고준위방사성폐기물 심층처분시설 안전성평가 입력자료 관리를 위한 해외사례 분석)

  • Mi Kyung Kang;Hana Park;Sunju Park;Hae Sik Jeong;Woon Sang Yoon;Jeonghwan Lee
    • Economic and Environmental Geology
    • /
    • v.56 no.6
    • /
    • pp.887-897
    • /
    • 2023
  • Leading waste disposal countries, such as Sweden, Switzerland, and the United Kingdom, conduct safety assessments across all stages of High-Level Radioactive Waste Deep Geological Disposal Facilities-from planning and site selection to construction, operation, closure, and post-closure management. As safety assessments are repeatedly performed at each stage, generating vast amounts of diverse data over extended periods, it is essential to construct a database for safety assessment and establish a data management system. In this study, the safety assessment data management systems of leading countries, were analyzed, categorizing them into 1) input and reference data for safety assessments, 2) guidelines for data management, 3) organizational structures for data management, and 4) computer systems for data management. While each country exhibited differences in specific aspects, commonalities included the classification of safety assessment input data based on disposal system components, the establishment of organizations to supply, use, and manage this data, and the implementation of quality management systems guided by instructions and manuals. These cases highlight the importance of data management systems and document management systems for securing the safety and enhancing the reliability of High-Level Radioactive Waste Disposal Facilities. To achieve this, the classification of input data that can be flexibly and effectively utilized, ensuring the consistency and traceability of input data, and establishing a quality management system for input data and document management are necessary.

Value and Prosect of individual diary as research materials : Based on the "The 12th May Diaries Collection" (개인 일기의 연구 자료로서의 가치와 전망 "5월12일 일기컬렉션"을 중심으로)

  • Choi, Hyo Jin;Yim, Jin Hee
    • The Korean Journal of Archival Studies
    • /
    • no.46
    • /
    • pp.95-152
    • /
    • 2015
  • "Archives of Everyday Life" refers to an organization or facility which collects, appraises, selects and preserves the document from the memory of individuals, groups, or a society through categorizing and classifying lives and cultures of ordinary people. The document includes materials such as diaries, autobiography, letters, and notes. It also covers any digital files or hypertext like posts from blogs and online communities, or photos uploaded on Social Network Services. Many research fields including the Records Management Studies has continuously claimed the necessity of collection and preservation of ordinary people's records on daily life produced every moment. Especially diary is a written record reflecting the facts experienced by an individual and his self-examination. Its originality, individuality and uniqueness are considered truly valuable as a document regardless of the era. Lately many diaries have been discovered and presented to the historical research communities, and diverse researchers in human and social studies have embarked more in-depth research on diaries, their authors, and social background of the time. Furthermore, researchers from linguistics, educational studies, and psychology analyze linguistic behaviors, status of cultural assimilation, and emotional or psychological changes of an author. In this study, we are conducting a metastudy from various research on diaries in order to reaffirm the value of "The 12th May Diaries Collection" as everyday life archives. "The 12th May Diaries Collection" consists of diaries produced and donated directly by citizens on the 12th May every year. It was only 2013 when Digital Archiving Institute in Univ. of Myungji organized the first "Annual call for the 12th May". Now more than 2,000 items were collected including hand writing diaries, digital documents, photos, audio and video files, etc. The age of participants also varies from children to senior citizens. In this study, quantitative analysis will be made on the diaries collected as well as more profound discoveries on the detailed contents of each item. It is not difficult to see stories about family and friends, school life, concerns over career path, daily life and feelings of citizens ranging all different generations, regions, and professions. Based on keyword and descriptors of each item, more comprehensive examination will be further made. Additionally this study will also provide suggestions to examine future research opportunities of these diaries for different fields such as linguistics, educational studies, historical studies or humanities considering diverse formats and contents of diaries. Finally this study will also discuss necessary tasks and challenges for "the 12th May Diaries Collection" to be continuously collected and preserved as Everyday Life Archives.