• Title/Summary/Keyword: Document-Classification

Search Result 448, Processing Time 0.03 seconds

Analysis of News Articles on Child Welfare Policies in South Korea: K-Means Clustering (대한민국 정권별 아동복지정책 관련 뉴스 기사 분석: K-평균 군집 분석)

  • Kim, Eun Joo;Kim, Seong Kwang;Park, Bit Na
    • Journal of East-West Nursing Research
    • /
    • v.29 no.2
    • /
    • pp.185-195
    • /
    • 2023
  • Purpose: The purpose of this study is to analyze changes of child welfare policies and provide insights based on the collection and classification of newspaper articles. Methods: Articles related to child welfare policies were collected from 1990, during the Kim, Young-sam administration, to May 9, 2022, under the Moon, Jae-in administration. K-Means clustering and keyword Term Frequency-Inverse Document Frequency analysis were utilized to cluster and analyze newspaper articles with similar themes. Results: The administrations of Kim, Young-sam, Kim, Dae-jung, Roh, Moo-hyun, and Park, Geun-hye were classified into two clusters, and the Lee, Myung-bak and Moon, Jae-in administrations were classified into three clusters. Conclusion: South Korea's child welfare policies have focused on ensuring the safety and healthy development of children through diverse policies initiatives over the years. However, challenges related to child protection and child abuse persist. This requires additional resources and budget allocation. It is important to establish a comprehensive support system for children and families, including comprehensive nursing support.

Applications of Machine Learning Models on Yelp Data

  • Ruchi Singh;Jongwook Woo
    • Asia pacific journal of information systems
    • /
    • v.29 no.1
    • /
    • pp.35-49
    • /
    • 2019
  • The paper attempts to document the application of relevant Machine Learning (ML) models on Yelp (a crowd-sourced local business review and social networking site) dataset to analyze, predict and recommend business. Strategically using two cloud platforms to minimize the effort and time required for this project. Seven machine learning algorithms in Azure ML of which four algorithms are implemented in Databricks Spark ML. The analyzed Yelp business dataset contained 70 business attributes for more than 350,000 registered business. Additionally, review tips and likes from 500,000 users have been processed for the project. A Recommendation Model is built to provide Yelp users with recommendations for business categories based on their previous business ratings, as well as the business ratings of other users. Classification Model is implemented to predict the popularity of the business as defining the popular business to have stars greater than 3 and unpopular business to have stars less than 3. Text Analysis model is developed by comparing two algorithms, uni-gram feature extraction and n-feature extraction in Azure ML studio and logistic regression model in Spark. Comparative conclusions have been made related to efficiency of Spark ML and Azure ML for these models.

Sensitivity Identification Method for New Words of Social Media based on Naive Bayes Classification (나이브 베이즈 기반 소셜 미디어 상의 신조어 감성 판별 기법)

  • Kim, Jeong In;Park, Sang Jin;Kim, Hyoung Ju;Choi, Jun Ho;Kim, Han Il;Kim, Pan Koo
    • Smart Media Journal
    • /
    • v.9 no.1
    • /
    • pp.51-59
    • /
    • 2020
  • From PC communication to the development of the internet, a new term has been coined on the social media, and the social media culture has been formed due to the spread of smart phones, and the newly coined word is becoming a culture. With the advent of social networking sites and smart phones serving as a bridge, the number of data has increased in real time. The use of new words can have many advantages, including the use of short sentences to solve the problems of various letter-limited messengers and reduce data. However, new words do not have a dictionary meaning and there are limitations and degradation of algorithms such as data mining. Therefore, in this paper, the opinion of the document is confirmed by collecting data through web crawling and extracting new words contained within the text data and establishing an emotional classification. The progress of the experiment is divided into three categories. First, a word collected by collecting a new word on the social media is subjected to learned of affirmative and negative. Next, to derive and verify emotional values using standard documents, TF-IDF is used to score noun sensibilities to enter the emotional values of the data. As with the new words, the classified emotional values are applied to verify that the emotions are classified in standard language documents. Finally, a combination of the newly coined words and standard emotional values is used to perform a comparative analysis of the technology of the instrument.

The Present State and Solutions for Archival Arrangement and Description of National Archives & Records Service of Korea (국가기록원의 기록물 정리기술의 현황과 개선방안)

  • Yoon, Ju-Bom
    • Journal of Korean Society of Archives and Records Management
    • /
    • v.4 no.2
    • /
    • pp.118-162
    • /
    • 2004
  • Archival description in archives has an important role in document control and reference service. Archives has made an effort to do archival description. But we have some differences and problems about a theory and practical processes comparing with advanced countries. The serious difference in a theory is that a function classification, maintenance of an original order, arrangement of multi-level description are not reflected in practical process. they are arranged in shelves after they are arranged by registration order in a unit of a volume like an arrangement of book. In addition, there are problems in history of agency change or control of index. So these can cause inconvenience for users. For improving, in this study we introduced the meaning and importance of arrangement of description, the situation and problem of arrangement of description in The National Archives, and a description guideline in other foreign countries. The next is an example for ISAD(G). This paper has chapter 8, the chapter 1 is introduction, the chapter 2 is the meaning and importance of arrangement of description, excluding the chapter 8 is conclusion we can say like this from the chapter 3 to the chapter 7. In the chapter 3, we explain GOVT we are using now and description element category in situation and problem of arrangement of description in Archives. In the chapter 4, this is about guideline from Archives in U.S.A, England and Australia. 1. Lifecycle Date Requirement Guide from NARA is introduced and of the description field, the way of the description about just one title element is introduced. 2. This is about the guideline of the description from Public Record Office. That name is National Archives Cataloguing Guidelines Introduction. We are saying "PROCAT" from this guideline and the seven procedure of description. 3. This is about Commomon Record Series from National Archives of Australia. we studied Registration & description procedures for CRS system. In the chapter 5, This is about the example which applied ISAD to. Archives introduce description of documents produced from Appeals Commission in the Ministry of Government Administration. In the chapter 6, 7. These are about the problems we pointed after using ISAD, naming for the document at procedure section in every institution, the lack of description fields category, the sort or classification of the kind or form, the reference or identified number, the absence description rule about the details, function classification, multi-level description, input format, arrangement of book shelf, authority control. The plan for improving are that problems. The best way for arrangement and description in Archives is to examine the standard, guideline, manual from archives in the advanced countries. So we suggested we need many research and study about this in the academic field.

A Novel Methodology for Extracting Core Technology and Patents by IP Mining (핵심 기술 및 특허 추출을 위한 IP 마이닝에 관한 연구)

  • Kim, Hyun Woo;Kim, Jongchan;Lee, Joonhyuck;Park, Sangsung;Jang, Dongsik
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.25 no.4
    • /
    • pp.392-397
    • /
    • 2015
  • Society has been developed through analogue, digital, and smart era. Every technology is going through consistent changes and rapid developments. In this competitive society, R&D strategy establishment is significantly useful and helpful for improving technology competitiveness. A patent document includes technical and legal rights information such as title, abstract, description, claim, and patent classification code. From the patent document, a lot of people can understand and collect legal and technical information. This unique feature of patent can be quantitatively applied for technology analysis. This research paper proposes a methodology for extracting core technology and patents based on quantitative methods. Statistical analysis and social network analysis are applied to IPC codes in order to extract core technologies with active R&D and high centralities. Then, core patents are also extracted by analyzing citation and family information.

Chatbot Design Method Using Hybrid Word Vector Expression Model Based on Real Telemarketing Data

  • Zhang, Jie;Zhang, Jianing;Ma, Shuhao;Yang, Jie;Gui, Guan
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.4
    • /
    • pp.1400-1418
    • /
    • 2020
  • In the development of commercial promotion, chatbot is known as one of significant skill by application of natural language processing (NLP). Conventional design methods are using bag-of-words model (BOW) alone based on Google database and other online corpus. For one thing, in the bag-of-words model, the vectors are Irrelevant to one another. Even though this method is friendly to discrete features, it is not conducive to the machine to understand continuous statements due to the loss of the connection between words in the encoded word vector. For other thing, existing methods are used to test in state-of-the-art online corpus but it is hard to apply in real applications such as telemarketing data. In this paper, we propose an improved chatbot design way using hybrid bag-of-words model and skip-gram model based on the real telemarketing data. Specifically, we first collect the real data in the telemarketing field and perform data cleaning and data classification on the constructed corpus. Second, the word representation is adopted hybrid bag-of-words model and skip-gram model. The skip-gram model maps synonyms in the vicinity of vector space. The correlation between words is expressed, so the amount of information contained in the word vector is increased, making up for the shortcomings caused by using bag-of-words model alone. Third, we use the term frequency-inverse document frequency (TF-IDF) weighting method to improve the weight of key words, then output the final word expression. At last, the answer is produced using hybrid retrieval model and generate model. The retrieval model can accurately answer questions in the field. The generate model can supplement the question of answering the open domain, in which the answer to the final reply is completed by long-short term memory (LSTM) training and prediction. Experimental results show which the hybrid word vector expression model can improve the accuracy of the response and the whole system can communicate with humans.

A Technique to Recommend Appropriate Developers for Reported Bugs Based on Term Similarity and Bug Resolution History (개발자 별 버그 해결 유형을 고려한 자동적 개발자 추천 접근법)

  • Park, Seong Hun;Kim, Jung Il;Lee, Eun Joo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.3 no.12
    • /
    • pp.511-522
    • /
    • 2014
  • During the development of the software, a variety of bugs are reported. Several bug tracking systems, such as, Bugzilla, MantisBT, Trac, JIRA, are used to deal with reported bug information in many open source development projects. Bug reports in bug tracking system would be triaged to manage bugs and determine developer who is responsible for resolving the bug report. As the size of the software is increasingly growing and bug reports tend to be duplicated, bug triage becomes more and more complex and difficult. In this paper, we present an approach to assign bug reports to appropriate developers, which is a main part of bug triage task. At first, words which have been included the resolved bug reports are classified according to each developer. Second, words in newly bug reports are selected. After first and second steps, vectors whose items are the selected words are generated. At the third step, TF-IDF(Term frequency - Inverse document frequency) of the each selected words are computed, which is the weight value of each vector item. Finally, the developers are recommended based on the similarity between the developer's word vector and the vector of new bug report. We conducted an experiment on Eclipse JDT and CDT project to show the applicability of the proposed approach. We also compared the proposed approach with an existing study which is based on machine learning. The experimental results show that the proposed approach is superior to existing method.

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

Discussion about the Self Disposal Guideline of Medical Radioactive Waste (의료용 방사성폐기물 자체처분 가이드라인에 관한 고찰)

  • Lee, Kyung-Jae;Sul, Jin-Hyung;Lee, In-Won;Park, Young-Jae
    • The Korean Journal of Nuclear Medicine Technology
    • /
    • v.21 no.2
    • /
    • pp.13-27
    • /
    • 2017
  • Purpose In the procedure of domestic medical radioactive self-disposal, there are many requests of supplementation and difficulties on the screening process. In this regard, presentation of basic guideline will improve the work processing efficiency of medical institution radioactive waste. From 2015 to 2016, We reviewed and compared a supplementary requests of domestic fifteen medical institution radioactive self-disposal Plan & Procedure manual. In connection with this, we derive the details of the radioactive waste document based on the relative regulation of nuclear safety Act. The representative supplementary requests of Korea Institute of Nuclear Safety are disposal method of non-flammability radioactive waste, storage method of scheduled self-disposal waste, the legitimacy of self-disposal and pre-treatment of self-disposal, reference radioactivity of disused filter and output of storage period, attachment the evidential matter of measurement efficiency when using a gamma counter. Through establishing a medical radioactive waste guideline, we can clearly suggest a classification standard of radioactive nuclide and the type of occurrence. As a result, we can confirm the reduction of examination processing period while preparing a self-disposal document and there is no spending expenses for business agency. Also, the storage efficiency of facility will better and reduce the economic expenses. On the basis of this guideline, we will expect a contribution to the improvement of work efficiency for officials who has a working-level difficulty of radioactive waste self-disposal.

  • PDF

A Korean Community-based Question Answering System Using Multiple Machine Learning Methods (다중 기계학습 방법을 이용한 한국어 커뮤니티 기반 질의-응답 시스템)

  • Kwon, Sunjae;Kim, Juae;Kang, Sangwoo;Seo, Jungyun
    • Journal of KIISE
    • /
    • v.43 no.10
    • /
    • pp.1085-1093
    • /
    • 2016
  • Community-based Question Answering system is a system which provides answers for each question from the documents uploaded on web communities. In order to enhance the capacity of question analysis, former methods have developed specific rules suitable for a target region or have applied machine learning to partial processes. However, these methods incur an excessive cost for expanding fields or lead to cases in which system is overfitted for a specific field. This paper proposes a multiple machine learning method which automates the overall process by adapting appropriate machine learning in each procedure for efficient processing of community-based Question Answering system. This system can be divided into question analysis part and answer selection part. The question analysis part consists of the question focus extractor, which analyzes the focused phrases in questions and uses conditional random fields, and the question type classifier, which classifies topics of questions and uses support vector machine. In the answer selection part, the we trains weights that are used by the similarity estimation models through an artificial neural network. Also these are a number of cases in which the results of morphological analysis are not reliable for the data uploaded on web communities. Therefore, we suggest a method that minimizes the impact of morphological analysis by using character features in the stage of question analysis. The proposed system outperforms the former system by showing a Mean Average Precision criteria of 0.765 and R-Precision criteria of 0.872.