• Title/Summary/Keyword: Document-Classification

Search Result 448, Processing Time 0.033 seconds

An Enhanced Feature Selection Method Based on the Impurity of Words Considering Unbalanced Distribution of Documents (문서의 불균등 분포를 고려한 단어 불순도 기반 특징 선택 방법)

  • Kang, Jin-Beom;Yang, Jae-Young;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.34 no.9
    • /
    • pp.804-816
    • /
    • 2007
  • Sample training data for machine learning often contain irrelevant information or redundant concept. It is also the case that the original data may include noise. If the information collected for constructing learning model is not reliable, it is difficult to obtain accurate information. So the system attempts to find relations or regulations between features and categories in the teaming phase. The feature selection is to remove irrelevant or redundant information before constructing teaming model. for improving its performance. Existing feature selection methods assume that the distribution of documents is balanced in terms of the number of documents for each class and the length of each document. In practice, however, it is difficult not only to prepare a set of documents with almost equal length, but also to define a number of classes with fixed number of document elements. In this paper, we propose a new feature selection method that considers the impurities among the words and unbalanced distribution of documents in categories. We could obtain feature candidates using the word impurity and eventually select the features through unbalanced distribution of documents. We demonstrate that our method performs better than other existing methods via some experiments.

Analysis of the abstracts of research articles in food related to climate change using a text-mining algorithm (텍스트 마이닝 기법을 활용한 기후변화관련 식품분야 논문초록 분석)

  • Bae, Kyu Yong;Park, Ju-Hyun;Kim, Jeong Seon;Lee, Yung-Seop
    • Journal of the Korean Data and Information Science Society
    • /
    • v.24 no.6
    • /
    • pp.1429-1437
    • /
    • 2013
  • Research articles in food related to climate change were analyzed by implementing a text-mining algorithm, which is one of nonstructural data analysis tools in big data analysis with a focus on frequencies of terms appearing in the abstracts. As a first step, a term-document matrix was established, followed by implementing a hierarchical clustering algorithm based on dissimilarities among the selected terms and expertise in the field to classify the documents under consideration into a few labeled groups. Through this research, we were able to find out important topics appearing in the field of food related to climate change and their trends over past years. It is expected that the results of the article can be utilized for future research to make systematic responses and adaptation to climate change.

Analysis of Potential Construction Risk Types in Formal Documents Using Text Mining (텍스트 마이닝을 통한 건설공사 공문 잠재적 리스크 유형 분석)

  • Eom, Sae Ho;Cha, Gichun;Park, Sun Kyu;Park, Seunghee;Park, Jongho
    • KSCE Journal of Civil and Environmental Engineering Research
    • /
    • v.43 no.1
    • /
    • pp.91-98
    • /
    • 2023
  • Since risks occurring in construction projects can have a significant impact on schedules and costs, there have been many studies on this topic. However, risk analysis is often limited to only certain construction situations,and experience-dependent decision-making is therefore mainly performed. Data-based analyses have only been partially applied to safety and contract documents. Therefore, in this study, cluster analysis and a Word2Vec algorithm were applied to formal documents that contain important elements for contractors or clients. An initial classification of document content into six types was performed through cluster analysis, and 157 occurrence types were subdivided through application of the Word2Vec algorithm. The derived terms were re-classified into five categories and reviewed as to whether the terms could develop into potential construction risk factors. Identifying potential construction risk factors will be helpful as basic data for process management in the construction industry.

The Reform of the National Records Management System and Change of Administrative System in Korean Government from 1948 to 1964 (한국정부 수립 이후 행정체제의 변동과 국가기록관리체제의 개편(1948년~64년))

  • Lee, Sang-Hun
    • The Korean Journal of Archival Studies
    • /
    • no.21
    • /
    • pp.169-246
    • /
    • 2009
  • The national records management system of the Korean Government has been developed in a close relationship with changes in the administrative system. The national records management system established immediately after the establishment of the Korean Government, began to be reformed as a system with a new feature during the quick transition of the administrative system during the early 1960s. Particularly this new system holds an important meaning in that it began to cope with the mass production system of records and was established on the government level for the first time since the establishment of the government. Also this was a basic framework that defined the records management pattern of the Korean Government for the later 40 years. Therefore, this study aims to identify the origin and the meaning of the national records management system established during the early 1960s. At the time of establishing the government, the administrative system of the Korean Government was not completely free from the framework of the administrative system of the Chosen General Government. This was mainly because the Korean Government had no capability to renovate the administrative system. This was not an exception also for the national records management system. In other words, the forms and preparation methods of official document, an official document management process, and the classification and appraisal system used the records management system of the Chosen General Government without any alteration. Main factors that brought about the reform of the national records management system as well as the change in the Korean administrative system during the early 1960s, were being created in Korean society, starting from the mid 1950s. This resulted from the growth of Korean Army, public officers, and students of administrative science as being the intrinsic elites of Korean society through their respective experience of the US administration. In particular, the reform of the creation, classification, filing, transfer, and preservation system shown during the introduction of a scientific management system of the US Army in the Korean Army was a meaningful change given the historic developing process of Korean records management system history. This change had a decisive effect on the reform of the national records management system during the early 1960s. As the Korean Army, public officers, and students of administrative science, who had posted growth beginning in the mid-1950s, emerged as administrative elites during the early 1960s, the administrative system of the Korean Government brought about a change, which was different from the past in terms of its quality, and the modernization work of documentary administration pursued during the period, became extended to the reform of the national records management system. Then, the direction of reform was 'the efficient and effective control' over records based on scientific management, which was advanced through the medium of the work that accommodate the US office management system and a decimal filing system to Korean administrative circumstances. Consequently, Various official document forms, standards, and the gist of process were improved and standardized, and the appraisal system based on the function-based classification were unified on the government level by introducing a decimal filing system.

Automatic Quality Evaluation with Completeness and Succinctness for Text Summarization (완전성과 간결성을 고려한 텍스트 요약 품질의 자동 평가 기법)

  • Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.2
    • /
    • pp.125-148
    • /
    • 2018
  • Recently, as the demand for big data analysis increases, cases of analyzing unstructured data and using the results are also increasing. Among the various types of unstructured data, text is used as a means of communicating information in almost all fields. In addition, many analysts are interested in the amount of data is very large and relatively easy to collect compared to other unstructured and structured data. Among the various text analysis applications, document classification which classifies documents into predetermined categories, topic modeling which extracts major topics from a large number of documents, sentimental analysis or opinion mining that identifies emotions or opinions contained in texts, and Text Summarization which summarize the main contents from one document or several documents have been actively studied. Especially, the text summarization technique is actively applied in the business through the news summary service, the privacy policy summary service, ect. In addition, much research has been done in academia in accordance with the extraction approach which provides the main elements of the document selectively and the abstraction approach which extracts the elements of the document and composes new sentences by combining them. However, the technique of evaluating the quality of automatically summarized documents has not made much progress compared to the technique of automatic text summarization. Most of existing studies dealing with the quality evaluation of summarization were carried out manual summarization of document, using them as reference documents, and measuring the similarity between the automatic summary and reference document. Specifically, automatic summarization is performed through various techniques from full text, and comparison with reference document, which is an ideal summary document, is performed for measuring the quality of automatic summarization. Reference documents are provided in two major ways, the most common way is manual summarization, in which a person creates an ideal summary by hand. Since this method requires human intervention in the process of preparing the summary, it takes a lot of time and cost to write the summary, and there is a limitation that the evaluation result may be different depending on the subject of the summarizer. Therefore, in order to overcome these limitations, attempts have been made to measure the quality of summary documents without human intervention. On the other hand, as a representative attempt to overcome these limitations, a method has been recently devised to reduce the size of the full text and to measure the similarity of the reduced full text and the automatic summary. In this method, the more frequent term in the full text appears in the summary, the better the quality of the summary. However, since summarization essentially means minimizing a lot of content while minimizing content omissions, it is unreasonable to say that a "good summary" based on only frequency always means a "good summary" in its essential meaning. In order to overcome the limitations of this previous study of summarization evaluation, this study proposes an automatic quality evaluation for text summarization method based on the essential meaning of summarization. Specifically, the concept of succinctness is defined as an element indicating how few duplicated contents among the sentences of the summary, and completeness is defined as an element that indicating how few of the contents are not included in the summary. In this paper, we propose a method for automatic quality evaluation of text summarization based on the concepts of succinctness and completeness. In order to evaluate the practical applicability of the proposed methodology, 29,671 sentences were extracted from TripAdvisor 's hotel reviews, summarized the reviews by each hotel and presented the results of the experiments conducted on evaluation of the quality of summaries in accordance to the proposed methodology. It also provides a way to integrate the completeness and succinctness in the trade-off relationship into the F-Score, and propose a method to perform the optimal summarization by changing the threshold of the sentence similarity.

An Attention Method-based Deep Learning Encoder for the Sentiment Classification of Documents (문서의 감정 분류를 위한 주목 방법 기반의 딥러닝 인코더)

  • Kwon, Sunjae;Kim, Juae;Kang, Sangwoo;Seo, Jungyun
    • KIISE Transactions on Computing Practices
    • /
    • v.23 no.4
    • /
    • pp.268-273
    • /
    • 2017
  • Recently, deep learning encoder-based approach has been actively applied in the field of sentiment classification. However, Long Short-Term Memory network deep learning encoder, the commonly used architecture, lacks the quality of vector representation when the length of the documents is prolonged. In this study, for effective classification of the sentiment documents, we suggest the use of attention method-based deep learning encoder that generates document vector representation by weighted sum of the outputs of Long Short-Term Memory network based on importance. In addition, we propose methods to modify the attention method-based deep learning encoder to suit the sentiment classification field, which consist of a part that is to applied to window attention method and an attention weight adjustment part. In the window attention method part, the weights are obtained in the window units to effectively recognize feeling features that consist of more than one word. In the attention weight adjustment part, the learned weights are smoothened. Experimental results revealed that the performance of the proposed method outperformed Long Short-Term Memory network encoder, showing 89.67% in accuracy criteria.

Mining Intellectual History Using Unstructured Data Analytics to Classify Thoughts for Digital Humanities (디지털 인문학에서 비정형 데이터 분석을 이용한 사조 분류 방법)

  • Seo, Hansol;Kwon, Ohbyung
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.141-166
    • /
    • 2018
  • Information technology improves the efficiency of humanities research. In humanities research, information technology can be used to analyze a given topic or document automatically, facilitate connections to other ideas, and increase our understanding of intellectual history. We suggest a method to identify and automatically analyze the relationships between arguments contained in unstructured data collected from humanities writings such as books, papers, and articles. Our method, which is called history mining, reveals influential relationships between arguments and the philosophers who present them. We utilize several classification algorithms, including a deep learning method. To verify the performance of the methodology proposed in this paper, empiricists and rationalism - related philosophers were collected from among the philosophical specimens and collected related writings or articles accessible on the internet. The performance of the classification algorithm was measured by Recall, Precision, F-Score and Elapsed Time. DNN, Random Forest, and Ensemble showed better performance than other algorithms. Using the selected classification algorithm, we classified rationalism or empiricism into the writings of specific philosophers, and generated the history map considering the philosopher's year of activity.

A Suggestion of the Direction of Construction Disaster Document Management through Text Data Classification Model based on Deep Learning (딥러닝 기반 분류 모델의 성능 분석을 통한 건설 재해사례 텍스트 데이터의 효율적 관리방향 제안)

  • Kim, Hayoung;Jang, YeEun;Kang, HyunBin;Son, JeongWook;Yi, June-Seong
    • Korean Journal of Construction Engineering and Management
    • /
    • v.22 no.5
    • /
    • pp.73-85
    • /
    • 2021
  • This study proposes an efficient management direction for Korean construction accident cases through a deep learning-based text data classification model. A deep learning model was developed, which categorizes five categories of construction accidents: fall, electric shock, flying object, collapse, and narrowness, which are representative accident types of KOSHA. After initial model tests, the classification accuracy of fall disasters was relatively high, while other types were classified as fall disasters. Through these results, it was analyzed that 1) specific accident-causing behavior, 2) similar sentence structure, and 3) complex accidents corresponding to multiple types affect the results. Two accuracy improvement experiments were then conducted: 1) reclassification, 2) elimination. As a result, the classification performance improved with 185.7% when eliminating complex accidents. Through this, the multicollinearity of complex accidents, including the contents of multiple accident types, was resolved. In conclusion, this study suggests the necessity to independently manage complex accidents while preparing a system to describe the situation of future accidents in detail.

A Study on the Correlation Lee Jae Ma's Four Types of Essential Physical Constitution and From index - Concerning Male and Female 3rd Year High School Student in Some Urban and Rural Areas - (사상체질류형(四象體質類型)과 체격(體格) 및 신체형태지수(身體形態指數)와의 비교연구(比較硏究) - 도시(都市)와 농촌(農村)의 일부지역(一部地域) 남녀고등학교(男女高等學校) 3학년(學年) 학생(學生)을 대상(對象)으로 -)

  • Lee, Moon-Ho;Hong, Sun-Yong
    • Journal of Sasang Constitutional Medicine
    • /
    • v.2 no.1
    • /
    • pp.71-85
    • /
    • 1990
  • 673 third-year students of boy's and girl's high schools in Taegu city and Kuni-gun and Youngyang-gun and Euisung-gun in Kyongbuk province were selected and investigated as the subject, of this study on the correlation between Lee Jae Ma's Four Types of Essential Physical Constitution and Physical Form index. The result of the study was found as follows. First, as for Height, the findings were not identical with the expression that "person of shaoyin(minor Yin) Type are short and small -- while person of Taiyin (major Yin) Type are tall and big," cited in classification of four different constitutions in a document named "Dong-Eu-Su-Se-Bo-Won". Comparison of persons of Shaoyang (minor Yang) - Type proved infitness due to the lack of data on Height in documents concerning Lee Jae Ma's four types of essential physical constitution. Second, as for Sitting Height, the correlation was prored between the findings of this study and the expression in the above document describing external physical characteristics of shaoyin-Type persons that "The upper part and' the lower part of the body are well balanced", but in point of Relative Sitting Height, none between the two. Third, as for Chest-Girth and Relative Chest-Girth plus Weight and Relative Weight, the expression that "Persons of Taiyin(major Yin) Type have the largest physique of the lour types of persons in the characteristics of external physical features, and that they also tend to have continental(widechest or large-scaled) character and strong nerve, that they are stoutly-built and fal." proved to have the correlation with the findings of this study. Fourth, in point of Chest-Girth and Relative Chest-Girth, this study found that its findings have the correlation with the phrase that "Chests are well developed upwar -- and sturdy and solid." in describing the characteristics of Shaoyang (minor Yang)-Type person' external physical features, and that with the phrase that "Chests are narrow" in the case of Shaoyin(minor Yin)-Type persons. Fifth, as for Weight and Relative Weight, the correlation was found between the findings and the expression that "shaoyin-Type persons have comparatively less flesh" as a sign of external physical characteristics of Shaoyin-Type persons. The above-cited findings proved that there exist some correlations between external physique of the Lee Jae Ma's four types of essential constitution and physical Form Indexes. Actually, however, in clinical classification, it is desirable that this approach should be consulted only after carefull consideration based on Lee Jae Ma's theory, and it seems imperative to continue the study of objectivization of Lee's theory.

  • PDF

A Deep Learning-based Depression Trend Analysis of Korean on Social Media (딥러닝 기반 소셜미디어 한글 텍스트 우울 경향 분석)

  • Park, Seojeong;Lee, Soobin;Kim, Woo Jung;Song, Min
    • Journal of the Korean Society for information Management
    • /
    • v.39 no.1
    • /
    • pp.91-117
    • /
    • 2022
  • The number of depressed patients in Korea and around the world is rapidly increasing every year. However, most of the mentally ill patients are not aware that they are suffering from the disease, so adequate treatment is not being performed. If depressive symptoms are neglected, it can lead to suicide, anxiety, and other psychological problems. Therefore, early detection and treatment of depression are very important in improving mental health. To improve this problem, this study presented a deep learning-based depression tendency model using Korean social media text. After collecting data from Naver KonwledgeiN, Naver Blog, Hidoc, and Twitter, DSM-5 major depressive disorder diagnosis criteria were used to classify and annotate classes according to the number of depressive symptoms. Afterwards, TF-IDF analysis and simultaneous word analysis were performed to examine the characteristics of each class of the corpus constructed. In addition, word embedding, dictionary-based sentiment analysis, and LDA topic modeling were performed to generate a depression tendency classification model using various text features. Through this, the embedded text, sentiment score, and topic number for each document were calculated and used as text features. As a result, it was confirmed that the highest accuracy rate of 83.28% was achieved when the depression tendency was classified based on the KorBERT algorithm by combining both the emotional score and the topic of the document with the embedded text. This study establishes a classification model for Korean depression trends with improved performance using various text features, and detects potential depressive patients early among Korean online community users, enabling rapid treatment and prevention, thereby enabling the mental health of Korean society. It is significant in that it can help in promotion.