• Title/Summary/Keyword: 문서 군집화

Search Result 93, Processing Time 0.03 seconds

A Comparative Study on Clustering Methods for Grouping Related Tags (연관 태그의 군집화를 위한 클러스터링 기법 비교 연구)

  • Han, Seung-Hee
    • Journal of the Korean Society for Library and Information Science
    • /
    • v.43 no.3
    • /
    • pp.399-416
    • /
    • 2009
  • In this study, clustering methods with related tags were discussed for improving search and exploration in the tag space. The experiments were performed on 10 Delicious tags and the strongly-related tags extracted by each 300 documents, and hierarchical and non-hierarchical clustering methods were carried out based on the tag co-occurrences. To evaluate the experimental results, cluster relevance was measured. Results showed that Ward's method with cosine coefficient, which shows good performance to term clustering, was best performed with consistent clustering tendency. Furthermore, it was analyzed that cluster membership among related tags is based on users' tagging purposes or interest and can disambiguate word sense. Therefore, tag clusters would be helpful for improving search and exploration in the tag space.

A study on the User Experience at Unmanned Checkout Counter Using Big Data Analysis (빅데이터를 활용한 편의점 간편식에 대한 의미 분석)

  • Kim, Ae-sook;Ryu, Gi-hwan;Jung, Ju-hee;Kim, Hee-young
    • The Journal of the Convergence on Culture Technology
    • /
    • v.8 no.4
    • /
    • pp.375-380
    • /
    • 2022
  • The purpose of this study is to find out consumers' perception and meaning of convenience store convenience food by using big data. For this study, NNAVER and Daum analyzed news, intellectuals, blogs, cafes, intellectuals(tips), and web documents, and used 'convenience store convenience food' as keywords for data search. The data analysis period was selected as 3 years from January 1, 2019 to December 31, 2021. For data collection and analysis, frequency and matrix data were extracted using TEXTOM, and network analysis and visualization analysis were conducted using the NetDraw function of the UCINET 6 program. As a result, convenience store convenience foods were clustered into health, diversity, convenience, and economy according to consumers' selection attributes. It is expected to be the basis for the development of a new convenience menu that pursues convenience and convenience based on consumers' meaning of convenience store convenience foods such as appropriate prices, discount coupons, and events.

Automatic Clustering of Same-Name Authors Using Full-text of Articles (논문 원문을 이용한 동명 저자 자동 군집화)

  • Kang, In-Su;Jung, Han-Min;Lee, Seung-Woo;Kim, Pyung;Goo, Hee-Kwan;Lee, Mi-Kyung;Goo, Nam-Ang;Sung, Won-Kyung
    • Proceedings of the Korea Contents Association Conference
    • /
    • 2006.11a
    • /
    • pp.652-656
    • /
    • 2006
  • Bibliographic information retrieval systems require bibliographic data such as authors, organizations, source of publication to be uniquely identified using keys. In particular, when authors are represented simply as their names, users bear the burden of manually discriminating different users of the same name. Previous approaches to resolving the problem of same-name authors rely on bibliographic data such as co-author information, titles of articles, etc. However, these methods cannot handle the case of single author articles, or the case when articles do not have common terms in their titles. To complement the previous methods, this study introduces a classification-based approach using similarity between full-text of articles. Experiments using recent domestic proceedings showed that the proposed method has the potential to supplement the previous meta-data based approaches.

  • PDF

Coreference Resolution for Korean Using Random Forests (랜덤 포레스트를 이용한 한국어 상호참조 해결)

  • Jeong, Seok-Won;Choi, MaengSik;Kim, HarkSoo
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.5 no.11
    • /
    • pp.535-540
    • /
    • 2016
  • Coreference resolution is to identify mentions in documents and is to group co-referred mentions in the documents. It is an essential step for natural language processing applications such as information extraction, event tracking, and question-answering. Recently, various coreference resolution models based on ML (machine learning) have been proposed, As well-known, these ML-based models need large training data that are manually annotated with coreferred mention tags. Unfortunately, we cannot find usable open data for learning ML-based models in Korean. Therefore, we propose an efficient coreference resolution model that needs less training data than other ML-based models. The proposed model identifies co-referred mentions using random forests based on sieve-guided features. In the experiments with baseball news articles, the proposed model showed a better CoNLL F1-score of 0.6678 than other ML-based models.

The Relationship Between Character and Costume in literary Work using Semantic networks -The novel 「Norwegian Wood」- (시맨틱 네트워크를 통한 문학작품 속 인물과 의상의 관계 -소설 「노르웨이의 숲」-)

  • Choi, Yeong-Hyeon;Kim, Seong Eun;Lee, Kyu-Hye
    • Journal of Digital Convergence
    • /
    • v.19 no.1
    • /
    • pp.307-314
    • /
    • 2021
  • This study aimed to apply the principle of the semantic network to a long novel in an attempt to understand the structure of the entire document and the manifested relationships between words and words. The costume expressions in Murakami's novel Norwegian Wood were analyzed based on the characters' symbols, relationships, and personality characteristics. The study identified the symbols of the characters in the novel and the relationship properties between the characters through the Clauset-Newman-Moore clustering algorithm. The descriptions and symbols of the relationships between the characters were identified within the worldview that the author had intended. Further, it was confirmed that the expression of each costume according to the character's personality was also connected to the clue that explained said character. This fusion study is academically significant in that it presents a new methodology for analyzing literary works

Hierarchical Overlapping Clustering to Detect Complex Concepts (중복을 허용한 계층적 클러스터링에 의한 복합 개념 탐지 방법)

  • Hong, Su-Jeong;Choi, Joong-Min
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.111-125
    • /
    • 2011
  • Clustering is a process of grouping similar or relevant documents into a cluster and assigning a meaningful concept to the cluster. By this process, clustering facilitates fast and correct search for the relevant documents by narrowing down the range of searching only to the collection of documents belonging to related clusters. For effective clustering, techniques are required for identifying similar documents and grouping them into a cluster, and discovering a concept that is most relevant to the cluster. One of the problems often appearing in this context is the detection of a complex concept that overlaps with several simple concepts at the same hierarchical level. Previous clustering methods were unable to identify and represent a complex concept that belongs to several different clusters at the same level in the concept hierarchy, and also could not validate the semantic hierarchical relationship between a complex concept and each of simple concepts. In order to solve these problems, this paper proposes a new clustering method that identifies and represents complex concepts efficiently. We developed the Hierarchical Overlapping Clustering (HOC) algorithm that modified the traditional Agglomerative Hierarchical Clustering algorithm to allow overlapped clusters at the same level in the concept hierarchy. The HOC algorithm represents the clustering result not by a tree but by a lattice to detect complex concepts. We developed a system that employs the HOC algorithm to carry out the goal of complex concept detection. This system operates in three phases; 1) the preprocessing of documents, 2) the clustering using the HOC algorithm, and 3) the validation of semantic hierarchical relationships among the concepts in the lattice obtained as a result of clustering. The preprocessing phase represents the documents as x-y coordinate values in a 2-dimensional space by considering the weights of terms appearing in the documents. First, it goes through some refinement process by applying stopwords removal and stemming to extract index terms. Then, each index term is assigned a TF-IDF weight value and the x-y coordinate value for each document is determined by combining the TF-IDF values of the terms in it. The clustering phase uses the HOC algorithm in which the similarity between the documents is calculated by applying the Euclidean distance method. Initially, a cluster is generated for each document by grouping those documents that are closest to it. Then, the distance between any two clusters is measured, grouping the closest clusters as a new cluster. This process is repeated until the root cluster is generated. In the validation phase, the feature selection method is applied to validate the appropriateness of the cluster concepts built by the HOC algorithm to see if they have meaningful hierarchical relationships. Feature selection is a method of extracting key features from a document by identifying and assigning weight values to important and representative terms in the document. In order to correctly select key features, a method is needed to determine how each term contributes to the class of the document. Among several methods achieving this goal, this paper adopted the $x^2$�� statistics, which measures the dependency degree of a term t to a class c, and represents the relationship between t and c by a numerical value. To demonstrate the effectiveness of the HOC algorithm, a series of performance evaluation is carried out by using a well-known Reuter-21578 news collection. The result of performance evaluation showed that the HOC algorithm greatly contributes to detecting and producing complex concepts by generating the concept hierarchy in a lattice structure.

Visualization of the Intellectual Structure on the Internet of Things Focuses on the Industry 4.0 (제 4차 산업혁명 중심의 사물인터넷 지적 구조 시각화)

  • Hyaejung, Lim;Chang-Kyo, Suh
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.27 no.6
    • /
    • pp.127-140
    • /
    • 2022
  • With the recent development of the ICT (information and communication technology), the revolution of the industry has moved on from the third industry to the fourth. There is no doubt that the companies would not survive in the future without adopting these technologies. The purpose of this research is to analyze the intellectual structure of the internet of things(IoT) literature for the Industry 4.0 to suggest a better insight for the field. The data for this research is extracted from the Web of Science database. Total of 1,631 documents and 72,754 references are used for the research with the analysis program CiteSpace. Author co-citation analysis is used to analyze the intellectual structure and performed clustering, timeline and burst detection analysis. We identified 12 sub-areas of IoT for the Industry 4.0 which are 'Supply Chain', 'Digital Twin', 'Smart Manufacturing System' and etc. Through the timeline analysis we can find out which clusters will increase or decrease its reputation. As concluding remarks, limitations and further research suggestions are discussed.

Analyzing the discriminative characteristic of cover letters using text mining focused on Air Force applicants (텍스트 마이닝을 이용한 공군 부사관 지원자 자기소개서의 차별적 특성 분석)

  • Kwon, Hyeok;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.27 no.3
    • /
    • pp.75-94
    • /
    • 2021
  • The low birth rate and shortened military service period are causing concerns about selecting excellent military officers. The Republic of Korea entered a low birth rate society in 1984 and an aged society in 2018 respectively, and is expected to be in a super-aged society in 2025. In addition, the troop-oriented military is changed as a state-of-the-art weapons-oriented military, and the reduction of the military service period was implemented in 2018 to ease the burden of military service for young people and play a role in the society early. Some observe that the application rate for military officers is falling due to a decrease of manpower resources and a preference for shortened mandatory military service over military officers. This requires further consideration of the policy of securing excellent military officers. Most of the related studies have used social scientists' methodologies, but this study applies the methodology of text mining suitable for large-scale documents analysis. This study extracts words of discriminative characteristics from the Republic of Korea Air Force Non-Commissioned Officer Applicant cover letters and analyzes the polarity of pass and fail. It consists of three steps in total. First, the application is divided into general and technical fields, and the words characterized in the cover letter are ordered according to the difference in the frequency ratio of each field. The greater the difference in the proportion of each application field, the field character is defined as 'more discriminative'. Based on this, we extract the top 50 words representing discriminative characteristics in general fields and the top 50 words representing discriminative characteristics in technology fields. Second, the number of appropriate topics in the overall cover letter is calculated through the LDA. It uses perplexity score and coherence score. Based on the appropriate number of topics, we then use LDA to generate topic and probability, and estimate which topic words of discriminative characteristic belong to. Subsequently, the keyword indicators of questions used to set the labeling candidate index, and the most appropriate index indicator is set as the label for the topic when considering the topic-specific word distribution. Third, using L-LDA, which sets the cover letter and label as pass and fail, we generate topics and probabilities for each field of pass and fail labels. Furthermore, we extract only words of discriminative characteristics that give labeled topics among generated topics and probabilities by pass and fail labels. Next, we extract the difference between the probability on the pass label and the probability on the fail label by word of the labeled discriminative characteristic. A positive figure can be seen as having the polarity of pass, and a negative figure can be seen as having the polarity of fail. This study is the first research to reflect the characteristics of cover letters of Republic of Korea Air Force non-commissioned officer applicants, not in the private sector. Moreover, these methodologies can apply text mining techniques for multiple documents, rather survey or interview methods, to reduce analysis time and increase reliability for the entire population. For this reason, the methodology proposed in the study is also applicable to other forms of multiple documents in the field of military personnel. This study shows that L-LDA is more suitable than LDA to extract discriminative characteristics of Republic of Korea Air Force Noncommissioned cover letters. Furthermore, this study proposes a methodology that uses a combination of LDA and L-LDA. Therefore, through the analysis of the results of the acquisition of non-commissioned Republic of Korea Air Force officers, we would like to provide information available for acquisition and promotional policies and propose a methodology available for research in the field of military manpower acquisition.

Identification of Strategic Fields for Developing Smart City in Busan Using Text Mining (텍스트 마이닝을 이용한 스마트 도시계획 수립을 위한 전략분야 도출연구: 부산 사례를 바탕으로)

  • Chae, Yoonsik;Lee, Sanghoon
    • Journal of Digital Convergence
    • /
    • v.16 no.11
    • /
    • pp.1-15
    • /
    • 2018
  • The purpose of this study is to analyze bibliographic information of Busan and other cities' reports for urban development initiative and identify the strategic fields for future smart city plan. Text mining method is used in this study to extract keywords and identify the characteristics and patterns of information in urban development reports. As a result, in earlier stage, Busan city focused on service creation for industrial development but there are lack of discussions on the linkage of information systems with ICT technology. However, recent urban planning in Busan contained various contents related to integrated connections of infrastructure, ICT system, and operation management of city in the specific fields of traffic, tourism, welfare, port/logistics, culture/MICE. This results of study is expected to provide policy implications for planning the future urban initiatives of smart city development.

Science and Technology Policy Studies, Society, and the State : An Analysis of a Co-evolution Among Social Issue, Governmental Policy, and Academic Research in Science and Technology (과학기술정책 연구와 사회, 정부 : 과학기술의 사회이슈, 정부정책, 학술연구의 공진화 분석)

  • Kwon, Ki-Seok;Jeong, Seohwa;Yi, Chan-Goo
    • Journal of Korea Technology Innovation Society
    • /
    • v.21 no.1
    • /
    • pp.64-91
    • /
    • 2018
  • This study explores the interactive pattern among social issue, academic research, and governmental policy on science and technology during the last 20 years. In particular, we try understand wether the science and technology policy research and governmental policy meets social needs appropriately. In order to do this, we have collected text data from news articles, papers, and governmental documents. Based on these data, social network analysis and cluster analysis has been carried out. According to the results, we have found that science and technology policy researches tend to focus on fragmented technological innovation meeting urgent practical needs at the initial stage. However, recently, the main characteristics of science and technology policy research shows co-evolutionary patterns responding to society. Furthermore, time lag also has been observed in the process of interaction among the three bodies. Based on these results, we put forward some suggestions for upcoming researches in science and technology policy. Firstly, analysis levels are needed to be shifted from micro level to mezo or macro level. Secondly, more research efforts are required to be focused on policy process in science technology and its public management. Finally, we have to enhance the sensitiveness to social issues through studies on agenda setting in science and technology policy.