• Title/Summary/Keyword: word clustering

Search Result 190, Processing Time 0.023 seconds

Information Technology Application for Oral Document Analysis (구술문서 자료분석을 위한 정보검색기술의 응용)

  • Park, Soon-Cheol;Hahm, Han-Hee
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.13 no.2
    • /
    • pp.47-55
    • /
    • 2008
  • The purpose of this paper is to develop an analytical methodology of or릴 documents by the application of. Information Technologies. This system consists of the key word search, contents summary, clustering, classification & topic tracing of the contents. The integrated model of the five levels of retrieval technologies can be exhaustively used in the analysis of oral documents, which were collected as oral history of five men and women in the area of North Jeolla. Of the five methods topic tracing is the most pioneering accomplishment both home and abroad. In final this research will shed light on the methodological and theoretical studies of oral history and culture.

  • PDF

Study on CEO New Year's Address: Using Text Mining Method (텍스트마이닝을 활용한 주요 대기업 신년사 분석)

  • YuKyoung Kim;Daegon Cho
    • Journal of Information Technology Services
    • /
    • v.22 no.2
    • /
    • pp.93-127
    • /
    • 2023
  • This study analyzed the CEO New Year's addresses of major Korean companies, extracting key topics for employees via text mining techniques. An intended contribution of this study is to assist reporters, analysts, and researchers in gaining a better understanding of the New Year's addresses by elucidating the implicit and implicative features of messages within. To this end, this study collected and analyzed 545 New Year's addresses published between 2012 and 2021 by the top 66 Korean companies in terms of market capitalization. Research methodologies applied include text clustering, word embedding of keywords, frequency analysis, and topic modeling. Our main findings suggest that the messages in the New Year's addresses were categorized into nine topics-organizational culture, global advancement, substantial management, business reorganization, capacity building, market leadership, management innovation, sustainable management, and technology development. Next, this study further analyzed the managerial significance of each topic and discussed their characteristics from the perspectives of time, industry, and corporate groups. Companies were typically found to emphasize sound management, market leadership, and business reorganization during economic downturns while stressing capacity building and organizational culture during market transition periods. Also, companies belonging to corporate groups tended to emphasize founding philosophy and corporate culture.

Speaker-Independent Isolated Word Recognition Using A Modified ISODATA Method (Modified ISODATA 방법을 이용한 불특정화자 단독어 인식)

  • Hwang, U-Geun;An, Tae-Ok;Lee, Hyeong-Jun
    • The Journal of the Acoustical Society of Korea
    • /
    • v.6 no.4
    • /
    • pp.31-43
    • /
    • 1987
  • As a study on Speaker-Independent Isolated Word Recognition, a Modified ISODATA clustering method is proposed. This method simplifies the outlier processing and the splitting procedure in conventional ISODATA algorithm, and eliminates the lumping procedure. Through this method, we could find cluster centers precisely and automatically. When this method applied to 11 digits by 10 males and 4 females, its recognition rates of $84.42\%$ for K=4 were better than those of the latest Modified K-means, $82.5\%$. Judging from these results, we proved this method the best method in finding cluster centers precisely.

  • PDF

Identification of Research Areas and Evolution of 2D Materials by the Keyword Mapping Methodology (키워드 매핑 기반 2차원 물질 연구 영역 탐지와 발전 과정 분석)

  • Ahn, Sejung;Lee, June Young
    • Journal of the Korean Institute of Electrical and Electronic Material Engineers
    • /
    • v.31 no.1
    • /
    • pp.11-18
    • /
    • 2018
  • Two-dimensional (2D) materials such as transition metal dichalcogenides have attracted tremendous scientific interests owing to their potential of solving the zero band-gap issue of graphene. In this work, the research areas and technology evolutionary dynamics of the 2D materials were identified using the scientometric method focusing on keyword mapping and clustering. The time-series analysis showed that the technological progress of 2D material is in the early growth period. The overlay mapping analysis were carried out to investigate the technology evolution of 2D materials with time. The strategic diagram of co-word analysis classifying the topological positions of keyword was derived to support the analysis results. It is conjectured that extensive research will be conducted widely on the application of 2D materials not only in electronic and optoelectronic devices, but also in various other fields such as biomedical applications, and that their development will be more rapid based on accumulated results of extant graphene research.

Efficient Illegal Contents Detection and Attacker Profiling in Real Environments

  • Kim, Jin-gang;Lim, Sueng-bum;Lee, Tae-jin
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.6
    • /
    • pp.2115-2130
    • /
    • 2022
  • With the development of over-the-top (OTT) services, the demand for content is increasing, and you can easily and conveniently acquire various content in the online environment. As a result, copyrighted content can be easily copied and distributed, resulting in serious copyright infringement. Some special forms of online service providers (OSP) use filtering-based technologies to protect copyrights, but illegal uploaders use methods that bypass traditional filters. Uploading with a title that bypasses the filter cannot use a similar search method to detect illegal content. In this paper, we propose a technique for profiling the Heavy Uploader by normalizing the bypassed content title and efficiently detecting illegal content. First, the word is extracted from the normalized title and converted into a bit-array to detect illegal works. This Bloom Filter method has a characteristic that there are false positives but no false negatives. The false positive rate has a trade-off relationship with processing performance. As the false positive rate increases, the processing performance increases, and when the false positive rate decreases, the processing performance increases. We increased the detection rate by directly comparing the word to the result of increasing the false positive rate of the Bloom Filter. The processing time was also as fast as when the false positive rate was increased. Afterwards, we create a function that includes information about overall piracy and identify clustering-based heavy uploaders. Analyze the behavior of heavy uploaders to find the first uploader and detect the source site.

Comparative analysis of model performance for predicting the customer of cafeteria using unstructured data

  • Seungsik Kim;Nami Gu;Jeongin Moon;Keunwook Kim;Yeongeun Hwang;Kyeongjun Lee
    • Communications for Statistical Applications and Methods
    • /
    • v.30 no.5
    • /
    • pp.485-499
    • /
    • 2023
  • This study aimed to predict the number of meals served in a group cafeteria using machine learning methodology. Features of the menu were created through the Word2Vec methodology and clustering, and a stacking ensemble model was constructed using Random Forest, Gradient Boosting, and CatBoost as sub-models. Results showed that CatBoost had the best performance with the ensemble model showing an 8% improvement in performance. The study also found that the date variable had the greatest influence on the number of diners in a cafeteria, followed by menu characteristics and other variables. The implications of the study include the potential for machine learning methodology to improve predictive performance and reduce food waste, as well as the removal of subjective elements in menu classification. Limitations of the research include limited data cases and a weak model structure when new menus or foreign words are not included in the learning data. Future studies should aim to address these limitations.

Improving the Retrieval Effectiveness by Incorporating Word Sense Disambiguation Process (정보검색 성능 향상을 위한 단어 중의성 해소 모형에 관한 연구)

  • Chung, Young-Mee;Lee, Yong-Gu
    • Journal of the Korean Society for information Management
    • /
    • v.22 no.2 s.56
    • /
    • pp.125-145
    • /
    • 2005
  • This paper presents a semantic vector space retrieval model incorporating a word sense disambiguation algorithm in an attempt to improve retrieval effectiveness. Nine Korean homonyms are selected for the sense disambiguation and retrieval experiments. The total of approximately 120,000 news articles comprise the raw test collection and 18 queries including homonyms as query words are used for the retrieval experiments. A Naive Bayes classifier and EM algorithm representing supervised and unsupervised learning algorithms respectively are used for the disambiguation process. The Naive Bayes classifier achieved $92\%$ disambiguation accuracy. while the clustering performance of the EM algorithm is $67\%$ on the average. The retrieval effectiveness of the semantic vector space model incorporating the Naive Bayes classifier showed $39.6\%$ precision achieving about $7.4\%$ improvement. However, the retrieval effectiveness of the EM algorithm-based semantic retrieval is $3\%$ lower than the baseline retrieval without disambiguation. It is worth noting that the performances of disambiguation and retrieval depend on the distribution patterns of homonyms to be disambiguated as well as the characteristics of queries.

Headword Finding System Using Document Expansion (문서 확장을 이용한 표제어 검색시스템)

  • Kim, Jae-Hoon;Kim, Hyung-Chul
    • Journal of Information Management
    • /
    • v.42 no.4
    • /
    • pp.137-154
    • /
    • 2011
  • A headword finding system is defined as an information retrieval system using a word gloss as a query. We use the gloss as a document in order to implement such a system. Generally the gloss is very short in length and then makes very difficult to find the most proper headword for a given query. To alleviate this problem, we expand the document using the concept of query expansion in information retrieval. In this paper, we use 2 document expansion methods : gloss expansion and similar word expansion. The former is the process of inserting glosses of words, which include in the document, into a seed document. The latter is also the process of inserting similar words into a seed document. We use a featureless clustering algorithm for getting the similar words. The performance (r-inclusion rate) amounts to almost 100% when the queries are word glosses and r is 16, and to 66.9% when the queries are written in person by users. Through several experiments, we have observed that the document expansions are very useful for the headword finding system. In the future, new measures including the r-inclusion rate of our proposed measure are required for performance evaluation of headword finding systems and new evaluation sets are also needed for objective assessment.

Domain Analysis on the Field of Open Access by Co-Word Analysis: Based on Published Journals of Library and Information Science during 2013 to 2018 (동시출현단어 분석을 활용한 오픈액세스 분야의 지적구조 분석: 2013년부터 2018년까지 출판된 문헌정보학 저널을 기반으로)

  • Kim, Sun-Kyum;Kim, Wan-Jong;Seo, Tae-Sul;Choi, Hyun-Jin
    • Journal of Korean Library and Information Science Society
    • /
    • v.50 no.1
    • /
    • pp.333-356
    • /
    • 2019
  • Open access has emerged as an alternative to overcome the crisis brought by scholarly communication on commercial publishers. The purpose of this study is to suggest the intellectual structure that reflects the newest research trend in the field of open access, to identify how the subject area is structured by using co-word analysis, and compare and analyze with the existing study. In order to do this, the total number of dataset was 761 papers collected from Web of Science during the period from January 2012 to November 2018 using information science and 2,321 keywords as a noun phase are extracted from titles and abstracts. To analyze the intellectual structure of open access, 13 topic clusters are extracted by network analysis and the keywords with higher centrallity are drawn by visualizing the intellectual relationship. In addition, after clustering analysis, the relationship was analyzed by plotting the result on the multidimensional scaling map. As a result, it is expected that our research helps the research direction of open access for the future.

Clustering Analysis of Films on Box Office Performance : Based on Web Crawling (영화 흥행과 관련된 영화별 특성에 대한 군집분석 : 웹 크롤링 활용)

  • Lee, Jai-Ill;Chun, Young-Ho;Ha, Chunghun
    • Journal of Korean Society of Industrial and Systems Engineering
    • /
    • v.39 no.3
    • /
    • pp.90-99
    • /
    • 2016
  • Forecasting of box office performance after a film release is very important, from the viewpoint of increase profitability by reducing the production cost and the marketing cost. Analysis of psychological factors such as word-of-mouth and expert assessment is essential, but hard to perform due to the difficulties of data collection. Information technology such as web crawling and text mining can help to overcome this situation. For effective text mining, categorization of objects is required. In this perspective, the objective of this study is to provide a framework for classifying films according to their characteristics. Data including psychological factors are collected from Web sites using the web crawling. A clustering analysis is conducted to classify films and a series of one-way ANOVA analysis are conducted to statistically verify the differences of characteristics among groups. The result of the cluster analysis based on the review and revenues shows that the films can be categorized into four distinct groups and the differences of characteristics are statistically significant. The first group is high sales of the box office and the number of clicks on reviews is higher than other groups. The characteristic of the second group is similar with the 1st group, while the length of review is longer and the box office sales are not good. The third group's audiences prefer to documentaries and animations and the number of comments and interests are significantly lower than other groups. The last group prefer to criminal, thriller and suspense genre. Correspondence analysis is also conducted to match the groups and intrinsic characteristics of films such as genre, movie rating and nation.