• Title/Summary/Keyword: Text Mining Method

Search Result 448, Processing Time 0.023 seconds

A Feature Selection Technique for an Efficient Document Automatic Classification (효율적인 문서 자동 분류를 위한 대표 색인어 추출 기법)

  • 김지숙;김영지;문현정;우용태
    • The Journal of Information Technology and Database
    • /
    • v.8 no.1
    • /
    • pp.117-128
    • /
    • 2001
  • Recently there are many researches of text mining to find interesting patterns or association rules from mass textual documents. However, the words extracted from informal documents are tend to be irregular and there are too many general words, so if we use pre-exist method, we would have difficulty in retrieving knowledge information effectively. In this paper, we propose a new feature extraction method to classify mass documents using association rule based on unsupervised learning technique. In experiment, we show the efficiency of suggested method by extracting features and classifying of documents.

  • PDF

A Text Mining-based Intrusion Log Recommendation in Digital Forensics (디지털 포렌식에서 텍스트 마이닝 기반 침입 흔적 로그 추천)

  • Ko, Sujeong
    • KIPS Transactions on Computer and Communication Systems
    • /
    • v.2 no.6
    • /
    • pp.279-290
    • /
    • 2013
  • In digital forensics log files have been stored as a form of large data for the purpose of tracing users' past behaviors. It is difficult for investigators to manually analysis the large log data without clues. In this paper, we propose a text mining technique for extracting intrusion logs from a large log set to recommend reliable evidences to investigators. In the training stage, the proposed method extracts intrusion association words from a training log set by using Apriori algorithm after preprocessing and the probability of intrusion for association words are computed by combining support and confidence. Robinson's method of computing confidences for filtering spam mails is applied to extracting intrusion logs in the proposed method. As the results, the association word knowledge base is constructed by including the weights of the probability of intrusion for association words to improve the accuracy. In the test stage, the probability of intrusion logs and the probability of normal logs in a test log set are computed by Fisher's inverse chi-square classification algorithm based on the association word knowledge base respectively and intrusion logs are extracted from combining the results. Then, the intrusion logs are recommended to investigators. The proposed method uses a training method of clearly analyzing the meaning of data from an unstructured large log data. As the results, it complements the problem of reduction in accuracy caused by data ambiguity. In addition, the proposed method recommends intrusion logs by using Fisher's inverse chi-square classification algorithm. So, it reduces the rate of false positive(FP) and decreases in laborious effort to extract evidences manually.

User Experience Evaluation of Menstrual Cycle Measurement Application Using Text Mining Analysis Techniques (텍스트 마이닝 분석 기법을 활용한 월경주기측정 애플리케이션 사용자 경험 평가)

  • Wookyung Jeong;Donghee Shin
    • Journal of the Korean Society for information Management
    • /
    • v.40 no.4
    • /
    • pp.1-31
    • /
    • 2023
  • This study conducted user experience evaluation by introducing various text mining techniques along with topic modeling techniques for mobile menstrual cycle measurement applications that are closely related to women's health and analyzed the results by combining them with a honeycomb model. To evaluate the user experience revealed in the menstrual cycle measurement application review, 47,117 Korean reviews of the menstrual cycle measurement application were collected. Topic modeling analysis was conducted to confirm the overall discourse on the user experience revealed in the review, and text network analysis was conducted to confirm the specific experience of each topic. In addition, sentimental analysis was conducted to understand the emotional experience of users. Based on this, the development strategy of the menstrual cycle measurement application was presented in terms of accuracy, design, monitoring, data management, and user management. As a result of the study, it was confirmed that the accuracy and monitoring function of the menstrual cycle measurement of the application should be improved, and it was observed that various design attempts were required. In addition, the necessity of supplementing personal information and the user's biometric data management method was also confirmed. By exploring the user experience (UX) of the menstrual cycle measurement application in-depth, this study revealed various factors experienced by users and suggested practical improvements to provide a better experience. It is also significant in that it presents a methodology by combines topic modeling and text network analysis techniques so that researchers can closely grasp vast amounts of review data in the process of evaluating user experiences.

Self-Supervised Document Representation Method

  • Yun, Yeoil;Kim, Namgyu
    • Journal of the Korea Society of Computer and Information
    • /
    • v.25 no.5
    • /
    • pp.187-197
    • /
    • 2020
  • Recently, various methods of text embedding using deep learning algorithms have been proposed. Especially, the way of using pre-trained language model which uses tremendous amount of text data in training is mainly applied for embedding new text data. However, traditional pre-trained language model has some limitations that it is hard to understand unique context of new text data when the text has too many tokens. In this paper, we propose self-supervised learning-based fine tuning method for pre-trained language model to infer vectors of long-text. Also, we applied our method to news articles and classified them into categories and compared classification accuracy with traditional models. As a result, it was confirmed that the vector generated by the proposed model more accurately expresses the inherent characteristics of the document than the vectors generated by the traditional models.

Keyword Reorganization Techniques for Improving the Identifiability of Topics (토픽 식별성 향상을 위한 키워드 재구성 기법)

  • Yun, Yeoil;Kim, Namgyu
    • Journal of Information Technology Services
    • /
    • v.18 no.4
    • /
    • pp.135-149
    • /
    • 2019
  • Recently, there are many researches for extracting meaningful information from large amount of text data. Among various applications to extract information from text, topic modeling which express latent topics as a group of keywords is mainly used. Topic modeling presents several topic keywords by term/topic weight and the quality of those keywords are usually evaluated through coherence which implies the similarity of those keywords. However, the topic quality evaluation method based only on the similarity of keywords has its limitations because it is difficult to describe the content of a topic accurately enough with just a set of similar words. In this research, therefore, we propose topic keywords reorganizing method to improve the identifiability of topics. To reorganize topic keywords, each document first needs to be labeled with one representative topic which can be extracted from traditional topic modeling. After that, classification rules for classifying each document into a corresponding label are generated, and new topic keywords are extracted based on the classification rules. To evaluated the performance our method, we performed an experiment on 1,000 news articles. From the experiment, we confirmed that the keywords extracted from our proposed method have better identifiability than traditional topic keywords.

Automatic Construction of Reduced Dimensional Cluster-based Keyword Association Networks using LSI (LSI를 이용한 차원 축소 클러스터 기반 키워드 연관망 자동 구축 기법)

  • Yoo, Han-mook;Kim, Han-joon;Chang, Jae-young
    • Journal of KIISE
    • /
    • v.44 no.11
    • /
    • pp.1236-1243
    • /
    • 2017
  • In this paper, we propose a novel way of producing keyword networks, named LSI-based ClusterTextRank, which extracts significant key words from a set of clusters with a mutual information metric, and constructs an association network using latent semantic indexing (LSI). The proposed method reduces the dimension of documents through LSI, decomposes documents into multiple clusters through k-means clustering, and expresses the words within each cluster as a maximal spanning tree graph. The significant key words are identified by evaluating their mutual information within clusters. Then, the method calculates the similarities between the extracted key words using the term-concept matrix, and the results are represented as a keyword association network. To evaluate the performance of the proposed method, we used travel-related blog data and showed that the proposed method outperforms the existing TextRank algorithm by about 14% in terms of accuracy.

Effects of Preprocessing on Text Classification in Balanced and Imbalanced Datasets

  • Mehmet F. Karaca
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.18 no.3
    • /
    • pp.591-609
    • /
    • 2024
  • In this study, preprocessings with all combinations were examined in terms of the effects on decreasing word number, shortening the duration of the process and the classification success in balanced and imbalanced datasets which were unbalanced in different ratios. The decreases in the word number and the processing time provided by preprocessings were interrelated. It was seen that more successful classifications were made with Turkish datasets and English datasets were affected more from the situation of whether the dataset is balanced or not. It was found out that the incorrect classifications, which are in the classes having few documents in highly imbalanced datasets, were made by assigning to the class close to the related class in terms of topic in Turkish datasets and to the class which have many documents in English datasets. In terms of average scores, the highest classification was obtained in Turkish datasets as follows: with not applying lowercase, applying stemming and removing stop words, and in English datasets as follows: with applying lowercase and stemming, removing stop words. Applying stemming was the most important preprocessing method which increases the success in Turkish datasets, whereas removing stop words in English datasets. The maximum scores revealed that feature selection, feature size and classifier are more effective than preprocessing in classification success. It was concluded that preprocessing is necessary for text classification because it shortens the processing time and can achieve high classification success, a preprocessing method does not have the same effect in all languages, and different preprocessing methods are more successful for different languages.

Analysis of Research Trends Related to Start-Up Using Text Mining (텍스트마이닝을 이용한 창업 관련 연구 동향 분석)

  • Han, Sung-Soo;Yang, Dong-Woo
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.12 no.5
    • /
    • pp.1-12
    • /
    • 2017
  • The purpose of this study is to investigate the trends of the start-up research in Korea. To accomplish this, meta-analysis was carried out using text mining methodology by dividing the entrepreneur-related master's and doctoral theses registered in RISS into the first term of entrepreneurship research by 2009 and the second term of entrepreneurship research from 2010. As a result of this study, it can be seen from the three different analysis that the entrepreneurship education and government policy and support are the subject of continuous research topics in the whole period and that the researches on small business start-ups have been studied continuously and conducted more in the second half. In addition, empirical analysis is strengthened in the latter stage of entrepreneurial research. The TF-IDF analysis reveals that many researches on veterans have been carried out in the field of entrepreneurship research, and in the latter period, it was found that many studies related to the elderly were conducted with cultural contents and aging society. In addition, research on brand-related research has been carried out throughout the entire period, and research on venture-related research, characteristics of entrepreneurs, entrepreneurship motivation and start-up strategy have been conducted a lot and female entrepreneurship was also studied. In the latter period, we have emphasized entrepreneurial achievements and found that research on start-ups such as industry-academia cooperation, start-up investment, and social enterprise diversified. This study is meaningful to apply the method which is becoming a recent issue such as text mining and topic analysis to the meta-analysis related to start-up. Future research will need to be undertaken on a variety of more detailed topics related to entrepreneurship.

  • PDF

Text Mining-Based Analysis for Research Trends in Vocational Studies (텍스트 마이닝을 활용한 직업학 연구동향 분석)

  • Yook, Dong-In
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.18 no.3
    • /
    • pp.586-599
    • /
    • 2017
  • This study attempts to understand the overall research trends in Vocational Studies using a text mining method, which is a means to analyze big data. The findings of the research show that Vocational Studies in Korea has been directly influenced by global economic crises, as evidenced by its exponential growth after the 1997 foreign exchange crisis that resulted in a bailout from the IMF. In addition, the topics of research have been shifting from such macro subjects as government policies and systems to such micro topics as individual career development. Moreover, the perspective of research is being moved from the socially vulnerable, including women and the disabled, to the economically marginalized, including retirees and the unemployed. As for the research targets, college students overwhelmingly outnumbered primary and secondary school students. However, few cases analyzed the clinical outcomes of career counseling or attempted to process job information and study the history of jobs. This research is limited in that it only analyzed journal abstracts. Nonetheless, it is meaningful because it used topic analysis, one of the text mining methods, to give a complete enumeration of all articles available for search, thereby crafting a framework of quantitative analysis methodology for Vocational Studies. It is also significant in that it is the first attempt to analyze themes in every stage of the development of Vocational Studies.

Occupational Therapy in Long-Term Care Insurance For the Elderly Using Text Mining (텍스트 마이닝을 활용한 노인장기요양보험에서의 작업치료: 2007-2018년)

  • Cho, Min Seok;Baek, Soon Hyung;Park, Eom-Ji;Park, Soo Hee
    • Journal of Society of Occupational Therapy for the Aged and Dementia
    • /
    • v.12 no.2
    • /
    • pp.67-74
    • /
    • 2018
  • Objective : The purpose of this study is to quantitatively analyze the role of occupational therapy in long - term care insurance for the elderly using text mining, one of the big data analysis techniques. Method : For the analysis of newspaper articles, "Long - Term Care Insurance for the Elderly + Occupational Therapy for the Elderly" was collected after the period from 2007 to 208. Naver, which has a high share of the domestic search engine, utilized the database of Naver News by utilizing Textom, a web crawling tool. After collecting the article title and original text of 510 news data from the collection of the elderly long term care insurance + occupational therapy search, we analyzed the article frequency and key words by year. Result : In terms of the frequency of articles published by year, the number of articles published in 2015 and 2017 was the highest with 70 articles (13.7%), and the top 10 terms of the key word analysis showed the highest frequency of 'dementia' (344) In terms of key words, dementia, treatment, hospital, health, service, rehabilitation, facilities, institution, grade, elderly, professional, salary, industrial complex and people are related. Conclusion : In this study, it is meaningful that the textual mining technique was used to more objectively confirm the social needs and the role of the occupational therapist for the dementia and rehabilitation in the related key keywords based on the media reporting trend of the elderly long - term care insurance for 11 years. Based on the results of this study, future research should expand research field and period and supplement the research methodology through various analysis methods according to the year.