• Title/Summary/Keyword: Text mining

Search Result 1,521, Processing Time 0.028 seconds

LDA Topic Modeling and Recommendation of Similar Patent Document Using Word2vec (LDA 토픽 모델링과 Word2vec을 활용한 유사 특허문서 추천연구)

  • Apgil Lee;Keunho Choi;Gunwoo Kim
    • Information Systems Review
    • /
    • v.22 no.1
    • /
    • pp.17-31
    • /
    • 2020
  • With the start of the fourth industrial revolution era, technologies of various fields are merged and new types of technologies and products are being developed. In addition, the importance of the registration of intellectual property rights and patent registration to gain market dominance of them is increasing in oversea as well as in domestic. Accordingly, the number of patents to be processed per examiner is increasing every year, so time and cost for prior art research are increasing. Therefore, a number of researches have been carried out to reduce examination time and cost for patent-pending technology. This paper proposes a method to calculate the degree of similarity among patent documents of the same priority claim when a plurality of patent rights priority claims are filed and to provide them to the examiner and the patent applicant. To this end, we preprocessed the data of the existing irregular patent documents, used Word2vec to obtain similarity between patent documents, and then proposed recommendation model that recommends a similar patent document in descending order of score. This makes it possible to promptly refer to the examination history of patent documents judged to be similar at the time of examination by the examiner, thereby reducing the burden of work and enabling efficient search in the applicant's prior art research. We expect it will contribute greatly.

A Gap Analysis Using Spatial Data and Social Media Big Data Analysis Results of Island Tourism Resources for Sustainable Resource Management (지속가능한 자원관리를 위한 섬 지역 관광자원의 공간정보와 소셜미디어 빅데이터 분석 결과를 활용한 격차분석)

  • Lee, Sung-Hee;Lee, Ju-Kyung;Son, Yong-Hoon;Kim, Young-Jin
    • Journal of Korean Society of Rural Planning
    • /
    • v.30 no.2
    • /
    • pp.13-24
    • /
    • 2024
  • This study conducts an analysis of social media big data pertaining to island tourism resources, aiming to discern the diverse forms and categories of island tourism favored by consumers, ascertain predominant resources, and facilitate objective decision-making grounded in scientific methodologies. To achieve this objective, an examination of blog posts published on Naver from 2022 to 2023 was undertaken, utilizing keywords such as 'Island tourism', 'Island travel', and 'Island backpacking' as focal points for analysis. Text mining techniques were applied to sift through the data. Among the resources identified, the port emerged as a significant asset, serving as a pivotal conduit linking the island and mainland and holding substantial importance as a focal point and resource for tourist access to the island. Furthermore, an analysis of the disparity between existing island tourism resources and those acknowledged by tourists who actively engage with and appreciate island destinations led to the identification of 186 newly emerging resources. These nascent resources predominantly clustered within five regions: Incheon Metropolitan City, Tongyeong/Geoje City, Jeju Island, Ulleung-gun, and Shinan-gun. A scrutiny of these resources, categorized according to the tourism resource classification system, revealed a notable presence of new resources, chiefly in the domains of 'rural landscape', 'tourist resort/training facility', 'transportation facility', and 'natural resource'. Notably, many of these emerging resources were previously overlooked in official management targets or resource inventories pertaining to existing island tourism resources. Noteworthy examples include ports, beaches, and mountains, which, despite constituting a substantial proportion of the newly identified tourist resources, were not accorded prominence in spatial information datasets. This study holds significance in its ability to unearth novel tourism resources recognized by island tourism consumers through a gap analysis approach that juxtaposes the existing status of island tourism resource data with techniques utilizing social media big data. Furthermore, the methodology delineated in this research offers a valuable framework for domestic local governments to gauge local tourism demand and embark on initiatives for tourism development or regional revitalization.

A Study on the Perception of Pit and Fissure Sealant using Unstructured Big Data (비정형 빅데이터를 이용한 치면열구전색(치아홈메우기)에 대한 인식분석)

  • Han-A Cho
    • Journal of Korean Dental Hygiene Science
    • /
    • v.6 no.2
    • /
    • pp.101-114
    • /
    • 2023
  • Background: This study aimed to explore the overall perception of pit and fissure sealants and suggest methods to revitalize their current stagnation. Methods: To determine the social perception of the change in coverage policy for pit and fissure sealants, we categorized them into five time periods. The first period (December 1, 2009 to November 30, 2010), the second period (December 1, 2010 to September 30, 2012), the third period (October 1, 2012 to May 5, 2013), the fourth period (May 6, 2013 to September 30, 2017), and the fifth period (October 1, 2017 to December 31, 2022). We utilized text mining, an unstructured big data analysis method. Keywords were collected and analyzed using Textom, and the frequency analysis of the top 30 keywords, structural features of the semantic network, centrality analysis, QAP correlation analysis, and co-occurrence analysis were conducted. Results: The frequency analysis showed that the top keywords for each time period were 'Cavities', 'Treatment', and 'Children'. In the structural features of the semantic network of pit and fissure sealants by time period, the density index was found to be around 1.00 for all time periods. The QAP correlation analysis showed the highest correlation between the first and second periods and the fourth and fifth periods with a correlation coefficient of 0.834. The co-occurrence analysis showed that 'cavities' and 'prevention were the top two words across all time periods. Conclusion: This study showed that pit and fissure sealants are well accepted by the society as a preventive treatment for caries. However, the awareness of health education related to these sealants was found to be low. Efforts to revitalize stagnant pit and fissure sealants need to be strengthened with effective education.

A Study on Trends of Key Issues in Port Safety at Busan Port (부산항 항만안전 주요 이슈 동향에 관한 연구)

  • Jeong-Min Lee;Do-Yeon Ha;Joo-Hye Kim
    • Journal of Navigation and Port Research
    • /
    • v.48 no.1
    • /
    • pp.34-48
    • /
    • 2024
  • As global supply chain risks proliferate unpredictably, the high interdependence of port and logistics industry intensifies the risk burden. This study conducted fundamental research to explore diverse safety issues in domestic ports. Utilizing news article data about Busan Port, we employed LDA topic modeling and time-series linear regression to understand key safety trends. Over the past 30 years, Busan Port faced nine major safety issues-maritime safety, import cargo inspection, labor strikes, and natural disasters emerged cyclically. Major port safety issues in Busan Port are primarily characterized by an unpredictable nature, falling under socio-environmental and natural phenomena types, indicating a significant impact of global uncertainty. Therefore, systematic policies need to be formulated based on identified port safety issues to enhance port safety in Busan Port. Additionally, there is a need to strengthen the resilience of port safety for unpredictable risk situations. In conclusion, advanced research activities are necessary to promote port safety enhancement in response to dynamically changing social conditions.

Online Privacy Protection: An Analysis of Social Media Reactions to Data Breaches (온라인 정보 보호: 소셜 미디어 내 정보 유출 반응 분석)

  • Seungwoo Seo;Youngjoon Go;Hong Joo Lee
    • Knowledge Management Research
    • /
    • v.25 no.1
    • /
    • pp.1-19
    • /
    • 2024
  • This study analyzed the changes in social media reactions of data subjects to major personal data breach incidents in South Korea from January 2014 to October 2022. We collected a total of 1,317 posts written on Naver Blogs within a week immediately following each incident. Applying the LDA topic modeling technique to these posts, five main topics were identified: personal data breaches, hacking, information technology, etc. Analyzing the temporal changes in topic distribution, we found that immediately after a data breach incident, the proportion of topics directly mentioning the incident was the highest. However, as time passed, the proportion of mentions related indirectly to the personal data breach increased. This suggests that the attention of data subjects shifts from the specific incident to related topics over time, and interest in personal data protection also decreases. The findings of this study imply a future need for research on the changes in privacy awareness of data subjects following personal data breach incidents.

What Concerns Does ChatGPT Raise for Us?: An Analysis Centered on CTM (Correlated Topic Modeling) of YouTube Video News Comments (ChatGPT는 우리에게 어떤 우려를 초래하는가?: 유튜브 영상 뉴스 댓글의 CTM(Correlated Topic Modeling) 분석을 중심으로)

  • Song, Minho;Lee, Soobum
    • Informatization Policy
    • /
    • v.31 no.1
    • /
    • pp.3-31
    • /
    • 2024
  • This study aimed to examine public concerns in South Korea considering the country's unique context, triggered by the advent of generative artificial intelligence such as ChatGPT. To achieve this, comments from 102 YouTube video news related to ethical issues were collected using a Python scraper, and morphological analysis and preprocessing were carried out using Textom on 15,735 comments. These comments were then analyzed using a Correlated Topic Model (CTM). The analysis identified six primary topics within the comments: "Legal and Ethical Considerations"; "Intellectual Property and Technology"; "Technological Advancement and the Future of Humanity"; "Potential of AI in Information Processing"; "Emotional Intelligence and Ethical Regulations in AI"; and "Human Imitation."Structuring these topics based on a correlation coefficient value of over 10% revealed 3 main categories: "Legal and Ethical Considerations"; "Issues Related to Data Generation by ChatGPT (Intellectual Property and Technology, Potential of AI in Information Processing, and Human Imitation)"; and "Fear for the Future of Humanity (Technological Advancement and the Future of Humanity, Emotional Intelligence, and Ethical Regulations in AI)."The study confirmed the coexistence of various concerns along with the growing interest in generative AI like ChatGPT, including worries specific to the historical and social context of South Korea. These findings suggest the need for national-level efforts to ensure data fairness.

A Study on Determining the Priority of Introducing Smart Ports in Korea (국내 스마트 항만 도입 우선순위 도출 연구)

  • Ryu, Won-Hyeong;Nam, Hyung-Sik
    • Journal of Korea Port Economic Association
    • /
    • v.40 no.1
    • /
    • pp.31-59
    • /
    • 2024
  • In June 2016, the term "Fourth Industrial Revolution" was first used at the World Economic Forum in Davos, Switzerland, and it gained worldwide attention. Consequently, the importance of smart ports has increased as the shipping industry has been incorporating various Fourth Industrial Revolution technologies. Currently, major countries around the world are working to achieve digital transformation in the maritime and port industry by establishing comprehensive smart ports. However, the smartification of domestic ports in South Korea is currently limited to a few areas such as Busan, Incheon, and Gwangyang, focusing on port automation. In this context, this study performed keyword analysis to identify key components of smart ports and conducted Analytic Hierarchy Process (AHP) analysis among relevant stakeholders to determine the priorities for the Introduction of smart ports in South Korea. The analysis revealed that universities prioritized automation, intelligenceization, informatization and environmentalization in that order. Research institutes prioritized informatization, intelligenceization, automation and environmentalization. Government agencies prioritized informatization, automation, intelligenceization and environmentalization, while private sector enterprises prioritized automation, intelligenceization, informatization, and environmentalization.

Improving Performance of Recommendation Systems Using Topic Modeling (사용자 관심 이슈 분석을 통한 추천시스템 성능 향상 방안)

  • Choi, Seongi;Hyun, Yoonjin;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.21 no.3
    • /
    • pp.101-116
    • /
    • 2015
  • Recently, due to the development of smart devices and social media, vast amounts of information with the various forms were accumulated. Particularly, considerable research efforts are being directed towards analyzing unstructured big data to resolve various social problems. Accordingly, focus of data-driven decision-making is being moved from structured data analysis to unstructured one. Also, in the field of recommendation system, which is the typical area of data-driven decision-making, the need of using unstructured data has been steadily increased to improve system performance. Approaches to improve the performance of recommendation systems can be found in two aspects- improving algorithms and acquiring useful data with high quality. Traditionally, most efforts to improve the performance of recommendation system were made by the former approach, while the latter approach has not attracted much attention relatively. In this sense, efforts to utilize unstructured data from variable sources are very timely and necessary. Particularly, as the interests of users are directly connected with their needs, identifying the interests of the user through unstructured big data analysis can be a crew for improving performance of recommendation systems. In this sense, this study proposes the methodology of improving recommendation system by measuring interests of the user. Specially, this study proposes the method to quantify interests of the user by analyzing user's internet usage patterns, and to predict user's repurchase based upon the discovered preferences. There are two important modules in this study. The first module predicts repurchase probability of each category through analyzing users' purchase history. We include the first module to our research scope for comparing the accuracy of traditional purchase-based prediction model to our new model presented in the second module. This procedure extracts purchase history of users. The core part of our methodology is in the second module. This module extracts users' interests by analyzing news articles the users have read. The second module constructs a correspondence matrix between topics and news articles by performing topic modeling on real world news articles. And then, the module analyzes users' news access patterns and then constructs a correspondence matrix between articles and users. After that, by merging the results of the previous processes in the second module, we can obtain a correspondence matrix between users and topics. This matrix describes users' interests in a structured manner. Finally, by using the matrix, the second module builds a model for predicting repurchase probability of each category. In this paper, we also provide experimental results of our performance evaluation. The outline of data used our experiments is as follows. We acquired web transaction data of 5,000 panels from a company that is specialized to analyzing ranks of internet sites. At first we extracted 15,000 URLs of news articles published from July 2012 to June 2013 from the original data and we crawled main contents of the news articles. After that we selected 2,615 users who have read at least one of the extracted news articles. Among the 2,615 users, we discovered that the number of target users who purchase at least one items from our target shopping mall 'G' is 359. In the experiments, we analyzed purchase history and news access records of the 359 internet users. From the performance evaluation, we found that our prediction model using both users' interests and purchase history outperforms a prediction model using only users' purchase history from a view point of misclassification ratio. In detail, our model outperformed the traditional one in appliance, beauty, computer, culture, digital, fashion, and sports categories when artificial neural network based models were used. Similarly, our model outperformed the traditional one in beauty, computer, digital, fashion, food, and furniture categories when decision tree based models were used although the improvement is very small.

Digital Archives of Cultural Archetype Contents: Its Problems and Direction (디지털 아카이브즈의 문제점과 방향 - 문화원형 콘텐츠를 중심으로 -)

  • Hahm, Han-Hee;Park, Soon-Cheol
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.17 no.2
    • /
    • pp.23-42
    • /
    • 2006
  • This is a study of the digital archives of Culturecontent.com where 'Cultural Archetype Contents' are currently in service. One of the major purposes of our study is to point out problems in the current system and eventually propose improvements to the digital archives. The government launched a four-year project for developing the cultural archetype content sources and establishing its related business with the hope of enhancing the nation's competitiveness. More specifically, the project focuses on the production of source materials of cultural archetype contents in the subjects of Korea's history. tradition, everyday life. arts and general geographical books. In addition, through this project, the government also intends to establish a proper distribution system of digitalized culture contents and to control copyright issues. This paper analyzes the digital archives system that stores the culture content data that have been produced from 2002 to 2005 and evaluates the current system's weaknesses and strengths. The summary of our findings is as follows. First. the digital archives system does not contain a semantic search engine and therefore its full function is 1agged. Second, similar data is not classified into the same categories but into the different ones, thereby confusing and inconveniencing users. Users who want to find source materials could be disappointed by the current distributive system. Our paper suggests a better system of digital archives with text mining technology which consists of five significant intelligent process-keyword searches, summarization, clustering, classification and topic tracking. Our paper endeavors to develop the best technical environment for preserving and using culture contents data. With the new digitalized upgraded settings, users of culture contents data will discover a world of new knowledge. The technology we introduce in this paper will lead to the highest achievable digital intelligence through a new framework.

A Proposal of a Keyword Extraction System for Detecting Social Issues (사회문제 해결형 기술수요 발굴을 위한 키워드 추출 시스템 제안)

  • Jeong, Dami;Kim, Jaeseok;Kim, Gi-Nam;Heo, Jong-Uk;On, Byung-Won;Kang, Mijung
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.1-23
    • /
    • 2013
  • To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.