• Title/Summary/Keyword: Text-based classification

Search Result 455, Processing Time 0.041 seconds

The Effect of Domain Specificity on the Performance of Domain-Specific Pre-Trained Language Models (도메인 특수성이 도메인 특화 사전학습 언어모델의 성능에 미치는 영향)

  • Han, Minah;Kim, Younha;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.28 no.4
    • /
    • pp.251-273
    • /
    • 2022
  • Recently, research on applying text analysis to deep learning has steadily continued. In particular, researches have been actively conducted to understand the meaning of words and perform tasks such as summarization and sentiment classification through a pre-trained language model that learns large datasets. However, existing pre-trained language models show limitations in that they do not understand specific domains well. Therefore, in recent years, the flow of research has shifted toward creating a language model specialized for a particular domain. Domain-specific pre-trained language models allow the model to understand the knowledge of a particular domain better and reveal performance improvements on various tasks in the field. However, domain-specific further pre-training is expensive to acquire corpus data of the target domain. Furthermore, many cases have reported that performance improvement after further pre-training is insignificant in some domains. As such, it is difficult to decide to develop a domain-specific pre-trained language model, while it is not clear whether the performance will be improved dramatically. In this paper, we present a way to proactively check the expected performance improvement by further pre-training in a domain before actually performing further pre-training. Specifically, after selecting three domains, we measured the increase in classification accuracy through further pre-training in each domain. We also developed and presented new indicators to estimate the specificity of the domain based on the normalized frequency of the keywords used in each domain. Finally, we conducted classification using a pre-trained language model and a domain-specific pre-trained language model of three domains. As a result, we confirmed that the higher the domain specificity index, the higher the performance improvement through further pre-training.

Discovering Promising Convergence Technologies Using Network Analysis of Maturity and Dependency of Technology (기술 성숙도 및 의존도의 네트워크 분석을 통한 유망 융합 기술 발굴 방법론)

  • Choi, Hochang;Kwahk, Kee-Young;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.24 no.1
    • /
    • pp.101-124
    • /
    • 2018
  • Recently, most of the technologies have been developed in various forms through the advancement of single technology or interaction with other technologies. Particularly, these technologies have the characteristic of the convergence caused by the interaction between two or more techniques. In addition, efforts in responding to technological changes by advance are continuously increasing through forecasting promising convergence technologies that will emerge in the near future. According to this phenomenon, many researchers are attempting to perform various analyses about forecasting promising convergence technologies. A convergence technology has characteristics of various technologies according to the principle of generation. Therefore, forecasting promising convergence technologies is much more difficult than forecasting general technologies with high growth potential. Nevertheless, some achievements have been confirmed in an attempt to forecasting promising technologies using big data analysis and social network analysis. Studies of convergence technology through data analysis are actively conducted with the theme of discovering new convergence technologies and analyzing their trends. According that, information about new convergence technologies is being provided more abundantly than in the past. However, existing methods in analyzing convergence technology have some limitations. Firstly, most studies deal with convergence technology analyze data through predefined technology classifications. The technologies appearing recently tend to have characteristics of convergence and thus consist of technologies from various fields. In other words, the new convergence technologies may not belong to the defined classification. Therefore, the existing method does not properly reflect the dynamic change of the convergence phenomenon. Secondly, in order to forecast the promising convergence technologies, most of the existing analysis method use the general purpose indicators in process. This method does not fully utilize the specificity of convergence phenomenon. The new convergence technology is highly dependent on the existing technology, which is the origin of that technology. Based on that, it can grow into the independent field or disappear rapidly, according to the change of the dependent technology. In the existing analysis, the potential growth of convergence technology is judged through the traditional indicators designed from the general purpose. However, these indicators do not reflect the principle of convergence. In other words, these indicators do not reflect the characteristics of convergence technology, which brings the meaning of new technologies emerge through two or more mature technologies and grown technologies affect the creation of another technology. Thirdly, previous studies do not provide objective methods for evaluating the accuracy of models in forecasting promising convergence technologies. In the studies of convergence technology, the subject of forecasting promising technologies was relatively insufficient due to the complexity of the field. Therefore, it is difficult to find a method to evaluate the accuracy of the model that forecasting promising convergence technologies. In order to activate the field of forecasting promising convergence technology, it is important to establish a method for objectively verifying and evaluating the accuracy of the model proposed by each study. To overcome these limitations, we propose a new method for analysis of convergence technologies. First of all, through topic modeling, we derive a new technology classification in terms of text content. It reflects the dynamic change of the actual technology market, not the existing fixed classification standard. In addition, we identify the influence relationships between technologies through the topic correspondence weights of each document, and structuralize them into a network. In addition, we devise a centrality indicator (PGC, potential growth centrality) to forecast the future growth of technology by utilizing the centrality information of each technology. It reflects the convergence characteristics of each technology, according to technology maturity and interdependence between technologies. Along with this, we propose a method to evaluate the accuracy of forecasting model by measuring the growth rate of promising technology. It is based on the variation of potential growth centrality by period. In this paper, we conduct experiments with 13,477 patent documents dealing with technical contents to evaluate the performance and practical applicability of the proposed method. As a result, it is confirmed that the forecast model based on a centrality indicator of the proposed method has a maximum forecast accuracy of about 2.88 times higher than the accuracy of the forecast model based on the currently used network indicators.

Suggestion of Urban Regeneration Type Recommendation System Based on Local Characteristics Using Text Mining (텍스트 마이닝을 활용한 지역 특성 기반 도시재생 유형 추천 시스템 제안)

  • Kim, Ikjun;Lee, Junho;Kim, Hyomin;Kang, Juyoung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.149-169
    • /
    • 2020
  • "The Urban Renewal New Deal project", one of the government's major national projects, is about developing underdeveloped areas by investing 50 trillion won in 100 locations on the first year and 500 over the next four years. This project is drawing keen attention from the media and local governments. However, the project model which fails to reflect the original characteristics of the area as it divides project area into five categories: "Our Neighborhood Restoration, Housing Maintenance Support Type, General Neighborhood Type, Central Urban Type, and Economic Base Type," According to keywords for successful urban regeneration in Korea, "resident participation," "regional specialization," "ministerial cooperation" and "public-private cooperation", when local governments propose urban regeneration projects to the government, they can see that it is most important to accurately understand the characteristics of the city and push ahead with the projects in a way that suits the characteristics of the city with the help of local residents and private companies. In addition, considering the gentrification problem, which is one of the side effects of urban regeneration projects, it is important to select and implement urban regeneration types suitable for the characteristics of the area. In order to supplement the limitations of the 'Urban Regeneration New Deal Project' methodology, this study aims to propose a system that recommends urban regeneration types suitable for urban regeneration sites by utilizing various machine learning algorithms, referring to the urban regeneration types of the '2025 Seoul Metropolitan Government Urban Regeneration Strategy Plan' promoted based on regional characteristics. There are four types of urban regeneration in Seoul: "Low-use Low-Level Development, Abandonment, Deteriorated Housing, and Specialization of Historical and Cultural Resources" (Shon and Park, 2017). In order to identify regional characteristics, approximately 100,000 text data were collected for 22 regions where the project was carried out for a total of four types of urban regeneration. Using the collected data, we drew key keywords for each region according to the type of urban regeneration and conducted topic modeling to explore whether there were differences between types. As a result, it was confirmed that a number of topics related to real estate and economy appeared in old residential areas, and in the case of declining and underdeveloped areas, topics reflecting the characteristics of areas where industrial activities were active in the past appeared. In the case of the historical and cultural resource area, since it is an area that contains traces of the past, many keywords related to the government appeared. Therefore, it was possible to confirm political topics and cultural topics resulting from various events. Finally, in the case of low-use and under-developed areas, many topics on real estate and accessibility are emerging, so accessibility is good. It mainly had the characteristics of a region where development is planned or is likely to be developed. Furthermore, a model was implemented that proposes urban regeneration types tailored to regional characteristics for regions other than Seoul. Machine learning technology was used to implement the model, and training data and test data were randomly extracted at an 8:2 ratio and used. In order to compare the performance between various models, the input variables are set in two ways: Count Vector and TF-IDF Vector, and as Classifier, there are 5 types of SVM (Support Vector Machine), Decision Tree, Random Forest, Logistic Regression, and Gradient Boosting. By applying it, performance comparison for a total of 10 models was conducted. The model with the highest performance was the Gradient Boosting method using TF-IDF Vector input data, and the accuracy was 97%. Therefore, the recommendation system proposed in this study is expected to recommend urban regeneration types based on the regional characteristics of new business sites in the process of carrying out urban regeneration projects."

A Study on Kiosk Satisfaction Level Improvement: Focusing on Kano, Timko, and PCSI Methodology (키오스크 소비자의 만족수준 연구: Kano, Timko, PCSI 방법론을 중심으로)

  • Choi, Jaehoon;Kim, Pansoo
    • Asia-Pacific Journal of Business Venturing and Entrepreneurship
    • /
    • v.17 no.4
    • /
    • pp.193-204
    • /
    • 2022
  • This study analyzed the degree of influence of measurement and improvement of customer satisfaction level targeting kiosk users. In modern times, due to the development of technology and the improvement of the online environment, the probability that simple labor tasks will disappear after 10 years is close to 90%. Even in domestic research, it is predicted that 'simple labor jobs' will disappear due to the influence of advanced technology with a probability of about 36%. there is. In particular, as the demand for non-face-to-face services increases due to the Corona 19 virus, which is recently spreading globally, the trend of introducing kiosks has accelerated, and the global market will grow to 83.5 billion won in 2021, showing an average annual growth rate of 8.9%. there is. However, due to the unmanned nature of these kiosks, some consumers still have difficulties in using them, and consumers who are not familiar with the use of these technologies have a negative attitude towards service co-producers due to rejection of non-face-to-face services and anxiety about service errors. Lack of understanding leads to role conflicts between sales clerks and consumers, or inequality is being created in terms of service provision and generations accustomed to using technology. In addition, since kiosk is a representative technology-based self-service industry, if the user feels uncomfortable or requires additional labor, the overall service value decreases and the growth of the kiosk industry itself can be suppressed. It is important. Therefore, interviews were conducted on the main points of direct use with actual users centered on display color scheme, text size, device design, device size, internal UI (interface), amount of information, recognition sensor (barcode, NFC, etc.), Display brightness, self-event, and reaction speed items were extracted. Afterwards, using the questionnaire, the Kano model quality attribute classification of each expected evaluation item was carried out, and Timko's customer satisfaction coefficient, which can be calculated with accurate numerical values The PCSI Index analysis was additionally performed to determine the improvement priorities by finally classifying the improvement impact of the kiosk expected evaluation items through research. As a result, the impact of improvement appears in the order of internal UI (interface), text size, recognition sensor (barcode, NFC, etc.), reaction speed, self-event, display brightness, amount of information, device size, device design, and display color scheme. Through this, we intend to contribute to a comprehensive comparison of kiosk-based research in each field and to set the direction for improvement in the venture industry.

Prediction of Correct Answer Rate and Identification of Significant Factors for CSAT English Test Based on Data Mining Techniques (데이터마이닝 기법을 활용한 대학수학능력시험 영어영역 정답률 예측 및 주요 요인 분석)

  • Park, Hee Jin;Jang, Kyoung Ye;Lee, Youn Ho;Kim, Woo Je;Kang, Pil Sung
    • KIPS Transactions on Software and Data Engineering
    • /
    • v.4 no.11
    • /
    • pp.509-520
    • /
    • 2015
  • College Scholastic Ability Test(CSAT) is a primary test to evaluate the study achievement of high-school students and used by most universities for admission decision in South Korea. Because its level of difficulty is a significant issue to both students and universities, the government makes a huge effort to have a consistent difficulty level every year. However, the actual levels of difficulty have significantly fluctuated, which causes many problems with university admission. In this paper, we build two types of data-driven prediction models to predict correct answer rate and to identify significant factors for CSAT English test through accumulated test data of CSAT, unlike traditional methods depending on experts' judgments. Initially, we derive candidate question-specific factors that can influence the correct answer rate, such as the position, EBS-relation, readability, from the annual CSAT practices and CSAT for 10 years. In addition, we drive context-specific factors by employing topic modeling which identify the underlying topics over the text. Then, the correct answer rate is predicted by multiple linear regression and level of difficulty is predicted by classification tree. The experimental results show that 90% of accuracy can be achieved by the level of difficulty (difficult/easy) classification model, whereas the error rate for correct answer rate is below 16%. Points and problem category are found to be critical to predict the correct answer rate. In addition, the correct answer rate is also influenced by some of the topics discovered by topic modeling. Based on our study, it will be possible to predict the range of expected correct answer rate for both question-level and entire test-level, which will help CSAT examiners to control the level of difficulties.

Study of Rhetorical Puns in Korean Comic Strips in Daily Newspaper (한국 신문만화의 언어유희적 기법 연구)

  • Kim, Eul-Ho
    • Cartoon and Animation Studies
    • /
    • s.10
    • /
    • pp.1-16
    • /
    • 2006
  • This thesis aims to recall the importance of language in comics by studying comic strips in Korean daily newspapers: the comic strips are analyzed for rhetorical puns in its language text as they representatively show the value and role of language in comics. Moreover, Korean comic strips, as they developed into current affairs comics, acquired a stronger media characteristic of communicating information compared to other genres of cartoons. As a result, comics strips have become a genre where language plays an important role and the words needing to be able to convey the meaning quickly and implicitly. Due to tight control of national authority, the language technique developed into an indirect expression rather than a stronger direct imaging technique. The political oppression of the comic strip paradoxically brought on the rhetorical development in the creative techniques. Based on this analysis, the writer studied the rhetorical puns of the texts Korean comic strips by implementing the classification techniques of rhetoric expressions. As a result, through quotes and analysis of actual comic strips, the writer confirmed that Korean comic strips do actually show tremendously vast rhetorical puns in its language application techniques. The writer was also able to conclude that the rhetorical puns in comics were the force entertaining and impressing the readers, and also acting as the creative principle. Concluding this study, the writer emphasizes that language, not only in comic strips, is a combination of words and images and is also an important factor in all cartoons in general. Thus the thesis proposes that the training of humanistic thoughts and linguistic sensitivity are as important as learning to draw in the creation of cartoons.

  • PDF

Building Hierarchical Knowledge Base of Research Interests and Learning Topics for Social Computing Support (소셜 컴퓨팅을 위한 연구·학습 주제의 계층적 지식기반 구축)

  • Kim, Seonho;Kim, Kang-Hoe;Yeo, Woondong
    • The Journal of the Korea Contents Association
    • /
    • v.12 no.12
    • /
    • pp.489-498
    • /
    • 2012
  • This paper consists of two parts: In the first part, we describe our work to build hierarchical knowledge base of digital library patron's research interests and learning topics in various scholarly areas through analyzing well classified Electronic Theses and Dissertations (ETDs) of NDLTD Union catalog. Journal articles from ACM Transactions and conference web sites of computing areas also are added in the analysis to specialize computing fields. This hierarchical knowledge base would be a useful tool for many social computing and information service applications, such as personalization, recommender system, text mining, technology opportunity mining, information visualization, and so on. In the second part, we compare four grouping algorithms to select best one for our data mining researches by testing each one with the hierarchical knowledge base we described in the first part. From these two studies, we intent to show traditional verification methods for social community miming researches, based on interviewing and answering questionnaires, which are expensive, slow, and privacy threatening, can be replaced with systematic, consistent, fast, and privacy protecting methods by using our suggested hierarchical knowledge base.

A Study on Marine Accident Ontology Development and Data Management: Based on a Situation Report Analysis of Southwest Coast Marine Accidents in Korea (해양사고 온톨로지 구축 및 데이터 관리방안 연구: 서해남부해역 선박사고 상황보고서 분석을 중심으로)

  • Lee, Young Jai;Kang, Seong Kyung;Gu, Ja-Yeong
    • Journal of the Korean Society of Marine Environment & Safety
    • /
    • v.25 no.4
    • /
    • pp.423-432
    • /
    • 2019
  • Along with an increase in marine activities every year, the frequency of marine accidents is on the rise. Accordingly, various research activities and policies for marine safety are being implemented. Despite these efforts, the number of accidents are increasing every year, bringing their effectiveness into question. Preliminary studies relying on annual statistical reports provide precautionary measures for items that stand out significantly, through the comparison of statistical provision items. Since the 2000s, large-scale marine accidents have repeatedly occurred, and case studies have examined the "accident response." Likewise, annual statistics or accident cases are used as core data in policy formulation for domestic maritime safety. However, they are just a summary of post-accident results. In this study, limitations of current marine research and policy are evaluated through a literature review of case studies and analyses of marine accidents. In addition, the ontology of the marine accident information classification system will be revised to improve the current limited usage of the information through an attribute analysis of boating accident status reports and text mining. These aspects consist of the reporter, the report method, the rescue organization, corrective measures, vulnerability of response, payloads, cause of oil spill, damage pattern, and the result of an accident response. These can be used consistently in the future as classified standard terms to collect and utilize information more efficiently. Moreover, the research proposes a data collection and quality assurance method for the practical use of ontology. A clear understanding of the problems presently faced in marine safety will allow "suf icient quality information" to be leveraged for the purpose of conducting various researches and realizing effective policies.

A Study on the Bibliographic Characteristics and Data Value of YoungNamRooSiun(嶺南樓詩韻) - focusing on the classification of the recorded characters - (『영남루시운(嶺南樓詩韻)』의 서지적 특징과 자료적 가치 - 수록 인물 분류를 중심으로 -)

  • Jun, Jae Dong
    • Journal of Korean Library and Information Science Society
    • /
    • v.49 no.4
    • /
    • pp.353-371
    • /
    • 2018
  • This paper aims to analyze the bibliographic characteristics and the material value of the recently introduced manuscript YoungNamRooSiun(嶺南樓詩韻). YoungNamRooSiun(嶺南樓詩韻) is a collection of six buildings including the Yeongnamroo(嶺南樓) main hall floor, Reulpadang(凌波堂), Chimryugak(枕流閣), Gaksadonghun(客舍東軒), Dukminjung(德民亭), and Namsujung(攬秀亭). And poetry. The number of artists listed here is 412, and the total number of works is 570 including 11 prose and 559 rhyme. Through the text, YoungNamRoo(嶺南樓) was able to confirm that it was a space where official affairs such as observatory, master's office space, and hospitality reception were held. Based on this, an attempt should be made to take the reliability of the YoungNamRoo(嶺南樓) Ruling through the comparison of the original texts of the YoungNamRoo(嶺南樓) Ruling Prize and the collected writers.

A Literature Review and Classification of Recommender Systems on Academic Journals (추천시스템관련 학술논문 분석 및 분류)

  • Park, Deuk-Hee;Kim, Hyea-Kyeong;Choi, Il-Young;Kim, Jae-Kyeong
    • Journal of Intelligence and Information Systems
    • /
    • v.17 no.1
    • /
    • pp.139-152
    • /
    • 2011
  • Recommender systems have become an important research field since the emergence of the first paper on collaborative filtering in the mid-1990s. In general, recommender systems are defined as the supporting systems which help users to find information, products, or services (such as books, movies, music, digital products, web sites, and TV programs) by aggregating and analyzing suggestions from other users, which mean reviews from various authorities, and user attributes. However, as academic researches on recommender systems have increased significantly over the last ten years, more researches are required to be applicable in the real world situation. Because research field on recommender systems is still wide and less mature than other research fields. Accordingly, the existing articles on recommender systems need to be reviewed toward the next generation of recommender systems. However, it would be not easy to confine the recommender system researches to specific disciplines, considering the nature of the recommender system researches. So, we reviewed all articles on recommender systems from 37 journals which were published from 2001 to 2010. The 37 journals are selected from top 125 journals of the MIS Journal Rankings. Also, the literature search was based on the descriptors "Recommender system", "Recommendation system", "Personalization system", "Collaborative filtering" and "Contents filtering". The full text of each article was reviewed to eliminate the article that was not actually related to recommender systems. Many of articles were excluded because the articles such as Conference papers, master's and doctoral dissertations, textbook, unpublished working papers, non-English publication papers and news were unfit for our research. We classified articles by year of publication, journals, recommendation fields, and data mining techniques. The recommendation fields and data mining techniques of 187 articles are reviewed and classified into eight recommendation fields (book, document, image, movie, music, shopping, TV program, and others) and eight data mining techniques (association rule, clustering, decision tree, k-nearest neighbor, link analysis, neural network, regression, and other heuristic methods). The results represented in this paper have several significant implications. First, based on previous publication rates, the interest in the recommender system related research will grow significantly in the future. Second, 49 articles are related to movie recommendation whereas image and TV program recommendation are identified in only 6 articles. This result has been caused by the easy use of MovieLens data set. So, it is necessary to prepare data set of other fields. Third, recently social network analysis has been used in the various applications. However studies on recommender systems using social network analysis are deficient. Henceforth, we expect that new recommendation approaches using social network analysis will be developed in the recommender systems. So, it will be an interesting and further research area to evaluate the recommendation system researches using social method analysis. This result provides trend of recommender system researches by examining the published literature, and provides practitioners and researchers with insight and future direction on recommender systems. We hope that this research helps anyone who is interested in recommender systems research to gain insight for future research.