• Title/Summary/Keyword: TF-IDF

Search Result 332, Processing Time 0.027 seconds

Content-based Recommendation Based on Social Network for Personalized News Services (개인화된 뉴스 서비스를 위한 소셜 네트워크 기반의 콘텐츠 추천기법)

  • Hong, Myung-Duk;Oh, Kyeong-Jin;Ga, Myung-Hyun;Jo, Geun-Sik
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.57-71
    • /
    • 2013
  • Over a billion people in the world generate new news minute by minute. People forecasts some news but most news are from unexpected events such as natural disasters, accidents, crimes. People spend much time to watch a huge amount of news delivered from many media because they want to understand what is happening now, to predict what might happen in the near future, and to share and discuss on the news. People make better daily decisions through watching and obtaining useful information from news they saw. However, it is difficult that people choose news suitable to them and obtain useful information from the news because there are so many news media such as portal sites, broadcasters, and most news articles consist of gossipy news and breaking news. User interest changes over time and many people have no interest in outdated news. From this fact, applying users' recent interest to personalized news service is also required in news service. It means that personalized news service should dynamically manage user profiles. In this paper, a content-based news recommendation system is proposed to provide the personalized news service. For a personalized service, user's personal information is requisitely required. Social network service is used to extract user information for personalization service. The proposed system constructs dynamic user profile based on recent user information of Facebook, which is one of social network services. User information contains personal information, recent articles, and Facebook Page information. Facebook Pages are used for businesses, organizations and brands to share their contents and connect with people. Facebook users can add Facebook Page to specify their interest in the Page. The proposed system uses this Page information to create user profile, and to match user preferences to news topics. However, some Pages are not directly matched to news topic because Page deals with individual objects and do not provide topic information suitable to news. Freebase, which is a large collaborative database of well-known people, places, things, is used to match Page to news topic by using hierarchy information of its objects. By using recent Page information and articles of Facebook users, the proposed systems can own dynamic user profile. The generated user profile is used to measure user preferences on news. To generate news profile, news category predefined by news media is used and keywords of news articles are extracted after analysis of news contents including title, category, and scripts. TF-IDF technique, which reflects how important a word is to a document in a corpus, is used to identify keywords of each news article. For user profile and news profile, same format is used to efficiently measure similarity between user preferences and news. The proposed system calculates all similarity values between user profiles and news profiles. Existing methods of similarity calculation in vector space model do not cover synonym, hypernym and hyponym because they only handle given words in vector space model. The proposed system applies WordNet to similarity calculation to overcome the limitation. Top-N news articles, which have high similarity value for a target user, are recommended to the user. To evaluate the proposed news recommendation system, user profiles are generated using Facebook account with participants consent, and we implement a Web crawler to extract news information from PBS, which is non-profit public broadcasting television network in the United States, and construct news profiles. We compare the performance of the proposed method with that of benchmark algorithms. One is a traditional method based on TF-IDF. Another is 6Sub-Vectors method that divides the points to get keywords into six parts. Experimental results demonstrate that the proposed system provide useful news to users by applying user's social network information and WordNet functions, in terms of prediction error of recommended news.

Suggestion of Urban Regeneration Type Recommendation System Based on Local Characteristics Using Text Mining (텍스트 마이닝을 활용한 지역 특성 기반 도시재생 유형 추천 시스템 제안)

  • Kim, Ikjun;Lee, Junho;Kim, Hyomin;Kang, Juyoung
    • Journal of Intelligence and Information Systems
    • /
    • v.26 no.3
    • /
    • pp.149-169
    • /
    • 2020
  • "The Urban Renewal New Deal project", one of the government's major national projects, is about developing underdeveloped areas by investing 50 trillion won in 100 locations on the first year and 500 over the next four years. This project is drawing keen attention from the media and local governments. However, the project model which fails to reflect the original characteristics of the area as it divides project area into five categories: "Our Neighborhood Restoration, Housing Maintenance Support Type, General Neighborhood Type, Central Urban Type, and Economic Base Type," According to keywords for successful urban regeneration in Korea, "resident participation," "regional specialization," "ministerial cooperation" and "public-private cooperation", when local governments propose urban regeneration projects to the government, they can see that it is most important to accurately understand the characteristics of the city and push ahead with the projects in a way that suits the characteristics of the city with the help of local residents and private companies. In addition, considering the gentrification problem, which is one of the side effects of urban regeneration projects, it is important to select and implement urban regeneration types suitable for the characteristics of the area. In order to supplement the limitations of the 'Urban Regeneration New Deal Project' methodology, this study aims to propose a system that recommends urban regeneration types suitable for urban regeneration sites by utilizing various machine learning algorithms, referring to the urban regeneration types of the '2025 Seoul Metropolitan Government Urban Regeneration Strategy Plan' promoted based on regional characteristics. There are four types of urban regeneration in Seoul: "Low-use Low-Level Development, Abandonment, Deteriorated Housing, and Specialization of Historical and Cultural Resources" (Shon and Park, 2017). In order to identify regional characteristics, approximately 100,000 text data were collected for 22 regions where the project was carried out for a total of four types of urban regeneration. Using the collected data, we drew key keywords for each region according to the type of urban regeneration and conducted topic modeling to explore whether there were differences between types. As a result, it was confirmed that a number of topics related to real estate and economy appeared in old residential areas, and in the case of declining and underdeveloped areas, topics reflecting the characteristics of areas where industrial activities were active in the past appeared. In the case of the historical and cultural resource area, since it is an area that contains traces of the past, many keywords related to the government appeared. Therefore, it was possible to confirm political topics and cultural topics resulting from various events. Finally, in the case of low-use and under-developed areas, many topics on real estate and accessibility are emerging, so accessibility is good. It mainly had the characteristics of a region where development is planned or is likely to be developed. Furthermore, a model was implemented that proposes urban regeneration types tailored to regional characteristics for regions other than Seoul. Machine learning technology was used to implement the model, and training data and test data were randomly extracted at an 8:2 ratio and used. In order to compare the performance between various models, the input variables are set in two ways: Count Vector and TF-IDF Vector, and as Classifier, there are 5 types of SVM (Support Vector Machine), Decision Tree, Random Forest, Logistic Regression, and Gradient Boosting. By applying it, performance comparison for a total of 10 models was conducted. The model with the highest performance was the Gradient Boosting method using TF-IDF Vector input data, and the accuracy was 97%. Therefore, the recommendation system proposed in this study is expected to recommend urban regeneration types based on the regional characteristics of new business sites in the process of carrying out urban regeneration projects."

A Study on Development of Patent Information Retrieval Using Textmining (텍스트 마이닝을 이용한 특허정보검색 개발에 관한 연구)

  • Go, Gwang-Su;Jung, Won-Kyo;Shin, Young-Geun;Park, Sang-Sung;Jang, Dong-Sik
    • Journal of the Korea Academia-Industrial cooperation Society
    • /
    • v.12 no.8
    • /
    • pp.3677-3688
    • /
    • 2011
  • The patent information retrieval system can serve a variety of purposes. In general, the patent information is retrieved using limited key words. To identify earlier technology and priority rights repeated effort is needed. This study proposes a method of content-based retrieval using text mining. Using the proposed algorithm, each of the documents is invested with characteristic value. The characteristic values are used to compare similarities between query documents and database documents. Text analysis is composed of 3 steps: stop-word, keyword analysis and weighted value calculation. In the test results, the general retrieval and the proposed algorithm were compared by using accuracy measurements. As the study arranges the result documents as similarities of the query documents, the surfer can improve the efficiency by reviewing the similar documents first. Also because of being able to input the full-text of patent documents, the users unacquainted with surfing can use it easily and quickly. It can reduce the amount of displayed missing data through the use of content based retrieval instead of keyword based retrieval for extending the scope of the search.

An Effective User-Profile Generation Method based on Identification of Informative Blocks in Web Document (웹 문서의 정보블럭 식별을 통한 효과적인 사용자 프로파일 생성방법)

  • Ryu, Sang-Hyun;Lee, Seung-Hwa;Jung, Min-Chul;Lee, Eun-Seok
    • Proceedings of the Korean Information Science Society Conference
    • /
    • 2007.10c
    • /
    • pp.253-257
    • /
    • 2007
  • 최근 웹 상에 정보가 폭발적으로 증가함에 따라, 사용자의 취향에 맞는 정보를 선별하여 제공하는 추천 시스템에 대한 연구가 활발히 진행되고 있다. 추천시스템은 사용자의 관심정보를 기술한 사용자 프로파일을 기반으로 동작하기 때문에 정확한 사용자 프로파일의 생성은 매우 중요하다. 사용자의 암시적인 행동정보를 기반으로 취향을 분석하는 대표적인 연구로 사용자가 이용한 웹 문서를 분석하는 방법이 있다. 이는 사용자가 이용하는 웹 문서에 빈번하게 등장하는 단어를 기반으로 사용자의 프로파일을 생성하는 것이다. 그러나 최근 웹 문서는 사용자 취향과 관련 없는 많은 구성요소들(로고, 저작권정보 등)을 포함하고 있다. 따라서 이러한 내용들을 모두 포함하여 웹 문서를 분석한다면 생성되는 프로파일의 정확도는 낮아질 것이다. 따라서 본 논문에서는 사용자 기기에서 사용자의 웹 문서 이용내역을 분석하고, 동일한 사이트로부터 얻어진 문서들에서 반복적으로 등장하는 블록을 제거한 후, 정보블럭을 식별하여 사용자의 관심단어를 추출하는 새로운 프로파일 생성방법을 제안한다. 이를 통해 보다 정확하고 빠른 프로파일 생성이 가능해진다. 본 논문에서는 제안방법의 평가를 위해, 최근 구매활동이 있었던 사용자들이 이용한 웹 문서 데이터를 수집하였으며, TF-IDF 방법과 제안방법을 이용하여 사용자 프로파일을 각각 추출하였다. 그리고 생성된 사용자 프로파일과 구매데이터와의 연관성을 비교하였으며, 보다 정확한 프로파일이 추출되는 결과와 프로파일 분석시간이 단축되는 결과를 통해 제안방법의 유효성을 입증하였다.)으로 높은 점수를 보였으며 내장첨가량에 따른 관능특성에서는 온쌀죽은 내장 $2{\sim}5%$ 첨가, 반쌀죽은 내장 $3{\sim}5%$ 첨가구에서 유의적(p<0.05)으로 높은 점수를 보였으나 쌀가루죽은 내장 $1{\sim}2%$ 첨가구에서 유의적(p<0.05)으로 낮은 점수를 보였다. 이상의 연구 결과를 통해 온쌀은 2%, 반쌀은 3%, 쌀가루는 4%의 내장을 첨가하여 제조한 전복죽이 이화학적, 물성적 및 관능적으로 우수한 것으로 나타났다.n)방법의 결과와 비교하였다.다. 유비스크립트에서는 모바일 코드의 개념을 통해서 앞서 언급한 유비쿼터스 컴퓨팅 환경에서의 문제점을 해결하고자 하였다. 모바일 코드에서는 프로그램 코드가 네트워크를 통해서 컴퓨터를 이동하면서 수행되는 개념인데, 이는 물리적으로 떨어져있으면서 네트워크로 연결되어 있는 다양한 컴퓨팅 장치가 서로 연동하기 위한 모델에 가장 적합하다. 이는 기본적으로 배포(deploy)라는 단계가 필요 없게 되고, 새로운 버전의 프로그램이 작성될지라도 런타임에 코드가 직접 이동하게 되므로 버전 관리의 문제도 해결된다. 게다가 원격 함수를 매번 호출하지 않고 한번 이동된 코드가 원격지에서 모두 수행을 하게 되므로 성능향상에도 도움이 된다. 장소 객체(Place Object)와 원격 스코프(Remote Scope)는 앞서 설명한 특징을 직접적으로 지원하는 언어 요소이다. 장소 객체는 모바일 코드가 이동해서 수행될 계산 환경(computational environment

  • PDF

Patent data analysis using clique analysis in a keyword network (키워드 네트워크의 클릭 분석을 이용한 특허 데이터 분석)

  • Kim, Hyon Hee;Kim, Donggeon;Jo, Jinnam
    • Journal of the Korean Data and Information Science Society
    • /
    • v.27 no.5
    • /
    • pp.1273-1284
    • /
    • 2016
  • In this paper, we analyzed the patents on machine learning using keyword network analysis and clique analysis. To construct a keyword network, important keywords were extracted based on the TF-IDF weight and their association, and network structure analysis and clique analysis was performed. Density and clustering coefficient of the patent keyword network are low, which shows that patent keywords on machine learning are weakly connected with each other. It is because the important patents on machine learning are mainly registered in the application system of machine learning rather thant machine learning techniques. Also, our results of clique analysis showed that the keywords found by cliques in 2005 patents are the subjects such as newsmaker verification, product forecasting, virus detection, biomarkers, and workflow management, while those in 2015 patents contain the subjects such as digital imaging, payment card, calling system, mammogram system, price prediction, etc. The clique analysis can be used not only for identifying specialized subjects, but also for search keywords in patent search systems.

A Method for Prediction of Quality Defects in Manufacturing Using Natural Language Processing and Machine Learning (자연어 처리 및 기계학습을 활용한 제조업 현장의 품질 불량 예측 방법론)

  • Roh, Jeong-Min;Kim, Yongsung
    • Journal of Platform Technology
    • /
    • v.9 no.3
    • /
    • pp.52-62
    • /
    • 2021
  • Quality control is critical at manufacturing sites and is key to predicting the risk of quality defect before manufacturing. However, the reliability of manual quality control methods is affected by human and physical limitations because manufacturing processes vary across industries. These limitations become particularly obvious in domain areas with numerous manufacturing processes, such as the manufacture of major nuclear equipment. This study proposed a novel method for predicting the risk of quality defects by using natural language processing and machine learning. In this study, production data collected over 6 years at a factory that manufactures main equipment that is installed in nuclear power plants were used. In the preprocessing stage of text data, a mapping method was applied to the word dictionary so that domain knowledge could be appropriately reflected, and a hybrid algorithm, which combined n-gram, Term Frequency-Inverse Document Frequency, and Singular Value Decomposition, was constructed for sentence vectorization. Next, in the experiment to classify the risky processes resulting in poor quality, k-fold cross-validation was applied to categorize cases from Unigram to cumulative Trigram. Furthermore, for achieving objective experimental results, Naive Bayes and Support Vector Machine were used as classification algorithms and the maximum accuracy and F1-score of 0.7685 and 0.8641, respectively, were achieved. Thus, the proposed method is effective. The performance of the proposed method were compared and with votes of field engineers, and the results revealed that the proposed method outperformed field engineers. Thus, the method can be implemented for quality control at manufacturing sites.

Strategies for the Development of Watermelon Industry Using Unstructured Big Data Analysis

  • LEE, Seung-In;SON, Chansoo;SHIM, Joonyong;LEE, Hyerim;LEE, Hye-Jin;CHO, Yongbeen
    • The Journal of Industrial Distribution & Business
    • /
    • v.12 no.1
    • /
    • pp.47-62
    • /
    • 2021
  • Purpose: Our purpose in this study was to examine the strategies for the development of watermelon industry using unstructured big data analysis. That is, this study was to look the change of issues and consumer's perception about watermelon using big data and social network analysis and to investigate ways to strengthen the competitiveness of watermelon industry based on that. Methodology: For this purpose, the data was collected from Naver (blog, news) and Daum (blog, news) by TEXTOM 4.5 and the analysis period was set from 2015 to 2016 and from 2017-2018 and from 2019-2020 in order to understand change of issues and consumer's perception about watermelon or watermelon industry. For the data analysis, TEXTOM 4.5 was used to conduct key word frequency analysis, word cloud analysis and extraction of metrics data. UCINET 6.0 and NetDraw function of UCINET 6.0 were utilized to find the connection structure of words and to visualize the network relations, and to make a cluster of words. Results: The keywords related to the watermelon extracted such as 'the stalk end of a watermelon', 'E-mart', 'Haman', 'Gochang', and 'Lotte Mart' (news: 015-2016), 'apple watermelon', 'Haman', 'E-mart', 'Gochang', and' Mudeungsan watermelon' (news: 2017-2018), 'E-mart', 'apple watermelon', 'household', 'chobok', and 'donation' (news: 2019-2020), 'watermelon salad', 'taste', 'the heat', 'baby', and 'effect' (blog: 2015-2016), 'taste', 'watermelon juice', 'method', 'watermelon salad', and 'baby' (blog: 2017-2018), 'taste', 'effect', 'watermelon juice', 'method', and 'apple watermelon' (blog: 2019-2020) and the results from frequency and TF-IDF analysis presented. And in CONCOR analysis, appeared as four types, respectively. Conclusions: Based on the results, the authors discussed the strategies and policies for boosting the watermelon industry and limitations of this study and future research directions. The results of this study will help prioritize strategies and policies for boosting the consumption of the watermelon and contribute to improving the competitiveness of watermelon industry in Korea. Also, it is expected that this study will be used as a very important basis for agricultural big data studies to be conducted in the future and this study will offer watermelon producers and policy-makers practical points helpful in crafting tailor-made marketing strategies.

The Analysis of Fashion Trend Cycle using Big Data (패션 트렌드의 주기적 순환성에 관한 빅데이터 융합 분석)

  • Kim, Ki-Hyun;Byun, Hae-Won
    • Journal of the Korea Convergence Society
    • /
    • v.11 no.12
    • /
    • pp.113-123
    • /
    • 2020
  • In this paper, big data analysis was conducted for past and present fashion trends and fashion cycle. We focused on daily look for ordinary people instead of the fashion professionals and fashion show. Using the social matrix tool, Textom, we performed frequency analysis, N-gram analysis, network analysis and structural equivalence analysis on the big data containing fashion trends and cycles. The results are as follows. First, this study extracted the major key words related to fashion trends for the daily look from the past(1980s, 1990s) and the present(2019 and 2020). Second, the frequence analysis and N-gram analysis showed that the fashion cycle has shorten to 30-40 years. Third, the structural equivalence analysis found the four representative clusters. The past four clusters are jean, retro codi, athleisure look, celebrity retro and the present clusters are retro, newtro, lady chic, retro futurism. Fourth, through the network analysis and N-gram analysis, it turned out that the past fashion is reproduced and evolves to the current fashion with certain reasoning.

Sentiment Analysis of Product Reviews to Identify Deceptive Rating Information in Social Media: A SentiDeceptive Approach

  • Marwat, M. Irfan;Khan, Javed Ali;Alshehri, Dr. Mohammad Dahman;Ali, Muhammad Asghar;Hizbullah;Ali, Haider;Assam, Muhammad
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.16 no.3
    • /
    • pp.830-860
    • /
    • 2022
  • [Introduction] Nowadays, many companies are shifting their businesses online due to the growing trend among customers to buy and shop online, as people prefer online purchasing products. [Problem] Users share a vast amount of information about products, making it difficult and challenging for the end-users to make certain decisions. [Motivation] Therefore, we need a mechanism to automatically analyze end-user opinions, thoughts, or feelings in the social media platform about the products that might be useful for the customers to make or change their decisions about buying or purchasing specific products. [Proposed Solution] For this purpose, we proposed an automated SentiDecpective approach, which classifies end-user reviews into negative, positive, and neutral sentiments and identifies deceptive crowd-users rating information in the social media platform to help the user in decision-making. [Methodology] For this purpose, we first collected 11781 end-users comments from the Amazon store and Flipkart web application covering distant products, such as watches, mobile, shoes, clothes, and perfumes. Next, we develop a coding guideline used as a base for the comments annotation process. We then applied the content analysis approach and existing VADER library to annotate the end-user comments in the data set with the identified codes, which results in a labelled data set used as an input to the machine learning classifiers. Finally, we applied the sentiment analysis approach to identify the end-users opinions and overcome the deceptive rating information in the social media platforms by first preprocessing the input data to remove the irrelevant (stop words, special characters, etc.) data from the dataset, employing two standard resampling approaches to balance the data set, i-e, oversampling, and under-sampling, extract different features (TF-IDF and BOW) from the textual data in the data set and then train & test the machine learning algorithms by applying a standard cross-validation approach (KFold and Shuffle Split). [Results/Outcomes] Furthermore, to support our research study, we developed an automated tool that automatically analyzes each customer feedback and displays the collective sentiments of customers about a specific product with the help of a graph, which helps customers to make certain decisions. In a nutshell, our proposed sentiments approach produces good results when identifying the customer sentiments from the online user feedbacks, i-e, obtained an average 94.01% precision, 93.69% recall, and 93.81% F-measure value for classifying positive sentiments.

Analysis of ICT Education Trends using Keyword Occurrence Frequency Analysis and CONCOR Technique (키워드 출현 빈도 분석과 CONCOR 기법을 이용한 ICT 교육 동향 분석)

  • Youngseok Lee
    • Journal of Industrial Convergence
    • /
    • v.21 no.1
    • /
    • pp.187-192
    • /
    • 2023
  • In this study, trends in ICT education were investigated by analyzing the frequency of appearance of keywords related to machine learning and using conversion of iteration correction(CONCOR) techniques. A total of 304 papers from 2018 to the present published in registered sites were searched on Google Scalar using "ICT education" as the keyword, and 60 papers pertaining to ICT education were selected based on a systematic literature review. Subsequently, keywords were extracted based on the title and summary of the paper. For word frequency and indicator data, 49 keywords with high appearance frequency were extracted by analyzing frequency, via the term frequency-inverse document frequency technique in natural language processing, and words with simultaneous appearance frequency. The relationship degree was verified by analyzing the connection structure and centrality of the connection degree between words, and a cluster composed of words with similarity was derived via CONCOR analysis. First, "education," "research," "result," "utilization," and "analysis" were analyzed as main keywords. Second, by analyzing an N-GRAM network graph with "education" as the keyword, "curriculum" and "utilization" were shown to exhibit the highest correlation level. Third, by conducting a cluster analysis with "education" as the keyword, five groups were formed: "curriculum," "programming," "student," "improvement," and "information." These results indicate that practical research necessary for ICT education can be conducted by analyzing ICT education trends and identifying trends.