• Title/Summary/Keyword: Document Frequency

Search Result 303, Processing Time 0.026 seconds

An Automated Topic Specific Web Crawler Calculating Degree of Relevance (연관도를 계산하는 자동화된 주제 기반 웹 수집기)

  • Seo Hae-Sung;Choi Young-Soo;Choi Kyung-Hee;Jung Gi-Hyun;Noh Sang-Uk
    • Journal of Internet Computing and Services
    • /
    • v.7 no.3
    • /
    • pp.155-167
    • /
    • 2006
  • It is desirable if users surfing on the Internet could find Web pages related to their interests as closely as possible. Toward this ends, this paper presents a topic specific Web crawler computing the degree of relevance. collecting a cluster of pages given a specific topic, and refining the preliminary set of related web pages using term frequency/document frequency, entropy, and compiled rules. In the experiments, we tested our topic specific crawler in terms of the accuracy of its classification, crawling efficiency, and crawling consistency. First, the classification accuracy using the set of rules compiled by CN2 was the best, among those of C4.5 and back propagation learning algorithms. Second, we measured the classification efficiency to determine the best threshold value affecting the degree of relevance. In the third experiment, the consistency of our topic specific crawler was measured in terms of the number of the resulting URLs overlapped with different starting URLs. The experimental results imply that our topic specific crawler was fairly consistent, regardless of the starting URLs randomly chosen.

  • PDF

Affinity and Variety between Words in the Framework of Hypernetwork (하이퍼네트워크에서 본 단어간 긴밀성과 다양성)

  • Kim, Joon-Shik;Park, Chan-Hoon;Lee, Eun-Seok;Zhang, Byoung-Tak
    • Journal of KIISE:Computer Systems and Theory
    • /
    • v.35 no.4
    • /
    • pp.166-171
    • /
    • 2008
  • We studied the variety and affinity between the successive words in the text document A number of groups were defined by the frequency of a following word in the whole text (corpus). In the previous studies, the Zipf's power law was explained by Chinese restaurant process and hub node was searched after by examining the edge number profile in scale free network. We have observed both a power law and a hub profile at the same time by studying the conditional frequency and degeneracy of a group. A symmetry between the affinity and the variety between words were found during the data analysis. And this phenomenon can be explained within a viewpoint of "exploitation and exploration." We also remark on a small symmetry breaking phenomenon in TIPSTER data.

A Study on the Development of Search Algorithm for Identifying the Similar and Redundant Research (유사과제파악을 위한 검색 알고리즘의 개발에 관한 연구)

  • Park, Dong-Jin;Choi, Ki-Seok;Lee, Myung-Sun;Lee, Sang-Tae
    • The Journal of the Korea Contents Association
    • /
    • v.9 no.11
    • /
    • pp.54-62
    • /
    • 2009
  • To avoid the redundant investment on the project selection process, it is necessary to check whether the submitted research topics have been proposed or carried out at other institutions before. This is possible through the search engines adopted by the keyword matching algorithm which is based on boolean techniques in national-sized research results database. Even though the accuracy and speed of information retrieval have been improved, they still have fundamental limits caused by keyword matching. This paper examines implemented TFIDF-based algorithm, and shows an experiment in search engine to retrieve and give the order of priority for similar and redundant documents compared with research proposals, In addition to generic TFIDF algorithm, feature weighting and K-Nearest Neighbors classification methods are implemented in this algorithm. The documents are extracted from NDSL(National Digital Science Library) web directory service to test the algorithm.

Incorporating Time Constraints into a Recommender System for Museum Visitors

  • Kovavisaruch, La-or;Sanpechuda, Taweesak;Chinda, Krisada;Wongsatho, Thitipong;Wisadsud, Sodsai;Chaiwongyen, Anuwat
    • Journal of information and communication convergence engineering
    • /
    • v.18 no.2
    • /
    • pp.123-131
    • /
    • 2020
  • After observing that most tourists plan to complete their visits to multiple cultural heritage sites within one day, we surmised that for many museum visitors, the foremost thought is with regard to the amount of time is to be spent at each location and how they can maximize their enjoyment at a site while still balancing their travel itinerary? Recommendation systems in e-commerce are built on knowledge about the users' previous purchasing history; recommendation systems for museums, on the other hand, do not have an equivalent data source available. Recent solutions have incorporated advanced technologies such as algorithms that rely on social filtering, which builds recommendations from the nearest identified similar user. Our paper proposes a different approach, and involves providing dynamic recommendations that deploy social filtering as well as content-based filtering using term frequency-inverse document frequency. The main challenge is to overcome a cold start, whereby no information is available on new users entering the system, and thus there is no strong background information for generating the recommendation. In these cases, our solution deploys statistical methods to create a recommendation, which can then be used to gather data for future iterations. We are currently running a pilot test at Chao Samphraya national museum and have received positive feedback to date on the implementation.

Research on Designing Korean Emotional Dictionary using Intelligent Natural Language Crawling System in SNS (SNS대상의 지능형 자연어 수집, 처리 시스템 구현을 통한 한국형 감성사전 구축에 관한 연구)

  • Lee, Jong-Hwa
    • The Journal of Information Systems
    • /
    • v.29 no.3
    • /
    • pp.237-251
    • /
    • 2020
  • Purpose The research was studied the hierarchical Hangul emotion index by organizing all the emotions which SNS users are thinking. As a preliminary study by the researcher, the English-based Plutchick (1980)'s emotional standard was reinterpreted in Korean, and a hashtag with implicit meaning on SNS was studied. To build a multidimensional emotion dictionary and classify three-dimensional emotions, an emotion seed was selected for the composition of seven emotion sets, and an emotion word dictionary was constructed by collecting SNS hashtags derived from each emotion seed. We also want to explore the priority of each Hangul emotion index. Design/methodology/approach In the process of transforming the matrix through the vector process of words constituting the sentence, weights were extracted using TF-IDF (Term Frequency Inverse Document Frequency), and the dimension reduction technique of the matrix in the emotion set was NMF (Nonnegative Matrix Factorization) algorithm. The emotional dimension was solved by using the characteristic value of the emotional word. The cosine distance algorithm was used to measure the distance between vectors by measuring the similarity of emotion words in the emotion set. Findings Customer needs analysis is a force to read changes in emotions, and Korean emotion word research is the customer's needs. In addition, the ranking of the emotion words within the emotion set will be a special criterion for reading the depth of the emotion. The sentiment index study of this research believes that by providing companies with effective information for emotional marketing, new business opportunities will be expanded and valued. In addition, if the emotion dictionary is eventually connected to the emotional DNA of the product, it will be possible to define the "emotional DNA", which is a set of emotions that the product should have.

Properties of the Twenty-seven Pulses in DongUiBoGam Based on the Eight Important Pulses (팔요맥을 중심으로 살펴본 『동의보감』 27맥 속성 연구)

  • Lee, Taehyung;Jung, Won-Mo;Go, Byeongho;Park, Hi-Joon;Kim, Namil;Chae, Younbyoung
    • Korean Journal of Acupuncture
    • /
    • v.32 no.4
    • /
    • pp.151-159
    • /
    • 2015
  • Objectives : Pulse diagnosis is considered particularly important among several methods of diagnosis in DongUiBoGam. In spite of its importance, numerous and various pulse descriptions made it difficult to learn and practice pulse diagnosis. In this article, we tried to analyze properties of the twenty-seven pulses from pulse diagnosis cases from DongUiBoGam to enable the practical understanding of pulse diagnosis. Methods : We constituted the four axis according to the eight important pulses. And we analyzed properties of the twenty-seven pulses through the relationship between the four pairs of important pulses and the twenty-seven pulses. To quantify the relevances of important pulses to the twenty-seven pulses, we used the term frequency-inverse document frequency(TF-IDF) method. Results : We could elicit properties of the twenty-seven pulses according to the four axis. Also, we reexamined the categorization of the seven exterior pulses / the eight interior pulses and the similar pulses from DongUiBoGam with the analysis results. Conclusions : We could understand properties of the twenty-seven pulses more specifically with the eight important pulses. And we also could see the relationship among the twenty-seven pulses on each axis. However, the limitation arising from insufficient number of pulse diagnosis cases in this research requires further research with more sources such as other traditional medical records or clinical records in the present time.

A Structural Analysis of Acupuncture & Moxibustion Points in the NaeGyeong Chapter of DongUiBoGam Using Text Mining (텍스트마이닝을 이용한 동의보감의 질병인식방식과 내경편 침구법 경혈 특성 분석)

  • Lee, Taehyung;Jung, Won-Mo;Lee, In-Seon;Lee, Hyejung;Kim, Namil;Chae, Younbyoung
    • Korean Journal of Acupuncture
    • /
    • v.30 no.4
    • /
    • pp.230-242
    • /
    • 2013
  • Objectives : DongUiBoGam is a representative medical literature in Korea. This research intends to structurally grasp how DongUiBoGam understands the human body and review the methods of acupuncture and moxibustion in the NaeGyeong chapter of it using text mining. Methods : The structure of DongUiBoGam was analyzed with specific parts of the book that described contents, major premises of understanding the human body, and processes of treatment. We analyzed characteristics of each acupoints in a relationship with causes of diseases & symptoms in the NaeGyeong chapter using a Term Frequency - Inverse Document Frequency(TFIDF). Results : Three different categories of pattern identification(PI) were formed after structural analysis of DongUiBoGam. Every causes of diseases & symptoms were transformed according to the three categories of PI. After analyzing the relationship between acupoints and causes of diseases & symptoms, 114 acupoints were visualized with TFIDF values of three PI categories. Conclusions : The selection of acupoints in NaeGyeong chapter of DongUiBoGam were linked to causes of diseases & symptoms based on the three PI categories. Through visualization of bipartite relationships between acupoints and causes of diseases & symptoms, we could easily understand characteristics of each acupoint.

A Feasibility Study on Adopting Individual Information Cognitive Processing as Criteria of Categorization on Apple iTunes Store

  • Zhang, Chao;Wan, Lili
    • The Journal of Information Systems
    • /
    • v.27 no.2
    • /
    • pp.1-28
    • /
    • 2018
  • Purpose More than 7.6 million mobile apps could be approved on both Apple iTunes Store and Google Play. For managing those existed Apps, Apple Inc. established twenty-four primary categories, as well as Google Play had thirty-three primary categories. However, all of their categorizations have appeared more and more problems in managing and classifying numerous apps, such as app miscategorized, cross-attribution problems, lack of categorization keywords index, etc. The purpose of this study focused on introducing individual information cognitive processing as the classification criteria to update the current categorization on Apple iTunes Store. Meanwhile, we tried to observe the effectiveness of the new criteria from a classification process on Apple iTunes Store. Design/Methodology/Approach A research approach with four research stages were performed and a series of mixed methods was developed to identify the feasibility of adopting individual information cognitive processing as categorization criteria. By using machine-learning techniques with Term Frequency-Inverse Document Frequency and Singular Value Decomposition, keyword lists were extracted. By using the prior research results related to car app's categorization, we developed individual information cognitive processing. Further keywords extracting process from the extracted keyword lists was performed. Findings By TF-IDF and SVD, keyword lists from more than five thousand apps were extracted. Furthermore, we developed individual information cognitive processing that included a categorization teaching process and learning process. Three top three keywords for each category were extracted. By comparing the extracted results with prior studies, the inter-rater reliability for two different methods shows significant reliable, which proved the individual information cognitive processing to be reliable as criteria of categorization on Apple iTunes Store. The updating suggestions for Apple iTunes Store were discussed in this paper and the results of this paper may be useful for app store hosts to improve the current categorizations on app stores as well as increasing the efficiency of app discovering and locating process for both app developers and users.

An Analysis of IoT Service using Sentiment Analysis on Online Reviews: Focusing on the Characteristics of Service Providers (감성분석을 활용한 사물인터넷(IoT) 서비스 리뷰 분석: 사업자 특성에 따른 차이를 중심으로)

  • Ryu, Min Ho;Cho, Hosoo
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.25 no.5
    • /
    • pp.91-102
    • /
    • 2020
  • The Internet of Things (IoT) is characterized as the market where various companies compete for the same consumers. Thus, there are differences in functions and performance provided by the main business area and other characteristics of the service providers. This paper investigates whether satisfaction with the service provided depends on the characteristics of the operator by using sentiment analysis of comments. To achieve this goal, word importance analysis and sensitivity analysis are conducted on 34,310 reviews of 41 applications registered in the Google Play. The review analysis was conducted at various levels, including TD-IDF (Term frequency-inverse document frequency) value of keywords, service sectors, the origin of providers, and domestic/foreign providers. The results show that users' overall assessment of IoT services was found to be low, and smart homes received relatively high reviews compared to other services, and manufacturing-based and overseas providers received relatively higher evaluations than others.

Analysis on the Trend of The Journal of Information Systems Using TLS Mining (TLS 마이닝을 이용한 '정보시스템연구' 동향 분석)

  • Yun, Ji Hye;Oh, Chang Gyu;Lee, Jong Hwa
    • The Journal of Information Systems
    • /
    • v.31 no.1
    • /
    • pp.289-304
    • /
    • 2022
  • Purpose The development of the network and mobile industries has induced companies to invest in information systems, leading a new industrial revolution. The Journal of Information Systems, which developed the information system field into a theoretical and practical study in the 1990s, retains a 30-year history of information systems. This study aims to identify academic values and research trends of JIS by analyzing the trends. Design/methodology/approach This study aims to analyze the trend of JIS by compounding various methods, named as TLS mining analysis. TLS mining analysis consists of a series of analysis including Term Frequency-Inverse Document Frequency (TF-IDF) weight model, Latent Dirichlet Allocation (LDA) topic modeling, and a text mining with Semantic Network Analysis. Firstly, keywords are extracted from the research data using the TF-IDF weight model, and after that, topic modeling is performed using the Latent Dirichlet Allocation (LDA) algorithm to identify issue keywords. Findings The current study used the summery service of the published research paper provided by Korea Citation Index to analyze JIS. 714 papers that were published from 2002 to 2012 were divided into two periods: 2002-2011 and 2012-2021. In the first period (2002-2011), the research trend in the information system field had focused on E-business strategies as most of the companies adopted online business models. In the second period (2012-2021), data-based information technology and new industrial revolution technologies such as artificial intelligence, SNS, and mobile had been the main research issues in the information system field. In addition, keywords for improving the JIS citation index were presented.