• Title/Summary/Keyword: Wikipedia

Search Result 153, Processing Time 0.026 seconds

Thesaurus Updating Using Collective Intelligence: Based on Wikipedia Encyclopedia (집단지성을 활용한 시소러스 갱신에 관한 연구: 위키피디아를 중심으로)

  • Han, Seung-Hee
    • Journal of the Korean Society for information Management
    • /
    • v.26 no.3
    • /
    • pp.25-43
    • /
    • 2009
  • The purpose of this study is to suggest how the classic thesaurus structure of terms and links can be mined and updated from Wikipedia encyclopedia, which is the best practice of collective intelligence. In a comparison with ASIS&T thesaurus, it was found that Wikipedia contains a substantial coverage of domain-specific concepts and semantic relations. Furthermore, it was resulted that the structural characteristics of Wikipedia, such as redirects, categories, and mutual links are suitable to extract semantic relationships of thesaurus. It is needed to apply to update various thesauri, including multilingual thesaurus, in order to generalize the results of this research.

Document Clustering using Clustering and Wikipedi (군집과 위키피디아를 이용한 문서군집)

  • Park, Sun;Lee, Seong Ho;Park, Hee Man;Kim, Won Ju;Kim, Dong Jin;Chandra, Abel;Lee, Seong Ro
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2012.10a
    • /
    • pp.392-393
    • /
    • 2012
  • This paper proposes a new document clustering method using clustering and Wikipedia. The proposed method can well represent the concept of cluster topics by means of NMF. It can solve the problem of "bags of words" to be not considered the meaningful relationships between documents and clusters, which expands the important terms of cluster by using of the synonyms of Wikipedia. The experimental results demonstrate that the proposed method achieves better performance than other document clustering methods.

  • PDF

A Study on Utilization of Wikipedia Contents for Automatic Construction of Linguistic Resources (언어자원 자동 구축을 위한 위키피디아 콘텐츠 활용 방안 연구)

  • Yoo, Cheol-Jung;Kim, Yong;Yun, Bo-Hyun
    • Journal of Digital Convergence
    • /
    • v.13 no.5
    • /
    • pp.187-194
    • /
    • 2015
  • Various linguistic knowledge resources are required in order that machine can understand diverse variation in natural languages. This paper aims to devise an automatic construction method of linguistic resources by reflecting characteristics of online contents toward continuous expansion. Especially we focused to build NE(Named-Entity) dictionary because the applicability of NEs is very high in linguistic analysis processes. Based on the investigation on Korean Wikipedia, we suggested an efficient construction method of NE dictionary using the syntactic patterns and structural features such as metadatas.

Automated Development of Rank-Based Concept Hierarchical Structures using Wikipedia Links (위키피디아 링크를 이용한 랭크 기반 개념 계층구조의 자동 구축)

  • Lee, Ga-hee;Kim, Han-joon
    • The Journal of Society for e-Business Studies
    • /
    • v.20 no.4
    • /
    • pp.61-76
    • /
    • 2015
  • In general, we have utilized the hierarchical concept tree as a crucial data structure for indexing huge amount of textual data. This paper proposes a generality rank-based method that can automatically develop hierarchical concept structures with the Wikipedia data. The goal of the method is to regard each of Wikipedia articles as a concept and to generate hierarchical relationships among concepts. In order to estimate the generality of concepts, we have devised a special ranking function that mainly uses the number of hyperlinks among Wikipedia articles. The ranking function is effectively used for computing the probabilistic subsumption among concepts, which allows to generate relatively more stable hierarchical structures. Eventually, a set of concept pairs with hierarchical relationship is visualized as a DAG (directed acyclic graph). Through the empirical analysis using the concept hierarchy of Open Directory Project, we proved that the proposed method outperforms a representative baseline method and it can automatically extract concept hierarchies with high accuracy.

An Experimental Study on Feature Selection Using Wikipedia for Text Categorization (위키피디아를 이용한 분류자질 선정에 관한 연구)

  • Kim, Yong-Hwan;Chung, Young-Mee
    • Journal of the Korean Society for information Management
    • /
    • v.29 no.2
    • /
    • pp.155-171
    • /
    • 2012
  • In text categorization, core terms of an input document are hardly selected as classification features if they do not occur in a training document set. Besides, synonymous terms with the same concept are usually treated as different features. This study aims to improve text categorization performance by integrating synonyms into a single feature and by replacing input terms not in the training document set with the most similar term occurring in training documents using Wikipedia. For the selection of classification features, experiments were performed in various settings composed of three different conditions: the use of category information of non-training terms, the part of Wikipedia used for measuring term-term similarity, and the type of similarity measures. The categorization performance of a kNN classifier was improved by 0.35~1.85% in $F_1$ value in all the experimental settings when non-learning terms were replaced by the learning term with the highest similarity above the threshold value. Although the improvement ratio is not as high as expected, several semantic as well as structural devices of Wikipedia could be used for selecting more effective classification features.

A Study on the Knowledge Formation Process of Wikipedia in Korea through Big Data Analysis (빅데이터 분석을 통해 본 한국 위키피디아의 지식형성 과정에 관한 연구)

  • Lee, Jungyeoun;Jeon, Suhyeon
    • Journal of the Korean Society for information Management
    • /
    • v.37 no.2
    • /
    • pp.171-195
    • /
    • 2020
  • This study analyzed the collaborative process in time series by dismantling the edit log big data of Wikipedia Korea, a representative online collaboration community, from early 2002 to 2019. Analysis elements were extracted from the document edit records, formatted in standardized XML, and analyzed using Python and R. The ways of editors' contribution, the characteristics of data contents, and the trend of document creation were explained by the analysis. An active contribution of a small set of editors and a loose participation of the majority were revealed. In addition, sociocultural characteristics that appear in online communities were also found in Wikipedia Korea. A new, diverse set of external resources is necessary to sustain the collective intelligence. An effort to settle new editors into the wikipedia community and an openness through circulation structure to avoid the exclusiveness of the management group are suggested.

Automatic Construction of Class Hierarchies and Named Entity Dictionaries using Korean Wikipedia (한국어 위키피디아를 이용한 분류체계 생성과 개체명 사전 자동 구축)

  • Bae, Sang-Joon;Ko, Young-Joong
    • Journal of KIISE:Computing Practices and Letters
    • /
    • v.16 no.4
    • /
    • pp.492-496
    • /
    • 2010
  • Wikipedia as an open encyclopedia contains immense human knowledge written by thousands of volunteer editors and its reliability is also high. In this paper, we propose to automatically construct a Korean named entity dictionary using the several features of the Wikipedia. Firstly, we generate class hierarchies using the class information from each article of Wikipedia. Secondly, the titles of each article are mapped to our class hierarchies, and then we calculate the entropy value of the root node in each class hierarchy. Finally, we construct named entity dictionary with high performance by removing the class hierarchies which have a higher entropy value than threshold. Our experiment results achieved overall F1-measure of 81.12% (precision : 83.94%, recall : 78.48%).

An Ontology-based Analysis of Wikipedia Usage Data for Measuring degree-of-interest in Country (국가별 관심도 측정을 위한 온톨로지 기반 위키피디아 사용 데이터 분석)

  • Kim, Hyon Hee;Jo, Jinnam;Kim, Donggeon
    • Journal of the Korea Society of Computer and Information
    • /
    • v.19 no.4
    • /
    • pp.43-53
    • /
    • 2014
  • In this paper, we propose an ontology-based approach to measuring degree-of-interest in country by analyzing wikipedia usage data. First, we developed the degree-of-interest ontology called DOI ontology by extracting concept hierarchies from wikipedia categories. Second, we map the title of frequently edited articles into DOI ontology, and we measure degree-of-interest based on DOI ontology by analyzing wikipedia page views. Finally, we perform chi-square test of independence to figure out if interesting fields are independent or not by country. This approach shows interesting fields are closely related to each country, and provides degree of interests by country timely and flexibly as compared with conventional questionnaire survey analysis.

An Online Terminology Dictionary of Traditional Korean Medicine (온라인 한의학 용어 사전 시스템 구축)

  • Kim, Sang-Kyun;Jang, Hyun-Chul;Yea, Sang-Jun;Kim, Chul;Song, Mi-Young
    • Korean Journal of Oriental Medicine
    • /
    • v.18 no.1
    • /
    • pp.45-52
    • /
    • 2012
  • Objectives : Our study aims to provide a collaborative Internet terminology dictionary like Wikipedia, where about 30,000 concept terminologies with respect to traditional Korean medicine (TKM) are shared and TKM experts can edit the terminologies. Methods : The concept terminologies have been collected and refined for three years by the terminology management system, a custom-made software built upon the Oracle database, where each terminology is divided and normalized into one or more tables. The operation of Wikipedia depends on MediaWiki, a free and open source wiki software built upon the MySQL database. The database schema of our terminology management system is different from that of MediaWiki so that MediaWiki cannot used as our terminology dictionary. Thus, we propose a way to share and edit TKM terminologies with wiki-like user interface. Results : We devise a new terminology dictionary system to search and edit terminology upon the database of the terminology management system. The online terminology dictionary of TKM has the user interface and functions which is similar to Wikipedia to support collaborative works. Conclusions : Wikipedia is operated on MediaWiki which is can be downloaded and used freely under the GNU General Public License. However, there occur problems to use MediaWiki upon the legacy system. Thus, other wiki projects start, they should be considered.

Extracting Korean-English Parallel Sentences from Wikipedia (위키피디아로부터 한국어-영어 병렬 문장 추출)

  • Kim, Sung-Hyun;Yang, Seon;Ko, Youngjoong
    • Journal of KIISE:Software and Applications
    • /
    • v.41 no.8
    • /
    • pp.580-585
    • /
    • 2014
  • This paper conducts a variety of experiments for "the extraction of Korean parallel sentences using Wikipedia data". We refer to various methods that were previously proposed for other languages. We use two approaches. The first one is to use translation probabilities that are extracted from the existing resources such as Sejong parallel corpus, and the second one is to use dictionaries such as Wiki dictionary consisting of Wikipedia titles and MRDs (machine readable dictionaries). Experimental results show that we obtained a significant improvement in system using Wikipedia data in comparison to one using only the existing resources. We finally achieve an outstanding performance, an F1-score of 57.6%. We additionally conduct experiments using a topic model. Although this experiment shows a relatively lower performance, an F1-score of 51.6%, it is expected to be worthy of further studies.