• 제목/요약/키워드: web mining

Search Result 548, Processing Time 0.028 seconds

Evaluation of Thyroid Cancer Medical Information Sites using HONCODE (HONCODE를 근거로 한 갑상선암에 대한 의료정보 제공사이트의 질 평가)

  • Heo, Jun;Jung, Yong Gyu;Sihn, Sung Chul;Kim, Jang Il
    • Journal of Service Research and Studies
    • /
    • v.3 no.2
    • /
    • pp.45-52
    • /
    • 2013
  • With the development of information and communication technology, the Internet is more important in the social and economic influence rapidly, and it is no different in the field of health care. As health information on the Internet increasing, the availabilities of health information from the Internet becomes more important with health care professionals and information specialists. the quality of health information on the Internet are continually being presented without any guarantee or judge on the quality. It is needed to provide the right to use of qualified health information through Internet. HONCODE has been established and managed by HON (Health On the Net) Foundation. In this paper, Web sites of thyroid cancer Information are evaluated using HONCODE. They provide domestic medical information on the Internet. Through this, more accuracy and evaluated information could be provided on the Internet about the thyroid cancer.

  • PDF

A Scalable Clustering Method for Categorical Sequences (범주형 시퀀스들에 대한 확장성 있는 클러스터링 방법)

  • Oh, Seung-Joon;Kim, Jae-Yearn
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.2
    • /
    • pp.136-141
    • /
    • 2004
  • There has been enormous growth in the amount of commercial and scientific data, such as retail transactions, protein sequences, and web-logs. Such datasets consist of sequence data that have an inherent sequential nature. However, few clustering algorithms consider sequentiality. In this paper, we study how to cluster sequence datasets. We propose a new similarity measure to compute the similarity between two sequences. We also present an efficient method for determining the similarity measure and develop a clustering algorithm. Due to the high computational complexity of hierarchical clustering algorithms for clustering large datasets, a new clustering method is required. Therefore, we propose a new scalable clustering method using sampling and a k-nearest-neighbor method. Using a real dataset and a synthetic dataset, we show that the quality of clusters generated by our proposed approach is better than that of clusters produced by traditional algorithms.

Detecting Spam Data for Securing the Reliability of Text Analysis (텍스트 분석의 신뢰성 확보를 위한 스팸 데이터 식별 방안)

  • Hyun, Yoonjin;Kim, Namgyu
    • The Journal of Korean Institute of Communications and Information Sciences
    • /
    • v.42 no.2
    • /
    • pp.493-504
    • /
    • 2017
  • Recently, tremendous amounts of unstructured text data that is distributed through news, blogs, and social media has gained much attention from many researchers and practitioners as this data contains abundant information about various consumers' opinions. However, as the usefulness of text data is increasing, more and more attempts to gain profits by distorting text data maliciously or nonmaliciously are also increasing. This increase in spam text data not only burdens users who want to obtain useful information with a large amount of inappropriate information, but also damages the reliability of information and information providers. Therefore, efforts must be made to improve the reliability of information and the quality of analysis results by detecting and removing spam data in advance. For this purpose, many studies to detect spam have been actively conducted in areas such as opinion spam detection, spam e-mail detection, and web spam detection. In this study, we introduce core concepts and current research trends of spam detection and propose a methodology to detect the spam tag of a blog as one of the challenging attempts to improve the reliability of blog information.

A Sparse Data Preprocessing Using Support Vector Regression (Support Vector Regression을 이용한 희소 데이터의 전처리)

  • Jun, Sung-Hae;Park, Jung-Eun;Oh, Kyung-Whan
    • Journal of the Korean Institute of Intelligent Systems
    • /
    • v.14 no.6
    • /
    • pp.789-792
    • /
    • 2004
  • In various fields as web mining, bioinformatics, statistical data analysis, and so forth, very diversely missing values are found. These values make training data to be sparse. Largely, the missing values are replaced by predicted values using mean and mode. We can used the advanced missing value imputation methods as conditional mean, tree method, and Markov Chain Monte Carlo algorithm. But general imputation models have the property that their predictive accuracy is decreased according to increase the ratio of missing in training data. Moreover the number of available imputations is limited by increasing missing ratio. To settle this problem, we proposed statistical learning theory to preprocess for missing values. Our statistical learning theory is the support vector regression by Vapnik. The proposed method can be applied to sparsely training data. We verified the performance of our model using the data sets from UCI machine learning repository.

Statistical Profiles of Users' Interactions with Videos in Large Repositories: Mining of Khan Academy Repository

  • Yassine, Sahar;Kadry, Seifedine;Sicilia, Miguel Angel
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.14 no.5
    • /
    • pp.2101-2121
    • /
    • 2020
  • The rapid growth of instructional videos repositories and their widespread use as a tool to support education have raised the need of studies to assess the quality of those educational resources and their impact on the quality of learning process that depends on them. Khan Academy (KA) repository is one of the prominent educational videos' repositories. It is famous and widely used by different types of learners, students and teachers. To better understand its characteristics and the impact of such repositories on education, we gathered a huge amount of KA data using its API and different web scraping techniques, then we analyzed them. This paper reports the first quantitative and descriptive analysis of Khan Academy repository (KA repository) of open video lessons. First, we described the structure of repository. Then, we demonstrated some analyses highlighting content-based growth and evolution. Those descriptive analyses spotted the main important findings in KA repository. Finally, we focused on users' interactions with video lessons. Those interactions consisted of questions and answers posted on videos. We developed interaction profiles for those videos based on the number of users' interactions. We conducted regression analysis and statistical tests to mine the relation between those profiles and some quality related proposed metrics. The results of analysis showed that all interaction profiles are highly affected by video length and reuse rate in different subjects. We believe that our study demonstrated in this paper provides valuable information in understanding the logic and the learning mechanism inside learning repositories, which can have major impacts on the education field in general, and particularly on the informal learning process and the instructional design process. This study can be considered as one of the first quantitative studies to shed the light on Khan Academy as an open educational resources (OER) repository. The results presented in this paper are crucial in understanding KA videos repository, its characteristics and its impact on education.

A Sentiment Analysis Algorithm for Automatic Product Reviews Classification in On-Line Shopping Mall (온라인 쇼핑몰의 상품평 자동분류를 위한 감성분석 알고리즘)

  • Chang, Jae-Young
    • The Journal of Society for e-Business Studies
    • /
    • v.14 no.4
    • /
    • pp.19-33
    • /
    • 2009
  • With the continuously increasing volume of e-commerce transactions, it is now popular to buy some products and to evaluate them on the World Wide Web. The product reviews are very useful to customers because they can make better decisions based on the indirect experiences obtainable through the reviews. Product Reviews are results expressing customer's sentiments and thus are divided into positive reviews and negative ones. However, as the number of reviews in on-line shopping increases, it is inefficient or sometimes impossible for users to read all the relevant review documents. In this paper, we present a sentiment analysis algorithm for automatically classifying subjective opinions of customer's reviews using opinion mining technology. The proposed algorithm is to focus on product reviews of on-line shopping, and provides summarized results from large product review data by determining whether they are positive or negative. Additionally, this paper introduces an automatic review analysis system implemented based on the proposed algorithm, and also present the experiment results for verifying the efficiency of the algorithm.

  • PDF

Construction of the Digital Archive System from the Records of Westerners Who Stayed in Korea during the Enlightenment Period of Chosun (개화기 조선 체류 서양인 기록물의 디지털 아카이브 시스템 구축)

  • Chung, Heesun;Kim, Heesoon;Song, Hyun-Sook;Lee, Myeong-Hee
    • Journal of the Korean BIBLIA Society for library and Information Science
    • /
    • v.27 no.4
    • /
    • pp.229-249
    • /
    • 2016
  • This study was conducted to create a digital archive for local cultural contents compiled from the records of westerners who stayed in Korea during the Enlightenment Period of Chosun. The compiled information were gathered from 22 records, and 10 main subjects, 40 sub-subjects and 239 mini-subjects were derived through the subject classification scheme. Item analysis was conducted through 38 metadata and input data types were classified and databased in Excel. Finally, a web-based digital archiving system was developed for searching and providing information through various access points. Suggestions for future research were made to expand archive contents through continuous excavation of westerners' records, to build an integrated information system of Korean digital archives incorporating individual archive systems, to develop standardization of classification schemes and a multidimensional classification system considering facet structure in cultural heritage areas, to keep consistency of contents through standardization of metadata format, and to build ontology using semantic search functions and data mining functions.

The Korean HapMap Project Website

  • Kim, Young-Uk;Kim, Seung-Ho;Jin, Hoon;Park, Young-Kyu;Ji, Mi-Hyun;Kim, Young-Joo
    • Genomics & Informatics
    • /
    • v.6 no.2
    • /
    • pp.91-94
    • /
    • 2008
  • Single nucleotide polymorphisms (SNPs) are the most abundant form of human genetic variation and are a resource for mapping complex genetic traits. A genome is covered by millions of these markers, and researchers are able to compare which SNPs predominate in people who have a certain disease. The International HapMap Project, launched in October, 2002, motivated us to start the Korean HapMap Project in order to support Korean HapMap infrastructure development and to accelerate the finding of genes that affect health, disease, and individual responses to medications and environmental factors. A Korean SNP and haplotype database system was developed through the Korean HapMap Project to provide Korean researchers with useful data-mining information about disease-associated biomarkers for studies on complex diseases, such as diabetes, cancer, and stroke. Also, we have developed a series of software programs for association studies as well as the comparison and analysis of Korean HapMap data with other populations, such as European, Chinese, Japanese, and African populations. The developed software includes HapMapSNPAnalyzer, SNPflank, HWE Test, FESD, D2GSNP, SNP@Domain, KMSD, KFOD, KFRG, and SNP@WEB. We developed a disease-related SNP retrieval system, in which OMIM, GeneCards, and MeSH information were integrated and analyzed for medical research scientists. The kHapMap Browser system that we developed and integrated provides haplotype retrieval and comparative study tools of human ethnicities for comprehensive disease association studies (http://www.khapmap.org). It is expected that researchers may be able to retrieve useful information from the kHapMap Browser to find useful biomarkers and genes in complex disease association studies and use these biomarkers and genes to study and develop new drugs for personalized medicine.

A Public-oriented e-marketplace Framework for the Mining Industry (광산업의 B2B 공적 e-Marketplace 프레임워크 구축에 관한 연구)

  • Park, Ki-Nam
    • Journal of Korea Society of Industrial Information Systems
    • /
    • v.11 no.5
    • /
    • pp.53-61
    • /
    • 2006
  • We propose public-oriented e-Marketplace framework construction that activates efficiently transaction of non-metal industrial resources through the case of Mineralland. The firms of Non-metal industrial resources domain have low information level and weak capital structure. So public enterprise has to construct e-marketplace to trade using exact market information. This framework consists of five domains-contents, commerces, communities, collaboration and electronic authentication. To draw this framework, we review many web-sites and literatures about B2B of industrial resources domain. In addition, this study provides practical implications and guidelines for activating public oriented e-Marketplace of non metal industrial resources.

  • PDF

A Customer Profile Model for Collaborative Recommendation in e-Commerce (전자상거래에서의 협업 추천을 위한 고객 프로필 모델)

  • Lee, Seok-Kee;Jo, Hyeon;Chun, Sung-Yong
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.5
    • /
    • pp.67-74
    • /
    • 2011
  • Collaborative recommendation is one of the most widely used methods of automated product recommendation in e-Commerce. For analyzing the customer's preference, traditional explicit ratings are less desirable than implicit ratings because it may impose an additional burden to the customers of e-commerce companies which deals with a number of products. Cardinal scales generally used for representing the preference intensity also ineffective owing to its increasing estimation errors. In this paper, we propose a new way of constructing the ordinal scale-based customer profile for collaborative recommendation. A Web usage mining technique and lexicographic consensus are employed. An experiment shows that the proposed method performs better than existing CF methodologies.