• Title/Summary/Keyword: web text analysis

Search Result 278, Processing Time 0.023 seconds

Authorship Attribution of Web Texts with Korean Language Applying Deep Learning Method (딥러닝을 활용한 웹 텍스트 저자의 남녀 구분 및 연령 판별 : SNS 사용자를 중심으로)

  • Park, Chan Yub;Jang, In Ho;Lee, Zoon Ky
    • Journal of Information Technology Services
    • /
    • v.15 no.3
    • /
    • pp.147-155
    • /
    • 2016
  • According to rapid development of technology, web text is growing explosively and attracting many fields as substitution for survey. The user of Facebook is reaching up to 113 million people per month, Twitter is used in various institution or company as a behavioral analysis tool. However, many research has focused on meaning of the text itself. And there is a lack of study for text's creation subject. Therefore, this research consists of sex/age text classification with by using 20,187 Facebook users' posts that reveal the sex and age of the writer. This research utilized Convolution Neural Networks, a type of deep learning algorithms which came into the spotlight as a recent image classifier in web text analyzing. The following result assured with 92% of accuracy for possibility as a text classifier. Also, this research was minimizing the Korean morpheme analysis and it was conducted using a Korean web text to Authorship Attribution. Based on these feature, this study can develop users' multiple capacity such as web text management information resource for worker, non-grammatical analyzing system for researchers. Thus, this study proposes a new method for web text analysis.

HTML Text Extraction Using Frequency Analysis (빈도 분석을 이용한 HTML 텍스트 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.9
    • /
    • pp.1135-1143
    • /
    • 2021
  • Recently, text collection using a web crawler for big data analysis has been frequently performed. However, in order to collect only the necessary text from a web page that is complexly composed of numerous tags and texts, there is a cumbersome requirement to specify HTML tags and style attributes that contain the text required for big data analysis in the web crawler. In this paper, we proposed a method of extracting text using the frequency of text appearing in web pages without specifying HTML tags and style attributes. In the proposed method, the text was extracted from the DOM tree of all collected web pages, the frequency of appearance of the text was analyzed, and the main text was extracted by excluding the text with high frequency of appearance. Through this study, the superiority of the proposed method was verified.

The Informative Support and Emotional Support Classification Model for Medical Web Forums using Text Analysis (의료 웹포럼에서의 텍스트 분석을 통한 정보적 지지 및 감성적 지지 유형의 글 분류 모델)

  • Woo, Jiyoung;Lee, Min-Jung;Ku, Yungchang
    • Journal of Information Technology Services
    • /
    • v.11 no.sup
    • /
    • pp.139-152
    • /
    • 2012
  • In the medical web forum, people share medical experience and information as patients and patents' families. Some people search medical information written in non-expert language and some people offer words of comport to who are suffering from diseases. Medical web forums play a role of the informative support and the emotional support. We propose the automatic classification model of articles in the medical web forum into the information support and emotional support. We extract text features of articles in web forum using text mining techniques from the perspective of linguistics and then perform supervised learning to classify texts into the information support and the emotional support types. We adopt the Support Vector Machine (SVM), Naive-Bayesian, decision tree for automatic classification. We apply the proposed model to the HealthBoards forum, which is also one of the largest and most dynamic medical web forum.

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

HTML Text Extraction Using Tag Path and Text Appearance Frequency (태그 경로 및 텍스트 출현 빈도를 이용한 HTML 본문 추출)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.25 no.12
    • /
    • pp.1709-1715
    • /
    • 2021
  • In order to accurately extract the necessary text from the web page, the method of specifying the tag and style attributes where the main contents exist to the web crawler has a problem in that the logic for extracting the main contents. This method needs to be modified whenever the web page configuration is changed. In order to solve this problem, the method of extracting the text by analyzing the frequency of appearance of the text proposed in the previous study had a limitation in that the performance deviation was large depending on the collection channel of the web page. Therefore, in this paper, we proposed a method of extracting texts with high accuracy from various collection channels by analyzing not only the frequency of appearance of text but also parent tag paths of text nodes extracted from the DOM tree of web pages.

Locating Text in Web Images Using Image Based Approaches (웹 이미지로부터 이미지기반 문자추출)

  • Chin, Seongah;Choo, Moonwon
    • Journal of Intelligence and Information Systems
    • /
    • v.8 no.1
    • /
    • pp.27-39
    • /
    • 2002
  • A locating text technique capable of locating and extracting text blocks in various Web images is presented here. Until now this area of work has been ignored by researchers even if this sort of text may be meaningful for internet users. The algorithms associated with the technique work without prior knowledge of the text orientation, size or font. In the work presented in this research, our text extraction algorithm utilizes useful edge detection followed by histogram analysis on the genuine characteristics of letters defined by text clustering region, to properly perform extraction of the text region that does not depend on font styles and sizes. By a number of experiments we have showed impressively acceptable results.

  • PDF

Empirical Analysis on the Effect of Design Pattern of Web Page, Perceived Risk and Media Richness to Customer Satisfaction (콘텐츠 제작방식, 지각된 위험, 미디어 풍부성이 고객만족에 미치는 영향 분석)

  • Park, Bong-Won;Lee, Jung-Mann;Lee, Jong-Won
    • The Journal of the Korea Contents Association
    • /
    • v.11 no.6
    • /
    • pp.385-396
    • /
    • 2011
  • Internet web pages can be classified by three major types such as texts only, images with texts and videos with texts. The purpose of this paper is to analyze how customers recognize and respond perspective of perceived risk and media richness with regard to design patterns of internet web pages. Additionally, we will examine the extent to which aforementioned factors affect customer satisfaction. Analyses with perceived risks revealed that customers feel less personal risks including performance, psychology and time/convenience when used web pages of text-images and text-videos, compared to text only based web pages. However, customers feel that web pages consisting of image-text or video-text have higher points in terms of symbolism and social presence in media richness, compared to text only based web pages. Finally, we showed that personal risk and text-based Web page negatively affect but symbolism and social presence positively impact on customer satisfaction. Therefore, this study suggests a clue that why video-based Web content did not grow different from many people's expectation.

A Technical Approach for Suggesting Research Directions in Telecommunications Policy

  • Oh, Junseok;Lee, Bong Gyou
    • KSII Transactions on Internet and Information Systems (TIIS)
    • /
    • v.8 no.12
    • /
    • pp.4467-4488
    • /
    • 2014
  • The bibliometric analysis is widely used for understanding research domains, trends, and knowledge structures in a particular field. The analysis has majorly been used in the field of information science, and it is currently applied to other academic fields. This paper describes the analysis of academic literatures for classifying research domains and for suggesting empty research areas in the telecommunications policy. The application software is developed for retrieving Thomson Reuters' Web of Knowledge (WoK) data via web services. It also used for conducting text mining analysis from contents and citations of publications. We used three text mining techniques: the Keyword Extraction Algorithm (KEA) analysis, the co-occurrence analysis, and the citation analysis. Also, R software is used for visualizing the term frequencies and the co-occurrence network among publications. We found that policies related to social communication services, the distribution of telecommunications infrastructures, and more practical and data-driven analysis researches are conducted in a recent decade. The citation analysis results presented that the publications are generally received citations, but most of them did not receive high citations in the telecommunications policy. However, although recent publications did not receive high citations, the productivity of papers in terms of citations was increased in recent ten years compared to the researches before 2004. Also, the distribution methods of infrastructures, and the inequity and gap appeared as topics in important references. We proposed the necessity of new research domains since the analysis results implies that the decrease of political approaches for technical problems is an issue in past researches. Also, insufficient researches on policies for new technologies exist in the field of telecommunications. This research is significant in regard to the first bibliometric analysis with abstracts and citation data in telecommunications as well as the development of software which has functions of web services and text mining techniques. Further research will be conducted with Big Data techniques and more text mining techniques.

Text Extraction In WWW Images (웹 영상에 포함된 문자 영역의 추출)

  • 김상현;심재창;김중수
    • Proceedings of the IEEK Conference
    • /
    • 2000.06d
    • /
    • pp.15-18
    • /
    • 2000
  • In this paper, we propose a method for text extraction in the Web images. Our approach is based on contrast detecting and pixel component ratio analysis in mouse position. Extracted data with OCR can be used for real time dictionary call or language translation application in Web browser.

  • PDF

A Study of Main Contents Extraction from Web News Pages based on XPath Analysis

  • Sun, Bok-Keun
    • Journal of the Korea Society of Computer and Information
    • /
    • v.20 no.7
    • /
    • pp.1-7
    • /
    • 2015
  • Although data on the internet can be used in various fields such as source of data of IR(Information Retrieval), Data mining and knowledge information servece, and contains a lot of unnecessary information. The removal of the unnecessary data is a problem to be solved prior to the study of the knowledge-based information service that is based on the data of the web page, in this paper, we solve the problem through the implementation of XTractor(XPath Extractor). Since XPath is used to navigate the attribute data and the data elements in the XML document, the XPath analysis to be carried out through the XTractor. XTractor Extracts main text by html parsing, XPath grouping and detecting the XPath contains the main data. The result, the recognition and precision rate are showed in 97.9%, 93.9%, except for a few cases in a large amount of experimental data and it was confirmed that it is possible to properly extract the main text of the news.