• Title/Summary/Keyword: HTML Documents

Search Result 149, Processing Time 0.023 seconds

X-tree Diff: An Efficient Change Detection Algorithm for Tree-structured Data (X-tree Diff: 트리 기반 데이터를 위한 효율적인 변화 탐지 알고리즘)

  • Lee, Suk-Kyoon;Kim, Dong-Ah
    • The KIPS Transactions:PartC
    • /
    • v.10C no.6
    • /
    • pp.683-694
    • /
    • 2003
  • We present X-tree Diff, a change detection algorithm for tree-structured data. Our work is motivated by need to monitor massive volume of web documents and detect suspicious changes, called defacement attack on web sites. From this context, our algorithm should be very efficient in speed and use of memory space. X-tree Diff uses a special ordered labeled tree, X-tree, to represent XML/HTML documents. X-tree nodes have a special field, tMD, which stores a 128-bit hash value representing the structure and data of subtrees, so match identical subtrees form the old and new versions. During this process, X-tree Diff uses the Rule of Delaying Ambiguous Matchings, implying that it perform exact matching where a node in the old version has one-to one corrspondence with the corresponding node in the new, by delaying all the others. It drastically reduces the possibility of wrong matchings. X-tree Diff propagates such exact matchings upwards in Step 2, and obtain more matchings downwsards from roots in Step 3. In step 4, nodes to ve inserted or deleted are decided, We aldo show thst X-tree Diff runs on O(n), woere n is the number of noses in X-trees, in worst case as well as in average case, This result is even better than that of BULD Diff algorithm, which is O(n log(n)) in worst case, We experimented X-tree Diff on reat data, which are about 11,000 home pages from about 20 wev sites, instead of synthetic documets manipulated for experimented for ex[erimentation. Currently, X-treeDiff algorithm is being used in a commeercial hacking detection system, called the WIDS(Web-Document Intrusion Detection System), which is to find changes occured in registered websites, and report suspicious changes to users.

Effect of Rule Identification in Acquiring Rules from Web Pages (웹 페이지의 내재 규칙 습득 과정에서 규칙식별 역할에 대한 효과 분석)

  • Kang, Ju-Young;Lee, Jae-Kyu;Park, Sang-Un
    • Journal of Intelligence and Information Systems
    • /
    • v.11 no.1
    • /
    • pp.123-151
    • /
    • 2005
  • In the world of Web pages, there are oceans of documents in natural language texts and tables. To extract rules from Web pages and maintain consistency between them, we have developed the framework of XRML(extensible Rule Markup Language). XRML allows the identification of rules on Web pages and generates the identified rules automatically. For this purpose, we have designed the Rule Identification Markup Language (RIML) that is similar to the formal Rule Structure Markup Language (RSML), both as pares of XRML. RIML is designed to identify rules not only from texts, but also from tables on Web pages, and to transform to the formal rules in RSは syntax automatically. While designing RIML, we considered the features of sharing variables and values, omitted terms, and synonyms. Using these features, rules can be identified or changed once, automatically generating their corresponding RSML rules. We have conducted an experiment to evaluate the effect of the RIML approach with real world Web pages of Amazon.com, BamesandNoble.com, and Powells.com We found that $97.7\%$ of the rules can be detected on the Web pages, and the completeness of generated rule components is $88.5\%$. This is good proof that XRML can facilitate the extraction and maintenance of rules from Web pages while building expert systems in the Semantic Web environment.

  • PDF

Seeking Alternative Models and Research Trends for Big Deals in the Electronic Journal Consortium (전자저널 빅딜 계약의 연구 동향과 대안 탐색)

  • Kim, Sang-Jun;Kim, Jeong-Hwan
    • Journal of Information Management
    • /
    • v.42 no.1
    • /
    • pp.85-111
    • /
    • 2011
  • The purpose of this study was to seek a workable alternative to replace a big deal related to the journal budget for the maintenance of academic libraries with the largest issue on the E-journal consortium. The contents of this study was to present it. It had examined the current situation, strengths, weaknesses and corresponding to replace the big deal contract. After reviewing the literature, we looked into the alternative activities for the big deal such as open access-based, usage-based, consortium improvement-based, publishers lead, and other models. As a result, the 'consortium cost reapportion model' was an alternative for the KESLI. The alternative was in the short term for cost division format, but long-term oriented for a consortium single(bloc) payment type or national licence model. The model was based on the data from the last year. It had evaluated download the PDF and HTML documents, but the three times weighting more than others, and the rest of 14 factors of 0.5 to 5 out of 100 total score. The total amount negotiated by national units 10, 20 and 30 grades for the final step was allocated to the participating library on the KESLI consortium.

A Design and Implementation of Event Processor for Playing SMIL 2.0 Documents (SMIL 2.0 문서 재생을 위한 이벤트 처리기의 설계 및 구현)

  • 김혜은;채진석;이재원;김성동;이종우
    • Journal of Korea Multimedia Society
    • /
    • v.7 no.2
    • /
    • pp.251-263
    • /
    • 2004
  • The Synchronized Multimedia Integration Language (SMIL), recommended by the World Wide Web Consortium (W3C) in 1998, is an XML-based declarative language to synchronize and present multimedia documents. SMIL can create new multimedia data integrating various types of multimedia objects which exist separately such as text, video, graphics and audio. It can support synchronization of multimedia data which are limited in current HTML-based Web technology. For its popularity, it is required to develop a multimedia server guaranteeing Quality of Service (QoS), authoring tool and player. For developing a SMIL authoring tool and player, the technologies are essentially required to read and analyze a SMIL document and to play synchronized various types of media objects in a timeline. In this paper, we describe a design and implementation of an event processor which supports SMIL 2.0 timing model. Moreover, we also develop a SMIL 2.0 player using the proposed event processor. This will facilitate the play of SMIL contents, so that it can contribute to the prosperity of SMIL technology It is possible to reuse in various language profiles defined in the SMIL standard. This player is expected to be utilized in other standard integrating SMIL such as XHTML+SMIL and SMIL Animation.

  • PDF

A Machine Learning Approach to Web Image Classification (기계학습 기반의 웹 이미지 분류)

  • Cho, Soo-Sun;Lee, Dong-Woo;Han, Dong-Won;Hwang, Chi-Jung
    • The KIPS Transactions:PartB
    • /
    • v.9B no.6
    • /
    • pp.759-764
    • /
    • 2002
  • Although image occupies a large part of importance on the Web documents, there have not been many researches for analyzing and understanding it. Many Web images are used for carrying important information but others are not used for it. In this paper classify the Web images from presently served Web sites to erasable or non-erasable classes. based on machine learning methods. For this research, we have detected 16 special and rich features for Web images and experimented by using the Baysian and decision tree methods. As the results, F-measures of 87.09%, 82.72% were achived for each method and particularly, from the experiments to compare the effects of feature groups, it has proved that the added features on this study are very useful for Web image classification.

Design and Implementation of Education Multimedia Content Authoring Tool (교육용 멀티미디어 컨텐츠 저작도구의 설계 및 구현)

  • 이혜정;정성태;정석태
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.7 no.5
    • /
    • pp.955-963
    • /
    • 2003
  • In this paper, in order to help one to write contents for education in such a way to increase effectiveness, we implement SMIL editor, which helps anyone to author contents for education and multimedia text in an easy way based on multimeda language SMIL. Thanks to its interface which utilizes WYSIWYGT, this editor allows teachers or other users who do not know SMIL to write contents for education and multimedia text in an easy way and to check in a real time how a partially completed document work and to revise when it is not satisfactory. It also allows one to write contents with explanations to help learning because usable multimedia objects can be inserted. This editor helps the user to reduce his inconvenience that he has in memorizing SMIL tags and to reduce his time and offers in writing the documents.

Introduction to Automatic Generation of Design Documents for Flight Software using Doxygen (Doxygen을 이용한 위성비행소프트웨어 설계문서 작성 자동화 방안 소개)

  • Lee, Jae-Seung;Yang, Seung-Eun;Choi, Jong-Wook;Cheon, Yee-Jin;Yun, Jeong-Oh
    • Proceedings of the Korea Information Processing Society Conference
    • /
    • 2012.11a
    • /
    • pp.844-847
    • /
    • 2012
  • 인공위성의 개발은 오랜 기간에 걸쳐 다양한 분야의 전문가들에 의해 개발된 결과물들이 통합되어 완성될 수 있다. 위성개발과 같이 많은 개발자가 공동으로 작업하여 하나의 결과물을 생산하는 경우 개발과정에서 방대한 양의 문서작업이 수반된다. 특히 비행소프트웨어와 같이 서로 다른 개발자에 의해 작성된 코드들이 하나의 이미지로 통합되어 빌드될 경우 발생하는 문제점들을 해결하고 요구되는 기능들을 디버깅하기 위해서는 개발과정 및 소스코드에 대한 문서들이 필수적이다. 이러한 소프트웨어 설계에 대한 문서는 그 양이 방대하고 소스코드와의 연계성이 필요하기 때문에 소스코드를 작성한 각 개발자들이 직접 수작업으로 문서를 작성하였다. 예를 들면, 기존의 위성비행소프트웨어 개발과정에서는 이러한 문서들 중 전체 위성비행소프트웨어의 단위 코드별 입출력, 수행기능 등의 상세 설계 내용을 기록하는 SDD(Software Design Description)는 개발자가 작성한 코드를 기반으로 수작업을 통하여 작성되었다. 이러한 작성방식은 작성자의 입력오류가 발생할 수도 있으며 소프트웨어 개발과 별도로 수작업이 요구되어 문서작성에 소요되는 시간적 손해가 발생하게 된다. 유럽에서는 이러한 문제점을 보완하기 위하여 C, C++, C#, JAVA, VHDL 등 다양한 언어를 사용하는 소프트웨어 개발에 적용 가능한 자동적 문서작성 도구인 Doxygen을 설계 및 개발문서 작성에 활용하고 있다. Doxygen은 PDF, HTML, Latex, RTF 등 다양한 출력 포맷도 지원한다. 본 논문에서는 Doxygen을 활용하여 위성비행소프트웨어 개발문서의 작성 시 소요시간을 단축하고 소스코드로부터 해당 설계 내용을 추출하여 자동적으로 문서를 작성할 수 있는 방안에 대하여 소개한다.

Web Document Transcoding Technique for Small Display Devices (소형 화면 단말기를 위한 웹 문서 변환 기법)

  • Shin, Hee-Sook;Mah, Pyeong-Soo;Cho, Soo-Sun;Lee, Dong-Woo
    • The KIPS Transactions:PartD
    • /
    • v.9D no.6
    • /
    • pp.1145-1156
    • /
    • 2002
  • We propose a web document transcoding technique that translates existing web pages designed for desktop computers into an appropriate form for hand-held devices connected to the wireless internet. By defining a content block based on a visual separation and using it as a minimum unit for analyzing and converting processes, we can get web pages converted more exactly. We also apply the reallocation of the content block and the generation of new index in order to provide convenient interface without left-right scrolling in small screen devices. These methods, compared with existing ways such as text level summary or partial extraction method, can provide efficient navigation and a full recognition of web documents. To gain those transcoding benefits, we propose the Layout-Forming Tag Analysis Algorithm that analyzes structural tags, which motivate visual separation and the Component Grouping Algorithm that extracts the content block. We also classify and rearrange the content block and generate the new index to produce an appropriate form of web pages for small display devices. We have designed and implemented our transcoding system in a proxy server and evaluated the methods and the algorithms through an analysis of transcoded results. Our transcoding system showed a good result on most of popular web pages that have complicated structures.

A Proposal of a Keyword Extraction System for Detecting Social Issues (사회문제 해결형 기술수요 발굴을 위한 키워드 추출 시스템 제안)

  • Jeong, Dami;Kim, Jaeseok;Kim, Gi-Nam;Heo, Jong-Uk;On, Byung-Won;Kang, Mijung
    • Journal of Intelligence and Information Systems
    • /
    • v.19 no.3
    • /
    • pp.1-23
    • /
    • 2013
  • To discover significant social issues such as unemployment, economy crisis, social welfare etc. that are urgent issues to be solved in a modern society, in the existing approach, researchers usually collect opinions from professional experts and scholars through either online or offline surveys. However, such a method does not seem to be effective from time to time. As usual, due to the problem of expense, a large number of survey replies are seldom gathered. In some cases, it is also hard to find out professional persons dealing with specific social issues. Thus, the sample set is often small and may have some bias. Furthermore, regarding a social issue, several experts may make totally different conclusions because each expert has his subjective point of view and different background. In this case, it is considerably hard to figure out what current social issues are and which social issues are really important. To surmount the shortcomings of the current approach, in this paper, we develop a prototype system that semi-automatically detects social issue keywords representing social issues and problems from about 1.3 million news articles issued by about 10 major domestic presses in Korea from June 2009 until July 2012. Our proposed system consists of (1) collecting and extracting texts from the collected news articles, (2) identifying only news articles related to social issues, (3) analyzing the lexical items of Korean sentences, (4) finding a set of topics regarding social keywords over time based on probabilistic topic modeling, (5) matching relevant paragraphs to a given topic, and (6) visualizing social keywords for easy understanding. In particular, we propose a novel matching algorithm relying on generative models. The goal of our proposed matching algorithm is to best match paragraphs to each topic. Technically, using a topic model such as Latent Dirichlet Allocation (LDA), we can obtain a set of topics, each of which has relevant terms and their probability values. In our problem, given a set of text documents (e.g., news articles), LDA shows a set of topic clusters, and then each topic cluster is labeled by human annotators, where each topic label stands for a social keyword. For example, suppose there is a topic (e.g., Topic1 = {(unemployment, 0.4), (layoff, 0.3), (business, 0.3)}) and then a human annotator labels "Unemployment Problem" on Topic1. In this example, it is non-trivial to understand what happened to the unemployment problem in our society. In other words, taking a look at only social keywords, we have no idea of the detailed events occurring in our society. To tackle this matter, we develop the matching algorithm that computes the probability value of a paragraph given a topic, relying on (i) topic terms and (ii) their probability values. For instance, given a set of text documents, we segment each text document to paragraphs. In the meantime, using LDA, we can extract a set of topics from the text documents. Based on our matching process, each paragraph is assigned to a topic, indicating that the paragraph best matches the topic. Finally, each topic has several best matched paragraphs. Furthermore, assuming there are a topic (e.g., Unemployment Problem) and the best matched paragraph (e.g., Up to 300 workers lost their jobs in XXX company at Seoul). In this case, we can grasp the detailed information of the social keyword such as "300 workers", "unemployment", "XXX company", and "Seoul". In addition, our system visualizes social keywords over time. Therefore, through our matching process and keyword visualization, most researchers will be able to detect social issues easily and quickly. Through this prototype system, we have detected various social issues appearing in our society and also showed effectiveness of our proposed methods according to our experimental results. Note that you can also use our proof-of-concept system in http://dslab.snu.ac.kr/demo.html.