• Title/Summary/Keyword: Web document

Search Result 759, Processing Time 0.025 seconds

Harmful Document Classification Using the Harmful Word Filtering and SVM (유해어 필터링과 SVM을 이용한 유해 문서 분류 시스템)

  • Lee, Won-Hee;Chung, Sung-Jong;An, Dong-Un
    • The KIPS Transactions:PartB
    • /
    • v.16B no.1
    • /
    • pp.85-92
    • /
    • 2009
  • As World Wide Web is more popularized nowadays, the environment is flooded with the information through the web pages. However, despite such convenience of web, it is also creating many problems due to uncontrolled flood of information. The pornographic, violent and other harmful information freely available to the youth, who must be protected by the society, or other users who lack the power of judgment or self-control is creating serious social problems. To resolve those harmful words, various methods proposed and studied. This paper proposes and implements the protecting system that it protects internet youth user from harmful contents. To classify effective harmful/harmless contents, this system uses two step classification systems that is harmful word filtering and SVM learning based filtering. We achieved result that the average precision of 92.1%.

Web Attack Classification via WAF Log Analysis: AutoML, CNN, RNN, ALBERT (웹 방화벽 로그 분석을 통한 공격 분류: AutoML, CNN, RNN, ALBERT)

  • Youngbok Jo;Jaewoo Park;Mee Lan Han
    • Journal of the Korea Institute of Information Security & Cryptology
    • /
    • v.34 no.4
    • /
    • pp.587-596
    • /
    • 2024
  • Cyber Attack and Cyber Threat are getting confused and evolved. Therefore, using AI(Artificial Intelligence), which is the most important technology in Fourth Industry Revolution, to build a Cyber Threat Detection System is getting important. Especially, Government's SOC(Security Operation Center) is highly interested in using AI to build SOAR(Security Orchestration, Automation and Response) Solution to predict and build CTI(Cyber Threat Intelligence). In this thesis, We introduce the Cyber Threat Detection System by analyzing Network Traffic and Web Application Firewall(WAF) Log data. Additionally, we apply the well-known TF-IDF(Term Frequency-Inverse Document Frequency) method and AutoML technology to classify Web traffic attack type.

A Study on Layout Extraction from Internet Documents Through Xpath (Xpath에 의한 인터넷 문서의 레이아웃 추출 방법에 관한 연구)

  • Han Kwang-Rok;Sun Bok-Keun
    • The Journal of the Korea Contents Association
    • /
    • v.5 no.4
    • /
    • pp.237-244
    • /
    • 2005
  • Currently most Internet documents including news data are made based on predefined templates, but templates are usually formed only for main data and are not helpful for information retrieval against indexes, advertisements, header data etc. Templates in such forms are not appropriate when Internet documents are used as data for information retrieval. In order to process Internet documents in various areas of information retrieval, it is necessary to detect additional information such as advertisements and page indexes. Thus this study proposes a method of detecting the layout of web pages by identifying the characteristics and structure of block tags that affect the layout of web pages and calculating distances between web pages. As a result of experiment, we can successfully extract 640 documents from 1000 samples and obtain 64% recall rate. This method is purposed to reduce the cost of web document automatic processing and improve its efficiency through applying the method to document preprocessing of information retrieval such as data extraction and document summarization.

  • PDF

Implementation of an XML-Based Editor/Transformer for Large Volume of Similar Documents (XML 기반의 대용량 유사 문서 편집기/변환기 구현)

  • 황인준
    • The Journal of Society for e-Business Studies
    • /
    • v.9 no.1
    • /
    • pp.21-38
    • /
    • 2004
  • With its recent popularity, Web is now considered as a huge repository of information. Most documents on the web have been created using HTML(Hyper Text Markup Language). Even though HTML is simple and easy to learn, it has several features that are obstacles to the efficient information retrieval. XML(eXtensible Markup Language) can provide a solution to such problems and in fact, has already been used in many applications, XML is a standard markup language for exchanging data on the web. It can describe a document structure freely by defining its DTD, which enables efficient integration and retrieval of data on the web. In this paper, we propose a versatile and efficient XML document manager. Its features include (i) form-based XML editor that enables easy creation of new XML documents, (ii) automatic document converter that can transform HTML documents with similar structure into XML documents automatically, and (iii) GUI-based DTD editor.

  • PDF

Recommendation System using Associative Web Document Classification by Word Frequency and α-Cut (단어 빈도와 α-cut에 의한 연관 웹문서 분류를 이용한 추천 시스템)

  • Jung, Kyung-Yong;Ha, Won-Shik
    • The Journal of the Korea Contents Association
    • /
    • v.8 no.1
    • /
    • pp.282-289
    • /
    • 2008
  • Although there were some technological developments in improving the collaborative filtering, they have yet to fully reflect the actual relation of the items. In this paper, we propose the recommendation system using associative web document classification by word frequency and ${\alpha}$-cut to address the short comings of the collaborative filtering. The proposed method extracts words from web documents through the morpheme analysis and accumulates the weight of term frequency. It makes associative rules and applies the weight of term frequency to its confidence by using Apriori algorithm. And it calculates the similarity among the words using the hypergraph partition. Lastly, it classifies related web document by using ${\alpha}$-cut and calculates similarity by using adjusted cosine similarity. The results show that the proposed method significantly outperforms the existing methods.

An Improved Combined Content-similarity Approach for Optimizing Web Query Disambiguation

  • Kamal, Shahid;Ibrahim, Roliana;Ghani, Imran
    • Journal of Internet Computing and Services
    • /
    • v.16 no.6
    • /
    • pp.79-88
    • /
    • 2015
  • The web search engines are exposed to the issue of uncertainty because of ambiguous queries, being input for retrieving the accurate results. Ambiguous queries constitute a significant fraction of such instances and pose real challenges to web search engines. Moreover, web search has created an interest for the researchers to deal with search by considering context in terms of location perspective. Our proposed disambiguation approach is designed to improve user experience by using context in terms of location relevance with the document relevance. The aim is that providing the user a comprehensive location perspective of a topic is informative than retrieving a result that only contains temporal or context information. The capacity to use this information in a location manner can be, from a user perspective, potentially useful for several tasks, including user query understanding or clustering based on location. In order to carry out the approach, we developed a Java based prototype to derive the contextual information from the web results based on the queries from the well-known datasets. Among those results, queries are further classified in order to perform search in a broad way. After the result provision to users and the selection made by them, feedback is recorded implicitly to improve the web search based on contextual information. The experiment results demonstrate the outstanding performance of our approach in terms of precision 75%, accuracy 73%; recall 81% and f-measure 78% when compared with generic temporal evaluation approach and furthermore achieved precision 86%, accuracy 71%; recall 67% and f-measure 75% when compared with web document clustering approach.

WCTT: Web Crawling System based on HTML Document Formalization (WCTT: HTML 문서 정형화 기반 웹 크롤링 시스템)

  • Kim, Jin-Hwan;Kim, Eun-Gyung
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.26 no.4
    • /
    • pp.495-502
    • /
    • 2022
  • Web crawler, which is mainly used to collect text on the web today, is difficult to maintain and expand because researchers must implement different collection logic by collection channel after analyzing tags and styles of HTML documents. To solve this problem, the web crawler should be able to collect text by formalizing HTML documents to the same structure. In this paper, we designed and implemented WCTT(Web Crawling system based on Tag path and Text appearance frequency), a web crawling system that collects text with a single collection logic by formalizing HTML documents based on tag path and text appearance frequency. Because WCTT collects texts with the same logic for all collection channels, it is easy to maintain and expand the collection channel. In addition, it provides the preprocessing function that removes stopwords and extracts only nouns for keyword network analysis and so on.

Document Clustering Methods using Hierarchy of Document Contents (문서 내용의 계층화를 이용한 문서 비교 방법)

  • Hwang, Myung-Gwon;Bae, Yong-Geun;Kim, Pan-Koo
    • Journal of the Korea Institute of Information and Communication Engineering
    • /
    • v.10 no.12
    • /
    • pp.2335-2342
    • /
    • 2006
  • The current web is accumulating abundant information. In particular, text based documents are a type used very easily and frequently by human. So, numerous researches are progressed to retrieve the text documents using many methods, such as probability, statistics, vector similarity, Bayesian, and so on. These researches however, could not consider both subject and semantic of documents. So, to overcome the previous problems, we propose the document similarity method for semantic retrieval of document users want. This is the core method of document clustering. This method firstly, expresses a hierarchy semantically of document content ut gives the important hierarchy domain of document to weight. With this, we could measure the similarity between documents using both the domain weight and concepts coincidence in the domain hierarchies.

Design and Implementation of Lesson Plan System for teacher-student based on XML (XML 기반 교수-학생 학습지도 시스템의 설계 및 구현)

  • Choi, Mun-Kyoung;Kim, Haeng-Kon
    • The KIPS Transactions:PartD
    • /
    • v.9D no.6
    • /
    • pp.1055-1062
    • /
    • 2002
  • Recently, the lesson plan document that is imported in the educational area is not provided to the educational information systematically, and the teachers are not easy to compose the lessen plan documentation. So, it needs additional time and effort to develope the lesson plan documents. Because of increasing the distributing network. web-based lesson plan system is required to all of the education area. Therefore, we need to compose the lesson plan that is possible to obtain the various teacher's requirement by providing creation, retrival, and reusability of document using the standard XML on web. In this paper, we developed the system for creating the common DTD (Document Type Definition), providing the standard XML document through the common DTD over the lesson plan analysis. In this system, it provides the editor to compose the lesson plan and supports the searching function to improvement of reusability on the existing lesson plan. We design the searching functions such as the structure base, facet and keyword. The composed lesson plans are interoperated with Database. Consequently, we can share the information on web by composing the lesson plan using the XML and save the time and cost by directly writing the lesson plan on web. We can also provide the improved learning environment.

XSLT document editing for XML document conversion in WYSIWYG environment (WYSIWYG 환경에서 XML 문서 변환을 위한 XSLT 문서편집 시스템)

  • 차원준;박주상;이용준;정회경
    • Proceedings of the Korean Institute of Information and Commucation Sciences Conference
    • /
    • 2003.10a
    • /
    • pp.500-503
    • /
    • 2003
  • XML been using extensively by standard of data exchanging in the Internet is observed by a technology to replace existent document creation language of HTML etc. Biggest characteristic of this XML is that logic information and physical information that express style of document that do that express structural substance of document were detached. Hereupon, W3C advised XSL that oner style function of form similar to HTML for XML's style and data conversion. Also, XSL's conversion function offers function that change XML document to other data format, and can describe style information through conversion of various document format. But, a XML document conversion technology that use XSLT know-how in domestic is unprepared real condition, and necessity about solution that can edit XSLT document efficiently is putting. This paper does XML document so that conversion and output are available in various document format. And offered research of XSLT document editing system that can edit and create XSLT document efficiently under WYSIWYG environment.

  • PDF