• Title/Summary/Keyword: document structure extraction

Search Result 31, Processing Time 0.026 seconds

Integration of Ontology Model and Product Structure for the Requirement Management of Building Specification (건조사양서 요구사항의 추적을 위한 온톨로지 모델과 제품구조 통합 기초 연구)

  • Kim, Seung-Hyun;Lee, Jang-Hyun;Han, Eun-Jung
    • Journal of the Society of Naval Architects of Korea
    • /
    • v.48 no.3
    • /
    • pp.207-214
    • /
    • 2011
  • Ship design requirements described in the building specification should be reflected in the design process. This paper identifies the configuration of requirements mentioned in the building specification using Ontology Representation Language (OWL). Ontology-based semantic search system specifies the requirement items. Through this extraction, building specifications mentioned for each entry are configured to the tree. Tracking requirements for ship design and a set of procedures to instruct is also used for the V model of systems engineering. The semantic search engine of robot agent and ontology can search the requirements specification document and extract the design information. Thereafter, design requirements for the tracking model that proposes the relationship between the associated BOM(bill of material) and product structure.

An Incremental Clustering Technique of XML Documents using Cluster Histograms (클러스터의 히스토그램을 이용한 XML 문서의 점진적 클러스터링 기법)

  • Hwang, Jeong-Hee
    • Journal of KIISE:Databases
    • /
    • v.34 no.3
    • /
    • pp.261-269
    • /
    • 2007
  • As a basic research to integrate and to retrieve XML documents efficiently, this paper proposes a clustering method by structures of XML documents. We apply an algorithm processing the many transaction data to the clustering of XML documents, which is a quite different method from the previous algorithms measuring structure similarity. Our method performs the clustering of XML documents not only using the cluster histograms that represent the distribution of items in clusters but also considering the global cluster cohesion. We compare the proposed method with the existing techniques by performing experiments. Experiments show that our method not only creates good quality clusters but also improves the processing time.

Text Extraction Algorithm using the HTML Logical Structure Analysis (HTML 논리적 구조분석을 통한 본문추출 알고리즘)

  • Jeon, Hyun-Gee;KOH, Chan
    • Journal of Digital Contents Society
    • /
    • v.16 no.3
    • /
    • pp.445-455
    • /
    • 2015
  • According as internet and computer technology develops, the amount of information has increased exponentially, arising from a variety of web authoring tools and is a new web standard of appearance and a wide variety of web content accessibility as more convenient for the web are produced very quickly. However, web documents are put out on a variety of topics divided into some blocks where each of the blocks are dealing with a topic unrelated to one another as well as you can not see with contents such as many navigations, simple decorations, advertisements, copyright. Extract only the exact area of the web document body to solve this problem and to meet user requirements, and to study the effective information. Later on, as the reconstruction method, we propose a web search system can be optimized systematically manage documents.

Knowledge Domain and Emerging Trends of Intelligent Green Building and Smart City - A Visual Analysis Using CiteSpace

  • Li, Hongyang;Dai, Mingjie
    • International conference on construction engineering and project management
    • /
    • 2017.10a
    • /
    • pp.24-31
    • /
    • 2017
  • As the concept of sustainability becomes more and more popular, a large amount of literature have been recorded recently on intelligent green building and smart city (IGB&SC). It is therefore needed to systematically analyse the existing knowledge structure as well as the future new development of this domain through the identification of the thematic trends, landmark articles, typical keywords together with co-operative researchers. In this paper, Citespace software package is applied to analyse the citation networks and other relevant data of the past eleven years (from 2006 to 2016) collected from Web of Science (WOS). Through this, a series of professional document analysis are conducted, including the production of core authors, the influence made by the most cited authors, keywords extraction and timezone analysis, hot topics of research, highly cited papers and trends with regard to co-citation analysis, etc. As a result, the development track of the IGB&SC domains is revealed and visualized and the following results reached: (i) in the research area of IGB&SC, the most productive researcher is Winters JV and Caragliu A is most influential on the other hand; (ii) different focuses of IGB&SC research have been emerged continually from 2006 to 2016 e.g. smart growth, sustainability, smart city, big data, etc.; (iii) Hollands's work is identified with the most citations and the emerging trends, as revealed from the bursts analysis in document co-citations, can be concluded as smart growth, the assessment of intelligent green building and smart city.

  • PDF

Korean Summarization System using Automatic Paragraphing (단락 자동 구분을 이용한 문서 요약 시스템)

  • 김계성;이현주;이상조
    • Journal of KIISE:Software and Applications
    • /
    • v.30 no.7_8
    • /
    • pp.681-686
    • /
    • 2003
  • In this paper, we describes a system that extracts important sentences from Korean newspaper articles using automatic paragraphing. First, we detect repeated words between sentences. Through observation of the repeated words, this system compute Closeness Degree between Sentences(CDS ) from the degree of morphological agreement and the change of grammatical role. And then, it automatically divides a document into meaningful paragraphs using the number of paragraph defined by the user´s need. Finally. it selects one representative sentence from each paragraph and it generates summary using representative sentences. Though our system doesn´t utilize some features such as title, sentence position, rhetorical structure, etc., it is able to extract meaningful sentences to be included in the summary.

Condition assessment of fire affected reinforced concrete shear wall building - A case study

  • Mistri, Abhijit;Pa, Robin Davis;Sarkar, Pradip
    • Advances in concrete construction
    • /
    • v.4 no.2
    • /
    • pp.89-105
    • /
    • 2016
  • The post - fire investigation is conducted on a fire-affected reinforced concrete shear wall building to ascertain the level of its strength degradation due to the fire incident. Fire incident took place in a three-storey building made of reinforced concrete shear wall and roof with operating floors made of steel beams and chequered plates. The usage of the building is to handle explosives. Elevated temperature during the fire is estimated to be $350^{\circ}C$ based on visual inspection. Destructive (core extraction) and non-destructive (rebound hammer and ultrasonic pulse velocity) tests are conducted to evaluate the concrete strength. X-ray diffraction (XRD) and Field Emission Scanning Electron Microscopy (FESEM) are used for analyzing micro structural changes of the concrete due to fire. Tests are conducted for concrete walls and roof slab on both burnt and unburnt locations. The analysis of test results reveals no significant degradation of the building after the fire which signifies that the structure can be used with full expectancy of performance for the remaining service life. This document can be used as a reference for future forensic investigations of similar fire affected concrete structures.

Academic Conference Categorization According to Subjects Using Topical Information Extraction from Conference Websites (학회 웹사이트의 토픽 정보추출을 이용한 주제에 따른 학회 자동분류 기법)

  • Lee, Sue Kyoung;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.22 no.2
    • /
    • pp.61-77
    • /
    • 2017
  • Recently, the number of academic conference information on the Internet has rapidly increased, the automatic classification of academic conference information according to research subjects enables researchers to find the related academic conference efficiently. Information provided by most conference listing services is limited to title, date, location, and website URL. However, among these features, the only feature containing topical words is title, which causes information insufficiency problem. Therefore, we propose methods that aim to resolve information insufficiency problem by utilizing web contents. Specifically, the proposed methods the extract main contents from a HTML document collected by using a website URL. Based on the similarity between the title of a conference and its main contents, the topical keywords are selected to enforce the important keywords among the main contents. The experiment results conducted by using a real-world dataset showed that the use of additional information extracted from the conference websites is successful in improving the conference classification performances. We plan to further improve the accuracy of conference classification by considering the structure of websites.

Query-based Answer Extraction using Korean Dependency Parsing (의존 구문 분석을 이용한 질의 기반 정답 추출)

  • Lee, Dokyoung;Kim, Mintae;Kim, Wooju
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.3
    • /
    • pp.161-177
    • /
    • 2019
  • In this paper, we study the performance improvement of the answer extraction in Question-Answering system by using sentence dependency parsing result. The Question-Answering (QA) system consists of query analysis, which is a method of analyzing the user's query, and answer extraction, which is a method to extract appropriate answers in the document. And various studies have been conducted on two methods. In order to improve the performance of answer extraction, it is necessary to accurately reflect the grammatical information of sentences. In Korean, because word order structure is free and omission of sentence components is frequent, dependency parsing is a good way to analyze Korean syntax. Therefore, in this study, we improved the performance of the answer extraction by adding the features generated by dependency parsing analysis to the inputs of the answer extraction model (Bidirectional LSTM-CRF). The process of generating the dependency graph embedding consists of the steps of generating the dependency graph from the dependency parsing result and learning the embedding of the graph. In this study, we compared the performance of the answer extraction model when inputting basic word features generated without the dependency parsing and the performance of the model when inputting the addition of the Eojeol tag feature and dependency graph embedding feature. Since dependency parsing is performed on a basic unit of an Eojeol, which is a component of sentences separated by a space, the tag information of the Eojeol can be obtained as a result of the dependency parsing. The Eojeol tag feature means the tag information of the Eojeol. The process of generating the dependency graph embedding consists of the steps of generating the dependency graph from the dependency parsing result and learning the embedding of the graph. From the dependency parsing result, a graph is generated from the Eojeol to the node, the dependency between the Eojeol to the edge, and the Eojeol tag to the node label. In this process, an undirected graph is generated or a directed graph is generated according to whether or not the dependency relation direction is considered. To obtain the embedding of the graph, we used Graph2Vec, which is a method of finding the embedding of the graph by the subgraphs constituting a graph. We can specify the maximum path length between nodes in the process of finding subgraphs of a graph. If the maximum path length between nodes is 1, graph embedding is generated only by direct dependency between Eojeol, and graph embedding is generated including indirect dependencies as the maximum path length between nodes becomes larger. In the experiment, the maximum path length between nodes is adjusted differently from 1 to 3 depending on whether direction of dependency is considered or not, and the performance of answer extraction is measured. Experimental results show that both Eojeol tag feature and dependency graph embedding feature improve the performance of answer extraction. In particular, considering the direction of the dependency relation and extracting the dependency graph generated with the maximum path length of 1 in the subgraph extraction process in Graph2Vec as the input of the model, the highest answer extraction performance was shown. As a result of these experiments, we concluded that it is better to take into account the direction of dependence and to consider only the direct connection rather than the indirect dependence between the words. The significance of this study is as follows. First, we improved the performance of answer extraction by adding features using dependency parsing results, taking into account the characteristics of Korean, which is free of word order structure and omission of sentence components. Second, we generated feature of dependency parsing result by learning - based graph embedding method without defining the pattern of dependency between Eojeol. Future research directions are as follows. In this study, the features generated as a result of the dependency parsing are applied only to the answer extraction model in order to grasp the meaning. However, in the future, if the performance is confirmed by applying the features to various natural language processing models such as sentiment analysis or name entity recognition, the validity of the features can be verified more accurately.

Web Site Keyword Selection Method by Considering Semantic Similarity Based on Word2Vec (Word2Vec 기반의 의미적 유사도를 고려한 웹사이트 키워드 선택 기법)

  • Lee, Donghun;Kim, Kwanho
    • The Journal of Society for e-Business Studies
    • /
    • v.23 no.2
    • /
    • pp.83-96
    • /
    • 2018
  • Extracting keywords representing documents is very important because it can be used for automated services such as document search, classification, recommendation system as well as quickly transmitting document information. However, when extracting keywords based on the frequency of words appearing in a web site documents and graph algorithms based on the co-occurrence of words, the problem of containing various words that are not related to the topic potentially in the web page structure, There is a difficulty in extracting the semantic keyword due to the limit of the performance of the Korean tokenizer. In this paper, we propose a method to select candidate keywords based on semantic similarity, and solve the problem that semantic keyword can not be extracted and the accuracy of Korean tokenizer analysis is poor. Finally, we use the technique of extracting final semantic keywords through filtering process to remove inconsistent keywords. Experimental results through real web pages of small business show that the performance of the proposed method is improved by 34.52% over the statistical similarity based keyword selection technique. Therefore, it is confirmed that the performance of extracting keywords from documents is improved by considering semantic similarity between words and removing inconsistent keywords.

Keyword Network Visualization for Text Summarization and Comparative Analysis (문서 요약 및 비교분석을 위한 주제어 네트워크 가시화)

  • Kim, Kyeong-rim;Lee, Da-yeong;Cho, Hwan-Gue
    • Journal of KIISE
    • /
    • v.44 no.2
    • /
    • pp.139-147
    • /
    • 2017
  • Most of the information prevailing in the Internet space consists of textual information. So one of the main topics regarding the huge document analyses that are required in the "big data" era is the development of an automated understanding system for textual data; accordingly, the automation of the keyword extraction for text summarization and abstraction is a typical research problem. But the simple listing of a few keywords is insufficient to reveal the complex semantic structures of the general texts. In this paper, a text-visualization method that constructs a graph by computing the related degrees from the selected keywords of the target text is developed; therefore, two construction models that provide the edge relation are proposed for the computing of the relation degree among keywords, as follows: influence-interval model and word- distance model. The finally visualized graph from the keyword-derived edge relation is more flexible and useful for the display of the meaning structure of the target text; furthermore, this abstract graph enables a fast and easy understanding of the target text. The authors' experiment showed that the proposed abstract-graph model is superior to the keyword list for the attainment of a semantic and comparitive understanding of text.