• Title/Summary/Keyword: Documents Generation

Search Result 155, Processing Time 0.028 seconds

A Knowledge-based Wrapper Learning Agent for Semi-Structured Information Sources (준구조화된 정보소스에 대한 지식기반의 Wrapper 학습 에이전트)

  • Seo, Hee-Kyoung;Yang, Jae-Young;Choi, Joong-Min
    • Journal of KIISE:Software and Applications
    • /
    • v.29 no.1_2
    • /
    • pp.42-52
    • /
    • 2002
  • Information extraction(IE) is a process of recognizing and fetching particular information fragments from a document. In previous work, most IE systems generate the extraction rules called the wrappers manually, and although this manual wrapper generation may achieve more correct extraction, it reveals some problems in flexibility, extensibility, and efficiency. Some other researches that employ automatic ways of generating wrappers are also experiencing difficulties in acquiring and representing useful domain knowledge and in coping with the structural heterogeneity among different information sources, and as a result, the real-world information sources with complex document structures could not be correctly analyzed. In order to resolve these problems, this paper presents an agent-based information extraction system named XTROS that exploits the domain knowledge to learn from documents in a semi-structured information source. This system generates a wrapper for each information source automatically and performs information extraction and information integration by applying this wrapper to the corresponding source. In XTROS, both the domain knowledge and the wrapper are represented as XML-type documents. The wrapper generation algorithm first recognizes the meaning of each logical line of a sample document by using the domain knowledge, and then finds the most frequent pattern from the sequence of semantic representations of the logical lines. Eventually, the location and the structure of this pattern represented by an XML document becomes the wrapper. By testing XTROS on several real-estate information sites, we claim that it creates the correct wrappers for most Web sources and consequently facilitates effective information extraction and integration for heterogeneous and complex information sources.

Subject-Balanced Intelligent Text Summarization Scheme (주제 균형 지능형 텍스트 요약 기법)

  • Yun, Yeoil;Ko, Eunjung;Kim, Namgyu
    • Journal of Intelligence and Information Systems
    • /
    • v.25 no.2
    • /
    • pp.141-166
    • /
    • 2019
  • Recently, channels like social media and SNS create enormous amount of data. In all kinds of data, portions of unstructured data which represented as text data has increased geometrically. But there are some difficulties to check all text data, so it is important to access those data rapidly and grasp key points of text. Due to needs of efficient understanding, many studies about text summarization for handling and using tremendous amounts of text data have been proposed. Especially, a lot of summarization methods using machine learning and artificial intelligence algorithms have been proposed lately to generate summary objectively and effectively which called "automatic summarization". However almost text summarization methods proposed up to date construct summary focused on frequency of contents in original documents. Those summaries have a limitation for contain small-weight subjects that mentioned less in original text. If summaries include contents with only major subject, bias occurs and it causes loss of information so that it is hard to ascertain every subject documents have. To avoid those bias, it is possible to summarize in point of balance between topics document have so all subject in document can be ascertained, but still unbalance of distribution between those subjects remains. To retain balance of subjects in summary, it is necessary to consider proportion of every subject documents originally have and also allocate the portion of subjects equally so that even sentences of minor subjects can be included in summary sufficiently. In this study, we propose "subject-balanced" text summarization method that procure balance between all subjects and minimize omission of low-frequency subjects. For subject-balanced summary, we use two concept of summary evaluation metrics "completeness" and "succinctness". Completeness is the feature that summary should include contents of original documents fully and succinctness means summary has minimum duplication with contents in itself. Proposed method has 3-phases for summarization. First phase is constructing subject term dictionaries. Topic modeling is used for calculating topic-term weight which indicates degrees that each terms are related to each topic. From derived weight, it is possible to figure out highly related terms for every topic and subjects of documents can be found from various topic composed similar meaning terms. And then, few terms are selected which represent subject well. In this method, it is called "seed terms". However, those terms are too small to explain each subject enough, so sufficient similar terms with seed terms are needed for well-constructed subject dictionary. Word2Vec is used for word expansion, finds similar terms with seed terms. Word vectors are created after Word2Vec modeling, and from those vectors, similarity between all terms can be derived by using cosine-similarity. Higher cosine similarity between two terms calculated, higher relationship between two terms defined. So terms that have high similarity values with seed terms for each subjects are selected and filtering those expanded terms subject dictionary is finally constructed. Next phase is allocating subjects to every sentences which original documents have. To grasp contents of all sentences first, frequency analysis is conducted with specific terms that subject dictionaries compose. TF-IDF weight of each subjects are calculated after frequency analysis, and it is possible to figure out how much sentences are explaining about each subjects. However, TF-IDF weight has limitation that the weight can be increased infinitely, so by normalizing TF-IDF weights for every subject sentences have, all values are changed to 0 to 1 values. Then allocating subject for every sentences with maximum TF-IDF weight between all subjects, sentence group are constructed for each subjects finally. Last phase is summary generation parts. Sen2Vec is used to figure out similarity between subject-sentences, and similarity matrix can be formed. By repetitive sentences selecting, it is possible to generate summary that include contents of original documents fully and minimize duplication in summary itself. For evaluation of proposed method, 50,000 reviews of TripAdvisor are used for constructing subject dictionaries and 23,087 reviews are used for generating summary. Also comparison between proposed method summary and frequency-based summary is performed and as a result, it is verified that summary from proposed method can retain balance of all subject more which documents originally have.

Collection and Utilization of Unstructured Environmental Disaster by Using Disaster Information Standardization (재난정보 표준화를 통한 환경 재난정보 수집 및 활용)

  • Lee, Dong Seop;Kim, Byung Sik
    • Ecology and Resilient Infrastructure
    • /
    • v.6 no.4
    • /
    • pp.236-242
    • /
    • 2019
  • In this study, we developed the system that can collect and store environmental disaster data into the database and use it for environmental disaster management by converting structured and unstructured documents such as images into electronic documents. In the 4th Industrial Revolution, various intelligent technologies have been developed in many fields. Environmental disaster information is one of important elements of disaster cycle. Environment disaster information management refers to the act of managing and processing electronic data about disaster cycle. However, these information are mainly managed in the structured and unstructured form of reports. It is necessary to manage unstructured data for disaster information. In this paper, the intelligent generation approach is used to convert handout into electronic documents. Following that, the converted disaster data is organized into the disaster code system as disaster information. Those data are stored into the disaster database system. These converted structured data is managed in a standardized disaster information form connected with the disaster code system. The disaster code system is covered that the structured information is stored and retrieve on entire disaster cycle. The expected effect of this research will be able to apply it to smart environmental disaster management and decision making by combining artificial intelligence technologies and historical big data.

A Search-Result Clustering Method based on Word Clustering for Effective Browsing of the Paper Retrieval Results (논문 검색 결과의 효과적인 브라우징을 위한 단어 군집화 기반의 결과 내 군집화 기법)

  • Bae, Kyoung-Man;Hwang, Jae-Won;Ko, Young-Joong;Kim, Jong-Hoon
    • Journal of KIISE:Software and Applications
    • /
    • v.37 no.3
    • /
    • pp.214-221
    • /
    • 2010
  • The search-results clustering problem is defined as the automatic and on-line grouping of similar documents in search results returned from a search engine. In this paper, we propose a new search-results clustering algorithm specialized for a paper search service. Our system consists of two algorithmic phases: Category Hierarchy Generation System (CHGS) and Paper Clustering System (PCS). In CHGS, we first build up the category hierarchy, called the Field Thesaurus, for each research field using an existing research category hierarchy (KOSEF's research category hierarchy) and the keyword expansion of the field thesaurus by a word clustering method using the K-means algorithm. Then, in PCS, the proposed algorithm determines the category of each paper using top-down and bottom-up methods. The proposed system can be used in the application areas for retrieval services in a specialized field such as a paper search service.

Conceptual Transformation for Code Generation from SDL-92 to Object-oriented Languages (SDL-92에서 객체지향 언어의 코드 생성을 위한 개념 변환)

  • Lee, Si-Young;Lee, Dong-Gill;Lee, Joon-Kyung;Kim, Sung-Ho
    • Journal of KIISE:Software and Applications
    • /
    • v.27 no.5
    • /
    • pp.473-487
    • /
    • 2000
  • SDL-92, the language for specification and description of system, has held on to the communication method that based on processes and signals in the adoption of object-oriented concept to embrace the previous documents of system specification and description and users. It has caused problems, not only the absence of corresponding concepts in automatic generation to object-oriented language program based on method and object, but also some side effects accompanied by them like visibility and communication method. So, in this paper, we present a general object-oriented language model, which based on method and object, make a study of problems in the transformation fromSDL-92 to proposed model, and then propose conceptual transformation methods to solve them. The proposed transformation method can utilize the built-in parallelism in objects and guarantee the compiler level portability in translated program by providing translation into the syntax of target language.

  • PDF

The Biometric Signature Delegation Method with Undeniable Property (부인봉쇄 성질을 갖는 바이오메트릭 서명 위임 기법)

  • Yun, Sunghyun
    • Journal of Digital Convergence
    • /
    • v.12 no.1
    • /
    • pp.389-395
    • /
    • 2014
  • In a biometric signature scheme, a user's biometric key is used to sign the document. It also requires the user be authenticated with biometric recognition method, prior to signing the document. Because the biometric recognition is launched every time the signature session started, it is not suitable for electronic commerce applications such as shopping malls where large number of documents to sign are required. Therefore, to commercialize biometric based signature schemes, the new proxy signature scheme is needed to ease the burden of the signer. In the proxy signature scheme, the signer can delegate signing activities to trustful third parties. In this study, the biometric based signature delegation method is proposed. The proposed scheme is suitable for applications where a lot of signing are required. It is consisted of biometric key generation, PKI based mutual authentication, signature generation and verification protocols.

Development of Response Spectrum Generation Program for Seismic Analysis of the Nuclear Equipment (원자력기기 내진해석응답스펙트럼 생성프로그램 개발)

  • Byun, Hoon-Seok;Kim, Yu-Chull;Lee, Joon-Keun
    • Proceedings of the Korean Society for Noise and Vibration Engineering Conference
    • /
    • 2004.11a
    • /
    • pp.755-762
    • /
    • 2004
  • In our country, when the replacement for individual components of equipment in nuclear power plants is required, establishment of individual criteria i.e. Required Response Spectra(RRS) of seismic test/analysis for the component is very difficult because of the absence of Test Response Spectra(TRS) for the individual component to be replaced, from the existing qualification documents. In this case, it is required to perform the structural analysis for the nuclear equipment including the components to be replaced. After the structural analysis, Analysis Response Spectra(ARS) at the point of the component shall be generated and used for seismic test of the component. However, as of today, no standard program authorized for the response spectra generation by using the structural analysis exists in korea. Because of above reason, the STAR-Egs computer program was developed by using the method which calculates directly the expected response spectrum(frequency vs. acceleration type) of the selected points in the nuclear equipment with input spectrum(Required Response Spectra, RRS), based on the dynamic characteristics of the Finite Element(FE) model that is equivalent to the nuclear equipment. The STAR-Egs controls ANSYS/I-DEAS commercial software and automatically extract modal parameters of the FE model. The STAR-Egs calculates response spectrum using the established algorithm based on the extracted modal parameters, too. Reliance on the calculation result of the STAR-Egs was verified through comparison output with the result of MATLAB commercial software based on the identical algorithm. Moreover, actual seismic testing was performed as per IEEE344-1987 for the purpose of program verification by comparison of the FE analysis results.

  • PDF

Design of a Data Model for the Rainfall-Runoff Simulation Based on Spatial Database (공간DB 기반의 강우-유출 모의를 위한 데이터 모델 설계)

  • Kim, Ki-Uk;Kim, Chang-Soo
    • Journal of the Korean Association of Geographic Information Studies
    • /
    • v.13 no.4
    • /
    • pp.1-11
    • /
    • 2010
  • This study proposed the method for the SWMM data generation connected with the spatial database and designed the data model in order to display flooding information such as the runoff sewer system, flooding areas and depth. A variety of data, including UIS, documents related to the disasters, and rainfall data are used to generate the attributes for flooding analysis areas. The spatial data is constructed by the ArcSDE and Oracle DB. The prototype system is also developed to display the runoff areas based on the GIS using the ArcGIS ArcObjects and spatial DB. The results will be applied to the flooding analysis based on the SWMM.

Heuristic-based Korean Coreference Resolution for Information Extraction

  • Euisok Chung;Soojong Lim;Yun, Bo-Hyun
    • Proceedings of the Korean Society for Language and Information Conference
    • /
    • 2002.02a
    • /
    • pp.50-58
    • /
    • 2002
  • The information extraction is to delimit in advance, as part of the specification of the task, the semantic range of the output and to filter information from large volumes of texts. The most representative word of the document is composed of named entities and pronouns. Therefore, it is important to resolve coreference in order to extract the meaningful information in information extraction. Coreference resolution is to find name entities co-referencing real-world entities in the documents. Results of coreference resolution are used for name entity detection and template generation. This paper presents the heuristic-based approach for coreference resolution in Korean. We constructed the heuristics expanded gradually by using the corpus and derived the salience factors of antecedents as the importance measure in Korean. Our approach consists of antecedents selection and antecedents weighting. We used three kinds of salience factors that are used to weight each antecedent of the anaphor. The experiment result shows 80% precision.

  • PDF

Automatic Generation of Explanatory 2D Vector Drawing from 3D CAD Data for Technical Documents (기술문서 작성을 위한 3 차원 CAD 데이터의 도해저작 알고리즘)

  • Shim H.S.;Yang S.W.;Choi Y.;Cho S.W.
    • Proceedings of the Korean Society of Precision Engineering Conference
    • /
    • 2005.06a
    • /
    • pp.177-180
    • /
    • 2005
  • Three dimensional shaded images are standard visualization method for CAD models on the computer screen. Therefore, much of the effort in the visualization of CAD models has been focused on how conveniently and realistically CAD models can be displayed on the screen. However, shaded 3D CAD data images captured from the screen may not be suitable for some application areas. Technical document, either in the paper or electronic form, can more clearly describe the shape and annotate parts of the model by using projected 2D line drawing format viewed from a user defined view direction. This paper describes an efficient method for generating such a 2D line drawing data in the vector format. The algorithm is composed of silhouette line detection, hidden line removal and cleaning processes.

  • PDF