• Title/Summary/Keyword: Dataset construction

Search Result 203, Processing Time 0.018 seconds

Automated Data Extraction from Unstructured Geotechnical Report based on AI and Text-mining Techniques (AI 및 텍스트 마이닝 기법을 활용한 지반조사보고서 데이터 추출 자동화)

  • Park, Jimin;Seo, Wanhyuk;Seo, Dong-Hee;Yun, Tae-Sup
    • Journal of the Korean Geotechnical Society
    • /
    • v.40 no.4
    • /
    • pp.69-79
    • /
    • 2024
  • Field geotechnical data are obtained from various field and laboratory tests and are documented in geotechnical investigation reports. For efficient design and construction, digitizing these geotechnical parameters is essential. However, current practices involve manual data entry, which is time-consuming, labor-intensive, and prone to errors. Thus, this study proposes an automatic data extraction method from geotechnical investigation reports using image-based deep learning models and text-mining techniques. A deep-learning-based page classification model and a text-searching algorithm were employed to classify geotechnical investigation report pages with 100% accuracy. Computer vision algorithms were utilized to identify valid data regions within report pages, and text analysis was used to match and extract the corresponding geotechnical data. The proposed model was validated using a dataset of 205 geotechnical investigation reports, achieving an average data extraction accuracy of 93.0%. Finally, a user-interface-based program was developed to enhance the practical application of the extraction model. It allowed users to upload PDF files of geotechnical investigation reports, automatically analyze these reports, and extract and edit data. This approach is expected to improve the efficiency and accuracy of digitizing geotechnical investigation reports and building geotechnical databases.

Performance analysis of Frequent Itemset Mining Technique based on Transaction Weight Constraints (트랜잭션 가중치 기반의 빈발 아이템셋 마이닝 기법의 성능분석)

  • Yun, Unil;Pyun, Gwangbum
    • Journal of Internet Computing and Services
    • /
    • v.16 no.1
    • /
    • pp.67-74
    • /
    • 2015
  • In recent years, frequent itemset mining for considering the importance of each item has been intensively studied as one of important issues in the data mining field. According to strategies utilizing the item importance, itemset mining approaches for discovering itemsets based on the item importance are classified as follows: weighted frequent itemset mining, frequent itemset mining using transactional weights, and utility itemset mining. In this paper, we perform empirical analysis with respect to frequent itemset mining algorithms based on transactional weights. The mining algorithms compute transactional weights by utilizing the weight for each item in large databases. In addition, these algorithms discover weighted frequent itemsets on the basis of the item frequency and weight of each transaction. Consequently, we can see the importance of a certain transaction through the database analysis because the weight for the transaction has higher value if it contains many items with high values. We not only analyze the advantages and disadvantages but also compare the performance of the most famous algorithms in the frequent itemset mining field based on the transactional weights. As a representative of the frequent itemset mining using transactional weights, WIS introduces the concept and strategies of transactional weights. In addition, there are various other state-of-the-art algorithms, WIT-FWIs, WIT-FWIs-MODIFY, and WIT-FWIs-DIFF, for extracting itemsets with the weight information. To efficiently conduct processes for mining weighted frequent itemsets, three algorithms use the special Lattice-like data structure, called WIT-tree. The algorithms do not need to an additional database scanning operation after the construction of WIT-tree is finished since each node of WIT-tree has item information such as item and transaction IDs. In particular, the traditional algorithms conduct a number of database scanning operations to mine weighted itemsets, whereas the algorithms based on WIT-tree solve the overhead problem that can occur in the mining processes by reading databases only one time. Additionally, the algorithms use the technique for generating each new itemset of length N+1 on the basis of two different itemsets of length N. To discover new weighted itemsets, WIT-FWIs performs the itemset combination processes by using the information of transactions that contain all the itemsets. WIT-FWIs-MODIFY has a unique feature decreasing operations for calculating the frequency of the new itemset. WIT-FWIs-DIFF utilizes a technique using the difference of two itemsets. To compare and analyze the performance of the algorithms in various environments, we use real datasets of two types (i.e., dense and sparse) in terms of the runtime and maximum memory usage. Moreover, a scalability test is conducted to evaluate the stability for each algorithm when the size of a database is changed. As a result, WIT-FWIs and WIT-FWIs-MODIFY show the best performance in the dense dataset, and in sparse dataset, WIT-FWI-DIFF has mining efficiency better than the other algorithms. Compared to the algorithms using WIT-tree, WIS based on the Apriori technique has the worst efficiency because it requires a large number of computations more than the others on average.

Ecological Health Assessments on Turbidwater in the Downstream After a Construction of Yongdam Dam (용담댐 건설후 하류부 하천 생태계의 탁수영향 평가)

  • Kim, Ja-Hyun;Seo, Jin-Won;Na, Young-Eun;An, Kwang-Guk
    • Korean Journal of Ecology and Environment
    • /
    • v.40 no.1
    • /
    • pp.130-142
    • /
    • 2007
  • This study was to examine impacts of turbid water on fish community in the downstream of Yongdam Dam during the period from June to October 2006. For the research, we selected six sampling sites in the field: two sites were controls with no influences of turbid water from the dam and other remaining four sites were the stations for an assessment of potential turbid effects. We evaluated integrative health conditions throughout applications of various models such as necropsy-based fish health assessment model (FHA), Index of Biological Integrity (IBI) using fish assemblages, and Qualitative Habitat Evaluation Index (QHEI). Laboratory tests on fish exposure under 400 NTU were performed to find out impact of turbid water using scanning electron microscope (SEM). Results showed that fine solid particles were clogging in the gill in the treatments, while particles were not found in the control. This results indicate that when inorganic turbidity increases abruptedly, fish may have a mechanical abrasion or respiratory blocking. The stream health condition, based on the IBI values, ranged between 38 and 48 (average: 42), indicating a "excellent" or "good" condition after the criteria of US EPA (1993). In the mean time, physical habitat condition, based on the QHEI, ranged 97 to 187 (average 154), indicating a "suboptimal condition". These biological outcomes were compared with chemical dataset: IBI values were more correlated (r=0.526, p<0.05, n=18) with QHEI rather than chemical water quality, based on turbidity (r=0.260, p>0.05, n=18). Analysis of the FHA showed that the individual health indicated "excellent condition", while QHEI showed no habitat disturbances (especially bottom substrate and embeddeness), food-web, and spawning place. Consequently, we concluded that the ecological health in downstream of Yongdam Dam was not impacted by the turbid water.