• Title/Summary/Keyword: Standard Dataset

Search Result 187, Processing Time 0.024 seconds

Standard-based Integration of Heterogeneous Large-scale DNA Microarray Data for Improving Reusability

  • Jung, Yong;Seo, Hwa-Jeong;Park, Yu-Rang;Kim, Ji-Hun;Bien, Sang Jay;Kim, Ju-Han
    • Genomics & Informatics
    • /
    • v.9 no.1
    • /
    • pp.19-27
    • /
    • 2011
  • Gene Expression Omnibus (GEO) has kept the largest amount of gene-expression microarray data that have grown exponentially. Microarray data in GEO have been generated in many different formats and often lack standardized annotation and documentation. It is hard to know if preprocessing has been applied to a dataset or not and in what way. Standard-based integration of heterogeneous data formats and metadata is necessary for comprehensive data query, analysis and mining. We attempted to integrate the heterogeneous microarray data in GEO based on Minimum Information About a Microarray Experiment (MIAME) standard. We unified the data fields of GEO Data table and mapped the attributes of GEO metadata into MIAME elements. We also discriminated non-preprocessed raw datasets from others and processed ones by using a two-step classification method. Most of the procedures were developed as semi-automated algorithms with some degree of text mining techniques. We localized 2,967 Platforms, 4,867 Series and 103,590 Samples with covering 279 organisms, integrated them into a standard-based relational schema and developed a comprehensive query interface to extract. Our tool, GEOQuest is available at http://www.snubi.org/software/GEOQuest/.

Randomized Bagging for Bankruptcy Prediction (랜덤화 배깅을 이용한 재무 부실화 예측)

  • Min, Sung-Hwan
    • Journal of Information Technology Services
    • /
    • v.15 no.1
    • /
    • pp.153-166
    • /
    • 2016
  • Ensemble classification is an approach that combines individually trained classifiers in order to improve prediction accuracy over individual classifiers. Ensemble techniques have been shown to be very effective in improving the generalization ability of the classifier. But base classifiers need to be as accurate and diverse as possible in order to enhance the generalization abilities of an ensemble model. Bagging is one of the most popular ensemble methods. In bagging, the different training data subsets are randomly drawn with replacement from the original training dataset. Base classifiers are trained on the different bootstrap samples. In this study we proposed a new bagging variant ensemble model, Randomized Bagging (RBagging) for improving the standard bagging ensemble model. The proposed model was applied to the bankruptcy prediction problem using a real data set and the results were compared with those of the other models. The experimental results showed that the proposed model outperformed the standard bagging model.

Lie Detection Technique using Video from the Ratio of Change in the Appearance

  • Hossain, S.M. Emdad;Fageeri, Sallam Osman;Soosaimanickam, Arockiasamy;Kausar, Mohammad Abu;Said, Aiman Moyaid
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.7
    • /
    • pp.165-170
    • /
    • 2022
  • Lying is nuisance to all, and all liars knows it is nuisance but still keep on lying. Sometime people are in confusion how to escape from or how to detect the liar when they lie. In this research we are aiming to establish a dynamic platform to identify liar by using video analysis especially by calculating the ratio of changes in their appearance when they lie. The platform will be developed using a machine learning algorithm along with the dynamic classifier to classify the liar. For the experimental analysis the dataset to be processed in two dimensions (people lying and people tell truth). Both parameter of facial appearance will be stored for future identification. Similarly, there will be standard parameter to be built for true speaker and liar. We hope this standard parameter will be able to diagnosed a liar without a pre-captured data.

Development of relational river data model based on river network for multi-dimensional river information system (다차원 하천정보체계 구축을 위한 하천네트워크 기반 관계형 하천 데이터 모델 개발)

  • Choi, Seungsoo;Kim, Dongsu;You, Hojun
    • Journal of Korea Water Resources Association
    • /
    • v.51 no.4
    • /
    • pp.335-346
    • /
    • 2018
  • A vast amount of riverine spatial dataset have recently become available, which include hydrodynamic and morphological survey data by advanced instrumentations such as ADCP (Acoustic Doppler Current Profiler), transect measurements obtained through building various river basic plans, riverine environmental and ecological data, optical images using UAVs, river facilities like multi-purposed weir and hydrophilic sectors. In this regard, a standardized data model has been subsequently required in order to efficiently store, manage, and share riverine spatial dataset. Given that riverine spatial dataset such as river facility, transect measurement, time-varying observed data should be synthetically managed along specified river network, conventional data model showed a tendency to maintain them individually in a form of separate layer corresponding to each theme, which can miss their spatial relationship, thereby resulting in inefficiency to derive synthetic information. Moreover, the data model had to be significantly modified to ingest newly produced data and hampered efficient searches for specific conditions. To avoid such drawbacks for layer-based data model, this research proposed a relational data model in conjunction with river network which could be a backbone to relate additional spatial dataset such as flowline, river facility, transect measurement and surveyed dataset. The new data model contains flexibility to minimize changes of its structure when it deals with any multi-dimensional river data, and assigned reach code for multiple river segments delineated from a river. To realize the newly developed data model, Seom river was applied, where geographic informations related with national and local rivers are available.

Development of 4D CT Data Generation Program based on CAD Models through the Convergence of Biomedical Engineering (CAD 모델 기반의 4D CT 데이터 제작 의용공학 융합 프로그램 개발)

  • Seo, Jeong Min;Han, Min Cheol;Lee, Hyun Su;Lee, Se Hyung;Kim, Chan Hyeong
    • Journal of the Korea Convergence Society
    • /
    • v.8 no.4
    • /
    • pp.131-137
    • /
    • 2017
  • In the present study, we developed the 4D CT data generation program from CAD-based models. To evaluate the developed program, a CAD-based respiratory motion phantom was designed using CAD software, and converted into 4D CT dataset, which include 10 phases of 3D CTs. The generated 4D CT dataset was evaluated its effectiveness and accuracy through the implementation in radiation therapy planning system (RTPS). Consequently, the results show that the generated 4D CT dataset can be successfully implemented in RTPS, and targets in all phases of 4D CT dataset were moved well according to the user parameters (10 mm) with its stationarily volume (8.8 cc). The developed program, unlike real 4D CT scanner, due to the its ability to make a gold-standard dataset without any artifacts constructed by modality's movements, we believe that this program will be used when the motion effect is important, such as 4D radiation treatment planning and 4D radiation imaging.

Integrated Verification of Hadoop Cluster Prototypes and Analysis Software for SMB (중소기업을 위한 하둡 클러스터의 프로토타입과 분석 소프트웨어의 통합된 검증)

  • Cha, Byung-Rae;Kim, Nam-Ho;Lee, Seong-Ho;Ji, Yoo-Kang;Kim, Jong-Won
    • Journal of Advanced Navigation Technology
    • /
    • v.18 no.2
    • /
    • pp.191-199
    • /
    • 2014
  • Recently, researches to facilitate utilization by small and medium business (SMB) of cloud computing and big data paradigm, which is the booming adoption of IT area, has been on the increase. As one of these efforts, in this paper, we design and implement the prototype to tentatively build up Hadoop cluster under private cloud infrastructure environments. Prototype implementation are made on each hardware type such as single board, PC, and server and performance is measured. Also, we present the integrated verification results for the data analysis performance of the analysis software system running on top of realized prototypes by employing ASA (American Standard Association) Dataset. For this, we implement the analysis software system using several open sources such as R, Python, D3, and java and perform a test.

Evaluation and validation of stem volume models for Quercus glauca in the subtropical forest of Jeju Island, Korea

  • Seo, Yeon Ok;Lumbres, Roscinto Ian C.;Won, Hyun Kyu;Jung, Sung Cheol;Lee, Young Jin
    • Journal of Ecology and Environment
    • /
    • v.38 no.4
    • /
    • pp.485-491
    • /
    • 2015
  • This study was conducted to develop stem volume models for the volume estimation of Quercus glauca Thunb. in Jeju Island, Republic of Korea. Furthermore, this study validated the developed stem volume models using an independent dataset. A total of 167 trees were measured for their diameter at breast height (DBH), total height and stem volume using non-destructive sampling methods. Eighty percent of the dataset was used for the initial model development while the remaining 20% was used for model validation. The performance of the different models was evaluated using the following fit statistics: standard error of estimate (SEE), mean bias absolute mean deviation (AMD), coefficient of determination (R2), and root mean square error (RMSE). The AMD of the five models from the different DBH classes were determined using the validation dataset. Model 5 (V = aDbHc), which estimates volume using DBH and total height as predicting variables, had the best SEE (0.02745), AMD (0.01538), R2 (0.97603) and RMSE (0.02746). Overall, volume models with two independent variables (DBH and total height) performed better than those with only one (DBH) based on the model evaluation and validation. The models developed in this study can provide forest managers with accurate estimations for the stem volumes of Quercus glauca in the subtropical forests of Jeju Island, Korea.

Compound Outlier Assessment and Verification for Multiple Field Monitoring Data (다수 계측 데이터에 대한 복합 이상치 평가 및 검증)

  • Jeon, Jesung
    • Journal of the Korean GEO-environmental Society
    • /
    • v.19 no.1
    • /
    • pp.5-14
    • /
    • 2018
  • All kinds of monitoring data in construction site could have outlier created from diverse cause. In this study generation technique of synthesis value, its regression, final outlier detection and assessment are conducted to distinct outlier data included in extensive time series dataset. Synthesis value having weight factor of correlation between a number of datasets consist of many monitoring data enable to detect outlier by increasing its correlation. Standard artificial dataset in which intentional outliers are inserted has been used for assessment of synthesis value technique. These results showed increase of detection accuracy for outlier and general tendency in case of having different time series models in common. Accuracy of outlier detection increased in case of using more dataset and showing similar time series pattern.

Comparative Analysis of Machine Learning Models for Crop's yield Prediction

  • Babar, Zaheer Ud Din;UlAmin, Riaz;Sarwar, Muhammad Nabeel;Jabeen, Sidra;Abdullah, Muhammad
    • International Journal of Computer Science & Network Security
    • /
    • v.22 no.5
    • /
    • pp.330-334
    • /
    • 2022
  • In light of the decreasing crop production and shortage of food across the world, one of the crucial criteria of agriculture nowadays is selecting the right crop for the right piece of land at the right time. First problem is that How Farmers can predict the right crop for cultivation because famers have no knowledge about prediction of crop. Second problem is that which algorithm is best that provide the maximum accuracy for crop prediction. Therefore, in this research Author proposed a method that would help to select the most suitable crop(s) for a specific land based on the analysis of the affecting parameters (Temperature, Humidity, Soil Moisture) using machine learning. In this work, the author implemented Random Forest Classifier, Support Vector Machine, k-Nearest Neighbor, and Decision Tree for crop selection. The author trained these algorithms with the training dataset and later these algorithms were tested with the test dataset. The author compared the performances of all the tested methods to arrive at the best outcome. In this way best algorithm from the mention above is selected for crop prediction.

Evaluation of Urban Weather Forecast Using WRF-UCM (Urban Canopy Model) Over Seoul (WRF-UCM (Urban Canopy Model)을 이용한 서울 지역의 도시기상 예보 평가)

  • Byon, Jae-Young;Choi, Young-Jean;Seo, Bum-Geun
    • Atmosphere
    • /
    • v.20 no.1
    • /
    • pp.13-26
    • /
    • 2010
  • The Urban Canopy Model (UCM) implemented in WRF model is applied to improve urban meteorological forecast for fine-scale (about 1-km horizontal grid spacing) simulations over the city of Seoul. The results of the surface air temperature and wind speed predicted by WRF-UCM model is compared with those of the standard WRF model. The 2-m air temperature and wind speed of the standard WRF are found to be lower than observation, while the nocturnal urban canopy temperature from the WRF-UCM is superior to the surface air temperature from the standard WRF. Although urban canopy temperature (TC) is found to be lower at industrial sites, TC in high-intensity residential areas compares better with surface observation than 2-m temperature. 10-m wind speed is overestimated in urban area, while urban canopy wind (UC) is weaker than observation by the drag effect of the building. The coupled WRF-UCM represents the increase of urban heat from urban effects such as anthropogenic heat and buildings, etc. The study indicates that the WRF-UCM contributes for the improvement of urban weather forecast such nocturnal heat island, especially when an accurate urban information dataset is provided.