DOI QR코드

DOI QR Code

Semantic Segmentation of Heterogeneous Unmanned Aerial Vehicle Datasets Using Combined Segmentation Network

  • Ahram, Song (Department of Location-based Information System, Kyungpook National University)
  • Received : 2023.02.07
  • Accepted : 2023.02.24
  • Published : 2023.02.28

Abstract

Unmanned aerial vehicles (UAVs) can capture high-resolution imagery from a variety of viewing angles and altitudes; they are generally limited to collecting images of small scenes from larger regions. To improve the utility of UAV-appropriated datasetsfor use with deep learning applications, multiple datasets created from variousregions under different conditions are needed. To demonstrate a powerful new method for integrating heterogeneous UAV datasets, this paper applies a combined segmentation network (CSN) to share UAVid and semantic drone dataset encoding blocks to learn their general features, whereas its decoding blocks are trained separately on each dataset. Experimental results show that our CSN improves the accuracy of specific classes (e.g., cars), which currently comprise a low ratio in both datasets. From this result, it is expected that the range of UAV dataset utilization will increase.

Keywords

1. Introduction

Deep learning model performance clearly depends on the availability of sufficient data to avoid problems such as overfitting (Ying, 2019) and other performance degradations (Li et al., 2019; Althnian et al., 2021). Compared with the huge availability of ground-survey natural image datasets, the construction of large-scale aerial-survey remote-sensing datasets is quite difficult for several reasons. For example, label annotation is expensive, as prior expert ground-survey knowledge is needed.

Despite these difficulties, many remote sensing datasets have been built for semantic segmentation and object detection (Li et al., 2021). In particular, unmanned aerial vehicle (UAV)-appropriated datasets have grown in importance. However,several challenges to their optimized application remain.Although UAVs can capture high-resolution imagery from a variety of viewing angles and altitudes, they are generally limited to collecting images ofsmallscenesfrom larger regions. Furthermore,sensors are commonly hampered by weather and environmental conditions. Moreover, image features must have a clear precedent based on ground-survey knowledge; hence, unexpected and complex scenes are difficult to interpret (Ma et al., 2019).

To improve the utility of UAV datasets for use with deep learning applications, the many datasetsthat have already been created under many weather and environmental conditionsshould be integrated for data expansion. However, doing so is particularly difficult owing to proprietary restrictions and the diversity of resolutions, seasons, class types, and labeling rubrics. Several studies have adapted heterogeneous datasets, such asthe work of Yang et al. (2020), who proposed a semi-supervised learning framework to increase the number of recognizable visual classes in various domains. Their experimental results showed that their jointly trained model can significantly improve accuracy as it learns more illumination-invariant features. Additionally, Valada et al. (2017) developed AdapNet, a multi-stream deep neural network for adaptive semantic segmentationinadverse environmental conditions. This study showed the effective utility of AdapNet with three different publicly available datasets.As a previousresearch, Song and Kim (2020) used a combined U-Net to segment aerial imagery by training the model using natural image data obtained from vehicles. They showed the efficiency of training different datasets at once to overcome data deficiencies.

Although the above studies were beneficial, they relied heavily on ground-survey datasets and the optimal architecture of combined segmentation network (CSN) was not analyzed.In the presentstudy, to truly leverage heterogeneousUAVdatasetsforsemantic segmentation, a combined convolutionalsemantic network thatshares specific layersis developed, and its prediction accuracy is analyzed with different network architectures to identify the structures that best influence the training results.

2. Materials and Methods

2.1. Dataset

Two UAV datasets, UAVid (Lyu et al., 2020) and the semantic drone dataset (SDD; Graz University of Technology, 2019), were used to train the proposed network. Both datasets contain high-resolution images acquired fromUAVplatforms. However, their numbers of classes and spatial resolutions differed. Detailed information on the two datasetsis described inTable 1.

Table 1. Information on the two datasets

OGCSBN_2023_v39n1_87_t0001.png 이미지

2.1.1. UAVid

The UAVid dataset provides 300 optical images (4,096 × 2,160 or 3,840 × 2,160) acquired in complex urban areas at an altitude of ~50 m.The dataset consists of eight classes (i.e., building, road, static car, moving car, tree, low vegetation, human, and background). Fig. 1a presents examples of UAVid.

OGCSBN_2023_v39n1_87_f0001.png 이미지

Fig. 1. Examples of images and labels of (a) UAVid and (b) semantic drone dataset (SDD).

2.1.2. Semantic Drone Dataset

The SDD contains optical images acquired in urban areas(Fig. 1b).The images depict more than 20 houses from a nadir view acquired at an altitude of 5–30 m, which is, on average, lower than UAVid imagery. Therefore, SDD images can ostensibly reveal more detailed features. The dataset contains 23 categories grouped by vegetation, construction, vehicles, living object, and natural object classes. SDD shares car, tree, vegetation, and human classes with UAVid; however, several classes are only included in the SDD dataset.

2.2. Combined Segmentation Network

TheCSN performssemantic segmentation based on the simplified U-Net architecture (Fig. 2).The network accepts images of 512 × 512 pixels. Noting that the image sizes ofthe two datasets differ, nearest-neighbor interpolation is used to adjust their sizes accordingly. The network uses an encoder and a decoder. Its covolutional blocks comprise two two-dimensional (2D) convolutional, batch normalization, activation layers followed by max-pooling for downscaling. During training, these blocksreceive inputfromthe two datasets and share training parameter weights. The purpose of sharing specific layers is to overcome the data shortage problem. Lee et al. (2018) shared the middle of their network to classify three hyperspectral images, and their results showed that it notably improved classification accuracy. In this paper, the initial encoding blocks are shared because the initial convolutional layers capture the low-level features (e.g., lines), although the classtypes and resolutions of the datasets differslightly.After encoding,featuremaps are decoded separately with separate training weights. Convolutional layers then extract high-level features (e.g., shapes and specific objects). Therefore, the convolutional blocks in decoding phase comprise one transposed convolutional layer to upscale the feature map, followed by two convolutional layers: one for each dataset.In this case, there are two decoding paths. The first is for UAVid and the second is for SDD. The output size is the same as the input (i.e., 512 × 512 × n), where n isthe number oflabel classes. Furthermore, during training, the CSN is trained with a combined loss: the sum of the two paths. When spatial cross entropy loss is defined as shown in Eq. (1), the combined loss is calculated as shown in Eq. (2).

OGCSBN_2023_v39n1_87_f0002.png 이미지

Fig. 2. Architecture of the combined segmentation network.

\(\begin{aligned}L_{n}=-\sum_{i=1}^{H \times W} \sum_{k=1}^{c} y_{i, k} \log \left(\hat{y}_{i, k}\right)\end{aligned}\)       (1)

L_n = L_n1+L_n2.       (2)

2.3. Effect of Network Architecture

In this study, the lengths of the encoding and decoding blocks are configured in three ways to confirm the effects of the CSN architecture. Case 1 uses six convolutional blocks in encoding phase and ten convolutional blocksin decoding phase. In Case 2, the number of convolutional blocksin the initialshared encoding phase are increased, but the convolutional blocks in the decoding phase are reduced. In contrast, Case 3 reduces the number of shared convolutional blocksandincreasesthenumberofdecodingconvolutional blocks according to the diagram in Fig. 3.

OGCSBN_2023_v39n1_87_f0003.png 이미지

Fig. 3. Architecture of the combined segmentation network (CSN) with three cases.

2.4. Environmental Setting

The CSN was built and trained in Google’s cloudbasedColab, which uses NDVITesla P100 GPUs.The initial learning rate was set to 0.0001, and Adam was used as the optimizer. The model was trained for 50 epochs with the two heterogeneous inputs. Because the objective of this study was to confirm the efficacy of the combined network in a situation in which the amount of data was insufficient, only some of the imagesin the dataset were used for training the model. From UAVid, 200 images were taken and divided into 70 and 30% for training and validation, respectively. From SDD, 400 images were taken at the same ratio.

2.5. Accuracy Assessment

Pixel-based accuracywasmeasured using a confusion matrix comprising label images and prediction results. The overall accuracy isrepresentative of classification accuracy, but it is limited in that it is difficult to reflect the accuracy of a few classes when there are imbalances among them.Thus, the F1 score (Eq. 3) was calculated from precision and recall, and true positive (TP), false positive (FP), and false negative (FN) assessments were used to populate the matrix (Eq. 4).

\(\begin{aligned}F 1score=2 \times \frac{\text { Precision } \times \text { Recall }}{\text { Precision }+ \text { Recall }}\\\end{aligned}\)       (3)

\(\begin{aligned}Precision=\frac{T P}{T P+F N}, Recall=\frac{T P}{T P+F N}\end{aligned}\)       (4)

3. Results and Discussion

Fig. 4 display the learning accuracies and losses of the four different models on the UAVid and SDD datasets per epoch. The total epoch was set to 50. UAVid training data accuracy was almost 0.8 and validation data accuracy was between 0.5 and 0.7. Similarly, the accuracy of SDD training data reached 0.8 and that of the SDD validation data was under 0.7. When comparing the results according to the networks, the accuracies were very similar. However, the basic U-Net performed better than the CSNs. Thus, the total losses of the basic U-Net were lower than those of the CSNs. Because the CSNs used the combined losses of UAVid and SDD datasets, the total loss was higher. The CSNs were trained so that the sum of the losses was minimized. However, it was confirmed that the learning processes of validation dataset withCSNs was more stable than the basic U-Net.

OGCSBN_2023_v39n1_87_f0004.png 이미지

Fig. 4. Learning accuracy and loss of training and validation sets obtained from (a) UAVid and (b) semantic drone dataset (SDD).

Fig. 5 presents the prediction results of randomly selected test imagesfrom the UAVid. Figs. 5a, g reflect the RGB input images, and Figs. 5b, h reflect the label images. The images mostly consisted of moving cars, buildings, roads, and trees. Figs. 5c, i reflect the prediction images obtained by the basic U-Net, in which a moving car could not be identified. However, the CSNs successfully identified moving cars (Figs. 5d-f, 5j-l). In particular, CSN Case 1 and Case 2 can segment moving cars on the road in the test images. However, the accuracy of CSN Case 1 waslower than that of U-Net. The main reason is that CSN Case 1 predicted background clutter on the road (e.g., roadlike materials). This is because the background clutter class contains parking lots and car-free areas and it has very similar spectral characteristics with road class.

OGCSBN_2023_v39n1_87_f0005.png 이미지

Fig. 5. Prediction Results of randomly selected test images from UAVid (a), (g) RGB image, (b),(h) reference images, and others are segmentation results of (c),(i) U-Net, (d),(j) CSN with case 1, (e),(k) CSN with case 2, (f),(l) CSN with case 3.

Table 2 lists the accuracy results of 50 randomly selected test images. The overall accuracy, F1 score, and Kappa coefficient of all classesshowed the highest values with the basic U-Net. However, the F1-scores of the static and moving car classesimproved with the CSN Case 1 and Case 2. In particular, CSN Case 2 had the highest accuracy because the length of the initial shared layerincreased.Asthe ratio ofstatic andmoving car classes in UAVid were very small (1.4 and 1.1%, respectively), learning with SDD can be improved by improving the accuracy of car classification.

Table 2. Accuracy assessment of test images from UAVid using four models

OGCSBN_2023_v39n1_87_t0003.png 이미지

Fig. 6 displays the predicted results of two test images from SDD and Table 3 shows the accuracy of randomly selected 50 test images. Like the UAVid results, the overall accuracy, F1 score, and kappa coefficient of all classes were lower with the CSNs. However, the vehicle class was improved with CSNs over the basic U-Net. Furthermore, human class was identified in CSN Case 2 and 3 (Figs. 6k and l). This is because CSNs can learn more reliable vehicle characteristics by sharing UAVid and SDD in the encoding phase.

Table 3. Accuracy assessment of test images from SDD using four models

OGCSBN_2023_v39n1_87_t0004.png 이미지

OGCSBN_2023_v39n1_87_f0006.png 이미지

Fig. 6. Prediction results of randomly selected test images from UAVid (a), (g) RGB image, (b), (h) reference images, and others are segmentation results of (c), (i) U-Net, (d), (j) CSN with case 1, (e), (k) CSN with case 2, (f), (l) CSN with case 3.

Learning using CSN has the drawback of having lesser accuracy and taking longer than learning each data separately. However, there were several classes, such as cars with improved accuracy, which are difficult to obtain sufficient information with a single dataset, and the ratio was small in each class. Through this result, it can be assumed that class information on heterogeneous datasets can be integrated throughCSN. Alternatively, there was no clear difference in the CSN structure. However,since thisstudy partially added the convolutional block in the encoder and decoder phases based on the basicCSN, the effectmay be insignificant. Nevertheless, in the case of SDD, it can be seen that the human class accuracy is improved when the convolutional block is increased.

In order to overcome the limitation of data shortage, there are methods of learning datasets acquired from variousplatformstogetherbytransferlearningapproaches (Panboonyuen et al., 2019; Cui et al., 2020).Although the transfer learning approaches have shown the effectiveness in dealing with heterogeneous datasets, the methods have complexities and as the amount of data increases, it is possible to include a dataset that negatively affects transfer segmentation rules. When comparing with transfer learning approaches, the proposed technique has the potential to improve the learning ability of the initial layers through a simple method of sharing layers. In addition, it shows the possibility of using multiple UAV datasets by focusing on UAV images with similar characteristics compared to existing methods ofsharing layers(Lee et al., 2018; Meletis and Dubbelman, 2018).

4. Conclusions

Because remote-sensing datasets are difficult to build on a large scale, it is crucial that researchers train their models with heterogeneous datasets. This paper used the CSN to segment two different datasets(i.e., UAVid and SDD), with which it shared its initial layers and was updated by the combined losses. Owing to this configuration, the CSN’s total losses were lower than those of the basic U-Net. Notably, for UAVid, the accuracy of specific classes such as moving car was improved when using the CSNs. In particular, CSN Case 2, which applied longer initialshared layers, was the most efficient in segmenting cars. However, there was no significant difference betweenCSN Cases 1 and 3.It meansthat the shared layerstructure devised in the experiment did notsignificantly affect the segmentation results.Although the accuracy ofthe car classesin SDD wasimproved withCSN, other classes’accuracies were lower than the basic U-Net because SDD has many classes, but the training images were insufficient. However, CSNs can improve prediction accuracy of specific classes when the ratio of the class in UAVid were very small. Ultimately, it was confirmed that segmentation accuracies can be improved for specific classesthat coexist in more than one dataset, aslong as they have similar characteristicsthat can be capitalized through layer-sharing.

Acknowledgments

This work was supported by the National Research Foundation of Korea (NRF) grantfunded by the Korean government Ministry of Science and ICT(MSIT) (No. 2022R1F1A1063254).

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Althnian, A., AlSaeed, D., Al-Baity, H., Samha, A., Dris, A.B., Alzakari, N.H. et al., 2021. Impact of dataset size on classification performance: an empirical evaluation in the medical domain. Applied Sciences, 11(2), 796. https://doi.org/10.3390/app11020796
  2. Cui,B.,Chen,X.,andLu,Y.,2020. Semantic segmentation of remote sensing images using transfer learning and deep convolutional neural network with dense connection. IEEE Access, 8, 116744- 116755. https://doi.org/10.1109/ACCESS.2020.3003914
  3. Lee, H., Eum, S., and Kwon, H., 2018. Cross-domain CNN for hyperspectral image classification. In Proceedings of the IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, July 22-27, pp. 3627-3630. https://doi.org/10.1109/IGARSS.2018.8519419
  4. Li, J., Huang, X., and Gong, J., 2019. Deep neural network for remote-sensing image interpretation: status and perspectives.National ScienceReview, 6(6), 1082-1086. https://doi.org/10.1093/nsr/nwz058
  5. Li,Y., Ma,J., and Zhang,Y., 2021.Image retrieval from remote sensing big data:Asurvey. Information Fusion, 67, 94-115. https://doi.org/10.1016/j.inffus.2020.10.008
  6. Lyu, Y., Vosselman, G., Xia, G., Yilmaz,A., and Yang, M.Y., 2020. UAVid: A semantic segmentation dataset for UAV imagery. ISPRS Journal of Photogrammetry and Remote Sensing, 165, 108-119. https://doi.org/10.1016/j.isprsjprs.2020.05.009
  7. Meletis, P. and Dubbelman, G., 2018. Training of convolutional networks on multiple heterogeneous datasets for street scene semantic segmentation. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, June 26-30, pp. 1045-1050. https://doi.org/10.1109/IVS.2018.8500398
  8. Graz University of Technology, 2019. Semantic Drone Dataset. Available online: http://dronedataset.icg.tugraz.at (accessed on Feb. 25, 2023).
  9. Panboonyuen, T., Jitkajornwanich, K., Lawawirojwong, S., Srestasathiern, P., and Vateekul, P., 2019. Semantic segmentation on remotely sensed images using an enhanced global convolutional network with channel attention and domain specific transfer learning. Remote Sensing, 11(1), 83. https://doi.org/10.3390/rs11010083
  10. Song,A. and Kim,Y., 2020. Semantic segmentation of remote-sensing imagery using heterogeneous big data : International society for photogrammetry and remote sensing potsdam and cityscape datasets. ISPRS International Journal of Geo Information, 9(10), 601. https://doi.org/10.3390/ijgi9100601
  11. Valada,A.,Vertens,J., Dhall,A., and Burgard,W., 2017. Adapnet: Adaptive semantic segmentation in adverse environmental conditions. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, May 29-June 3, pp. 4644-4651. https://doi.org/10.1109/ICRA.2017.7989540
  12. Yang, K., Hu, X., Wang, K., and Stiefelhagen, R., 2020. In Defense of Multi-Source Omni Supervised Efficient Conv Net for Robust Semantic Segmentation in Heterogeneous Unseen Domains. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, Oct. 19-Nov. 13, pp. 1386-1393. https://doi.org/10.1109/IV47402.2020.9304768
  13. Ying, X., 2019. An overview of over fitting and its solutions. Journal of Physics: Conference Series, 1168, 022022. https://doi.org/10.1088/1742-6596/1168/2/022022