DOI QR코드

DOI QR Code

Keypoint-based Deep Learning Approach for Building Footprint Extraction Using Aerial Images

  • Jeong, Doyoung (Department of Civil and Environmental Engineering, Seoul National University) ;
  • Kim, Yongil (Department of Civil and Environmental Engineering, Seoul National University)
  • Received : 2021.02.08
  • Accepted : 2021.02.20
  • Published : 2021.02.26

Abstract

Building footprint extraction is an active topic in the domain of remote sensing, since buildings are a fundamental unit of urban areas. Deep convolutional neural networks successfully perform footprint extraction from optical satellite images. However, semantic segmentation produces coarse results in the output, such as blurred and rounded boundaries, which are caused by the use of convolutional layers with large receptive fields and pooling layers. The objective of this study is to generate visually enhanced building objects by directly extracting the vertices of individual buildings by combining instance segmentation and keypoint detection. The target keypoints in building extraction are defined as points of interest based on the local image gradient direction, that is, the vertices of a building polygon. The proposed framework follows a two-stage, top-down approach that is divided into object detection and keypoint estimation. Keypoints between instances are distinguished by merging the rough segmentation masks and the local features of regions of interest. A building polygon is created by grouping the predicted keypoints through a simple geometric method. Our model achieved an F1-score of 0.650 with an mIoU of 62.6 for building footprint extraction using the OpenCitesAI dataset. The results demonstrated that the proposed framework using keypoint estimation exhibited better segmentation performance when compared with Mask R-CNN in terms of both qualitative and quantitative results.

Keywords

1.Introduction

Buildings provide key cadastral information related to populations and cities, and are fundamental to urban planning disaster management and 3D city modeling. Modern remote sensing data such as satellite and drone imagery provide an unprecedented range of visual information for mapping buildings. Knowledge of a building’s location can be valuable to assist and protect the well-being of urban citizens by monitoring the city’s expansion, while illegal building construction activities can be detected and regulated. Although it is possible to delineate the location and shape of the buildings manually, the cost of annotating over such a process for large areas highlights the need to develop automatic building extraction algorithms.

Building footprint extraction, which is an actively researched topic in the domain of remote sensing, is challenging due to the variability of building shapes, materials, and dimensions in addition to the different types of backgrounds against which they are located (Pasquali et al., 2019). In early works, building footprints were often delineated with multistep, bottom-up approaches and a combination of multispectral satellite imagery and LiDAR (Sohn et al., 2007). However, these methods have poor generalization abilities. Recently, Deep Neural Networks (DNNs) have shown successful performance in building footprint extraction (Zhao et al., 2018) by using only optical images. DNNs with multiple nonlinear layers can automatically learn high-level abstract features from large amounts of training data and outperform conventional algorithms.

Building footprint extraction is normally processed by combining two tasks: (i) segmentation, which is the extraction of building regions from the given area, and (ii) instantiation, which is the identification of individual buildings. In other words, related studies aim to extract individual, vectorized buildings by integrating the two tasks together. The result can be classified depending on which method is performed first. One approach is semantic segmentation (segmentation before instantiation) (Ji et al., 2019; Xu et al., 2018), which classifies image pixels into building and nonbuilding pixels. Through post-processing, each individual pixel in building regions is identified by grouping connected pixels. The second approach is instance segmentation, where each building is detected from within a bounding box and each detected object is segmented into building and nonbuilding pixels (Zhang et al., 2020).

Building footprint extraction is essentially a binary classification problem in that there are only two categories, namely, buildings and nonbuildings. In several challenges and previous papers, semantic segmentation was used to classify each pixel class by using deep features through U-Net-based deep learning networks (Iglovikov et al., 2018), resulting in superior performance for winning submissions in numerous challenges on building footprint extraction. However, the semantic segmentation approach produces coarse segmentation results, such as non-sharp boundaries in the output. These results are caused by the use of convolutional layers with large receptive fields and by the pooling layers in Deep Convolutional Neural Networks (DCNNs), which fail to detect fine local details in the image because they do not consider the interactions occurring between pixels. In this case, the segmentation result is produced as a binary classification image, which is not a desirable output from a user’s point of view for many applications (Li et al., 2019).

Recently, a series of studies has been conducted to create polygon representations that describe geometric objects of vector structures in an end-to-end learnable approach (Li et al., 2019; Zhang et al., 2020; Zhao et al., 2018). Instead of pixel-wise segmentation maps, instance segmentation approaches have been introduced to directly generate polygons in an end-to-end network. Li et al. (2019) adapted Recurrent Neural Networks for building footprint extraction to predict the vertices and edge masks of an instance using detection modules. This approach can also produce a visually qualitative segmentation mask with sharp boundaries while connecting each vertex with its nearest neighbors.

In this study, the public dataset from the Open Cities AI Challenge was used for building footprint extraction. These datasets consist of orthorectified aerial images where annotations match roof outlines and building footprints as shown in Fig. 1.

OGCSBN_2021_v37n1_111_f0001.png 이미지

Fig. 1. Sample Images of the OpenCitiesAI dataset corresponding to the four different sites: (a) Accra, Ghana; (b) Dar es Salaam, Tanzania; (c) Pointe Noire, Rep. of Congo; (d) Monrovia, Liberia.

Each polygon point is regarded as a keypoint, and keypoint extraction is performed for instantiation after object detection. Keypoints in building extraction are points of interest based on the local image gradient direction, that is, the vertex of a building polygon. This approach is composed of two modules: object detection and vectorization. Polygons are acquired by predicting the optimal locations of the polygon vertices and linking the outer vertices with straight lines, thereby creating formulaic polygons. In previous studies, PolygonRNN (Castrejon et al., 2017) and PolygonRNN++ (Acuna et al., 2018) use fully convolutional layers to extract the bounding boxes of each instance. These layers are then fed into sequential recursive neural networks (RNNs), which predict a boundary mask, the locations of the polygon vertices, and the first vertex to start edge generation. In one sequence, the current boundary and vertex prediction are influenced by previous predictions. Keypoint detection networks can be deployed after regional proposal networks (RPNs) to extract additional information following detection for applications such as pose estimation (Wei et al., 2020). The objective of this study is to, therefore, derive visually enhanced building objects by directly extracting the vertices of independent buildings using a combination of instance segmentation and keypoint detection.

2. Methodology

1) Mask R-CNN

The backbone network follows the typical two-stage instance segmentation approach proposed from Mask R-CNN (He et al., 2017) as illustrated in Fig. 2. This approach works by predicting the segmentation mask, and is separated into detection and segmentation tasks. The architecture outputs well-localized RoI features, which play a key role in the model.

OGCSBN_2021_v37n1_111_f0002.png 이미지

Fig. 2. Overview of backbone network and proposed framework.

Mask R-CNN performs detection and then segmentation. The detection process generates localized RoIs from the feature map from a feature extractor, such as residual network (ResNet) which is a feature extractor used in the common two-stage object detector, Faster R-CNN (He et al., 2016; Ren et al., 2015). After this step, the feature of each RoI is fed into simple convolutional layers and an FCN to obtain object masks for semantic segmentation. Many other instance segmentation models have been developed based on R-CNN, such as those developed for mask proposals such as R-FCN (Dai et al., 2016) and CenterMask (Lee et al., 2020). The performance of these networks has been improved by applying FCN-based anchor-free detector such as DenseBox (Huang et al., 2015) and FCOS (Tian et al., 2019) instead of Faster R-CNN with the pre-defined anchor box. In this study, the effect of the detector on the performance of the proposed framework was evaluated by comparing the use of Faster R-CNN and FCOS as localizing RoI.

The goal of the backbone network is to provide a segmentation mask for polygon initialization for each individual object. The instance segmentation model is exploited to generate a segmentation mask for each instance in the scene, as demonstrated in the original Mask R-CNN. A bounding-box detection step is added to predict separate keypoints and partition the image into individual building instances. ResNet-FPN is integrated for corresponding RoIs into the framework (Lin et al., 2017). The FPN enhances the performance of the RPN by adding additional information through a multiscale pyramidal hierarchy of CNNs called feature pyramids.

2) Keypoints Detection and Grouping

The keypoint detection step outputs a heatmap of keypoints for each object. The proposed keypoint prediction network is similar to the one outlined in Law et al. (2018). Suppose an input image of height H and width Whas dimensions of I∈RH×W×3. For each image patch, there exists a corresponding ground truth heatmap such that Y∈[0, 1]H×W. The aim of this process is to produce a corresponding heatmap of candidate keypoints, \(\check{Y} \in[0,1]^{H \times W}\), which represents the vertices of each instance. A prediction, \(\check{Y}=1\), corresponds to a detected keypoint, while \(\check{Y}=0\) denotes the background.

For each ground truth keypoint, p∈R2, the ground truth keypoint map is guided by using the Gaussian kernel \(Y_{x, y}=\exp \left(-\frac{\left(x-p_{x}\right)^{2}+\left(y-p_{y}\right)^{2}}{2 \sigma_{p}^{2}}\right)\), where \(\sigma_{p}^{2}\) object size-adaptive standard deviation (Law et al., 2018). In this regard, the penalty is reduced to negative locations within a radius of the positive location instead of applying equal penalization during training (Zhang et al., 2020). Therefore, the training object is set as a penalty-reduced pixel-wise logistic regression with modified focal loss, Lkeypoint, to maintain a balance between positive and negative locations (Goyal, et al., 2017). The loss function of the keypoint extraction is defined in Equation (1). An example of the heatmap generation result is shown in Fig. 3.

OGCSBN_2021_v37n1_111_f0003.png 이미지

Fig. 3. Input image with annotation and its keypoint heatmap.

\(\begin{array}{l} L_{\text {keypoint }}= \\ -\frac{1}{N} \sum_{x y}\left\{\begin{array}{c} \left(1-\check{Y}_{x v}\right)^{a} \log \left(\check{Y}_{x v}\right) \quad \text { if } \quad Y_{x v}=1\\ \left(1-Y_{x y}\right)^{\beta}\left(\check{Y}_{x y}\right)^{\alpha} \log \left(1-\check{Y}_{x y}\right) \text { otherwise } \end{array}\right. \end{array}\)       (1)

where α and β are hyper-parameters for focal loss and N is the number of objects in a patch. For this study, the hyper-parameters are fixed as α=2 and β=4, in accordance with Lin et al. (2017).

Segmentation masks are regarded as a rough probability map that estimates the probability of each pixel belonging to the foreground (Li et al., 2020; Luo et al., 2018). With the merging of the segmentation masks with the feature inside an RoI, the mask can be used to distinguish whether a keypoint is enclosed inside of an object or not, as displayed in Fig. 4. Then, the FCN is applied to the local features acquired by using RoIAlign with the predicted mask to predict the keypoints heatmap.

OGCSBN_2021_v37n1_111_f0004.png 이미지

Fig. 4. Mask and keypoint branches.

Overall, the total loss function of the proposed network is a multitask loss function expressed as follows:

\(L=L_{c l s}+L_{n g g}+\lambda_{\text {mask }} L_{\text {mask }}+\lambda_{\text {polygon }} L_{p o b y o n}\)       (2)

where Lcls is a cross-entropy loss for bounding-box classification and Lreg is a bounding-box loss for bounding-box regression, which is defined in He et al. (2017). For all experiments in this study, λmask=0.2 and λpolygon=1, unless specified otherwise. The features of the backbone are passed through separate layers of 3×3 convolution, ReLU, and 1×1 convolution.

To create a polygon where edges are sequentially connected with keypoints, a simple geometric method is adopted for the predicted keypoints, which is illustrated in Fig. 5. First, four extreme keypoints, located at the farthest left, right, bottom, or top of an instance, are selected as the group of the start point. Then, a matrix of all keypoints is calculated based on Euclidean distance to search for the distance connecting the extreme keypoints. The point with the shortest distance with respect to an extreme point is first selected as the start point, and the first edge is generated by establishing a connection between this initial point and its extreme point. Then, the generated point is considered as the initial point for creating the next polygon and the next set of vertices is connected with those in its neighborhood. The grouping of keypoints is iterated until the final keypoint meets the initial keypoint. Finally, a polygon of an object is formed by integrating all of the generated edges. One limitation occurs when the shape of a ground truth object is concave. and the grouping method fails to utilize all keypoints to create a complete polygon. This problem can be resolved by establishing a connection between the initial point and its next closest point.

OGCSBN_2021_v37n1_111_f0005.png 이미지

Fig. 5. Strategy for grouping keypoints.

3. Experiments

1) Dataset

OpenCitiesAI dataset features from 2 to 7-cm resolution drone imagery from African cities, which contains small and highly diverse building roof styles. The images have a size of 512 × 512 pixels with RGB bands. Building footprints are annotated with the local OSM data. The database covers more than 700, 000 buildings in 12 African cities and regions and contain small, highly diverse building roof styles.

Unlike the SpaceNet 2 dataset (Van Etten et al., 2018), which consists of non-nadir satellite images for building footprint extraction, OpenCitiesAI contains orthorectified images; hence, the roof boundaries and building footprints match correctly. In other words, the OpenCitiesAI dataset images are free from misalignment bias between the roof contour observed in a satellite image and the building footprint, given that the label is only annotated to the building’s ground footprint. However, since the labels were annotated automatically using OSM, the ground truth inevitably contains omission error. The entire dataset contains over 790,000 building footprint labels, and examples of the images and annotations are shown in Fig. 1.

The texture of building boundaries may be distorted when processing geometric correction during image mosaicking. In addition, some buildings are often not annotated due to omissions in OSM data layers. However, since the images in the dataset contain geographic information, any sign of geometric distortion can easily be detected via visual inspection. For this study, the multi-resolution drone images were all resampled to 10 cm spatial resolution.

2) Training and Testing Details

The proposed framework was evaluated on the OpenCitiesAI dataset, which consists of 790 k annotated buildings. A total of 31 drone images from eight cities was split into 20, 000 patches with a size of 512 by 512 pixels. Out of these patches, 80% were used for training the weights, while the remaining 20% served as the testing set. The experiment was implemented using PyTorch 1.5.0. on Python 3.7. Instance segmentation models were implemented on top of Detectron2 (Wu et al., 2019). The configuration of the backbone network is displayed in Table 1. The ResNet-101 architecture was used to control the feature extractor, and the anchor stride of the RPN layer was adjusted for object detection in the drone images.

Table 1. Configurations of the backbone network

OGCSBN_2021_v37n1_111_t0001.png 이미지

When training the Mask R-CNN model, the pretrained weights of ResNet-101 were adopted to initialize the backbone network. The batch size was set to 2, and the Adam optimizer was employed (Kingma et al., 2014). The learning rate was initialized as 10-4, with a weight decay of 10-7 per 1000 epochs. Mask R- CNN was also trained and evaluated on the same dataset as the baseline with the same configuration using Detectron2. The network was trained on a single NVIDIA GeForce 2080 Ti with 12 GB memory.

3) Metrics

To measure the performance of the proposed framework, Intersection over Union (IoU) and F1-score were used. These metrics are widely used in semantic segmentation and building footprint extraction and are presented by the following equations:

\(I o U=\frac{T P}{T P+F P+F N}\)       (1)

\(\text { precise }=\frac{T P}{T P+F P}\)       (2)

\(\text { recall }=\frac{T P}{T P+F N}\)       (3)

\(F 1=\frac{2 \times \text { precise } \times \text { recall }}{\text { precise }+\text { recall }}\)       (4)

where TP, TN, FP, FNdenote the true positive, true negative, false positive and false negative, respectively. In addition, Structural Similarity (SSIM) is employed to evaluate the performance of the segmentation mask, which is the predicted binary building mask and ground truth used in this study. SSIM is represented by the following expression:

\(\operatorname{SSIM}(x, y)=\frac{\left(2 \mu_{x} \mu_{y}+C_{1}\right)\left(2 \sigma_{x y}+C_{2}\right)}{\left(\mu_{x}^{2}+\mu_{y}^{2}+C_{1}\right)\left(\sigma_{x}^{2}+\sigma_{y}^{2}+C_{1}\right)}\)       (5)

where x and yare the set of pixel values in the fixed size of window; μx and μy, σx and σy are the mean and variance of x and y. Further, σxy is the covariance of x and y, while c1=0.012 and c2=0.032 to avoid resulting in a zero-value denominator.

4. Results and Discussion

In this section, the performance of the proposed framework is evaluated using three metrics. The mask accuracy was evaluated by the F1-score, Intersection over Union (IoU) and Structural Similarity (SSIM).

1) Building Extraction Accuracy

The evaluation results for the dataset and a total score were computed for the proposed model and control groups. The results are shown in Table 2. Compared with the typical instance segmentation model when using Mask R-CNN, the proposed method generated consistently higher segmentation accuracy over the other models. However, since the proposed network used the same object detection structure as Mask R-CNN, there is a clear limitation in performance improvement. Since the proposed branch predicts the mask and keypoints for each localized RoI with RPN, its performance depends on the predicted bounding box generated from the first stage. For an object with ground-truth keypoints outside the predicted RoI or if the keypoints cannot be detected, subsequent keypoint detection may be limited in performance. This issue is further emphasized due to ambiguous boundaries between adjacent buildings, due to the large number of dense buildings in the OpenCitiesAI dataset with highly heterogeneous patterns, different building sizes, and roof styles that restricts the model’s ability to distinguish instances.

Table 2. Building extraction accuracy.

OGCSBN_2021_v37n1_111_t0002.png 이미지

Since the two-stage detector has a limitation in that the bounding box of small objects may be omitted by the NMS algorithm, the detection accuracy for small objects tended to decrease more than others. The semantic segmentation inferences of the building mask predictions are produced in a pixel-wise manner, thus helping improve the overall accuracy of the mask prediction. However, the semantic segmentation models only produced pixel-wise semantic labels to classify buildings and were not able to distinguish individual building objects effectively. To address this problem, submissions in previous competitions employed post-processing algorithms, such as the watershed algorithm, to separate building regions or train the edges between adjacent buildings (Long et al., 2015).

The Open Cities AI dataset consists of images from urban areas in developing countries in Africa, which contain a large proportion of small and densely distributed buildings. This heterogeneity in the dataset makes it difficult to distinguish between the background and the building in cases of adjacent buildings. The accuracy metric results organized in Table 2 reveal that the incorporation of keypoint geometry helped to increase accuracy in comparison to conventional segmentation models such as Mask R-CNN and U-Net.

2) Impact of the Object Detector

Since localized features are used for segmentation, the localization performance of the object detector significantly impacted the performance of building extraction. For comparison with the state-of-the-art object detector algorithms, FCOS (Tian et al., 2019) was substituted as the backbone network, while the same feature extractor, ResNet-101, was used. By performing deeper inference by using FCN, FCOS can perform better than RPN, which consists of multiple CNNs (Lee et al., 2020). FCOS is also an anchor-free detector; hence, it is less affected by the NMS algorithm, which removes anchor candidates. The model performance was evaluated based on the integration of the proposed keypoint detection module. Table 3 shows that the building footprint extraction performance in the two-stage instance segmentation is closely related to the object detection performance since the segmentation tasks executed on the localized RoI features are extracted by the detection results. Additional improvement in the detector performance can be expected through optimal hyper-parameter tuning.

Table 3. Accuracy indices of different instance segmentation methods

OGCSBN_2021_v37n1_111_t0003.png 이미지

3) Keypoint Detection

To detect the keypoints of an instance, the input feature not only uses the localized features but also concatenates them with the prediction mask. Experiments were conducted for three scenarios to verify this approach: (1) predicting the segmentation mask only with multiple convolutional networks, (2) detecting keypoints through a fully convolutional layer on the localized features directly, and (3) predicting keypoints through a fully convolutional layer by adding the localized features and the mask. The results of the three experiments are summarized in Table 4.

Table 4. Accuracy of three different segmentation scenarios

OGCSBN_2021_v37n1_111_t0004.png 이미지

When the keypoints were extracted from the localized features alone, the performance was determined to be lower compared with the two other cases. This is because the keypoint of another instance is not distinguished when creating a heatmap for learning keypoints. and the keypoints of adjacent buildings are also detected. Based on the results, the recall accuracy improved, but the IoU and F1-score were found to be lower. This problem can be resolved if the predicted mask is used as a feature together with the localized RoI features. The generated masks are regarded as possibly belonging to the same instance and its inclusion in the prediction process enables the classification of instances in the RoI.

4) Qualitative Analysis

Fig. 6 reveals that the proposed network produces visually superior results in extracting the building footprints detected from the extension of Mask R- CNN. Our network’s results demonstrate the shape and location of the buildings are well depicted, suggesting that sharper boundaries and geometric details are preserved. However, since keypoint estimation was performed only on the object detected by the RPNs, many false negative pixels were observed. The instance segmentation model generated pixel-based masks after the instantiation of each object and considers each boundary as a building footprint. In the DCNNs of repeating pooling and downsampling layers, edge information was partially lost, and the generated mask appeared rough and round. By contrast, the proposed framework was able to directly predict the vertices of the object. The segmentation mask predicted in the model was considered as additional information for keypoint estimation. By grouping keypoints into independent polygons, the proposed framework can predict more realistic building footprints from the satellite and aerial images.

OGCSBN_2021_v37n1_111_f0006.png 이미지

Fig. 6. Comparison of the Results of Mask R-CNN and the Proposed Framework. (a) Segmentation by Mask R-CNN, (b) Results from proposed framework.

4. Conclusion

In this study, a building footprint extraction framework using keypoint detection is presented. First, a keypoint detection module integrated the localized RoI features from mask prediction. Since the number of keypoints of a detected building is not fixed, keypoint detection was conducted using an FCN to predict the various numbers of keypoints. The proposed methodology involves deep learning models composed of a backbone network and a keypoint detection module which groups the points to generate polygons for building footprint detection. The proposed network operates by simply adding the detection network module to the common two-stage instance segmentation networks in an end-to-end learning framework, thereby predicting vectorized building polygons without the need for heavy post-processing algorithms.

The proposed framework was evaluated using the OpenCitiesAI dataset composed of satellite and aerial images with differing spatial resolutions and variable spatial distribution of building features. The experiments demonstrated that the proposed framework successfully improved the visibility of the output mask’s shape. Additional experiments were conducted in this study to verify the validity of the keypoint detection module’s branch design.

State-of-the-art, one-stage models (Law et al., 2018; Li et al., 2020) can directly detect bounding boxes instead of relying on anchors, and are more suitable for building detection in urban areas with densely distributed buildings. Aside from the integration of keypoints, the use of information extracted from kinetic polygonal partitioning, inspired by the superpixel algorithm, can also improve the accuracy of building footprint extraction.

References

  1. Acuna, D., H. Ling, A. Kar, and S. Fidler, 2018. Efficient interactive annotation of segmentation datasets with polygon-rnn++, Proc. of 2018 the IEEE conference on Computer Vision and Pattern Recognition, Salt lake city, UT, Jun.18-22, pp. 859-868.
  2. Castrejon, L., K. Kundu, R. Urtasun, and S. Fidler, 2017. Annotating object instances with a polygonrnn, Proc. of 2017 the IEEE conference on computer vision and pattern recognition, Hawaii convention Center Honolulu, HI, Jul. 21-26, pp. 5230-5238.
  3. Dai, J., Y. Li, K. He, and J.J. Sun, 2016. R-fcn: Object detection via region-based fully convolutional networks, arXiv preprint, arXiv(1605.06409): 379-387.
  4. He, K., G. Gkioxari, P. Dollar, and R. Girshick, 2017. Mask r-cnn, Proc. of 2017 the IEEE international conference on computer vision, Venice, ITA, Oct. 22-29, pp. 2961-2969.
  5. He, K., X. Zhang, S. Ren, and J. Sun, 2016. Deep residual learning for image recognition, Proc. of 2016 the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, Jun. 27-30, Vol. 1, pp. 770-778.
  6. Huang, L., Y. Yang, Y. Deng, and Y.J. Yu, 2015. Densebox: Unifying landmark localization with end to end object detection, arXiv preprint, arXiv(1509.04874): 1-13.
  7. Iglovikov, V. and A.J. Shvets, 2018. Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for image segmentation, arXiv preprint, arXiv(1801.05746): 1-5.
  8. Ji, S., S. Wei, and M. Lu, 2019. A scale robust convolutional neural network for automatic building extraction from aerial and satellite imagery, International Journal of Remote Sensing, 40(9): 3308-3322. https://doi.org/10.1080/01431161.2018.1528024
  9. Kingma, D.P. and J. Ba, 2014. Adam: A method for stochastic optimization, arXiv preprint, arXiv(1412.6980): 1-15.
  10. Law, H. and J. Deng, 2018. Cornernet: Detecting objects as paired keypoints, Proc. of 2018 the European Conference on Computer Vision (ECCV), Munich, GER, Sep. 8-15, pp. 734-750.
  11. Lee, Y. and J. Park, 2020. Center Mask: Real-time anchor-free instance segmentation, Proc. of 2020 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, Jun. 13-19, Vol. 1, pp. 13906-13915.
  12. Li, M., F. Lafarge, and R. Marlet, 2020. Approximating shapes in images with low-complexity polygons, Proc. of 2020 the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, Jun. 13-19, pp. 8633-8641.
  13. Li, Z., J.D. Wegner, and A. Lucchi, 2019. Topological map extraction from overhead images, Proc. of 2019 the IEEE International Conference on Computer Vision, Seoul, KOR, Oct. 27-Nov. 2 pp. 1715-1724.
  14. Lin, T.-Y., P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, 2017. Feature pyramid networks for object detection, Proc. of 2017 the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, Jul. 21-26, pp. 2117-2125.
  15. Lin, T.-Y., P. Goyal, R. Girshick, K. He, and P. Dollar, 2017. Focal loss for dense object detection, Proc. of 2017 the IEEE International Conference on Computer Vision, Venice, ITA, Oct. 22-29, pp. 2980-2988.
  16. Long, J., E. Shelhamer, and T. Darrell, 2015, Fully convolutional networks for semantic segmentation, Proc. of 2015 the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, JUN. 7-12, pp. 3431-3440.
  17. Luo, C., X. Chu, and A. Yuille, 2018. Orinet: A fully convolutional network for 3d human pose estimation, arXiv preprint, arXiv(1811.04989): 1-14.
  18. Pasquali, G., G.C. Iannelli, and F.J. Dell'Acqua, 2019. Building Footprint Extraction from Multispectral, Spaceborne Earth Observation Datasets Using a Structurally Optimized U-Net Convolutional Neural Network, Remote Sensing, 11(23): 2803. https://doi.org/10.3390/rs11232803
  19. Ren, S., K. He, R. Girshick, and J. Sun, 2015. Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in Neural Information Processing Systems, arXiv(1506.01497): 91-99.
  20. Sohn, G., I. Dowman, and R. Sensing, 2007. Data fusion of high-resolution satellite imagery and LiDAR data for automatic building extraction, ISPRS Journal of Photogrammetry and Remote Sensing, 62(1): 43-63. https://doi.org/10.1016/j.isprsjprs.2007.01.001
  21. Tian, Z.,C. Shen, H. Chen, and T. He, 2019. Fcos: Fully convolutional one-stage object detection, Proc. of the IEEE International Conference on computer vision, Seoul, KOR, Oct. 27-Nov. 2, pp. 9627-9636.
  22. Van Etten, A., D. Lindenbaum, and T. Bacastow, 2018. Spacenet: Aremote sensing dataset and challenge series, arXiv preprint, arXiv(1807.01232): 1-21.
  23. Wei, F., X. Sun, H. Li, J. Wang, and S. Lin, 2020, Point-set anchors for object detection, instance segmentation and pose estimation, Proc. of 2020 European Conference on Computer Vision, Glasgow, UK, Aug. 23-28, pp. 527-544. https://doi.org/10.1007/978-3-030-58607-2_31
  24. Wu, Y.,A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, 2019, Detectron2, https://github.com/facebookresearch/detectron2,Accessed on Nov. 6, 2020.
  25. Xu, Y., L. Wu, Z. Xie, and Z. Chen, 2018. Building extraction in very high resolution remote sensing imagery using deep learning and guided filters, Remote Sensing, 10(1): 144. https://doi.org/10.3390/rs10010144
  26. Zhang, L., J. Wu, Y. Fan, H. Gao, and Y. Shao, 2020. An Efficient Building Extraction Method from High Spatial Resolution Remote Sensing Images Based on Improved Mask R-CNN, Sensors, 20(5): 1465. https://doi.org/10.3390/s20051465
  27. Zhao, K.,J. Kang, J. Jung, and G. Sohn, 2018, Building Extraction From Satellite Images Using Mask R-CNN With Building Boundary Regularization, Proc. of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, Jun. 18-22, pp. 247-251.

Cited by

  1. Central Courtyard Feature Extraction in Remote Sensing Aerial Images Using Deep Learning: A Case-Study of Iran vol.13, pp.23, 2021, https://doi.org/10.3390/rs13234843