DOI QR코드

DOI QR Code

Aerial Dataset Integration For Vehicle Detection Based on YOLOv4

  • Omar, Wael (Department of Geoinformatics, University of Seoul) ;
  • Oh, Youngon (Department of Geoinformatics, University of Seoul) ;
  • Chung, Jinwoo (Department of Geoinformatics, University of Seoul) ;
  • Lee, Impyeong (Department of Geoinformatics, University of Seoul)
  • Received : 2021.06.10
  • Accepted : 2021.08.24
  • Published : 2021.08.31

Abstract

With the increasing application of UAVs in intelligent transportation systems, vehicle detection for aerial images has become an essential engineering technology and has academic research significance. In this paper, a vehicle detection method for aerial images based on the YOLOv4 deep learning algorithm is presented. At present, the most known datasets are VOC (The PASCAL Visual Object Classes Challenge), ImageNet, and COCO (Microsoft Common Objects in Context), which comply with the vehicle detection from UAV. An integrated dataset not only reflects its quantity and photo quality but also its diversity which affects the detection accuracy. The method integrates three public aerial image datasets VAID, UAVD, DOTA suitable for YOLOv4. The training model presents good test results especially for small objects, rotating objects, as well as compact and dense objects, and meets the real-time detection requirements. For future work, we will integrate one more aerial image dataset acquired by our lab to increase the number and diversity of training samples, at the same time, while meeting the real-time requirements.

Keywords

1. Introduction

With the rapid growth of information technology in recent years, smart transportation networks have become an important and unavoidable trend in modern traffic management. Vehicle detection, as the main technology of the smart transport system, is the basis for the realization of many important functions (Qiu, 2014). Vehicle detection, distance measurement, and statistics of traffic parameters play an important role nowadays, such as traffic flow and density, location, and monitoring of vehicles, and traffic data mining, etc.

Simultaneously, with the maturity of technology and the popularization of the UAV (Unmanned Aerial Vehicle) industry, which has the characteristics of being lightweight, fixable, and inexpensive. The aerial images of UAVs in the field of applications reflect a tremendous advantage.

In addition, in engineering applications, vehicle detection in aerial images plays an important role. The applications also rely on computer vision, artificial intelligence, image processing, and other disciplines. These applications are typical interdisciplinary research applications. Therefore, it also has great importance for study in academic research.

This paper introduces an integrated three aerial image public datasets, unify classes names, annotation formate, aerial image vehicle detection method based on the YOLOv4 deep learning algorithm and optimize the network map for the YOLOv4 training process. This integration has a slight improvement comparing to relatively similar research based on YOLOv3 (Lu, 2018), which results in 76.7% of the mean Average Precision score (mAP).

2. Research Purpose

Airborne vehicle detection is essential for various applications, such as large-scale traffic monitoring, parking lot usage, urban planning, disaster management, and search and rescue operations. Aerial photos with a wide field of view can quickly provide valuable information about large open spaces (Ajay, 2017). Due to the substantial increase in the number of vehicles, traffic management and control have become more complex. Especially in the city. The main socioeconomic consequences of traffic problems, such as air pollution, traffic congestion, and health problems, have increased the development of new automatic algorithms and sufficient traffic data (Lewandowski, 2018). Detection algorithms based on aerial photos can usually provide information on the location, number, and type of vehicles in various traffic scenarios at a low price.

Most of the proposed methods are based on the transfer of object detection algorithms developed for natural scene images to aerial photos without a large-format aerial image data set. Previous networks trained on large natural scene datasets (eg ImageNet (Deng, 2009), MSCOCO (Lin, 2014), PASCAL VOC (Everingham, 2010).

Other data sets such as VEDAI (Jurie, 2016), COWO (Mundhenk, 2016), DLR3KMunichVehicle (Mattyus, 2015), and UCASAOD (Zhu, 2015), mostly focus on vehicle detection but also include a limited number of annotated vehicles. To improve vehicle detection research, including vehicle detection, counting, and tracking, we provide three customized aerial image datasets for real-time vehicle detection.

3. Related Work

The widely used methods proposed by domestic and foreign scholars for vehicle detection are mainly divided into three categories: based on motion information, features, and matching of templates. Background subtraction and registration methods are used to detect dynamic vehicles (Cheng, 2009), and others to detect vehicles in aerial images based on a median background difference method (Azevedo, 2014). However, because the aerial video has the characteristics of complex scenes and diverse objects, the above two methods achieve the detection of moving objects, both methods do not achieve the desired effect for accurate vehicle detection, and false and missing detection are also extreme. To track and classify vehicles on highways, other scholars combined Haar features and Adaboost (Sivaraman, 2010). A vehicle detection method based on The Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM) features were suggested to Achieve the detection of vehicles on urban roads (Tehrani, 2012). The two methods above increase detection accuracy, but since the traditional conventional machine learning approach only supports training with a limited amount of data, there is still a lack of vehicle diversity detection.

Recently, with the update of computer hardware updates, particularly A Graphics Processing Unit (GPU), deep learning algorithms were quickly developed especially solving the problems in pattern recognition and image processing. It solves problems and is more effective and accurate than conventional algorithms. Therefore, to achieve vehicle detection, this paper utilizes a deep learning algorithm, YOLOv4.

Different techniques have been proposed in the previous studies to solve the problem of car detection in aerial images and similar related problems. The main challenge is the small size of the object and the high density of these objects in the aerial view, which may lead to information loss. Various challenges for each type of aerial imagery (fixed CCTV cameras, satellite, or UAV), due to different angles and resolutions as well. In this section, we present the most recent and relevant works in vehicle detection for three imagery types most recent, relevant works in object detection for each of these three imagery types, and the value-added of our work.

1) Fixed Surveillance Cameras

The vehicle detection from overhead surveillance images issue was addressed by Xi et al. (Xi et al., 2019) . They proposed a multi-task approach based on the Faster R-CNN algorithm which subdivides the object detection task into simpler subtasks. This approach aims to deal with enlarged objects, thus improving the detection accuracy of small objects which appear frequently in aerial views. Moreover, the cost-sensitive loss gives more importance to objects that are occluded or to be detected due to a complex background. Their method count on their private dataset that was collected from surveillance cameras installed on top of parking lot buildings. However, this method has not been tested on other public datasets, nor UAV images. In a similar application, (Kim and Oghaz, 2018) compared different implementations of CNN-based object detectors, namely YOLO, Single Shot MultiBox Detector (SSD). They applied these algorithms to the problem of pedestrian detection and trained and tested them on their in-house dataset. It is composed of images captured by surveillance cameras in retail stores. They found that YOLOv3 (with a 416 input size) and SSD (with a VGG-500 feature extractor) (Liu, et al., 2016) . it provides a better tradeoff between accuracy and response latency.

2) Satellite Imagery

To solve the vehicle detection problem from Google Earth images Chen et al. (2014) applied a technique based on a Hybrid Deep Convolutional NeuralNetwork (HDNN) and a sliding window search. A particular layers map of the CNN (last convolutional layer and max-pooling layer) are split into blocks of variable field sizes. This split, able to extract features of different scales. moreover, they modified the sliding windows to include the essential part of the vehicle to be detected. Thus, they improved the detection rate compared to the traditional deep architectures at that time, but a higher execution time.

3) UAV Imagery

Fewer works have addressed the problem of vehicle detection from aerial images. Ammour et al. (2017) applied a pre-trained CNN with a linear support vector machine (SVM) classifier to detect and count vehicles in high-resolution UAV images in complex urban areas. Then the VGG16 CNN model is applied that is extracted around each selected region in order to generate descriptive features (Simonyan and Zisserman, 2015). Consequently, they are classified using a linear SVM binary model. At the last, they applied a fine tuning morphological increase for smoothing the detected regions. These different techniques succeeded to reduce the testing dataset (5 images containing 127 car instances), but it lacks real-time processing, due to the high computational cost of the mean-shift segmentation stage.

4. YOLO Deep Learning Object Detection Algorithm

YOLO was proposed in 2015 by Joseph Redmon and others (Redmon, 2016), is a CNN-based real-time object detection system (Convolutional Neural Network). In 2017, Joseph Redmon and Ali Farhadi published YOLO v2 at the CVPR (Conference on Computer Vision and Pattern Recognition), which improved the precision and speed of the algorithm (Redmon, 2017). Joseph Redmon and Ali Farhadi suggested the new YOLO v3 in April of this year, which has further improved performance on object detection (Redmon, 2018). YOLO v4, which has been proposed in 2020, makes everyone can use a 1080 Ti or 2080 Ti GPU to train a super-fast and accurate object detector (Bochkovskiy, 2020). It modifies state-of-the-art methods and makes them more efficient and suitable for single GPU training. According to its update method, this chapter presents the basic concepts of the YOLO algorithm.

1) YOLO v1

YOLO divides the input image into S×S grids. If the Coordinate Center of the GT (Ground Truth) of the object falls into this grid, the grid is responsible for detecting the object. YOLO innovation is that it reforms the region proposal detection framework: Region-Based Convolutional Neural Network (RCNN) series need to generate Region Proposal to complete classification and regression. But the Regional Proposal is overlapping, which will add a lot of repetitive work. However, YOLO predicts the bounding box of the object found in all cells of the grid, the reliability of the location as well as the probability vectors of all classes at one time, thus it solves the problem in one shot.

Simply, the YOLO network structure borrows Google Net while the difference is the YOLO uses a 1×1 convolutional layer, plus a 3×3 convolutional layer instead of the inception module. The network structure of YOLOv1 consists of 24 convolutional layers and 2 full connection layers, as shown in Fig. 1.

OGCSBN_2021_v37n4_747_f0001.png 이미지

Fig. 1. Datasets integration flowchart for datasets.

OGCSBN_2021_v37n4_747_f0002.png 이미지

Fig. 2. YOLO v1 network structure.

2) YOLO v2

Compared YOLO v1 with the region proposal-based method such as Faster R-CNN, YOLO v1 has much poisoning error and lower recall score. Therefore, the main enhancements of YOLO v2 are to improve the recall score and poisoning ability including batch normalization, anchor boxes, and multi-scale training.

Batch normalization is a popular training technique by adding batch normalization layer after each layer, the entire batch data can be normalized to space with a mean of 0 and variance of 1, which can avoid the disappearing of the gradient as a good gradient explosion, and make network convergence faster. The full connection layer in YOLO v1 is used to predict the coordinates of the bounding box after the convolutional layer. As well, the size of the input image for the YOLO v1 training network is fixed, while the YOLO v2 adjusts the size of the input image every 10 epoch during training. These adjustments give the model a good detection effect during the test on the multi-scale input image.

3) YOLO v3

This model has more complexity than the previous version, but its detection on small objects and compact dense or highly overlapping objects is much better. YOLO v3 went through several improvements such as; 1) replaces the Softmax Loss of YOLO v2 with the Logistic Loss. When the predicted objects are complex, especially if there are many overlapped labels in the dataset, it is more reliable to use Logistic Regression, 2) YOLO v3 uses nine anchors instead of YOLO v2, that efficiently improve the IoU, (3 YOLO v3 uses three detections which significantly affect on the small object detection, while YOLO v2 uses only two, 4) improve the detection accuracy by deepening the network in YOLO v3 by replacing the darknet-19 network of YOLO v2 with the darknet-53 network.

4) YOLO v4

The majority of object detectors based on CNN are largely only acceptable just for recommendation systems. For example, the search for free parking spaces through urban video cameras is carried out by slow, precise models, while the warning of vehicle collisions is linked to fast, inaccurate models Improving the accuracy of the real-time object detector allows it to be used not only for hint producing suggestion systems but also for stand-alone process management and reduction of human interference. The operation of real-time object detectors on traditional Graphics Processing Units (GPU) enables their mass use at an affordable cost. Modern neural networks that are most accurate do not function in real-time and require a large number of GPUs for large mini-batch-size training (Bochkovskiy, 2020). YOLO v4 addresses these problems by generating a CNN that operates in real-time on a normal GPU, and for which training requires only one GPU.

OGCSBN_2021_v37n4_747_f0003.png 이미지

Fig. 3. Comparison of YOLOv4 and other state-of-the-art object detectors.

YOLOv4 runs two times faster than EfficientDet, which also improves YOLOv3’s AP and FPS by 10% and 12%, respectively, as shown in Figure 2.

This model performs efficiently on object detection which makes everyone can use a 1080 Ti or 2080 Ti GPU to train a much faster and more accurate object detector.

5. Public Datasets for YOLO Training

Classifier performance which trained based on the convolutional dataset is inefficient on aerial images because aerial images have several special features as follow:

1) Scale diversity, the shooting altitude of UAVs ranges from tens of meters to kilometers, resulting in a wide range of sizes of similar objects on the ground.

2) Perspective specificity, the perspectives of aerial images are generally high-altitude overlooking, while most of the common datasets are ground level perspectives.

3) A small object, the objects of aerial images are normally only a few dozen or even a few pixels, so, their amount of information is less also.

4) Multidirectional, aerial images are taken from a bird’s view, and the object’s directions are unconfident (while the object directly on the conventional dataset tends to have confidently, such as pedestrians are generally upright).

5) High background complexity, aerial images have a large field of view (usually covered a few square kilometers), and it may contain a wide variety of backgrounds, which will have a strong interference and false object detection.

Because of the above reasons, it is often difficult to train an optimal classifier for object detection using conventional aerial image datasets. Thus, a specialized aerial image dataset is necessary. In this paper, we introduce three public aerial image datasets which used and processed. This optimized aerial image dataset is suitable for YOLOv4 training. This chapter introduces the needed information of these three datasets.

1) UAVDT Dataset

The UAVDT benchmark (Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking ) consists of 100 video sequences chosen from over 10 hours of footage shot with a UAV platform in a variety of urban sites, representing a variety of typical scenes such as squares, arterial streets, toll stations, highways, bridges, and T-junctions. The videos were taken at a frame rate of 30 frames per second (fps) with a JPEG image resolution of 1080 × 540 pixels. The number of annotation errors is reduced as much as possible. Specifically, about 80, 000 frames in the UAVDT benchmark dataset are annotated over 2, 700 vehicles with 0.84 million bounding boxes (Yu, 2019). According to PASCAL VOC (Everingham, 2014), the regionsthat cover too small and unclear vehicles are ignored in each frame due to low resolution. Three levels of altitude are annotated, in low-altitude (10 m ~ 30 m), more details of objects are captured and the object may occupy a larger area, e.g., 22.6% pixels of the frame. While in medium altitude (30 m ~ 70 m), more angles and views are acquired. While in higher altitudes (> 70 m), vehicles are represented with less clarity. To this end, the dataset is constructed with a large-scale challenging UAV Detection and Tracking (UAVDT) benchmark for 3 significant tasks, i.e., object DETection (DET), Single Object Tracking (SOT), and Multiple Object Tracking (MOT).

OGCSBN_2021_v37n4_747_f0004.png 이미지

Fig. 4. Examples of the manually annotated frames in the UAVDT benchmark. The three rows represent the DET, MOT, and SOT tasks, respectively. The shooting conditions of UAVs are presented in the lower right corner. The pink areas represent the ignored regions in the dataset. Different bounding box colors denote different classes of vehicles.

2) VAID Dataset

VAID (Vehicle Aerial Imaging from Drone) dataset is a new vehicle detection dataset, VAID (Vehicle Aerial Imaging from Drone), with the aerial images captured by a drone. It consists of about 6, 000 aerial images under different illumination conditions and viewing angles from different sites in Taiwan. The images are taken with the resolution of 1137 × 640 pixels in JPG format. The dataset contains seven classes of vehicles, namely ‘sedan’, ‘minibus’, ‘truck’, ‘pickup truck’, ‘bus’, ‘cement truck’ and ‘trailer’. Figure 4 shows some example images from the VAID dataset as well as four other datasets, VEDAI, DLR-MVDA, KIT-AIS, and COWC. It can be seen that the vehicles are much smaller compared to the objects in general recognition and classification datasets. Each image consists of 5 vehicles on average, and the vehicle size is about 0.7% of an image. The annotation of each sample includes the sample class, the center point coordinates, direction, and the four corner point coordinates of the ground truth. The targets in VEDAI are slightly easier to identify. Most of the vehicles in the images are sparsely distributed with a clear environment (Lin, 2020).

OGCSBN_2021_v37n4_747_f0005.png 이미지

Fig. 5. In the VAID dataset, the common vehicles are classified into 7 categories, namely (a) sedan, (b) minibus, (c) truck, (d) pickup truck, (e) bus, (f) cement truck, and (g) trailer. The sample images are shown in the figure from the left to the right accordingly.

3) DOTA

DOTA (Dataset for Object Detection in Aerial images) is an aerial image dataset made by Xia Guisong of Wuhan University, Bai Xiang of Huazhong University of Science and Technology, and others (Xia, 2018). Different sensors defiantly cause a deviation, and to overcome this, the original material comes from various platforms) such as Google Earth). DOTA is characterized by multi-sensor and multi-resolution, to be specific that the GSDs of the pictures are differentiated. DOTA contains a add up to of 2806 pictures approximately 4000 × 4000 pixels and is annotated 15 classes of samples manually (“plane”, “ship”, “storage tank”, “baseball diamond”, “tennis court”, “swimming pool”, “ground track field”, “harbor”, “bridge”, “large vehicle”, “small vehicle”, “helicopter”, “roundabout”, “soccer ball field” and “basketball court”) with a total number of 188, 282. Each annotated sample includes sample class, and the 4 corners coordinates of GT (where the top left corner is the starting point, arranged in clockwise order).

OGCSBN_2021_v37n4_747_f0006.png 이미지

Fig. 6. Some aerial images in the VAID dataset were captured using a drone. It consists of different road types and traffic scenes.

OGCSBN_2021_v37n4_747_f0007.png 이미지

Fig. 7. Samples of annotated images in DOTA. It shows three samples per category, except six for large vehicle.

4) LSM

LSM (Lab of Sensors and Modelling-University of Seoul), this dataset is prepared for our detection, classification, and tracking experiment. There is 8 videos-25 frame per second were acquired by MAVIC-pro drone, using visual sensor camera lens 35 mm. The total flight duration is 1 hour in Seoul-South Korea. We annotated 1002 orthoimages 4000*3000 pixels size for 4 classes (car, truck, heavy truck, and bus). The altitude for each flight was 140 meters from the ground and the same position. The dataset split into training and test set 501 for training and 501 for testing.

OGCSBN_2021_v37n4_747_f0008.png 이미지

Fig. 8. Samples of annotated images in LSM.

6. A vehicle Detection Method for Aerial Image Based on YOLO v4

In this paper, we integrate the mentioned three public aerial image datasets and modify the parameters of the YOLO v4 network map to train the model. Moreover, we unified class names to be 4 classes representing different types of vehicle categories (“car”, “truck”, “heavy truck”, “bus”). Therefore, we propose 4 classes of vehicle detection methods for an aerial image in specific steps as follows.

1) Make Standard Datasets for YOLO v4 Training

The standard training dataset for YOLO v4 consists of two parts: images and labels, where images are JPEG format while labels are text format. Labels and pictures are in one-to-one correspondence. Each annotated sample in the corresponding image is label records annotated. The annotation format is: class GT’s center point coordinates (x, y), GT’s width and height (w, h), where (x, y, w, h) are normalized values. Wrap the line to differentiate when there are multiple samples in one image. Since the input dimension of the YOLO v4 training network is 416 × 416 × 3, the image size used for training should not be large, on the other hand, sample characteristics after resizing may be lost seriously. The basic information of the three public aerial image datasets described in Chapter 4 is shown in Table 1.

Table 1. The basic information of the three public aerial image datasets

OGCSBN_2021_v37n4_747_t0001.png 이미지

We process the above three datasets separately.

1) UAVDT

a) Merge dataset 4 parts

b) Split them randomly to test and train 20%, 80%

c) Convert labels to YOLO format

d) Set desired classes only (car, truck, heavy_truck, bus)

e) Calculate width and height according to the coordinates of GT’s 4 corners:

W= xmax–xmin, h= ymax–ymin (1)

The UAVDT benchmark dataset has more than 2, 700 vehicles with 0.84 million bounding boxes. According to PASCAL VOC, the area containing a car that is too small has low resolution and is ignored at each frame. Unmanned aerial vehicle benchmark: Object detection and tracking. It can be used in. Here it includes light, night, and fog. In particular, videos shot in sunlight induce shadow interference.

Table 2. The number of each category vehicle in the UAVDT dataset

OGCSBN_2021_v37n4_747_t0002.png 이미지

2) VAID

a) Convert labels to YOLO format

b) Merge vehicles 7 categories, namely (sedan, minibus, truck, pickup truck, bus, cement truck, and trailer to our desired classes only (car, truck, heavy_truck, cyclist)

c) Split them randomly to test and train 20%, 80%

The dataset is being manually annotated with objects of 7 classes and a total of 2, 950 samples. Each image consists of an average of 5 vehicles, the vehicle size is approximately 0.7% of the image. The annotations for each sample include the coordinates of the sample class center point, the direction, and the coordinates of the four vertices of the ground truth.

Table 3. The number of each category vehicle in the VAID dataset

OGCSBN_2021_v37n4_747_t0003.png 이미지

3) DOTA

a) Merge “heavy-vehicle vehicle” and “light vehicle” to “car”, and delete all the annotations of the other 13 classes in labels; After processing, the new datasets information are shown in Table 2.

b) Select image size 3000 × 3000 ~ 4000 × 4000

It contains more than 188, 000 cases of scales, orientation, shape and arranged and arranged and arranged and arranged and arranged and arranged objects instead of commonly used boxes. The large data set of the number of images and the number of versions on each image and a total of 15 categories, only two classes (large and small) for general use. Therefore, it is not suitable for detecting objects in certain applications. DOTA represents a large record of 2806 aerial images. In our experiments, we represent these objects in two species (cars, heavy vehicles).

Table 4. The number of each category vehicle in the VAID dataset

OGCSBN_2021_v37n4_747_t0004.png 이미지

4) LSM

LSM dataset was annotated by the program that allows you to label object bounding boxes with IDs of videos and images. We have annotated a small part of our testing purpose.

Table 5. The actual class numbers in the LSM dataset are used in the training process

OGCSBN_2021_v37n4_747_t0005.png 이미지

2) Summarization of Three Datasets

optimized and preprocessing three datasets by merging classes, convert the labeling formate, and delete the unnecessary classes, are summarized as follows.

Table 6. The processed dataset information

OGCSBN_2021_v37n4_747_t0006.png 이미지

3) Configure Network Parameters for YOLO Training

To train our YOLOv4 model, we will first need to provision GPU resources to run our training job. Because YOLOv4 training requirements scale up substantially when using larger networks in the family

Table 7. A merged classes names accordingly to my research purpose

OGCSBN_2021_v37n4_747_t0007.png 이미지

1) Batch size

Use YOLO v4 default parameter batch _ size = 64.

2) Number of iterations

The dataset contains a total of 7547 images, so one epoch needs to iterate: 7547/ 64 ≈ 117 times. The YOLO training defaults to iterate 160 epochs, so the number of iterations is 160×117 = 18867 times.

3) Learning rate

The initial learning rate is 0.001 after 60 epochs divided by 10 after 90 epochs divided by 10 once again.

4) Number of filters in the last layer of the network

Filters = (class + 5) × 3 = (5 + 5) × 3 = 30

7. Experimental Results

In this paper, we use NVIDIA GeForce RTX 2060 graphics card for training. The training duration is about 72 hours. Thus, to test our model, we used our Lab of Sensors and Modeling dataset (LSM) which consists of 1 hour of videos. We have selected 501 images for testing, these images include our 4 distributed classes as shown in Fig. 10.

Table 8. 5The test result

OGCSBN_2021_v37n4_747_t0008.png 이미지

OGCSBN_2021_v37n4_747_f0010.png 이미지

Fig. 10. (a) LSM testing image, (b) result on LSM images.

Testing results are shown in Fig. 9(a, b). Image (a) shows one of our LSM datasets and the result shows in Fig. 11 (b), that the training model has a good effect on the detection of small objects. The vehicles in Fig. 11 are mostly horizontal with some vertical and rotated vehicles, the test result shows that the model has a good performance on the detection of rotating objects, while the detection may miss one object in the far lower right corner, but the model has 79.77% of (mAP) mean average precision score on LSM testing dataset.

OGCSBN_2021_v37n4_747_f0009.png 이미지

Fig. 9. Shows the distribution number of each class in the LSM dataset.

8. Conclusion

Other data sets such as VEDAI, COWO, DLR3K MunichVehicle, and UCASAOD, mostly focus on vehicle detection but also include a limited number of annotated vehicles. To improve vehicle detection research, including vehicle detection, counting, and tracking, we provide three customized aerial image datasets for real-time vehicle detection. In this paper, a vehicle detection method based on YOLO deep learning algorithm for the aerial image is presented. This method integrates three public aerial image datasets suitable for YOLOv4. The training model presents good test results especially for small objects, rotating objects, as well as compact and dense objects, and meets the real-time requirements, and reached 79.77% of the mean Average Precision (mAP) score. Next, we will integrate one more aerial image dataset acquired by our lab to increase the number and diversity of training samples, at the same time. While integrating more datasets, we will optimize the latest version of the YOLO algorithm to further improvement to the detection accuracy.

Conflict of interest

The authors declare no conflicts of interest regarding the publication of this paper.

Acknowledgments

This research was supported by a grant (20200495) from the advancement of equipment to respond to illegal ships technology development funded by the Korea Institute of Marine Science & Technology Promotion.

References

  1. Ajay, A., V. Sowmya and K.P. Soman, 2017. Vehicle detection in aerial imagery using eigen features, Proc. of 2017 International Conference on Communication and Signal Processing ICCSP, Chennai, IN, Apr. 6-8, pp. 1620-1624. https://doi.org/10.1109/ICCSP.2017.8286664
  2. Ammour, N., H. Alhichri, Y. Bazi, B. Benjdira, N. Alajlan, and M. Zuair, 2017. Deep Learning Approach for Car Detection in UAV Imagery, Remote Sens, 9(4): 312. https://doi.org/10.3390/rs9040312
  3. Azevedo, C.L., J.L. Cardoso, M. Ben-Akiva, J.P. Costeira, and M. Marques, 2014. Automatic VehicleTrajectory Extraction by Aerial Remote Sensing, Procedia - Social and Behavioral Sciences, 111: 849-858. https://doi.org/10.1016/j.sbspro.2014.01.119
  4. Bochkovskiy, A., C.Y.Wang, and H.Y.M Liao, 2020. YOLOv4: Optimalspeed and accuracy of object detection, arXiv print, arXiv: 2004.10934v1.
  5. Chen, X., S. Xiang, C.-L. Liu, and C.-H. Pan, 2014. Vehicle Detection in Satellite Images by Hybrid Deep Convolutional Neural Networks, IEEE Geoscience and Remote Sensing Letters, 11(10): 1797-1801. https://doi.org/10.1109/LGRS.2014.2309695
  6. Cheng, P. Z., 2009, Detecting and Counting Vehicles from SMALL LOW-COST UAV IMAGES, Proc. of ASPRS 2009 Annual Conference, Baltimore, MD, Mar. 9-13, pp. 1-7.
  7. Deng,J.,W. Dong,R. Socher, L.J. Li, K. Li, and L. Fei-Fei, 2009. ImageNet:Alarge-scale hierarchical image database, Proc of 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, US, Jun. 20-25, pp. 248-255.
  8. Everingham, M., S.M.A. Eslami, L.V. Gool, C.K.I. Williams, J. Winn, and A. Zisserman, 2014. The Pascal visual object Classes challenge: A retrospective, International Journal of Computer Vision, 111: 98-136. https://doi.org/10.1007/s11263-014-0733-5
  9. Everingham, M.L.V. Gool, C.K.I. Williams, J. Winn, and A. Zisserman 2010. The pascal visual object classes (voc) challenge, International Journal of Computer Vision, 88(2): 303-338. https://doi.org/10.1007/s11263-009-0275-4
  10. Kim, C.E., M.M.D. Oghaz, J. Fajtl, V. Argyriou, and P. Remagnino, 2018. A comparison of embedded deep learning methods for person detection, arXiv PrePrint, arXiv:1812.03451.
  11. Lewandowski, M., B. Placzek, M. Bernas, and P. Szymala, 2018. Road traffic monitoring system based on mobile devices and bluetooth low energy beacons, Wireless Communications and Mobile Computing, 2018: 1-12. https://doi.org/10.1155/2018/3251598
  12. Lin H.-Y., K.-C. Tu, and C.-Y. Li, 2020. VAID: An Aerial Image Dataset for Vehicle Detection and Classification, IEEE Access, 8: 212209-212219. https://doi.org/10.1109/ACCESS.2020.3040290
  13. Lin, T.-Y., M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar C, and L. Zitnick, 2014. Microsoft COCO: Common objectsin context, ECCV.
  14. Liu, K. and G. Mattyus, 2015. Fast multiclass vehicle detection on aerial images, IEEE Geoscience and Remote Sensing Letters, 12(9): 1938-1942. https://doi.org/10.1109/LGRS.2015.2439517
  15. Liu, W., D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A.C. Berg, 2016. SSD: Single shot multibox detector, Proc. of European Conference on Computer Vision 2016, Cham, CH, Oct. 8-16, pp. 21-37.
  16. Lu, J., C. Ma, L. Li, X. Xing, Y. Zhang, Z. Wang, and J. Xu, 2018. A Vehicle Detection Method for Aerial Image ased on YOLO. Journal of Computer and Communications, 6(11): 98-107. https://doi.org/10.4236/jcc.2018.611009
  17. Qiu, Y., (2014). Video-Based Vehicle Detection in Intelligent Transportation System, Master Thesis, Jilin University, Chang Chun, CN.
  18. Razakarivony, S. and F. Jurie, 2016. Vehicle detection in aerial imagery: A small target detection benchmark, Journal of Visual Communication and Image Representation, 34: 187-203. https://doi.org/10.1016/j.jvcir.2015.11.002
  19. Redmon, J. and A. Farhadi, 2017. YOLO9000: Better, Faster, Stronger. Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, US, Jul. 21-26, Vol. 1, pp. 6517-6525.
  20. Redmon, J. and A. Farhadi, 2018. YOLO v3: An Incremental Improvement, arXiv preprint, arxiv: 1804.02767.
  21. Redmon, J., S. Divvala, R. Girshick, and A. Farhadi, 2016. You Only Look Once:Unified, Real-Time Object Detection, Proc. of the IEEE Conference on Computer Vision and Pattern Recognition, LAS VEGAS, NV, US, Jun. 27-30, Vol.1, pp. 779-788.
  22. Simonyan, K. and A. Zisserman, 2015. A Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv preprint, arXiv: 1409.1556.
  23. Sivaraman, S. and M.M. Trivedi, 2010. A General Active-Learning Framework for On-Road Vehicle Recognition and Tracking, IEEE Transactions on Intelligent Transportation Systems, 11(2): 267-276. https://doi.org/10.1109/TITS.2010.2040177
  24. Mundhenk, T.N., G. Konjevod, W.A. Sakla, and K. Boakye, 2016. A large contextual dataset for classification, detection and counting of cars with deep learning, Proc. of European Conference on Computer Vision 2014, Zurich, CH, Sep. 6-12, Vol. 4, pp. 740-755. https://doi.org/10.1007/978-3-319-46487-9_48
  25. Tehrani Niknejad, H., A. Takeuchi, S. Mita, and D. McAllester, 2012. On-Road Multivehicle Tracking Using Deformable Object Model and Particle Filter with Improved Likelihood, IEEE Transactions on Intelligent Transportation Systems, 13(2): 748-758. https://doi.org/10.1109/TITS.2012.2187894
  26. Xi, X., Z. Yu, Z. Zhan, and C. Tian, 2019. Multi-task Cost-sensitive-Convolutional Neural Network for Car Detection, IEEE Access, 7: 98061-98068. https://doi.org/10.1109/ACCESS.2019.2927866
  27. Xia, G.-S. X. Bai,J. Ding, Z. Zhu, S. Belongie,J. Luo, M. Datcu, M. Pelillo, and L. Zhang, 2018. Dota: A large-scale dataset for object detection in aerial images, Proc. of IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, US,Jun. 19-21, pp. 3974-3983.
  28. Yu, H., G. Li,W. Zhang, Q. Huang, D. Du, Q.Tian, and N . Sebe, 2019. The unmanned aerial vehicle Benchmark: Object Detection, tracking and baseline, International Journal of Computer Vision, 128(5): 1141-1159. https://doi.org/10.1007/s11263-019-01266-1
  29. Zhu, H., X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, 2015. Orientation robust object detection in aerial images using deep convolutional neural network, Proc. of 2015 IEEE International Conference on Image Processing(ICIP), Quebec, QC, CA, Sep. 27-30, pp. 3735-3739.