BEV perception algorithm: core technology for next-generation autonomous driving
01 BEV perception algorithm concept?
Bird's-Eye-View, bird's-eye view (top view). The BEV sensing algorithm has many advantages.
First of all, the BEV view has the advantage of small occlusion. Due to the perspective effect of vision, real-world objects are easily blocked by other objects in 2D images. Therefore, the traditional 2D-based perception method can only perceive visible targets. The occlusion algorithm will not be able to do anything.
In the BEV space, time series information can be easily fused, and the algorithm can predict the occluded area based on prior knowledge and "brain" whether there are objects in the occluded area. Although the "imagined" objects certainly have an "imaginary" component, they still have many benefits for the subsequent control modules.
In addition, the scale change of the BEV perception algorithm is small, and better perception results can be obtained by inputting data with relatively consistent scales into the network.
02 Introduction to BEV sensing algorithm data set
2.1 kitti-360 data set
kitti-360 is a large-scale dataset containing rich sensory information and complete annotation. We recorded several suburbs of Karlsruhe, Germany, over a driving distance of 73.7 kilometers, corresponding to more than 320,000 images and 100,000 laser scans. We annotate static and dynamic 3D scene elements with coarse boundary primitives and transfer this information to the image domain, providing dense semantic and instance annotations for 3D point clouds and 2D images.
To collect data, the station wagon was equipped with a 180° fisheye camera on each side and a 90° see-through stereo camera on the front (baseline 60 cm). In addition, a Velodyne HDL-64E and a SICK LMS 200 laser scanning unit were installed on the roof in a pushrod configuration. This setup is similar to the one used by KITTI, except that thanks to the additional fisheye camera and pushbroom laser scanner, a full 360° field of view is obtained, whereas KITTI only offers perspective images and Velodyne laser scanning, with a vertical field of view of 26.8°. In addition, the system is equipped with an IMU/GPS positioning system. The sensor layout of the collection vehicle is shown in the figure.
Figure 1 Kitti-360 data set collection vehicle
2.2 nuScenes dataset
nuScenes is the first large-scale data set to provide a full set of sensor data for autonomous vehicles, including 6 cameras, 1 lidar, 5 millimeter wave radars, as well as GPS and IMU. Compared to the kitti dataset, it contains more than 7 times more object annotations. The sensor layout of the collection vehicle is shown in the figure.
Figure 2 nuScenes data set collection vehicle model
03 BEV sensing algorithm classification
Based on the input data, BEV perception research is mainly divided into three parts - BEV Camera, BEV LiDAR and BEV Fusion. The figure below depicts the overview of the BEV sensing family. Specifically, BEV Camera represents a vision-only or vision-centric algorithm for 3D object detection or segmentation from multiple surrounding cameras; BEV LiDAR describes the detection or segmentation task of point cloud input; BEV Fusion describes the detection or segmentation task from multiple surrounding cameras. Fusion mechanism of multiple sensor inputs, such as cameras, lidar, global navigation satellite systems, odometers, high-definition maps, CAN bus, etc.
Figure 3 Basic perception algorithm for autonomous driving
As shown in the figure, the basic perception algorithms of autonomous driving (classification, detection, segmentation, tracking, etc.) are divided into three levels, with the concept of BEV perception located in the middle. Based on different combinations of sensor input layers, basic tasks, and product scenarios, a certain BEV sensing algorithm can be formulated accordingly. For example, M2BEV and BEVFormer belong to the visual BEV direction and are used to perform multiple tasks including 3D object detection and BEV map segmentation. BEVFusion designed a fusion strategy in the BEV space to simultaneously perform 3D detection and tracking from camera and lidar inputs.
The representative work in BEV Camrea is BEVFormer. BEVFormer implements 3D target detection and Map segmentation tasks and achieved SOTA results.
3.1 Pipeline of BEVFormer:
1) Backbone + Neck (ResNet-101-DCN + FPN) extracts multi-scale features of surround images;
2) The Encoder module proposed in the paper (including the Temporal Self-Attention module and the Spatial Cross-Attention module) completes the modeling of surround image features to BEV features;
3) The Decoder module similar to Deformable DETR completes the classification and positioning tasks of 3D target detection;
4) Definition of positive and negative samples (using the Hungarian matching algorithm commonly used in Transformer, the total loss sum of Focal Loss + L1 Loss is minimum);
5) Calculation of loss (Focal Loss classification loss + L1 Loss regression loss);
6) Back propagation to update network model parameters;
Figure 4 BEVFormer framework diagram
The BEVFusion algorithm is inseparable from the BEV LiDAR and BEV Camera algorithms, and usually uses a fusion module to fuse point cloud and image features. Among them, BEV Fusion is one of the representative works.
3.2 BEVFusion’s Pipeline:
1) Given different perceptual inputs, first apply modality-specific encoders to extract their features;
2) Convert multi-modal features into a unified BEV representation that preserves both geometric and semantic information;
3) The existing view conversion efficiency bottleneck can be accelerated through pre-computation and intermittent reduction to accelerate the BEV pooling process;
4) Then, the convolution-based BEV encoder is applied to the unified BEV features to alleviate the local bias between different features;
5) Finally, add some task-specific headers to support different 3D scene understanding tasks.
Figure 5 BEV Fusion framework diagram
04 Advantages and Disadvantages of BEV Sensing Algorithm
At present, the research on perception and prediction algorithms based on pure vision in the industry usually only focuses on image-view solutions for a single sub-problem in the above process, such as 3D target detection, semantic map recognition or object motion prediction. Different methods are combined through pre-fusion or post-fusion. The sensing results of the network are fused. This results in that multiple sub-modules can only be stacked in a linear structure when building the overall system. Although the above approach enables problem decomposition and facilitates independent academic research, this serial architecture has several important drawbacks:
1) Model errors in upstream modules will continue to be transmitted downstream. However, in independent research on sub-problems, true values are usually used as input, which makes the cumulative error significantly affect the performance of downstream tasks.
2) There are repeated computational processes such as feature extraction and dimension conversion in different sub-modules, but the serial architecture cannot realize the sharing of these redundant calculations, which is not conducive to improving the overall efficiency of the system.
3) Unable to make full use of temporal information. On the one hand, temporal information can be used as a supplement to spatial information to better detect objects that are occluded at the current moment and provide more reference information for locating the location of objects. On the other hand, timing information can help determine the motion state of an object. In the absence of timing information, pure vision-based methods are almost unable to effectively judge the movement speed of an object.
Different from the image-view solution, the BEV solution uses multiple cameras or radars to convert visual information into a bird's-eye view for related perception tasks. This solution can provide a larger field of view for autonomous driving perception and can complete multiple perception tasks in parallel. At the same time, the BEV perception algorithm is to integrate information into the BEV space, so this is conducive to exploring the conversion process from 2D to 3D.
At the same time, the BEV perception algorithm currently has a gap with existing point cloud solutions in 3D detection tasks. Exploring visual BEV perception algorithms will help reduce costs. The cost of a set of LiDAR equipment is often 10 times that of visual equipment, so visual BEV is the truth of the future, but at the same time the huge amount of data it brings requires huge computing resources.
Review Editor: Huang Fei
#BEV #perception #algorithm #core #technology #nextgeneration #autonomous #driving
- ASML lithography machine technology leader, challenges and opportunities coexist
- What is the eye diagram on PCB and how is it formed?
- Advanced attenuators and terminations deliver superior RF solutions
- Texas Instruments launches AWR2544 radar sensor chip tailored for satellite architecture
- Bourns Expands SinglFuse SMD Fuses with Two High Voltage/Current Model Series
- Why is the inductor a difficult problem in the BUCK circuit, rather than a simple combination of MOS tube and three components?
- New audio design services launched
- Analysis of electrostatic surge protection scheme for NB-IoT device antennas
- New product launch | Xianji Semiconductor joins hands with VeriSilicon to break through multimedia MCU performance and create a new generation of digital instruments
- Innovative expanded beam fibre optic connector technology