The convolutional layers are designed to perform feature extraction, and are chosen empirically through a series of experiments that vary layer configurations. This model uses strided convolutions in the first three convolutional layers with a 2×2 stride and a 5×5 kernel, and a non-strided convolution with a 3×3 kernel size in the final two convolutional layers. The five convolutional layers are followed with three fully connected layers, leading to a final output control value which is the inverse-turning-radius. The fully connected layers are designed to function as a controller for steering, but by training the system end-to-end, it is not possible to make a clean break between which parts of the network function primarily as feature extractor, and which serve as controller.
Fig. 3: Drive simulator
Training and simulator
The first step in training a neural network is to select the frames for use. The collected data is labeled with road type, weather condition and the driver’s activity (staying in a lane, switching lanes, turning and so forth). To train a CNN to do lane following, data is selected where the driver is staying in a lane, while the rest is discarded.
Then, the video is sampled at a rate of 10 frames per second because a higher sampling rate would include images that are highly similar, and thus not provide much additional useful information. To remove a bias towards driving straight, the training data includes a higher proportion of frames that represent road curves.
After selecting the final set of frames, the data is augmented by adding artificial shifts and rotations to teach the network how to recover from a poor position or orientation. The magnitude of these perturbations is chosen randomly from a normal distribution. The distribution has zero mean, and the standard deviation is twice the standard deviation that is measured with human drivers. Artificially augmenting the data does add undesirable artifacts as the magnitude increases.
Fig. 4 shows a screenshot of the simulator in interactive mode. The simulator takes prerecorded videos from a forward-facing on-board camera connected to a human-driven data-collection vehicle, and generates images that approximate what would appear if the CNN were instead steering the vehicle. These test videos are time-synchronised with the recorded steering commands generated by the human driver. Since human drivers don’t drive in the centre of the lane all the time, there is a need to manually calibrate the lane’s centre as it is associated with each frame in the video used by the simulator.
Fig. 4: Simulator in interactive mode
The simulator transforms the original images to account for departures from the ground truth. Note that this transformation also includes any discrepancy between the human driven path and the ground truth. The transformation is accomplished by the same methods. The simulator accesses the recorded test video along with the synchronised steering commands that occurred when the video was captured. The simulator sends the first frame of the chosen test video, adjusted for any departures from the ground truth, to the input of the trained CNN, which then returns a steering command for that frame.
The CNN steering commands as well as the recorded human-driver commands are fed into the dynamic model of the vehicle to update the position and orientation of the simulated vehicle. In Fig. 4, the green area on the left is unknown because of the viewpoint transformation. The highlighted wide rectangle below the horizon is the area which is sent to the CNN.
The simulator then modifies the next frame in the test video so that the image appears as if the vehicle was at the position that resulted by following steering commands from the CNN. This new image is then fed to the CNN and the process repeats. The simulator records the off-centre distance (distance from the car to the lane centre), the yaw, and the distance travelled by the virtual car. When the off-centre distance exceeds one metre, a virtual human intervention is triggered, and the virtual vehicle position and orientation is reset to match the ground truth of the corresponding frame of the original test video.
Fig. 5 shows how the CNN learns to detect useful road features on its own, with only the human steering angle as training signal. It was not explicitly trained to detect road outlines. CNNs are able to learn the entire task of lane and road following without manual decomposition into road or lane marking detection, semantic abstraction, path planning and control. A small amount of training data from less than a hundred hours of driving is sufficient to train the car to operate in diverse conditions, on highways, local and residential roads in sunny, cloudy and rainy conditions.
Fig. 5: How the CNN sees an unpaved road. Top: Camera image sent to the CNN; bottom left: activation of the first-layer feature maps; bottom right: activation of the second-layer feature maps
Image identification in autonomous vehicles
To be able to identify images, autonomous vehicles need to process a full 360-degree dynamic environment. This creates the need for dual-frame processing because collected frames must be combined and considered in context with each other. A vehicle can be equipped with a rotating camera to collect all relevant driving data. The machine must be able to recognise metric, symbolic and conceptual knowledge as demonstrated in Fig. 6.
Metric knowledge is the identification of the geometry of static and dynamic objects, which is required to keep the vehicle in its lane and at a safe distance from other vehicles. Symbolic knowledge allows the vehicle to classify lanes and conform to basic rules of the road. Conceptual knowledge allows the vehicle to understand relationships between traffic participants and anticipate the evolution of the driving scene. Conceptual knowledge is the most important aspect for being able to detect specific objects and avoid collisions.
One current method of obstacle detection in autonomous vehicles is the use of detectors and sets of appearance-based parameters. The first step in this method is the selection of areas of interest. This process narrows down areas of the field of vision that contain potential obstacles.
Appearance cues are used by the detectors to find areas of interest. These appearance cues analyse two-dimensional data and may be sensitive to symmetry, shadows, or local texture and colour gradients. Three-dimensional analysis of scene geometry provides greater classification of areas of interest. These additional cues include disparity, optical flow and clustering techniques.
Disparity is the pixel difference for an object from frame to frame. If you look at an object alternately closing one eye after the other, the ‘jumping’ you see in the object is the disparity. It can be used to detect and reconstruct arbitrarily shaped objects in the field.
Optical flow combines scene geometry and motion. It samples the environment and analyses images to determine the motion of objects. Finally, clustering techniques group image regions with similar motion vectors as these areas are likely to contain the same object. A combination of these cues is used to locate all areas of interest.
Fig. 6: Image identification in autonomous vehicles
While any combination of cues is attainable, it is necessary to include both appearance cues and three-dimensional cues as the accuracy of three-dimensional cues decreases quadratically with increasing distance. In addition, only persistent detections are flagged as obstacles so as to lower the rate of false alarms.
After areas of interest have been identified, these must be classified by passing them through many filters that search for characteristic features of on-road objects. This method takes a large amount of computation and time. The use of CNNs can increase the efficiency of this detection process. CNN-based detection system can classify areas that contain any type of obstacle. Motion-based methods such as optical flow heavily rely on the identification of feature points, which are often misclassified or not present in the image.
All the knowledge-based methods are for special obstacles (pedestrians, cars, etc) or in special environments (flat road, obstacles differing in appearance from ground). Convolutional neural networks are the most promising for classifying complex scenes because these closely mimic the structure and classification abilities of the human brain. Obstacle detection is only one important part of avoiding a collision. It is also vital for the vehicle to recognise how far away the obstacles are located in relation to its own physical boundaries.
Fig. 7: Obstacle detection test results: Input images (top), ground truths with black as positive (middle) and detected obstacles with orange as positive (bottom)
Depth estimation is an important consideration in autonomous driving as it ensures the safety of passengers as well as other vehicles. Estimating the distance between an obstacle and the vehicle is an important safety concern.
A CNN may be used for this task as CNNs are a viable method to estimate depth in an image. In a study, researchers trained their network on a large dataset of object scans, which is a public database of over ten thousand scans of everyday 3D objects, focused on images of chairs and used two different loss functions for training. They found that bi-weight trained network was more accurate at finding depth than the L2 norm. With images of varying size and resolution, it had an accuracy between 0.8283 and 0.9720 with a perfect accuracy being 1.0.
While estimating depth on single-frame stationary objects is simpler than on moving objects seen by vehicles, researchers found that CNNs can also be used for depth estimation in driving scenes. They fed detected obstacle blocks to a second CNN programmed to find depth. The blocks were split into strips parallel to the lower image boundary. These strips were weighted with depth codes from bottom to top with the notion that closer objects would normally appear closer to the lower bound of the image. The depth codes went from ‘1’ to ‘6’ with ‘1’ representing the most shallow areas and ‘6’ representing the deepest areas. The obstacle blocks were assigned the depth code for the strip they appeared in.
The CNN then used feature extraction in each block area to determine whether vertically adjacent blocks belonged to the same obstacle. If the blocks were determined to be the same obstacle, they were assigned the lower depth code to alert the vehicle of the closest part of the obstacle. CNN was trained on image block pairs to develop a base for detecting depth and then tested on street images as in the obstacle detection method. The CNN had an accuracy of 91.46 per cent in two-block identification.
Read part 2