Qualcomm Lays Groundwork for Robotic Vision Processing

n the last four to five years, the use of artificial intelligence (AI) has grown more rapidly than ever before, and this trend is expected to continue as AI starts becoming ubiquitous in everyday lives. One area benefitting from this growth is the application of AI to robotics, where it is used for everything from making decisions to controlling movements. However, perhaps the biggest contribution of AI to robotics is the application of computer vision (CV) to allow for “robotic autonomy”.

Robotic autonomy is the ability for a robot to perform tasks with little or no external influence or assistance. Generally speaking, robotic autonomy involves gaining information about the surrounding environment, and performing full or partial movement without human intervention. CV can be used to help reach both of these goals.

Addressing Challenges
To achieve such autonomy, there are three primary challenges for developing robotic vision. The first is determining the orientation of objects, where objects in the surrounding environment must not only be identified, but their orientation in 3D space must be determined for a robot to interact with and/or avoid these objects.

Next is dealing with moving objects as objects in a given environment may not be static and would mean robots need to detect, identify, and track objects over space and time. The third challenge to achieve autonomy is navigating. For a robot to be autonomous, it needs algorithms that allow it to plan its motion in a dynamic environment.

Traditionally, robots addressed these challenges by sensing and detecting their surroundings. Two common methods included the use of simple shock sensors to detect collisions and the use of barriers with infrared transmitters to ascertain boundaries.

Today, robots are able to do this more autonomously, thanks to advances in mobile processing power, AI, machine learning (ML), and CV techniques. This has led to a general approach for CV development involving four phases. The first phase is preprocessing, where data is collected from the real world, that is from sensors and cameras, and converted into a more clean and usable state. Next is feature detection, where features such as corners, edges, among others, are extracted from preprocessed data. The third phase is object and detection classification, where objects are detected from the features and may be classified according to known feature maps, while the last phase is object tracking and navigation, where objects that have been identified are tracked across time. This can include both objects and changing viewpoints of the environment as the robot navigates.

The data generated by these phases can then be used to control servos, make decisions, and perform other high-level robotic tasks. These four phases of strategies are discussed further below along with some of the frameworks and hardware available.

Before Data is collected
After Data is collected
Photo 1: Data is collected from sensors and cameras and converted into a more clean and usable state.
A robot collects data from the real world using one or more cameras and/or other sensors. However, this raw data may not be in a suitable state for the accurate calculations and predictions required to meet established goals. Here, methods such as digital signal processing (DSP) can be used to clean the data into a more usable form. Image data for example, can be cleaned in numerous ways including resizing, gamma correction, and contrast enhancement; while sensor data, such as that from the inertial measurement unit (IMU), accelerometer, barometers, etc. can be fused, interpolated, and/or filtered.

When it comes to image data, choose a system that can process two (stereo) images or two planes at the same time, and can support various resolutions and frame rates. This provides more flexibility in the amount and rate of data that is collected. Similarly, sensor data can be collected at various frequencies and bit rates. Needless to say, to reduce the overhead of processing all of this data, it is generally best to use the lowest sample rate and resolutions that provide the required amount of data.

Feature detection
With clean data available, features can then be extracted. Four common features that CV techniques look for in visual data are:

  • Corners: a point-like feature with a local 2D structure
  • Edges: a set of points between two regions
  • Blobs: regions of interest
  • Ridges: a curve with a ridge point.
Detected from visual data
Photo 2: Image shows a number of features that might be detected from visual data.
Feature detection algorithms can require a lot of processing power but generally operate on a pixel-by-pixel basis, which makes them suitable for parallel execution on the different compute blocks (such as the computer vision processing engine, DSP, GPU, etc. of Qualcomm processors). There are a number of CV algorithms that can be used including the Harris Corner Detector, FAST, Hough Transform, and other detectors, as well as object detection approaches based on Maximally Stable Extremal Regions (MSER).

Detecting objects and orientations
With features detected, the next step is to detect and classify objects from them. This has traditionally been a difficult task, thanks to challenges like variations in viewpoints, different sized images, deformation (i.e., change in shape) of objects over time, and varying illumination conditions. Fortunately, a neural network that has been trained to detect and classify objects with such variations can help. These networks are often trained from datasets containing large quantities of labelled images.

One popular approach is to employ a convolutional neural network (CNN), where small regions of the image are fed into the network in a process known as ‘sliding windows’. There are a number of ML frameworks that support object detection with neural networks such as Caffe2 and TensorFlow, as well as the ONNX format.

Determining the orientation of objects is important for both object interaction and navigation. The main challenge here is determining the orientation of an object and/or the robot itself in 3D world-space. A popular approach is to apply homography algorithms such as linear least square solver, random sampling and consensus (RANSAC), and least median of squares, to compute points between frames of 2D imagery.

Once objects have been detected, they are assigned metadata such as an ID, bounding box, etc. which can be used during object detection and navigation.

Determining the orientation of an object
Photo 3: Determining the orientation of an object and/or the robot itself in 3D world-space is the main challenge.
Object tracking and navigation
With objects and aspects of the surrounding environment identified, a robot then needs to track them. Because objects can move around, and the robot’s viewport will change as it navigates, a mechanism is needed to track these elements over time and across frames captured by the camera(s) and other sensors. Because this mechanism must be fast enough to run every frame, numerous algorithms have been devised over the years that approach the problem in different ways.

For example, Centroid Tracking computes the center point of a bounding box around an identified object across frames, and computes the distance between the point as it changes, under the assumption that the object will only move a certain distance each frame. Another approach is to use a Kalman filter that uses statistics over time to predict the location of an object.

The mean shift algorithm is an approach that basically finds the mean of some aspect of an image (e.g., color histogram) within a sub region of a frame. It then looks for the same description within the next frame by seeking to maximize similarities in features. This allows it to account for changes such as scale, orientation etc. and to ultimately track where the object is.

Because these techniques only need to track a subset of the original features, they can generally deal with changes such as orientation, occlusion, among others, efficiently and with good success, which makes them effective for robotics vision processing.

However, objects are not the only thing that need to be tracked. The robot itself should be able to successfully navigate its environment and this is where Simultaneous Localization and Mapping (SLAM) comes in. SLAM seeks to estimate a robot’s location and derive a map of the environment. It can be implemented using several algorithms, such as Kalman filters. SLAM is often implemented by fusing data from multiple sensors, and when it involves visual data, the process is often referred to as Visual-Inertial Simultaneous Localization and Mapping (VISLAM).

Of course, SLAM is only as good as what the robot can sense, so the use of high-quality cameras and sensors is important, as well as ensuring that they are not blocked from capturing data. From a safety perspective, fail safes are important in case data cannot be acquired (e.g., the cameras become covered).

High-quality cameras and sensors
Photo 4: High-quality cameras and sensors play important role to complement present computational methods and algorithms.
Algorithms for Path Planning
Another important aspect of autonomous navigation is path planning. Path planning is the process of taking a target location, and deriving the necessary commands to control the speeds and turns that the robot needs to perform, to successfully navigate to that location.

Path planning typically involves identifying the domains or “spaces” namely configuration space, which defines the robot’s pose DoFs (position and orientation); free space, which is a configuration that prevents or avoids collisions; target space, or the target location to move the robot; obstacle space, which is the area where the robot cannot enter; and danger space, or the less desirable area that the robot can pass through, such as mud puddle, among others.

There are numerous path planning algorithms that engineers can employ, many of which seek to identify and use the aforementioned spaces in different ways to achieve their goals. Several categories of these algorithms include the following:

  • Grid-based: the Configuration space is organized into a grid, and collision detection is used to determine if a robot can navigate between the grid’s cells.
  • Artificial potential fields: goals are assigned “attraction” properties and obstacles are assigned “repulsion” properties to attract and repel the robot to its target.
  • Sampling-based: samples of configurations called “milestones”, are taken from the Configuration space, and collision detection is used to determine if a line segment can be created to connect two samples. The resulting collection of connected milestones represents the “road map” of a path to follow.

The next generation of autonomous robots, with the use of CV and ML, are more sophisticated with their abilities to ‘see’ their surroundings, ‘analyze’ dynamic scenarios or changing conditions, and ‘make decisions’. At the same time, they will require powerful processors and rich tool support.

Thankfully, there are a number of frameworks and technologies in place to help developers. For example, Qualcomm Technologies, Inc. (QTI) recently announced the Qualcomm Robotics RB3 Platform and the Qualcomm Robotics RB3 Development Kit based on the Qualcomm SDA 845 SoC, which is designed to provide mobile hardware capabilities and tools, to support the development of cutting-edge robotics.

With such tools and frameworks available, developers can focus on higher-level robotic vision functionality and approaches for object detection and classification, and object tracking and navigation.

Qualcomm Robotics RB3 Platform and Qualcomm SDA845 are products of Qualcomm Technologies, Inc. and/or its subsidiaries.

About this Article:

The author is Dev Singh, Senior Director, Business Development and Head of Robotics, Drones, and Intelligent Machines, Qualcomm Technologies, Inc.