Cloud native EDA tools & pre-optimized hardware platforms
Allen Watson, Sr. Product Marketing Manager, 草榴社区
Johan Kraft, CEO, Percepio AB
Today’s powerful vision processors allow for excellent performance, but making sure your solution makes efficient use of the hardware is another matter. The ability to visualize the runtime behavior of your system can help accelerate development, debugging, and validation. Percepio Tracealyzer for OpenVX? allows you to visualize the execution of OpenVX applications and identify bottlenecks where optimization can make a big difference. Tracealyzer for OpenVX is available for 草榴社区’ DesignWare? ARC? EV6x embedded vision processors, and leverages the built-in trace support in the ARC MetaWare EV Development Toolkit.
Embedded vision applications are typically written as OpenVX graphs. OpenVX is an open standard for the acceleration of computer vision applications and has many embedded and real-time use cases. This includes applications such as face, body, and gesture tracking, smart video surveillance, advanced driver assistance systems (ADAS), object and scene reconstruction, augmented reality, visual inspection, and robotics.
An OpenVX graph is constructed from one or more kernels. Each kernel performs a vision function and may be one of the standard OpenVX kernels, a custom supplied kernel, or a user-defined kernel (Figure 1).
Figure 1: An OpenVX Graph
For example, with a vision processor like the ARC EV6x processor, a kernel may run on one or more of the vision CPUs or on the CNN Engine (Figure 2). The processor’s OpenVX-based runtime software manages the execution of the kernels and handles memory allocation and use.
Figure 2: OpenVX-based runtime on EV6x processor
It can be a challenge to make sure the processor hardware is being used as efficiently as possible. For example, an OpenVX graph node may require more processing time than expected and overload one core, while the other cores remain mostly idle. Or, perhaps the application is spending a lot of time waiting for DMA transfers to complete. You may also have tried to improve performance by adding more compute resources, but the performance gain is less than expected. To address these types of issues, Percepio developed a version of their visualization solution, Tracealyzer, for OpenVX.
With this solution, you can identify bottlenecks where optimization can make a big difference. Tracealyzer for OpenVX provides a variety of graphical views showing different perspectives of the recorded behavior, ranging from a detailed trace view to high-level overviews and statistics. This article describes the different views available when using the Tracealyzer tool.
The trace view displays a timeline of the OpenVX graph execution so that you can study the scheduling, pipelining, and timing in detail. The trace view can be adapted in many ways and supports both horizontal and vertical display.
As an example, we are using the demo trace provided with Tracealyzer for OpenVX (“demo_openvx.xml”). This has been recorded from an OpenVX demo application together with a screenshot from the trace view (Figure 3).
Figure 3: OpenVX Demo Application
You can see how the runtime software schedules the graph using two cores. Core 0 reads the input frames and feeds it to the sobel3x3 node. The result is then processed further on Core 1, using the magnitude and convert_depth filters. The processing is run-to-completion, so each rectangle (fragment) in the trace is a separate job that runs without preemptions. The short fragments of the filter functions are precondition checks, while the long fragments show the actual filter processing.
The magnitude node starts before the sobel3x3 node is completed, which is possible because the OpenVX implementation in this example divides each frame into tiles. One node may output multiple tiles that are written to the output buffer one by one, as soon as completed. Thus, the following node (e.g., magnitude) does not need to wait for a full frame, but can start as soon as the first tile is available, assuming the nodes run on different cores. This allows for a pipelined processing that utilizes the cores efficiently.
Figure 4: Trace View
The trace view is composed of fields, labeled “CPU 0” and “CPU 1” in Figure 4. Each type of fields displays different types of information. OpenVX nodes are shown in a “scheduling” field, either one field per CPU core (left example) or a single field for all nodes (right example).
To get an overview of how your OpenVX application utilizes your CPU cores, look at the CPU load graph, shown twice (one for each core) together with the trace view in vertical mode (Figure 5). The CPU load graph allows you to see the overall load on your CPU cores, as well as how the load varies over time and the contribution of each node.
The CPU load graph also works as an overview where you can spot anomalies, for instance, the two spikes in the “sobel3x3” node (shown in red) where the utilization is around 80-90%. You can see what causes these spikes by double-clicking in the CPU load graph to show the corresponding section in the trace view.
Figure 5: CPU Load Graph
All views in Tracealyzer are interconnected in similar ways, which makes it easy to drill down from high-level overviews into the detailed trace. The colors make it easier to identify the OpenVX graph nodes. The same color coding is used across all Tracealyzer views. It is also possible to open multiple instances of the CPU load graph or view all cores combined in a single CPU load graph. Note that the CPU loads are accumulated in this mode, so with two cores the scale goes up to 200%.
The CPU load graph works by dividing the displayed time window into a number of fixed size time intervals, by default 50, and then calculates the amount of processing time used by each node within each time interval. The result is displayed as a stacked histogram, where the Y-axis shows the relative utilization within each time interval. Since the concept of CPU load is always relative to a certain time window (independent of what tool you use), zooming in or out may change the levels of the CPU load graph as the reference time window is changed. When zoomed in a lot, most time intervals will only contain a single node so the graph will be similar to the trace view.
In Figure 6, we added two instances of the Actor Instance Graph, where the Y-axis shows the execution times of the graph nodes. This way, you can see where nodes execute longer than normal and inspect the trace view to see the details. Note that “Actor” is a Tracealyzer term meaning “execution context”, corresponding to nodes in OpenVX.
Figure 6: Actor Instance Graph
In addition to execution time, the Actor Instance Graph can show various timing properties, including separation and periodicity. You can change the property that is displayed in the “Execution Time” dropdown menu.
The Actor Statistics Report gives a statistical summary of the trace, including the highest, lowest, and average values observed for timing properties such as execution time. All extreme values in this report are links and can be clicked on to find the corresponding location in the trace view (Figure 7).
Figure 7: Actor Statistics Report
With the Actor Statistics Report, you can find the extreme values and see what was going on in the system at that time. Although all details are not recorded you can still get valuable clues about what caused these values. For instance, using the Actor Instance Graph you can find other cases with similarly high execution time and check for correlations in the trace. Perhaps the high execution time only occurs under particular circumstances, e.g., due to intense activity on other CPU cores saturating the bus.
Note that you can export and save the statistics reports, either as formatted HTML files (like above) or as tabbed text files (Figure 8). The latter allows for easy data import into other tools and is done by checking the option “Data Export” in the Actor Statistics Report dialog. This allows you to run measurements on alternative designs and compare the resulting performance metrics systematically. For instance, the statistics report shows that IDLE0 is running 40.7% of the time, meaning that Core 0 is only 59.7% loaded, while IDLE1 only runs for 17% of the time, meaning that Core 1 is 83% loaded.
Figure 8: Actor Statistic report in tabbed text file format
Tracealyzer allows you to add your own user events, i.e., custom events logged from the application code. User events allows you to visualize just about anything in your application, like diagnostic messages, variable values, and states.
Figure 9 shows an example where user events have been logged on two user event channels, “MyVariable” showing values of an integer variable and “MyState” showing state names. Tracealyzer can display such user events in several ways, e.g., as event labels in the trace view (1) and as entries in the Event Log (2). The User Event Signal Plot (3) allows for plotting numerical data from user events.
Moreover, if you have important state variables in your system, you can log the state changes as user events and define a State Machine in Tracealyzer to see the states in the trace view timeline (4). You can also see a summary of the state changes as a state machine graph (5). You can even get statistics on the time spent in each state, or the time between any two events by defining a “custom interval”.
Figure 9: User Events
The example use case described is based on an OpenVX application developed using 草榴社区’ DesignWare ARC MetaWare EV Development Toolkit for ARC EV6x processors.
The EV6x Embedded Vision Processors integrate one, two or four high-performance vision CPUs, each consisting of a 32-bit scalar core with a 512-bit vector DSP. They can include an optimized convolution neural network (CNN) engine for fast and accurate object detection, classification, and scene segmentation. The processors are fully programmable and configurable and combine the flexibility of software solutions with the high performance and low power consumption of dedicated hardware.
The ARC MetaWare EV Development Toolkit provides a complete set of tools, runtime software and libraries that enable the development of embedded vision applications and machine learning applications with the EV6x Processor family. The toolkit consists of the MetaWare Compiler and Debugger, ARC nSIM Instruction Set Simulator (ISS), EV Runtime and libraries, CNN Software Development Kit (CNN SDK), and the EV Virtualizer Development Kit (EV VDK).
Efficient development of advanced embedded vision and AI applications requires the ability to rapidly debug, validate, and optimize software. Percepio’s Tracealyzer for OpenVX visualization tool enables designers using 草榴社区’ ARC EV6x processors to observe the runtime behavior of their software and optimize their applications for maximum performance while accelerating development cycles for real-time vision applications such as ADAS and self-driving vehicles.
to learn more about Percepio Tracealyzer for OpenVX.