Technology Today

2010 Issue 2

Monarch Meets Demanding,
High-Stress Processing Requirements

Raytheon is bringing two DARPA-sponsored technologies together to meet challenging warfighter needs: Monarch, an exceptional processor archi­tecture that provides an order of magnitude more processing per watt than other com­puting solutions, and SAVi (seismic and acoustic vibration imaging), an advanced sensor that uses laser vibrometry and a number of compute-intensive algorithms to detect buried objects such as mines and tunnels.

The direction in U.S. Department of Defense (DoD) systems is toward large data vol­ume sensors with demanding signal and data processing throughput requirements. Processors for these systems also need outstanding energy efficiency. Even as commercial processors have increased in processing performance, the amount of data provided by the sensor to the front-end processor has placed an even greater stress on the back-end processor for more perfor­mance within a stricter power budget. The morphable networked micro-architecture (Monarch) is a high-performance processing chip developed with the goal of providing exceptional compute capacity and high data bandwidth coupled with state-of-the-art power efficiency and full programmability.

A chip is typically designed either for front-end signal processing or back-end control and data processing. The Monarch archi­tecture and chip can efficiently do either, or both concurrently. It can perform as a single system on a chip, supporting single or mul­tiple diverse processing functions, resulting in a significant reduction in the number of processor types required for computing sys­tems, or it can perform as an array of chips to provide teraflop throughput.

Development History

Monarch got its start as the Raytheon-developed High-Performance Processing System (HPPS) architecture. HPPS was part of a challenge problem from a DoD agency to develop a 1-teraflop, 10-watt architecture for 2010. In 1999, Raytheon received a seedling contract and assembled a small team to study this. Out of that work and follow-on internal research and development investment came the core of Monarch, the dataflow-based field pro­grammable computing array (FPCA). This approach was further developed in Phase I of a new Defense Advanced Research Projects Agency (DARPA) Information Processing Technology Office program: Polymorphic Computing Architecture (PCA).

First-Pin

The goal of the PCA program was to develop adaptive, high-performance processing architectures that can be opti­mized to mission requirements across DoD applications — whether in response to changes from mission to mission or to the dynamic evolution of in-mission processing requirements. A team from the Information Sciences Institute of the University of Southern California (USC/ISI), led by John Granacki, was developing the Data IntensiVe Architecture (DIVA). Raytheon became part of the USC/ISI team and the two elements — DIVA and HPPS — were combined to create Monarch.

By Phase III of PCA, Raytheon became the prime contractor and the team grew to include Exogi, Mercury Computers, IBM and Georgia Tech. The chip was fabricated by IBM using their Cu08 (90nm) CMOS ASIC process.

Architecture

The Monarch chip includes six reduced instruction set computer (RISC) processors, 12 megabytes of on-chip dynamic random access memory (DRAM), two DDR2 ports, two serial RapidIO ports, 16 2.6-gigabyte per second streaming I/O ports, and the FPCA, a reconfigurable computing array. The Monarch chips can boot from a single commercial flash memory part, providing a highly embeddable system-on-a-chip pro­cessing solution. Monarch can also be used as a tiled array of processing chips to build a multi-teraflop computer, again with no glue parts required, thus achieving excellent size, weight, energy, performance and time values and enabling embedded systems that demand high performance computing for complex algorithms.

The FPCA is key to the Monarch chip’s high performance and efficiency. The FPCA contains 96 multiplier-ALUs, 124 dual-port memories, 248 address generators, and 20 direct memory access (DMA) engines all con­nected through a rich, dynamically switched interconnect. The architecture of the FPCA has been optimized for signal processing algorithms, for example, fast Fourier trans­forms and finite impulse response filters, using 16- and 32-bit integer and 32-bit IEEE floating point data. The FPCA uses a dataflow processing paradigm that supports streaming data with hardware support for dataflow synchronization, and uses a novel distributed programming paradigm. Monarch also supports threaded style execution through six independent RISC processors. The processors may also be configured to operate on 256-bit wide word or or on single instruction, multiple data operations. Many of the on-chip data paths and memories are 256 bits wide for high bandwidth; others are 32 bits wide to match common data needs. Total on-chip memory bandwidth is 390 gi­gabytes per second, enabling the sustained throughput of 64 gigaflops.

For power efficiency, Monarch substan­tially differs from nearly all conventional digital signal processors or RISC processors. Conventional architectures use DMA to place data into memory, pull it out for comput­ing, put it back into memory and then use DMA to send it to the output devices. Monarch data paths support direct execution of dataflow graphs — streaming data through the processor from input devices through computing elements, to output devices with no need to store the data in memory. Streaming execution without using memories saves all the energy consumed storing and retrieving data multiple times into memory. Memory is used only when the algorithm requires time alignment or for saving state.

The Monarch programming environment is a combination of industry standard languages (C/C++) and machine-specific dataflow language. The RISC compiler provides auto vectorization when develop­ing code for the wide word processor and scalar code. The dataflow assembler is the primary path for programming the FPCA. Math libraries can be used for both the threaded and dataflow streaming portions of the machine to reduce programmer work load. There are simulators, debuggers, and a real-time executive for the machine. The maturities of the tools vary but are sufficient for developing application code.

Figure 1

Monarch Advantages

Monarch provides high signal processing throughput in a power-efficient balanced architecture. Monarch can perform as a single system on a chip or as an array of chips to provide teraflop throughput. Tests have shown that Monarch can sustain 64 gigaflops of throughput via the FPCA while consuming less than 20 watts of power. The achievement of 3 gigaflops of throughput per watt in the current generation results in one of the most efficient processors available.

Dataflow processing is the key to Monarch’s power efficiency. Dataflow is rooted in an early 1900s technology process: the assembly line.

Standard general purpose processors can be viewed as a single-worker with a long “to-do” list to complete a job. A lot of time and energy is wasted checking what needs to be done next (instruction fetching).

The FPCA processor is an entire shop full of specialized workers. Each worker is provided only a short list of operations to do. Upon completion, the incremental product is sent down the line, and the worker receives the next piece to perform the same operations on the line. The operations are nearly always the same, with only limited flexibility (i.e., a painter can be told to paint it green instead of blue, but never to weld). FCPA processors have extremely high throughput at the cost of lower overall flexibility. Dataflow architecture also eases programming workload.

Programming Monarch’s FPCA requires the mindset of an industrial engineer as much as a software engineer to:

  • Decompose a task into individually workable units
  • Effectively utilize workers
  • Balance the workload
  • Optimally route between workers
System Impact

DARPA’s SAVi program is an example of the trend toward large data volume sensors with demanding requirements, with a sensor ca­pable of producing 1 to 2 gigasamples per second of data and needing hundreds of giga­flops of compute power.

SAVi uses laser Doppler vibrometry to detect mines and tunnels in real time, from a mobile platform. SAVi induces ground-surface vibra­tions by applying an acoustic stimulus for land mines and improvised explosive devices and a seismic stimulus for tunnels. Laser Doppler vibrometry allows for non-contact vibration measurements of a surface by detecting the Doppler shift of a laser beam frequency to de­rive the vibration velocity over time for a target.

The Monarch implementation will replace the original SAVi processing approach that was estimated to take 96 conventional commercial processors with 16 Monarch chips. The SAVi program will utilize quad-chip boards devel­oped by Mercury Federal Systems for Raytheon in a four-board chassis configuration to fulfill SAVi system processing requirements.

Raytheon is proud of this innovative processor. We plan to continue providing similar creative solutions to stay at the forefront of informa­tion systems and computing technologies to meet the future needs of our customer’s radar, electro-optical, missile, communications, and signal intelligence systems.

Kenneth Prager

Top of Page