programmablelogicZONE Products for the week of August 4, 2008
Cswitch Says...
Industry’s Highest Bandwidth Configurable Device Family
- Supports the evolution of networks to higher data rates and increased services without power and cost penalties
- CS90 Family based on innovative Configurable Switch Array architecture uses familiar HDL-based design flow
Cswitch Corporation introduced its complete CS90 family of configurable logic devices addressing applications requiring up to 100 Gbits/s of packet processing bandwidth. The CS90 family is based upon Cswitch’s innovative Configurable Switch Array (CSA) architecture, which is a dramatic departure from currently existing solutions, It offers ASIC performance with FPGA flexibility by embedding configurable functions that are tailored to efficiently support any packet-based application.
“The heterogenous Configurable Switch Array architecture used by the CS90 family offers designers all the building blocks necessary to build 20 to 100 Gbits/s datapath applications in a single device”, said Doug Laird, Cswitch President and CEO. “Furthermore, because the architecture utilizes embedded blocks, timing closure effort is significantly reduced. The CS90 family meets our customers’ challenges,”, added Laird.
Cswitch’s innovative Configurable Switch Array (CSA) architecture includes fully configurable embedded blocks operating at up to 1 GHz, supporting common data plane functions such as header parsing and CRC generation. In addition, the architecture uses the proprietary dataCrossconnect packet transport fabric to move data between blocks at 2 GHz, an industry first. This combination of embedded blocks and innovative interconnect allows CS90 devices to break through bandwidth bottlenecks found in FPGAs, thereby offering a true ASIC alternative for next generation packet-based applications seeking 20 – 100 Gbits/s throughput.
The complete CS90 family is comprised of three devices: the CS9050, CS9070 and the CS9090. The family offers in excess of 9 million usable gates, complemented by up to 40 serial transceivers operating at 6.4 Gbits/s. Each member contains three types of embedded blocks called Configurable Packet Engines (CPE) designed to address header parsing, fast lookups, and high-bandwidth polynomial arithmetic. In addition, the family supports up to 19 Mbits of on-chip memory, as well as 4 embedded high-speed memory controllers supporting DDR2 and RLDRAMII at 533 MHz.
“The CS90 family from Cswitch marks a dramatic and necessary evolution for the semiconductor industry as it attempts to address high performance applications that are no longer served economically by FPGAs or ASICs,” said Rich Wawrzyniak, Semiconductor Analyst for Semico.
EN-Genius Says…
It’s almost as if the absence of product announcements from the major FPGA makers has created a news vacuum that Cswitch and other upstarts like XMOS and Atmel have rushed to fill with daring challenges to traditional programmable logic technologies. One of the most intriguing contenders is the Cswitch CS90 family of configurable logic devices. While it might not be appropriate for some of the more general-purpose applications that FPGAs are called upon to handle, Cswitch technology could plug an important gap in the high-performance packet processing and traffic management application space (somewhere North of 20 Gbit/s)where NPUs start to bottom out, FPGAs start to draw too much power and ASICs just cost too much.
Cswitch devices are a mixed array of traditional LUTs ( known as programmable logic blocks, or PLBs) and simple, fast, configurable packet engines (CPEs) tied together with an unusual 2-tiered interconnect fabric. The PLBs are somewhat similar to the 4-input LUTs found in Xilinx Virtex4 devices and run at 500 MHz. There are around 7000 8-LUT logic blocks spread across the chip for a total of 57,300 LUTs to play with. While this is a small fraction of the number of LUTs available in Brand A or Brand X larger chips, Cswitch hardware packet engines math units and I/O cores eliminate the need to waste gates implementing these functions.
Since Cswitch has specifically targeted networking/comms applications as its major focus, it’s no wonder that they have sprinkled a healthy dose of Configurable Packet Engine (CPE) cores throughout the chip. Boasting clock speeds of up to 1 GHz, the CPE’s dedicated processing blocks can be configured to handle packet inspection, analysis and classification as well as any repetitive math functions you might need.
A CPE consists of three separate elements that can be configured and assembled as needed:
- Packet parsers (PPs) – These simple 16-bit engines are specifically designed for efficient header parsing using either simple lookup tables or working in conjunction with the chip-embedded CAM elements. If your application requires it, you can also use the interconnect fabric to attach one or more CPEs to larger external lookup hardware
- Reconfigurable CAMs (RCAMs) – These handy blocks can give you either a 40-bit x 64-word search space for simple binary CAM functions or a 20-bit x 64-word TCAM. The compact size is handy for simple classification tasks but if you want to handle larger tables or larger search terms (such as a full IPv6 header) you can use a portion of the chip’s high-speed interconnect fabric to access an external TCAM or other search engine
- Reconfigurable arithmetic units (RAUs) – Somewhat like a traditional ALU, but more flexible, these cores can perform network-related math functions like CRC, hashing, checksum, and the heavy-duty computations associated with crypto applications. Cswitch says the RAUs also do well at DSP-like MAC functions that allow efficient implementation of digital filters, equalizers, and FFT-related functions. You can use a C-based development tool to write your own routines or use a sizeable library of canned math functions that’s already available from Cswitch
A traditional SRAM-based local interconnect system is used to configure the PLBs and knit them together into small blocks with the CPEs. Depending on the size of the chip, you get between 32 and 64 CPE cores, each of which can be used in stand-alone mode or ganged together with other CPEs.
The CPE and PLB blocks are overlaid by a meshed interconnect fabric that’s responsible for moving data between blocks as well as to and from the device’s I/O resources. Each row and column is a stand-alone 40 Gbit/s (20-bit, 2 GHz) time-slotted channel that is linked to local logic structures and other channels by fabric switch/gearboxes that are sprinkled throughout the chip. Cswitch was hazy about some of the details but, from what I was told, each row and column functions like a scaled-down SONET TDM channel. Following this analogy, connections between CPEs, PLBs, and the I/O resources located around the edge of the chip are made by the switch nodes that allocate one or more dedicated time slots to a particular sub-channel. Because the top-layer interconnect uses a deterministic switched mechanism, the timing interactions between logic elements should be well-defined and make it much, much easier to achieve timing closure for your design.
All devices in the CS90 family enjoy a menu of I/O choices that offers as many options as anything offered on the highest-density conventional FPGA. My only concern (as I’ll discuss further later) is that I’m unsure of the pedigree or actual performance of their SerDes I/Os, one of the trickiest and most critical elements of any modern networking component. Because the I/O menu is so extensive, I’ll just post the list as it was presented to me in the briefing: 640 General Purpose I/O pins for parallel interfaces
- 1.25 Gbit/s Differential
- 1.07 Gbit/s Single-ended
- Configurable I/O standard, drive, & termination
- Supports XAUI, PCIe,SGMII, FC, and other proprietary SerDes-based interconnects
- Supports SPI-4.2, XSBI, DDR, QDR interfaces
24 - 40 Multirate SERDES
- Each 1.0 - 6.4 Gbit/s full-duplex channel can be is equipped with selectable MAC blocks to support a 10/100/1GbE, PCIe 1/2, SGMII, or 1/2/4G FibreChannel connection
- Each set of four SerDes channels has a 10GbE MAC
- Code to implement a soft 10-Gbit/s Interlaken interface is available now, hard core to follow in Q4 2008
4 High-Speed Memory Controllers (about 2x faster than FPGA-based controllers)
- Configurable for 8 bit to 72 bit
- Supports DDR-II, RLDRAM-II (533 MHz), QDR-II
1 GHz Embedded Memory
- 16 Mbit of embedded RAM (in 15 cascadable blocks).
- 2.4 Mbit of dual-port RAM (120 cascadable blocks)
- If needed the embedded RAM can be configured as pseudo-dual-port memory
Thanks to a combination of gate-efficient hard cores and an interconnect system that allows you to use much more of the available logic, a Cswitch device promises to deliver more processing power per square centimeter of silicon than a conventional FPGA – at least for the applications it’s been designed to address. Although I’m not sure it would do as well when used in more general-purpose applications, I believe Cswitch claims that they should be able to replace 2 - 4 equivalent-priced Virtex/Stratix-class FPGAs for designs involving packet inspection, classification, or other networking-oriented tasks.
From what I can see, you could build a fair-sized router or security device using only one or two Cswitch devices, a honking-big TCAM, and a fast PowerPC. In fact, in one design exercise, Cswitch was able to build a 4 x 10GbE Ethernet-MPLS Switch that handled Ethernet termination, frame parsing and header field extraction, MPLS encapsulation, and most other must-have features using only one of their devices. According to Cswitch, the design did not use up all the logic resources on the chip and it still managed to do the work of at least three equivalent-priced Virtex-5 FPGAs.
The CS90 power consumption varies according to how many logic elements are actually used, how fast they are clocked, and other application-specific details. In some of the initial tests, the chip dissipated around 15 W when configured as a 40 Gbit/s traffic manager. Other applications that used more on-chip resources and pushed the I/O elements harder could roughly double its power draw but the processor-style packaging (rated for 100 W) they’ve housed it in should be more than up to the job.
For all its promise, the Cswitch architecture raises several concerns that I think will have to be addressed before it finds any widespread market acceptance. The first issue that comes to mind is whether Cswitch has provided the hardware and software resources that will allow designers to quickly become comfortable with the technology and begin to use it effectively.
If the unusual nature of Cswitch architecture gives you unsettling thoughts about what it’s going to take to program/configure it, you’re not alone. When I raised my concerns Cswitch said that they’ve come up with as set of development tools that have similar look and feel to Brand X and Brand A tools to generate Verilog/VHDL files that are passed to a piece of software from Magma that handles front-end placement and synthesis. Timing analysis is derived from Magma QuartzTime technology which can be used to drive constraint-driven synthesis, place, and route.
Nevertheless, Cswitch does admit that their architecture and development board are not quickly mastered by a designer who is simply used to dealing with FPGAs. They say that early experiences with customers have taught them that converting a pure LUT-based design to take full advantage of the chip architecture typically requires some cooperation/hand-holding with experts – at least for the moment. Now that the first products have hit the market, Cswitch is in the process of making improvements in their tools and developing a larger library of configurable reference designs that they hope will eliminate this expensive part of the sales cycle, for most applications. Whether or not they really accomplish this will be one of the deciding factors as to whether these devices actually gain market traction, or become an interesting science fair experiment.
I am also concerned about whether the device on-chip SerDes transceivers are robust enough to perform well in the less-than-ideal channel conditions that they’re likely to encounter in real-world applications. While I worry a bit about their reach in copper cabling, my main concern is how they will perform in the unruly environments of longer PCB traces and backplanes which can have all sorts of unpleasant attenuation, reflection and crosstalk.
Any of the major FPGA makers has probably spent almost as much on developing its SerDes transceivers than Cswitch has spent on its entire product and yet at least one of them has experienced persistent problems in achieving reliable signal integrity in higher-speed applications. I regret that time constraints prevented me from finding out whether Cswitch developed its own SerDes or where it got the IP from because that might have provided some clues as to what to expect. For now, I’ll consider the issue open and hope to see a credible demonstration at one of the conferences I attend, or when I visit Cswitch on one of my trips through the Valley.
Since Cswitch has been shipping products to selected customers for several months, I am not as worried about whether the design will actually work as I am about what the company is doing to gain a toehold in a market that’s notoriously tough on upstart technologies. Most of the startups I’ve seen that actually got market traction aggressively pursued partnerships and references designs with complementary semiconductors and put a lot of energy into cultivating an ecosystem of 3rd-party tools and IP. Cswitch’s partnership with Magma on development tools is a good start but the people who briefed me were somewhat vague on what else they intend to do.
So, if Cswitch manages to overcome these obstacles, does this spell the imminent death of the FPGA as we know it? Most likely the answer is no. Nevertheless, the rising cost of ASIC development is creating several opportunities for new kinds of programmable logic technology that deliver better performance and lower power consumption in certain kinds of applications than their SRAM-based counterparts can. So while the Cswitch CS90 family of configurable logic devices may not ever displace traditional FPGAs in general-purpose applications, its unique architecture may help it capture the high-speed packet processing jobs that lie beyond the grasp of today’s biggest, baddest Stratix/Virtex-class devices.
The CS9070 is the first member of the CS90 device family and is sampling today. 10-k unit pricing brackets Stratix/Virtex pricing at under $300 for the smaller CS9050 device to around $700 for the top-line CS9090. The CS9050 and the CS9090 will begin sampling later in 2008. Cswitch’s Andara Development Tool Suite supporting the CS90 family is available today.
Update 8/8/08:
After this review was originally posted, Ed McKernan, VP of Business Development at Cswitch, responded to some of my concerns about their SerDes interface circuits and agreed to have his comments posted here.
He told me that CSwitch did indeed license a 3rd party SerDes and found the performance to be better than some of the FPGAs they’ve seen. McKernan also explained that tests on their transceiver had produced a fairly clean receive eye diagram after running a 6.4 Gbit/s SerDes signal across 22 inches of trace but, as anyone who’s worked with this technology knows, that is only half the story. At this point I asked a few additional questions that might shed a little light on the actual performance we can expect from CSwitch’s SerDes transceivers under real-world conditions.
What was the dielectric material in the PCB? Was it FR4 or something more exotic?
The PCB material used on our board is FR4. The backplane had Nelco 4000-13SI. We have also tested long PCB traces on FR4 (but without backplane connectors).
What connectors did you use, and was there any backdrilling in the PCB for transitions?
We used HMZD backplane connectors (XAUI test backplane).
How many traces did you run at a time and how many layers did it take?
For the CS9070, pinout is done such that all SerDes routing can be done on one layer if necessary as this allows no-via pin escapes from BGA. System designers concerned with crosstalk mitigation in PCB routing may choose to use two layers - one for TX and one for RX. Our own validation boards have SerDes routing only on two layers for this reason and they run in bundles of 4 serial inks each. For the CS9090, its pinout will support 2-layer SerDes routing on a larger number of serial links.
Was the SerDes testing done on a real-world application (like your router reference design) or a simple demonstration board?
SerDes testing has been simple demonstration using our validation board.
|
|