Intel Network Router IXP12xx User Manual

IXP12xx ATM OC12/Ethernet IP  
Router Example Design  
Performance and Headroom Analysis  
April, 2002  
Document Number: 301144-001  
 
Version 1.0, 4/10/02  
IXP12xx ATM OC12/Ethernet IP Router Example Design  
Performance and Headroom Analysis  
OVERVIEW  
This documents details the performance and headroom analysis done on the IXP12xx ATM  
OC12 / Ethernet IP Router Example Design. It covers the general performance aspects of the  
protocols; cycle and instruction budgets; testing under different workloads; and performance  
measurements in both, simulation and hardware environments.  
This document also attempts to analyze the amount of headroom available in this design for  
customers to add additional features by providing microengine and memory utilization metrics.  
Three different configurations are supported:  
One ATM OC-12 port &  
For use with the IXP1240/1250 with hardware CRC capability  
eight 100Mbps Ethernet ports  
Four ATM OC-3 ports &  
eight 100Mbps Ethernet ports  
Similar to the above configuration (requires the IXP1240/50), except  
that it uses four OC-3 ports.  
For use with the IXP1200 (which does not have hardware CRC  
capability). Instead, CRC computation is performed by two  
microengines (thus the reduced data rates).  
Two ATM OC-3 ports &  
four 100Mbps Ethernet ports  
Since in each configuration aggregate Ethernet port bandwidth exceeds aggregate ATM port  
bandwidth, ATM port bandwidth is the limiting external factor. This example design supports  
full-duplex, full-bandwidth ATM communication on all available ATM ports.  
The design is able to simultaneously transmit and receive any traffic pattern on all available ATM  
ports at line rate. Line rate means that no idle cells should appear on the ATM links.  
Furthermore, no ATM PHY FIFO overflows or Ethernet MAC FIFO overflows or underflows  
should occur.  
MEASUREMENT ENVIRONMENT  
Simulation and hardware performance testing was performed under the following conditions:  
o
232 MHz IXP1240 with an 80 MHz IX Bus  
(IXP1200 measurements do not use the hardware-CRC on the IXP1240)  
133 MHz SDRAM – ‘-75’ speed-grade  
o
(some results for 143 MHz (‘-7E’ speed-grade) are also provided where indicated)  
Alternate DRAM Timing  
The project ships with two FLASH files for two different DRAM speed grades.  
atm_ether\tools\flash contains files for 133MHz (-75) and 143 MHz (-7E) DRAM. Most  
measurements were repeated with both settings to illustrate the sensitivity of the design to DRAM  
performance. Where not specifically mentioned in this document, the slower 133MHz settings  
were used.  
Page 3 of 17  
 
Version 1.0, 4/10/02  
KEY WORKLOADS & APPROACHES TO TESTING THE EXAMPLE DESIGN  
Protocol Performance of IP over ATM vs. Ethernet  
Figure 1 details the protocol processing required to carry an IP packet over ATM and Ethernet. .  
Figure 1 – Protocol Processing  
Figures 2 and 3 show that as the size of the IP packet varies so do the efficiencies of ATM and  
Ethernet. This section details those efficiencies and the resulting performance implications  
Single Cell PDU Workload  
Single-cell PDUs result from IP packets of size 20 to 32 bytes – for example UDP packets with  
up to 4 payload bytes (8 bytes of LLC/SNAP plus 8 bytes of AAL5 trailer are included with the  
IP packet in the 48-byte cell payload). Adding a 4-byte ATM cell header plus 1-byte HEC results  
in a 53-byte cell. SONET overhead transparently adds about another 2 bytes/cell to the wire-time  
such that its total cost is 55-bytes in terms of a 155 or 622 Mbps ATM link.  
When the same packet is carried over Ethernet, it expands to consume a minimum-sized 64-byte  
frame. Ethernet then adds at least 960ns of inter-packet gap (12-bytes), plus a preamble (8-  
bytes). The total packet cost is 84-bytes on a 100Mbps Ethernet link.  
Page 4 of 17  
 
Version 1.0, 4/10/02  
The result is that ATM is significantly more efficient that Ethernet in terms of Mbps for carrying  
very small PDUs. Every Mbps of single-cell-PDUs on the ATM link requires (84/55) Mbps on  
the matching Ethernet link(s).  
17 6  
16 0  
14 4  
12 8  
112  
96  
80  
64  
48  
32  
16  
0
0
20  
40  
60  
80  
100  
120  
140  
IP Packet Length [Bytes]  
Ethernet Frame Bytes AAL5 PDU Length  
Figure 2 – Frame and PDU Length versus IP Packet Length  
1000  
900  
800  
700  
600  
500  
400  
300  
200  
10 0  
0
OC- 12 Input  
OC- 3 Input  
0
100  
200  
300  
400  
500  
600  
700  
800  
900  
1000  
1100  
1200 1300  
1400 1500  
IP Packet Length [Bytes]  
Figure 3 – Expected Ethernet Transmit Bandwidth  
Page 5 of 17  
 
Version 1.0, 4/10/02  
As shown graphically in Figure 3, 622Mbps of single-cell-PDU input requires 622*(84/55) = 949  
Mbps of Ethernet output.  
This example design supplies 800Mbps of Ethernet bandwidth (IXP1240 configurations), so  
under a single cell/PDU workload the design can be expected to transmit Ethernet at line rate, and  
to discard the excess ATM input. In the reverse direction, if Ethernet data is received at wire rate  
(even with all 8 ports running at wire-rate), ATM transmit will not be saturated under a single  
cell/PDU workload.  
Multiple Cells/PDU Workload  
Following Figure 2 from left to right, it is clear that once a PDU size overflows from one cell into  
two, Ethernet becomes more efficient in terms of Mbps.  
When the packet completely fills two or three cells, ATM is again more efficient, but not by  
much. For example, two full cells require 622(118/110) = 666 Mbps; and three full cells require  
622/(166/159) = 625 Mbps of Ethernet bandwidth for 622Mbps of ATM bandwidth. These  
numbers are well below the 800Mbps of Ethernet bandwidth available in the example  
configuration.  
Thus for multi-cell/PDU workloads, this design has more Ethernet bandwidth available than  
ATM bandwidth, and excess Ethernet input will be discarded. In the reverse direction, Ethernet  
transmit bandwidth will not be exceeded even if all ATM ports receive data at wire-rate (Figure  
3).  
While this design supports any IP packet size between 20 and 1500 bytes, 40 byte packets are  
expected to be the most common. 40-byte packets corresponds to a 20-byte IP header plus a 20-  
byte TCP header, with no payload. 40-byte IP packets form AAL5 PDUs that consume 2 ATM  
cells.  
The largest PDU supported by the design contains a 1500-byte packet. This packet is carried by a  
1518-byte Ethernet frame or by a 32-cell AAL5 PDU.  
CYCLE AND INSTRUCTION BUDGETS  
Cycle Budgets to support Line Rates  
OC-12 line rate is 622Mbps, but SONET overhead reduces it to 599Mbps available to ATM cells.  
53 bytes/cell * 8 bits/byte / 599 Mb/sec = 708 ns/cell. So 232MHz * 708 ns/cell = 164  
cycles/cell.  
OC-3 line rate is 155Mbps, but SONET overhead reduces it to 149Mbps available to ATM cells.  
53 bytes/cell * 8 bits/byte / 149 Mb/sec = 2.85 us/cell. So 232MHz * 2.85 us/cell = 660  
cycles/cell.  
Ethernet has a variable sized frame, and thus a variable per-frame cycle budget. The worst-case  
is minimum-sized 64-byte frames, thus they are the focus for per-frame calculations here. A 64-  
byte frame actually occupies 84 bytes on the wire. {(12 byte Inter Packet Gap) + (8 byte  
preamble) + (46 byte IP packet) + (14 byte Ethernet Header) + (4 byte Ethernet FCS) = 84  
Page 6 of 17  
 
Version 1.0, 4/10/02  
bytes/minimum frame}. 84 bytes/frame * 8 bits/byte / 100Mb/sec = 6.72 usec/frame. 232MHz *  
6.72 usec/frame = 1559 cycles/frame  
These cycle budgets specify how frequently a cell or frame goes over the wire. If multiple  
threads handle multiple frames on the same wire, then the budgets are multiplied accordingly.  
For example, the OC-12 cycle budget is 164 cycles/cell, but since the four threads on a single  
microengine can work on four frames simultaneously, the equivalent per-thread cycle budget  
becomes 4 * 164 cycles, or approximately 660 cycles/frame. That is, four threads working on 4  
different cells can each take up to 660 cycles to process a cell and still keep up with line rate.  
This per-thread per-packet cycle budget is independent of how the thread consumes the cycles – it  
specifies only the maximum time (in cycles) between the beginning and end of packet processing.  
The cycles may be used for instruction execution, aborted instructions due to branches,  
microengine stalls due to command queue pushback, or idle cycles. Changes in any of these uses  
of time can cause a thread to meet or exceed its cycle budget.  
The Developer’s Workbench IX Bus Device Simulator is typically configured to show  
performance in Mbps based on frames/sec. However, it can also be configured to display  
cycles/frame, which is useful in tuning a design to reach cycle budgets.  
Developer’s Workbench IX Bus Simulator – Bounded and Unbounded Wire Rates  
Simulations can be run with ports “bounded” or “unbounded” to the wire rate. Simulations run  
with ports “bounded” to wire rate will always show exactly the correct cycle budget/frame –  
because it is bound to the desired wire rate. It is also useful to run a simulation with the ports  
“unbounded” to wire-rate (infinite bandwidth on the wire). This means that on the receive side  
there is always data waiting on the wire, and on the transmit side the wire is always ready to  
accept more data. If the design is able to run faster than wire-rate, then setting the IX Bus Device  
Simulator to display in cycles/frame can be useful to relate that to instructions.  
This technique was used to measure the OC-12 Receive Microengine over several workloads  
against its 164 cycles/cell budget. The 8 interleaved VC workloads were used to make sure that  
the VC-cache experienced a 100% miss rate. Figure 4 shows the results for both the –75 and –7E  
DRAM speed grades.  
Cells/PDU  
Virtual Circuits  
Cycles/Cell –75  
154.6  
Cycles/cell –7E  
137.9  
1*  
1*  
1*  
2
1
4 random  
8 interleaved  
1
163.9  
172.8**  
161.0  
149.8  
159.0  
137.1  
2
32  
32  
8 interleaved  
1
8 interleaved  
158.4  
152.5  
131.5  
149.2  
141.9  
127.4  
* Simulations show that ATM Receive can handle the 1-cell/PDU workload, but that the IP Router in the  
next pipeline stage falls behind.  
** For –75 DRAM, the ATM Receive cycle budget is exceeded for a workload of single-cell, interleaved  
PDUs.  
Figure 4 – OC-12 Unbounded ATM Receive simulations versus 164 cycle budget  
Page 7 of 17  
 
Version 1.0, 4/10/02  
One issue with running simulations unbounded to wire-rate is that it can hide errors because there  
is no concept of device overflows or underflows. Further the design can become un-balanced,  
say for example if an efficient receiver races ahead of the rest of the design, hogging shared  
system resources and potentially penalizing another part of the system.  
Another approach is to simulate bounded, but to bind to a wire-rate that is faster than the actual  
wire-rate. The disadvantage of this technique is that it is an iterative process. To discover the  
maximum performance one must raise the wire rate until the design fails to keep up, and then one  
must lower it until the design runs correctly without any overflows or underflows.  
SIMULATION MEASUREMENT PROCEDURE AND RESULTS  
In the simulation environment, 29, 40, and 1500 byte packets are measured using the Developer’s  
Workbench IX Bus Device Simulator’s streams facility. The workloads are homogeneous, in that  
the same sized packets are sent into both Ethernet and ATM ports.  
To measure the performance of the design, the simulation is run with the Ethernet ports bounded  
to 100Mbps, and the ATM ports bounded to 155 or 622 Mbps, as appropriate. The simulator is  
set to stop if it detects a device overflow or underflow.  
Full-bandwidth input streams of the specified packet size are simultaneously applied to all ATM  
and Ethernet ports present in the configuration for at least 1M cycles.  
Upon completion of the simulation run, the line rates in the IX Bus Device Status window are  
observed. The Ethernet ports should be receiving at 100 Mbps each. The ATM port(s) should be  
receiving at 622 (or 155 Mbps each). For 29-byte packets, the Ethernet side should transmit at  
wire-rate and discard excess ATM input. For 40 and 1500-byte packet workloads the ATM side  
should transmit at wire-rate and discard excess Ethernet input.  
No device overflows or underflows were detected during the simulation.  
Simulated 29-byte packet performance1  
For the OC-12 and 4xOC-3 configurations running the 1 cell/PDU workload, the simulation stops  
with a watch-point when the MSGQ from the ATM Receive Microengine to the IP Route  
Microengine fills to capacity. This means that the IPR Microengine is not able to keep up with  
the 1 lookup/cell workload (1.4M lookups/sec). Upon disabling the watch-point and completing  
the 1M cycle simulation, the number of PDUs dropped due to the ATM_RX_IPR_FULLQ is  
compared to the total number of cells received via ATM. This shows that the IP Router  
Microengine drops 19-22% of the cells received via ATM. Conversely, it shows that the IP  
Router Microengine is routing 78-81% of the 1.4M cells/sec input, or about 1.1M routes/second.  
While this observation shows that under this workload the IP Router does not keep up with the  
input, it shows that for a workload with 2-cell PDUs, the IP Router has the capability of routing  
(1.1 – 1.4/2) = 400K routes/second more than the maximum 700K routes/second required.  
1 Simulations for 29-byte, 40-byte, and 1500-byte packet loads were run using 133 MHz memory (-75).  
Page 8 of 17  
 
 
Version 1.0, 4/10/02  
Both the OC-12 and 4xOC-3 configurations experience an ATM overflow after 1M cycles. This  
indicates that under this system workload, the receiver is not keeping up with the wire, but has  
dropped a cell in the first 6,000 cells.  
Simulated 40-byte and 1500-byte packet performance2  
The OC-12 and 4xOC-3 configurations perform at ATM wire-rate under full-duplex, full-  
bandwidth 40-byte and 1500-byte packet loads.  
HARDWARE MEASUREMENT PROCEDURE AND RESULTS  
While simulation provides a high degree of visibility into the design, there are several important  
benefits to measuring on hardware:  
1. The ability to do experiments on large numbers of packets. The simulator receives about  
10 cells/second, whereas the hardware can receive OC-12 data at 1.4M cells/second.  
Thus events that take a long time to occur on the simulator occur almost immediately on  
the hardware.  
2. Unplanned error conditions occur regularly on the hardware due to hot plugging cables,  
optical and electrical noise, mis-configured lab equipment or random error injection via  
lab equipment, etc – and the design must handle them correctly.  
3. The Transactor simulator does not model DRAM refresh overhead, and so configurations  
that are sensitive to DRAM bandwidth will notice a small performance hit on the  
hardware versus simulation.  
To measure the design in the lab, an ADTECH AX/4000 is attached to the ATM ports, and a  
Smartbits 600 is attached to the 8 Ethernet ports.  
Both pieces of equipment simultaneously generate traffic (at line-rate), and also observe the data  
that is being transmitted through the system from the other end of the design. Example AX/4000  
sequence files are included in the project under atm_ether\tools\AX4000.  
After the experiment, the Octal MAC and the IXF6012 “Volga” PHY counters are checked for  
evidence of underflows or overflows. Specifically, for the PHY, idle cells sent or received are  
searched for, as these would indicate that the ATM links were not fully utilized.  
The Counters_print() command at the VxWorks* command line also displays if the microcode  
discarded any packets, and why.  
Hardware Measurement Results  
Only the OC-12 configuration results are detailed here, as no WAN daughter-card was available  
with 4xOC-3 ports.  
ATM Transmit Rate is expressed as a percentage of OC-12 wire rate, as received and reported by  
the AX/4000. IXF6012 Transmit Idle cells are reported by “_VolgaGetChanCounters” to report  
2 Simulations for 29-byte, 40-byte, and 1500-byte packet loads were run using 133 MHz memory (-75).  
Page 9 of 17  
 
 
Version 1.0, 4/10/02  
the number of times the PHY was not fed a cell in time to keep the wire busy, and thus had to  
manufacture an idle cell. The number reported here is from the 2nd counters query when 2  
“_VolgaGetChanCounters” are issued on the same line at the VxWorks prompt (this is because  
“_VolgaGetChanCounters” prints out the delta between a previous invocation and the present  
invocation). IXF6012 Overflows are measured the same way, and they are generally the result of  
the StrongARM* core overhead involved in running the “_VolgaGetChanCounters” command  
itself. “Ethernet Transmit Kframe/sec” captures the lowest and highest results as received and  
reported by SmartBits600 over the 8 Ethernet ports.  
The test measurements are repeated with a variable number of full-bandwidth Ethernet ports  
driving the design. The test with “0” Ethernet input ports shows the maximum possible ATM-to-  
Ethernet performance, that is, when there is no Ethernet-to-ATM traffic to load down the system.  
This is effectively a half-duplex ATM-to-Ethernet forwarding measurement. More Ethernet input  
ports are added to show how the system handles the increase in load, even though for 40 and  
1500-byte packet measurements, 6-8 Ethernet ports over-subscribe available ATM transmit  
bandwidth.  
Hardware 29-byte packet performance  
Ethernet  
Input  
Ports  
ATM  
Transmit  
Rate [%]  
84  
IXF6012  
Transmit  
Idle  
ATM  
Receive  
Ports  
IXF6012  
Overflows  
Ethernet  
Transmit  
KFrame/s  
132 – 138  
127 – 147  
133 – 148  
148.8  
Ethernet  
Transmit  
[MB/s]  
8 – 9  
8.5 – 9.5  
8.5 – 9.5  
9.5  
8
7
6
0
N/A  
N/A  
N/A  
N/A  
1
1
1
1
4000  
1000  
0
73  
63  
0
0
Figure 5 – Single-cell/PDU Performance using 133MHZ DRAM  
The bottom entry in the table with 0 Ethernet Input Ports shows half-duplex performance – i.e.  
what the design does when it is only forwarding this workload from ATM to Ethernet. The result  
is wire-rate ATM Receive and Ethernet Transmit performance, and the StrongARM core can run  
“_VolgaGetChanCounters” without disturbing the data plane at all. As discussed above, this  
workload is attempting to transmit 949Mbps out the 800Mbps of Ethernet ports. Indeed, 8  
Ethernet ports X 148,808 frames/sec = 1.19M packets/second; while the ATM Receive packet  
rate is 1.4M packets/sec. Looking at the microengine counters, The ratio between the packets  
dropped due to full Ethernet Transmit queues and the packets dropped due to a full IP Router  
input MSGQ shows that about 37% of the dropped packets are due to Ethernet transmit queues  
being full, and the remaining 63% are due to the IP Router Microengine not being able to route  
1.4M packets/second. This is consistent with the simulation result for the same workload that  
showed the IP router couldn’t keep up with 1.4 routes/second.  
Transmitting from Smartbits on 6 full-bandwidth Ethernet ports impacts Ethernet Transmit  
performance, but only on a couple of ports. But this is not enough Ethernet input to saturate  
ATM Transmit.  
Increasing the Ethernet workload to 7 ports, and then 8 ports, increases the ATM Transmit  
performance, but with the ratio of 949Mbps Ethernet to 622Mbps ATM, this is still not enough  
Ethernet input to saturate the ATM Transmitter. Also, Ethernet Transmit performance starts to  
Page 10 of 17  
 
Version 1.0, 4/10/02  
degrade in these scenarios, and the design becomes subject to ATM overflows from running  
“_VolgaGetChanCounters”.  
Ethernet  
Input  
Ports  
ATM  
Transmit  
Rate [%]  
84  
IXF6012  
Transmit  
Idle  
ATM  
Receive  
Ports  
IXF6012  
Overflows  
Ethernet  
Transmit  
KFrame/s  
138 - 147  
142 - 148  
144 – 148  
148,808  
Ethernet  
Transmit  
[MB/s]  
8.8 – 9.4  
9.0 – 9.5  
9.2 – 9.5  
9.5  
8
7
6
0
N/A  
N/A  
N/A  
N/A  
1
1
1
1
0
0
0
0
73  
63  
0
Figure 6 – Single-cell/PDU Performance using 143MHz DRAM  
Repeating the same measurements for 143 MHz DRAM results in the same ATM transmit  
bandwidth in all cases (Figure 6). Interference from “_VolgaGetChanCounters” is gone however,  
even in the 8 port configuration. Ethernet Transmit performance is slightly better, but still sub-  
wire-rate in the 8 port configuration. When there is no Ethernet input at all, the ratio between  
packets dropped by the IP Router versus full Ethernet transmit queues improved such that the  
Ethernet Transmit queues drop 45% and the IP Router drops only 55% of the excess input.  
Hardware 40-byte packet performance  
Ethernet  
Input  
Ports  
ATM  
Transmit  
Rate [%]  
97  
IXF6012  
Transmit  
Idle  
30,000  
1,500  
700  
ATM  
Receive  
Ports  
IXF6012  
Overflows  
Ethernet  
Transmit  
KFrame/s  
88,300  
88,300  
88,300  
Ethernet  
Transmit  
[MB/s]  
5.6  
8
7
6
5
4
1
1
1
1
1
700  
69  
0
0
0
100  
100  
100  
84  
5.6  
5.6  
5.6  
5.6  
0
N/A  
88,300  
88,300  
Figure 7 – Two-cell PDU Performance on 133MHZ DRAM  
For the 40-byte (2 cell/PDU) workload there are half as many IP lookups/second as are required  
in the 29-byte (1 cell/PDU) workload. As expected, the IP Router was able to keep up with this  
workload and didn’t drop any packets.  
For five Ethernet input ports, the design performed perfectly, and “_VolgaGetChanCounters” did  
not cause any dropped cells.  
Reducing the Ethernet input to 4 ports did not allow enough input to saturate ATM Transmit.  
Increasing the Ethernet input ports to 6, 7 and 8 allowed over-subscription in the Ethernet to  
ATM forwarding direction, as evidenced by the ATM Transmit idle cells and ATM Receive  
overflows. Indeed, even under nominal conditions, the 8 port input configuration was able to  
drive only 97% of the ATM wire.  
For this packet size, Ethernet is more efficient that ATM, and the Ethernet transmitter cannot be  
expected to be driven at wire rate. Indeed, under nominal conditions the Ethernet transmitter  
performed perfectly for all tested configurations.  
Page 11 of 17  
 
Version 1.0, 4/10/02  
Ethernet  
Input  
Ports  
8
ATM  
Transmit  
Rate [%]  
100  
IXF6012  
Transmit  
Idle  
ATM  
Receive  
Ports  
1
IXF6012  
Overflows  
Ethernet  
Ethernet  
Transmit  
[MB/s]  
5.6  
Transmit  
KFrame/s  
88,300  
0
0
Figure 8 – Two-cell/PDU Performance on 143MHZ DRAM  
Using 143 MHz DRAM, the 40-byte (2-cell/PDU) workload performed perfectly, even with 8  
Ethernet ports over-subscribing the ATM Transmitter (Figure 8). “_VolgaGetChanCounters”  
recorded zero ATM Transmit Idle cells and zero ATM Receive overflows.  
This illustrates the sensitivity of the design to DRAM speed grade selection.  
Hardware 1500-byte packet performance  
Ethernet  
Input  
Ports  
ATM  
Transmit  
Rate [%]  
100  
IXF6012  
Transmit  
ATM  
Receive  
Ports  
IXF6012  
Overflows  
Ethernet  
Transmit  
KFrame/s  
5,450  
5,518  
5,518  
Ethernet  
Transmit  
[MB/s]  
8.33  
Idle  
2
0
0
8
7
6
5
1
1
1
1
500  
200  
0
100  
100  
92  
8.37  
8.37  
8.37  
N/A  
0
5,518  
Figure 9 – 32-cell/PDU Performance on 133MHz DRAM  
Results for 1500-byte packets (32-cells/PDU) were similar to the 2-cell case. The design worked  
flawlessly with six input Ethernet ports. Five ports was not enough to drive ATM to saturation,  
and the design degraded slightly as it was over-subscribed by adding the 7th and then 8th Ethernet  
input port (Figure 9).  
Ethernet  
Input  
Ports  
8
ATM  
Transmit  
Rate [%]  
100  
IXF6012  
Transmit  
Idle  
ATM  
Receive  
Ports  
1
IXF6012  
Overflows  
Ethernet  
Transmit  
KFrame/s  
5,517  
Ethernet  
Transmit  
[MB/s]  
8.37  
0
0
Figure 10 – 32-cell/PDU Performance on 143 MHz DRAM  
Analogous to the 2-cell/PDU case, the 32-cell/PDU case performed perfectly using 143 MHz  
DRAM, even in the face of over-subscription with 8 Ethernet input ports and  
“_VolgaGetChanCounters” running on the core (Figure 10).  
StrongARM CORE PERFORMANCE  
This example design sends exception cells, PDUs and frames to the StrongARM core. It uses up  
to 4 message queues for this purpose – one for each of the microengines that can send data to the  
core. The core is alerted by an interrupt when data is put into the core message queues.  
Page 12 of 17  
 
Version 1.0, 4/10/02  
Queue to Core Measurement Technique  
The performance of the queue-to-core path can be measured by modifying a nominal input data  
stream such that the entire stream is forwarded to core. For example, changing the IP version in  
the IP header from 4 to 5 will cause the packets to be forwarded to the core.  
The lab equipment sends this data stream at a known rate, and the amount of it that gets to the  
core is measured. Both the StrongARM core and the microcode count how many packets  
successfully reach the core. The microcode also counts how many packets failed to get queued to  
the core because the queue was full. In addition, the PHY or MAC count any packets that are  
dropped before they were received by the microengines. The ratio of packets that made it to the  
core over the total packets sent is multiplied by the input wire-rate to arrive at the queue to core  
success rate.  
ATM Queue to Core Throughput  
For ATM management cells, and PDUs with IP exceptions, the core interrupt service routine  
schedules a user-context interrupt handler that consumes the appropriate message queue. Then it  
simply re-cycles the buffer descriptors by pushing them back onto the freelist without further  
processing.  
When faced with a stream of 100% exception PDUs, the IP Router Microengine is able to send  
over 200,000 PDUs/second to the core by this method. This is over half of OC-3’s 353,207  
cells/second link capacity. The throughput of this communication method peaks at over 320,000  
cells/second, but at that point over 10% of the input streams is discarded due to full queues to  
core, and the VxWorks interrupt handler sub-system complains about ring buffer overflow as it  
attempts to schedule the user-context interrupt handlers.  
Ethernet Queue to Core Throughput  
Ethernet exception frames are sent to the StrongARM core via the same message queue technique  
as ATM exceptions. Again, the interrupt service routine schedules a user-context interrupt  
handler for every interrupt. However, the Ethernet core handler copies the packets into the  
VxWorks pseudo-Ethernet driver and sends them up the VxWorks IP stack before re-cycling the  
buffer descriptors.  
When forwarding exception packets to the VxWorks IP stack this way, the system is able to send  
37,500 frames/second to the core. This is about 1/3 of the 148,810 frames/second on a 100Mbps  
Ethernet link (minimum-sized Ethernet frames).  
If the packets are simply discarded by the user-context interrupt handler instead of being copied  
into VxWorks, then the success rate rises to over 145,000 frames/second – nearly full 100Mbps  
link bandwidth. The system is able to send more packets to the StrongARM core – rising to over  
500,000 frames/second when faced with 4 ports at full bandwidth. However, the more it is over-  
subscribed, the higher percentage of packets the system drops – particularly as VxWorks  
succumbs to ring buffer overflow at about 200,000 frames/second.  
Page 13 of 17  
 
Version 1.0, 4/10/02  
RESOURCE UTILIZATION AND HEADROOM ANALYSIS  
This section details system resource utilization, including per-microengine resources such as  
registers and microstore instructions; as well as shared resources such as Scratchpad RAM,  
SRAM, and DRAM. The memory utilization is shown using the default system memory map as  
shipped.3 One of the ATM utilities, config_print(), prints out the Scratchpad RAM, SRAM, and  
DRAM address maps to show the memory map in detail. Almost all of the data structure sizes in  
these memories are configurable.  
Microengine Register and Microstore Headroom  
Register utilization is measured by using the free_register_test() macro to soak up all available  
registers. This is an iterative process – the macro either prints out how many registers it  
successfully allocated, or the assembler quits because it failed to allocate the requested registers.  
In this analysis this macro is applied at the global level, so the result is worst-case. It is generally  
possible to allocate more registers if their scope does not overlap the deepest scope in the design.  
Available  
General  
Purpose  
Registers  
Available  
SRAM  
Transfer  
Registers  
Available  
SDRAM  
Transfer  
Registers  
Available  
Microstore  
Instructions  
Available  
Threads  
Microengine  
Configuration 1 – IXP1240 1xOC-12, 8xEthernet  
ATM Receive  
IP Route  
Ethernet Receive  
(2 microengines)  
ATM Transmit  
Ethernet Transmit  
2
11  
1
1
12  
5
0
4
0
0
3
0
5
0
0
0
2
6
0
0
0
0
1
0
1514  
1677  
1363  
1363  
1728  
1235  
Configuration 2 – IXP1240 4xOC-3, 8xEthernet  
ATM Receive  
IP Route  
Ethernet Receive  
(2 microengines)  
ATM Transmit  
Ethernet Transmit  
12  
11  
1
1
8
0
4
0
0
1
0
6
0
0
0
2
6
0
0
0
0
0
0
1724  
1677  
1348  
1348  
1716  
1235  
5
Configuration 3 – IXP1200 2xOC-3, 4xEthernet  
ATM Receive  
& IP Route  
9
0
0
0
1403  
Ethernet Receive  
CRC Check  
CRC Generate  
ATM Transmit  
Ethernet Transmit  
1
12  
16  
8
0
3
0
1
0
0
2
2
4
6
0
0
0
2
0
1370  
1811  
1847  
1805  
1235  
5
Figure 12 – Microengine register and microstore headroom  
3 So that a single version of the ATM utilities in the foreign model DLL and the vxWorks utilities could  
handle any configuration, all three port configurations (IXP1240 1xOC12, IXP1240 4xOC3, and IXP1200  
2xOC3) use the same memory map.  
Page 14 of 17  
 
 
Version 1.0, 4/10/02  
Microstore utilization can be observed by opening a microengine list window with line numbers  
enabled, and recording the last line number plus 1. Available instructions = 2048 – used  
instructions.  
Figure 12 shows the results for each of the three configurations.  
The CRC Check and CRC Generate microengines apply only to the IXP1200 configuration. In  
the IXP1200 configuration the ATM Receive and IP Route functions run on the same  
microengine. See the Application Note for this example design for more detail on microengine  
and thread allocation.  
Scratchpad RAM Capacity  
There are 1024 32-bit Scratchpad RAM locations on-chip, and just over 50% of them are  
available.  
This design uses 256 locations (25%) for statistics counters, including 16 counters for each of the  
12 ports; plus global counters.  
7 message queues consume 112 entries (11%), and a table to map port numbers to MAC  
addresses consumes 16 more entries. There are some smaller users of Scratchpad RAM locations  
that cause some address map fragmentation, but there are basically two blocks of about 256  
entries available at 0x100 and 0x300.  
SRAM Capacity  
The IXM1240 Network Processor Base Card comes with 8MB of SRAM. This design is  
currently configured so it uses less than 4MB, leaving over 50% available.  
As various configurations of this project may be integrated with other code that utilizes memory  
below 0x20000 (.5MB), the area below 0x20000 was simply left alone to avoid potential address  
map conflicts.  
The SRAM portion of the IP Lookup Table begins at 0x20000, and can grow almost to 0x80000,  
(1.5MB).  
For convenience, the same utilities are compiled to run on both Software and Hardware CRC  
configurations. Thus the current project taxes all configurations with a 64K location (256KB)  
CRC-32 lookup table.  
The VC Table occupies 0x50000 (320K) locations, corresponding to a 64K entry table with 5  
locations per entry. The VC Miss Table has 8K additional entries consuming 0xA000 locations.  
All totaled, the VC tables consume 360K locations (1.4MB).  
The system is configured with 16K Buffer Descriptors at 4 words each, consuming 64K locations  
(265KB).  
Page 15 of 17  
 
Version 1.0, 4/10/02  
SDRAM Capacity  
The IXM1240 Network Processor Base Card comes with 128MB of SDRAM. The project is  
configured to use less than 64MB: 32MB of Packet Data Buffers, 16MB for VxWorks, and the  
balance for IP Route Table Entries. This leaves over 50% available.  
The IP Route Table Entries live at 0x8100, simply because they do so by default in all the  
example designs.  
Associated with each of the 16K Buffer Descriptors in SRAM is a 2KB Data Buffer in SDRAM,  
for a total of 32MB of Data Buffers in SDRAM.  
Note that the Data Buffers span the multiple SDRAM bank selects as described in the IXP1200  
Hardware Reference Manual. Further, the descriptor freelist is “mixed” at initialization time to  
give the SDRAM controller the opportunity to access addresses from the multiple DRAM banks.  
See the appendix (“Buffer Allocation in DRAM”) for more information on “mixing” the freelist.  
VxWorks consumes the top 16MB of DRAM.  
SRAM and SDRAM Bandwidth  
Bandwidth measurements are taken on the Transactor simulator using the full-duplex, full-  
bandwidth 40-byte packet workload on a 2Mcycle simulation. Percent idle cycles are 100%  
minus the active percentage shown in the Developer’s Workbench Statistics/Summary box.  
Configuration  
SRAM % Idle  
65.2  
SDRAM % Idle  
IXP1240 1xOC-12 8x100Mbps Ethernet  
IXP1240 4xOC-3 8x100Mbps Ethernet  
IXP1200 2xOC-3 4x100Mbps Ethernet  
42.9  
43.3  
57.5  
57.0  
43.0  
Figure 13 – SRAM and SDRAM bandwidth headroom  
The IXP1200 configuration handles ½ the throughput of the IXP1240 configurations, but it has  
much less SRAM bandwidth available. This is because it uses a CRC-32 lookup table in SRAM.  
The two IXP1240 configurations handle exactly the same bandwidth, and their SDRAM usage is  
essentially the same. The 4xOC-3 configuration uses more SRAM bandwidth, however, because  
its ATM Transmitter uses an array of SRAM Buffer Descriptor Queues (BDQ) in SRAM, while  
the simpler OC-12 configuration uses a pair of Message Queues (MSGQ) in Scratchpad RAM.  
The OC-12 configuration’s VC Cache may also reduce accesses to the VC Table in SRAM.  
From a high-level, one could say that the IXP1200 configuration is SRAM bandwidth limited;  
and that the IXP1240 configurations are DRAM bandwidth limited. This can be explored further  
with unbounded simulations, which allow the design to not be limited by wire-rate. Additional  
SRAM and SDRAM accesses can be added to the design to observe the impact on performance;  
and the design can be run on different SDRAM speed grades.  
Page 16 of 17  
 
Version 1.0, 4/10/02  
APPENDIX  
Buffer Allocation in DRAM  
The microengines in this example design uses two DRAM command queues. The ordered queue  
is used by all sdram_crc[] commands to transfer packet data between DRAM and the receive and  
transmit FIFOs. The priority queue is used for all other microengine DRAM accesses, including  
access to IP lookup table entries, and modifications to packet headers.  
While the instruction set mandates that the sdram_crc[] commands use the ordered queue, the  
design has the flexibility to use different queues for the other DRAM accesses. system_config.h  
defines DRAM_QUEUE to either "ordered", "optimize_mem", or "priority", for this purpose.  
The choice of the priority queue as the default was made by comparing the alternatives for the  
OC-12 full-bandwidth configuration in simulation.  
As described in the IXP1200 Hardware Reference Manual, the IXP12xx performs "Active  
Memory Optimization" to eliminate latencies when it accesses different DRAM banks. This is  
true even if the chip uses the ordered and priority queues as in this design. Thus there is a  
performance benefit if the system's DRAM accesses frequently alternate between DRAM banks.  
To take advantage of this optimization, the DRAM data buffer pool is positioned to equally span  
multiple DRAM banks. Further, the descriptor freelist describing this pool is initialized such that  
subsequent buffer allocations refer to buffers from alternate DRAM banks.  
Page 17 of 17  
 

JVC Camcorder LYT2257 002A M User Manual
JVC CRT Television AV 27F475 User Manual
JVC CRT Television HV 28P40BJ User Manual
Karcher Pressure Washer K 399 M User Manual
Kenmore Dishwasher 15831 User Manual
Kettler Home Gym CTR 3 User Manual
KitchenAid Cookware KHMS2050S User Manual
KitchenAid Range W10284348A User Manual
Koolatron Refrigerator BC10 G User Manual
Korg Recording Equipment krog User Manual