Force10 Networks has been flattening out and speeding up networks since its was founded over a decade ago, and today it is fleshing out its product line with a server virtualization–friendly top-of-rack Gigabit Ethernet switch called the S60. There are plenty of Gigabit Ethernet switches out there, but Steve Garrison, vice …
The S60 also has 1.25GB of deep packet buffering
"The fix for congestion is not necessarily to move to 10 Gigabit switches, according to Force10, but rather to push line rates in a Gigabit switch and give the device ultra-deep buffering to cope with those momentary rogue waves on the network."
"The S60 also has 1.25GB of deep packet buffering"
I'm assuming this means gigabytes, as written, and not gigabits.
Can someone explain why such large buffers are needed? My previous understanding was that excessively large buffers in routers could be detrimental because it doesn't do anything to actually increase the transmitting speed, but will accumulate a large backlog of packets that ultimately causes significant latency.
Consider a 1gbps generator and a 100mbps consumer, what's the net benefit of keeping an enormous buffer in the middle?
Sliding window TCP implementations would ensure the buffer remains full (since the packets are not being dropped) which happens to be the worst case for latency.
I'll admit this is not my domain, so anyone who knows better feel free to let me know.
The benefit of the deep buffers is in handling momentary spikes, which are often observed on individual machines.
When such spike occurs, it is buffered, while the egress line continues handling the data at its maximum speed, provided the scheduler can keep up. With deep buffers, even very large spikes can be handled without affecting the clients and without requiring retransmission of packets dropped due to congestion.
The deep buffer cannot do anything for your hypothetical 1Gbps traffic through 100Mbps line bottleneck if the source keeps sending data at a uniform 1Gbps. If this is the case, you simply need to buy more bandwidth.
The solution presented by Force10 aims at handling traffic that fits within the available bandwidth most of the time, while allowing for occasional (although fairly large) traffic spikes that don't. As the buffers help avoid congestion, packets do not need to be retransmitted (lengthy process), they simply sit in the queue looking pretty and await their turn at the line, while the traffic simply suffers from a bit of delay.
First of all, the story is a bit misleading. A machine with a GigE interface will always send at GigE speed. If you send 10 bytes, it will be sent at a rate of 1 Gig/Sec which just means the transmission will be short. Now, people looking at bandwidth graphs might see only 10 meg of usage. That means that it is transmitting at 1Gig only 1 percent of the time over the past sample period.
The problem comes in where you have several machines talking at once. So imagine GigE switch ports with a GigE uplink and 10 servers attempt to send at exactly the same time, even if the data they are sending at that moment is short in length, the switch must store that burst of data coming in. If you have 10 soda straws in and one soda straw out, you need a reservoir to hold some of that traffic until it can drain out. You often see this with uplinks and backend database machines where you might have 100 front-end servers trying to talk at once to a backend database. This problem gets worse when you have 10 virtual machines on one physical machine.
So one might say ... put a 10G interface on the database machine. That is fine for one direction but now what happens when the db machine sends? It is sending at 10Gig and the data can be delivered to the receiving interface at only 1Gig. Again, you need to buffer it if you don't want to do things like tcp backoff. If you start dropping packets, backing off TCP, and reducing window sizes, you are killing througput that could be avoided by simply adding some RAM to the switches for buffers.
Yes, buffers can be bad for high latency links but for the local LAN, they are generally a good thing.
@quartzie & @AC
"Again, you need to buffer it if you don't want to do things like tcp backoff. If you start dropping packets, backing off TCP, and reducing window sizes, you are killing througput that could be avoided by simply adding some RAM to the switches for buffer."
I somewhat disagree with this quote, because for constant/static saturated traffic, the buffer will not and can not increase throughput, it will just increase latency. However the rest makes sense to me.
For constant traffic where the sender endpoints are slower than the receiver endpoints, obviously the buffers will never fill up, and therefor don't help this scenario.
So I think quartzie's post about dynamic spikes is the main scenario where this might help, as long as the spike is temporary, since it avoided packet loss.
I wonder how likely the spike scenario is versus the over saturation scenario on a typical lan?
This doesn't sound particularly helpful for networks with heavy TCP traffic. With sufficient load, TCP is designed to eat up all the buffering you give it, and the consequence of that is high latency for other traffic. If the "rogue wave" is TCP traffic, what you really want to do is start dropping packets or set ECN bits so the sending stacks will back off. There is no reason to buffer in the switch what you can buffer on the sender. Without ECN, excessive buffering just makes things worse.
It would seem what would be really useful is the equivalent of ECN at the layer 2 (ethernet level), such that if a outgoing switch port is congested the source ethernet adapters can throttle back on a (source MAC, destination MAC) pair basis. And isn't that more or less what the new 802.1Qau congestion management draft is trying to accomplish?
Re: Deep buffering
"This doesn't sound particularly helpful for networks with heavy TCP traffic. With sufficient load, TCP is designed to eat up all the buffering you give it, and the consequence of that is high latency for other traffic."
That is just wrong. TCP will keep increasing the window size, and therefore increase the amount of data in flight, up to the point, that the receiver is waiting for data. And then even then, most OSes will cap the window to 1MB. Maybe 8 or 16MB at most.
1.25GB sounds like marketing rather than engineering
From my recollection, typical TCP sessions will retransmit after 2 seconds, so providing more buffering than that is futile. If a server is currently under heavy load, there will be a point where the additional buffering is just maintaining the overloaded state and preventing load-shedding (i.e. VMotion) until the box starts dropping the connections anyway.
My guess is that for a fully loaded switch (44 x 1Gbps, 4 x 10Gbps) buffering in excess of maybe 0.25 second of traffic (i.e. a bit over 256MB of buffering for the whole switch) will achieve very little additional performance - it would be interesting to see how this switch performs in load testing as compared to the other switches mentioned in the article.
Let me throw in some questionable ballpark calculations.
Assuming those two seconds retransmit, then you can meaningfully buffer up to one second assuming symmetric congestion, but for a rack mostly communicating with a local lan, probably very close to two seconds. Assuming 84Gbit/sec, that's what, 10.4 GByte/sec ballpark. So 1.25GByte is more on the order of 1/8th of a second of full aggregate bandwidth counted one-way only for obvious reasons. Only counting the GigE interfaces, that's 1/4th of a second buffer space.
I think I can see what they're on about. Still, load testing would be nice, yes. Does that ``1.25GB'' of buffer space bring solace in the given scenario and does that occur often enough that the solution is worth the 10k USD pricetag?
Re: 1.25GB sounds like marketing rather than engineering
"From my recollection, typical TCP sessions will retransmit after 2 seconds, so providing more buffering than that is futile. "
"My guess is that for a fully loaded switch (44 x 1Gbps, 4 x 10Gbps) buffering in excess of maybe 0.25 second of traffic (i.e. a bit over 256MB of buffering for the whole switch) will achieve very little additional performance"
No, 256MB isn't enough. What I have found, to deal with spikes, you need about 1 second of buffering, preferably 2 seconds. You would need 100MB for a 1 second buffer on a 1Gbps port.
Buffering on routers and switches is an active area of research. What isn't clear, here is if the buffers are partitioned per port. Partitioning means that an overload on one port, doesn't degrade any other ports. Plus, it means that you can use slower memory for the buffer, as the memory only needs to operate at the line rate of one port.