I’ve been looking into data center fabrics and how you handle the scale of large networks lately so I decided I should take some time today to fully view the presentation(video and PDF) by David Swafford which he did at NANOG 59 late last year.

I met David Swafford when Facebook came to town for MPLS 2013. He was a really cool guy. I was inspired even at the time by hearing the way that they are going about support their networks. Very smart!

I took away a lot of nuggets from watching it. Here are a few:

Assume we can’t trust any rack
We can’t trust networking boxes either
Backbone devices are powerful in the wrong ways for a data center. They can handle many routes but don’t have the desired port density.
Going from 2 large leaf switches to many smaller leaf switches allows you to move from 1+1 to N+1.
Beware of silent failures by complex networking devices. They are hard to detect, BTW.
Automating ToR switch upgrades and handing a “push-button” interface to the service owners helped to remove the roadblocks for full upgrades of ToR switches. (I found it analogous to app upgrades on my phone)
They even scripted many parts of the process, such as determining who the on-call is for a given group at a given time. Fascinating.

Monitor all the things:

interface statistics and state
bgp statistics and state
FIBs
TCP retransmits

Respond to your Alerts with Automation:

FBAR stands for Facebook Automation Remediation
Receive Alert, login to device, verify still down, either ignore or remedy.

He also covers a lot of thoughts on engineers that automate:

Spend less time doing repetitive tasks
Spend more time solving interesting problems or learning

His final challenge: What would you do if you weren’t afraid?

Blog: Ideas and Living