I’ve been looking into data center fabrics and how you handle the scale of large networks lately so I decided I should take some time today to fully view the presentation(video and PDF) by David Swafford which he did at NANOG 59 late last year.
I met David Swafford when Facebook came to town for MPLS 2013. He was a really cool guy. I was inspired even at the time by hearing the way that they are going about support their networks. Very smart!
I took away a lot of nuggets from watching it. Here are a few:
- Assume we can’t trust any rack
- We can’t trust networking boxes either
- Backbone devices are powerful in the wrong ways for a data center. They can handle many routes but don’t have the desired port density.
- Going from 2 large leaf switches to many smaller leaf switches allows you to move from 1+1 to N+1.
- Beware of silent failures by complex networking devices. They are hard to detect, BTW.
- Automating ToR switch upgrades and handing a “push-button” interface to the service owners helped to remove the roadblocks for full upgrades of ToR switches. (I found it analogous to app upgrades on my phone)
- They even scripted many parts of the process, such as determining who the on-call is for a given group at a given time. Fascinating.
Monitor all the things:
- interface statistics and state
- bgp statistics and state
- FIBs
- TCP retransmits
Respond to your Alerts with Automation:
- FBAR stands for Facebook Automation Remediation
- Receive Alert, login to device, verify still down, either ignore or remedy.
He also covers a lot of thoughts on engineers that automate:
- Spend less time doing repetitive tasks
- Spend more time solving interesting problems or learning
His final challenge: What would you do if you weren’t afraid?