Adventures with Cumulus Linux & Mellanox 100G Switches

Introduction

We decided to replace our Juniper EX series (in a Virtual Chassis configuration) that had caused us some issues over the last couple of years in 4 separate datacenters. Ok, issues is an understatement, we had entire switch clusters crash based on how long they were running, we completely lost confidence in them.

In researching our needs and our desire to use this as an opportunity to upgrade our backbone (and not just admit we made a bad decision with Juniper), we decided on Mellanox’s Spectrum SN2010 series switches with a mix of 25G and 100G ports in a half-width chassis; Yes, 2 switches per 1U of rack space, each with redundant PSUs. Mellanox has a great reputation in the HPC realm, and are focused on just switches and NICs. To prove out our solution, our vendor, Vology, was able to get us a couple of demo units before shelling out the money for our 8 new switches.

Our Environment

Our total number of machines per datacenter is really small, just 3 hyperconverged VM hypervisors (Proxmox + Ceph) — it’s amazing how many CPU cores, RAM, and NVMe disks you can get in a single server these days. We also have couple of Juniper SRX1500 routers in a chassis cluster with 10G interconnects (which are luckily much more reliable than their switches), then of course the pair of switches with redundant uplinks to various ISPs and cross-connects to our other datacenters. These are the switches that we’re talking about here.

Some datacenters have a couple auxiliary servers for special purposes, but that’s not really relevant. These Mellanox switches fit the bill really well and we have plenty of ports left over even with each system interconnected to each switch with 802.3ad (LACP bonding).

Switch OS Decisions

Our first task in evaluation was determining which OS we wanted to run. The great thing about Mellanox is they are an Open Ethernet or Whitebox switch vendor, so you can choose your OS. They of course have their own they call Onyx, but then there is Cumulus Linux, and Microsoft’s SONiC, then a few lesser known ones. All are Linux-based (yes, even Microsoft SONiC).

The first thing that surprised us coming from the Juniper world, is each switch is independent. They’re not clustered together. You don’t have a single management IP address. Configurations are not sync’d across the switches automatically. At first we started doubting our path, was this going to be something that would be so burdensome to maintain we’d be better off with the devil we knew?

Then we remembered, we’re an Ansible shop, and there are much larger people than us using these things. It can’t be as bad as it first seems. We just need to change the way we think. SONiC was ruled out pretty early on as its not a full plug and play distribution, and meant for much larger deployments than we run. So that left Onyx and Cumulus. After reviewing the Onyx documentation, we were left with some questions as to how it handled switch failovers with MLAGs (Multi-chassis Link Aggregation Groups) with only 2 switches as it seemed like there were some shortcomings with split-brain (or determining who was the real master). The community and documentation for Cumulus seemed better. We checked with our sales rep, and while Cumulus was more expensive, it was only a small percentage more than the total cost of the solution and they were born out of the DevOps world which fits in nicely with us.

Evaluating Cumulus

Installation of Cumulus (v3.7) with a USB thumb drive was pretty straight forward, we didn’t hit any issues. The switches ended up having a quad core Intel Atom C2558 CPU at 2.4GHz with 8GB of RAM, and it dropped us into a familiar Debian 8 (Jessie) environment. Being that Ubuntu which we had run for years was based on Debian, it felt comfortable. It looked like we were just logged into any old vanilla server. Looking at the network interfaces we saw an eth0 (management port), then a bunch of swpX ports that listed an ethernet interface for each switch port powered by the Spectrum silicon.

Of course, we were stuck in our old way of thinking still, and decided to use the Cumulus NCLU (Network Command Line Utility) to configure everything as we figured that was the best (recommended) way. We got a basic configuration up with a test server and a bond using this method. It told us it modified /etc/network/interfaces and /etc/cumulus/ports.conf. Obviously the first was the normal Debian way to configure network interfaces, the second was something cumulus-specific. Overall, we didn’t think much of it, the NCLU command set was pretty rich, and worked well.

Ansible too had an NCLU module, so we went ahead and scripted out what we had done, as well as some other standard system provisioning steps (joining to our FreeIPA authentication server, remote logging, etc). We left some hanging chads where we figured we’d fill in our knowledge gaps over time and get our scripts hardened and idempotent as is the Ansible way.

So we marched on… The next thing was to bring up the second switch and figure out how interconnect a single server to both switches, so that the server’s interfaces can be LACP bonded to both switches for redundancy and performance. Since we knew this was standard linux, we wondered how this worked as it wasn’t normally possible, we figured there must be some Mellanox Spectrum magic that made this work as it was an advertised feature set.

Turns out we were wrong here, Cumulus made an open-source clagd daemon that is not tied to the silicon at all, presumably if we had the desire, we could use this elsewhere. Their MLAG docs on this setup are pretty good, but some naming conventions might lead you to believe there is some magic behind the scenes, but there’s really not. The clagd daemon actually transfers the state information and MAC addresses between switches (ok, Linux servers) for designated interfaces to allow the servers to think they are connected to a single endpoint, more information on that here. You basically just set up a bond between the 2 switches (their docs name this bond peerlink I guess for clarity, but the name doesn’t matter). They also break out a dedicated VLAN for clagd to use for its communication, they use 4094 by convention. The other VLANs on the peerlink will just naturally interconnect the VLANs between switches to allow traffic to go back and forth. Then you just tag a unique number for each inter-switch bond pair as the clag-id, so clagd knows which port on each side corresponds to the other. This worked great the first time, we thought it would be much harder than this.

A little more research turned up this reference talking about failure situations. So we added a backup link via a dedicated cable from eth0 to eth0 since we weren’t using these ports and instead breaking out a management VLAN for access on the switches. This link is never used for data, it is only for detecting if the peer is online or not to prevent issues when the peerlink is down. Turns out our implementation matches #4 in their MLAG best practices.

We had also initially dedicated 2x 100G interfaces for the peerlink to interconnect our switches, but upon further review, and the fact that all of our servers are connected to each switch, we learned this was overkill and wasteful. With LACP, the same MAC address is used on each interface in the bond, and all learned MAC addresses are shared with its peers. So when a packet ingresses and interface, it knows the MAC address of its destination and will forward the packet to the interface that learned the destination MAC address. It is smart enough to NEVER traverse the peerlink if the destination MAC address is available on the same physical switch as the packet ingressed, and will forward it accordingly. So that means the ONLY time a packet traverses the peerlink is if the MAC address for the destination ONLY exists on the remote switch, such as may be true for singly-connected devices or in the event of a port, NIC, or cabling failure. Given this information, we decided to use 2x 25G ports for our peerlink, which would more than handle bandwidth generated by singly-connected devices (basically IPMI iKVM interfaces), and still provide sufficient bandwidth in the event of a link failure on a server (our load is unlikely to be that high on average, and any issues should be promptly fixed).

At this point, we were happy. Our solution was vetted. There were still some rough edges to polish before deployment but we were confident this was going to work.

Abandoning NCLU

Since we use Ansible for configuration management on servers, as mentioned above, we decided this was the proper path for Cumulus Linux as well. The Cumulus docs talk about Ansible support and there’s an NCLU module for Ansible. However, one of the primary principles for writing Ansible scripts is to make sure they’re idempotent (meaning if you run them back to back, it should show no changes and perform no system changes). It turns out this isn’t true for the NCLU module. The problem is you can’t just set your entire configuration in NCLU and expect it to be the entirety of the configuration unless your first NCLU command is “net del all”. When you execute that command, even if you replace it with an identical configuration, it loses idempotence … it always shows as changed and network interfaces are reconfigured when not necessary. We contacted Cumulus support, and they basically asked us why we were using the NCLU module in the first place instead of just writing the configuration files. Hmm.

It seems like the strength of NCLU is to use it for queries. It can provide nice summary output for all aspects of Cumulus that are more readable than the native linux calls, as it combines the output from multiple sources. I’m not entirely sure why anyone would use NCLU for configuration management at this point.

We ended up defining a custom variable hierarchy in our ansible configuration for the various switch groups, that contained metadata such as vlans, trunk/access port config, STP, clag-ids, bonds and so on that is evaluated by a fairly complex Jinja2 template for our /etc/network/interfaces file. Its mildly specific to our environment (such as pre-chosen ports for peerlinks), and supports some odd things many won’t care about (like renumbering VLANs from port to port) … so I haven’t yet decided if its worth sharing. We also added in some sanity checks in our role like running “ifreload -a -s” to validate the configuration before applying and rollback to the prior one if it failed for some reason. Other than that, we just had a couple of minor customizations to /etc/cumulus/ports.conf to set configuration (such as splitting a couple 100G ports into 2x50G and using a breakout DirectAttach cable). Unfortunately at this time, making a change in ports.conf causes a brief outage while switchd restarts, which meant we had to add a “wait_for_connection” stanza to our Ansible tasks after restarting switchd asynchronously, but I hear that’s fixed in Cumulus 4.x. Any changes to /etc/network/interfaces are done by “ifreload -a” and happen almost instantaneously with no detectable outage (as long as you don’t reconfigure the port you’re connecting through).

This solution is fully idemopotent and is very easy to maintain.

Final Thoughts

Since deploying our Mellanox switches with Cumulus Linux many months ago, I’m happy to report we’ve had zero issues. Making configuration changes has been as easy as editing a variable and re-running our ansible playbook on the respective switches. While true it does take a little longer than it did to do a basic port change in Juniper, we know we have a standardized configuration across all of our switches and have better confidence that if we do have a failure, we can spin up a new one in less time. It is also a lot less error prone for simple tasks, and a more standardized configuration (Juniper might have different ways to configure ports depending on which model switch is being run).

In hind sight, choosing Cumulus was a great choice for another reason, Nvidia ended up buying both Mellanox and Cumulus, so I’d imagine there will be much tighter coupling between the two in the future, and I wonder what the future holds for their Onyx OS.

Going in to the future, we need to start evaluating Cumulus Linux 4.1+. There is unfortunately no direct upgrade path, it is a reinstallation from scratch due to significant changes with their migration to Debian 10 from 8. It doesn’t seem like there are any great changes and configurations are the same, other than possibly the ports.conf changes not requiring an outage as previously mentioned. We don’t anticipate any issues, or a significant amount time required for such a migration, but we figured we’d let other companies work out the kinks before we jump on board.