When networking in Vagrant goes wrong

Majic

When networking in Vagrant goes wrong

Pingbacks

Comments

For some time now I have been a big fan of Vagrant, a very convenient tool for provisioning and working with virtual machines in development environments.

Out of the box, most people use Vagrant in a single-machin set-up. In this role, Vagrant is perfect for things like website development by helping you reduce clutter on the local dev machine, moving it into easily reproducible VM, and being able to install tools that might not be available on your host's distro.

A couple of weeks ago I embarked on learning how to use and work with Docker, an application container engine/manager/you name it software.

Since I did not want to mess with my Gentoo box by installing a bunch of new packages for testing purposes, I decided to bring up VMs instead. In order to make life easier, the decision was to use Vagrant for bringing up the VMs, and Ansible for setting them up. That way I can bring up, and tear down the environment on a whim without using too much energy.

At first, there was just one VM as I learned all the basics on using Docker. This quickly grew to three VMs for testing more of Docker functionality, predominantly for purpose of running a Docker Swarm.

In order to have Docker Swarm working, I needed to have same IP addresses assigned to VMs every time. Out of the box, however, Vagrant uses VirtualBox internal DHCP for providing IPs, which means you will get random IPs assigned more often than not (and you can't control the internal DHCP for IP assignment purposes).

Luckily, Vagrant can set-up additional private network for connecting all the machines, where you can either decide to use DHCP or set-up static IPs. A couple of updates to Vagrantfile later, I was running with static IPs on an extra VM interface.

As I read more about Docker, I decided to bring up a system that is closer to what would be run in production. E.g. proper HA with multiple VMs, separation of VM duties etc.

This involved a lot more Ansible roles and configuration, and in order to make sure everything works correctly I started destroying and bringing up the VMs on more regular basis to test things out.

And then the problems started - Vagrant would often get stuck, randomly, and on random machines, at step SSH auth method: private key. Now, sometimes it would work fine, and you could proceed. Or maybe you would just need to destroy that one machine, and simply resume the process.

Since this is rather annoying (along the lines of me swearing at the monitor at increasing intervals), everything was put on pause until I resolved the issue.

First off was searching on the Internet if anyone faced similar issues. People reported similar symptoms, but none of the underlying problems were applicable to my case.

Next off was testing if port forwarding, set-up by Vagrant in VirtualBox, worked correctly. Turned out this was not the case for the faulty Vagrant machines. The weird part was that some faulty VMs would become reachable after a reboot.

After logging-in into the VirtualBox machine via its terminal (no SSH at this point), it turned out that the default network interface, which is supposed to get its IP via DHCP, was connected, but did not get any IPs. This was definitively the underlying issue for Vagrant getting stuck, but not the end of the road. Bringing up the second interface (the private network one) manually did not yield any success either - I could not reach anything outside of the VM, not even the host (default gateway).

After the initial connectivity testing, I figured it has to be some kind of package version compatibility issue.

As I mentioned before, I am a Gentoo user, which involves rather frequent upgrades of both the kernel and VirtualBox (amongst other packages). Since I had some recent issues with VirtualBox after an upgrade, first thing to do was to switch to somewhat older VirtualBox version that worked for me fine in another project for a number of months.

This did not help, though. Next off was downgrading the kernel to version 4.4.52 from 4.9.16. One host machine reboot later and... Nadda. Still the intermittent Vagrant failures.

It was time for another round of searching on the Internet... Now I concentrated specifically on Debian base boxes. There was a couple of mentions here and there mentioning boxes not being packaged properly.

So, I switched to using Bento base box to figure out if I can reproduce the issue. Surprisingly, I had no issues with that one.

After going back and forth a couple of times, I almost gave up on semi-official Debian base boxes and went for Bento instead. Luckily, I still wanted to figure out what the difference is...

Yet again, this fired off another round of searching. And, at some point I did run into an issue 7876 on Github. The issue was related to network interface ordering.

However, this did not exactly apply to my case, plus it felt like that specific issue would happen only once the machine has become reachable already. This did nudge me in correct direction...

What I discover was that network interface order on the VMs themselves was not 100% reliable. Sometimes the default (NAT) network interface would become eth0, with private network interface being eth1, and sometimes it was reversed. Indeed, I was able to bring-up correct networks on the two devices manually, and could confirm via VM network settings page associated MAC addresses.

When using the debian/contrib-jessie64 base box, I did notice that the second interface (the private network one) has different network adapter type compared to first (NAT) one.

Since you can specify network adapter type manually in Vagrantfile, I gave that a go, and things started working fine.

However, when using bento/debian-8.7, both network adapters would be set to 82540EM.

The big question was - what was the difference between debian/contrib-jessie64 and bento/debian-8.7 base boxes. Is it the way they are built? Is there some metadata that signals to Vagrant what network adapter to use?

I started comparing the files associated with the two base boxes, and finally found the relevant difference - it was in the box.ovf file - which amongst other things describes the network adapters available.

The relevant part of bento/debian-8.7 had the following section:

<Network>
    <Adapter slot="0" enabled="true" MACAddress="080027FD0A74" type="82540EM">
    <NAT/>
    </Adapter>
    <Adapter slot="1" type="82540EM"/>
    <Adapter slot="2" type="82540EM"/>
    <Adapter slot="3" type="82540EM"/>
    <Adapter slot="4" type="82540EM"/>
    <Adapter slot="5" type="82540EM"/>
    <Adapter slot="6" type="82540EM"/>
    <Adapter slot="7" type="82540EM"/>
</Network>

The relevant part of debian/contrib-jessie64 had the following section:

<Network>
    <Adapter slot="0" enabled="true" MACAddress="080027884F2B" cable="true" type="82540EM">
        <NAT/>
    </Adapter>
    <Adapter slot="1" type="Am79C973"/>
    <Adapter slot="2" type="Am79C973"/>
    <Adapter slot="3" type="Am79C973"/>
    <Adapter slot="4" type="Am79C973"/>
    <Adapter slot="5" type="Am79C973"/>
    <Adapter slot="6" type="Am79C973"/>
    <Adapter slot="7" type="Am79C973"/>
</Network>

In other words, debian/contrib-jessie64 had different network adapter type listed for "unused" (non-default) network adapters. This probably resulted in slightly different heuristics within the VM for assigning network interface names. Which resulted in me wasting a couple of hours on this issue -.-

But oh well, at least I finally managed to find what the problem is, and can report it upstream. Plus, it was an interesting learning experience :)

April 2017

October 2017