Happy Holidays everyone,
I came across an outage over the weekend and I am stumped. I do not believe this is a router or network switch specific issue. I also don't think this is a hardware issue as I've seen this happen with another ESXi 4.1 box on different hardware.
Basically, I received a random alert that my servers had gone down. Nagios would send alerts of up, down, up down, etc. However all services were not responding. Basically, the servers were running but there was too much network loss for anything to function.
Upon further inspection, the ESXi box was causing some sort of network loop / collission condition. If I unplug the ESXi network cable, all network issues go away. Plug ESXi back in, packet loss immediately occurs again.
As a troubleshooting measure I shut down each virtual machine running one at a time to see if any of them might be causing the problem specifically. After I shut the final vm down, the loop issue disappeared. I then fired up each machine and everything was back to normal. ?!
So either the last and final vm I shut down was causing an issue, which I doubt, or some networking issue with vmware was going on and all VMs needed to be reset to resolve it.
I was running a single vSwitch with all my my VM's and my Management Interface on it, connected to my main network switch.
As a preliminary measure, I made a separate vswitch in vmware and connected a separate network cable to my physical switch. This new vswitch now contains all virtual machines and the original vswitch just contains the management interface on it's own network cable. Not sure if or why this would resolve the issue but I did it anyway.
Has anyone ever experienced this before, or can shed some light on more specifically what was occuring for the ESXi box to bring down my network? I've seen this on 2 different ESXi 4.1 boxes on different hardware/network topology so hopefully it's some configuration tweak I can do.
Thanks in advance.
Mike