We had about 50 of 1000 VMs stop responding for about 15 minutes this morning. The affected VMs are on different ESX 4.0.0 build 261974 servers but all on one 1 of 2 Netapp clustered controllers serving NFS. All 4 volumes on "MyFiler" disconnected. The filers and blades running ESX connect through the same 2 L2 switches. The filer did not see any network disconnects. On the switches, the ports for the filer saw traffic drop to near 0 KBps for the 15 minutes. The NFS volumes are several TBs consisting of well over 50 VMs each so I don't understand why all of the VMs were not affected. We have not implemented the NetApp recommended vendor setting for disk timeout of 190s and left it as the default put in place by VMware tools. Any ideas of what caused this?
I have edited/commented some of the logs to remove the unique info...
/var/log/messages [same logs appear on other ESX hosts in the same cluster]
Sep 27 04:01:03 server1 vobd: Sep 27 04:01:03.352: 34164107014900us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.358: 34164119021021us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.488: 34164119150841us: [vprob.vmfs.nfs.server.disconnect] ...
Sep 27 04:01:15 server1 vobd: Sep 27 04:01:15.489: 34164119151386us: [vprob.vmfs.nfs.server.disconnect] ...
/var/log/vmkwarning [same logs appear on other ESX hosts in the same cluster]
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.724 cpu16:127832)WARNING: VSCSI: 3116: handle 11526(vscsi0:1):WaitForCIF: Issuing reset; number of CIF:5
Sep 27 04:00:12 server1 vmkernel: 395:10:00:53.965 cpu6:110441)WARNING: VSCSI: 3116: handle 11506(vscsi0:1):WaitForCIF: Issuing reset; number of CIF:1
Sep 27 04:01:03 server1 vmkernel: 395:10:01:44.772 cpu2:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
Sep 27 04:01:15 server1 vmkernel: 395:10:01:56.778 cpu6:4120)WARNING: NFS: 277: Lost connection to server <IP> mount point ...
MyFiler logs - this is the filer hosting the VMs. Some of the errors are most likely not related
Tue Sep 27 02:55:19 PDT [MyFiler: sis.changelog.full:warning]: Change logging metafile on volume <VOL> is full and can not hold any more fingerprint entries.
Tue Sep 27 04:11:56 PDT [MyFiler: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (<IP of ESX17>) where receive side flow control has been enabled. There are 72104 bytes in the receive buffer. This socket is being closed from the deferred queue.
Tue Sep 27 04:19:03 PDT [MyFiler: rlm.orftp.failed:warning]: RLM communication error, receiver timeout waiting for RLM response.
Tue Sep 27 04:19:03 PDT [MyFiler: rlm.driver.hourly.stats:warning]: The software driver for the Remote LAN Module (RLM) detected a problem: Configuration Error (
1).
Tue Sep 27 04:19:15 PDT [MyFiler: replication.src.err:error]: SnapVault: source transfer from ... transfer aborted because of network error.
Tue Sep 27 04:19:27 PDT [MyFiler: replication.src.err:error]: SnapVault: source transfer from ... transfer attempted from busy source.