Subnet masks are important…SharePoint Is Up…err Down

We had this awesome situation the past 5 days.  Another team wanted to use ElasticSearch to index SharePoint.  They would attempt to connect to SharePoint, but were not able to.  Of course, the SharePoint Servers were in fact up as demonstrated by my ability to connect to them from my laptop and from other servers in the farm.  I therefore wrote them off as crazy and put down as a firewall/F5/Linux issue.  But they kept nagging at me and eventually escalated to the higher powers that be and I was forced to deal with it.  Here's how it played out:

Quick Facts:

  • ElasticSearch on its own /28 subnet
  • SharePoint on its own /28 subnet (more on this later)
  • F5 VIPs for load balancing on both sides (both SP WFEs and ElasticSearch queries)
  • Both subnets part of a larger /24 subnet allocation pool

The process (after 5 days of back and forth):

  • Can you ping our server IPs?  Yes
  • Can you hit our SP URLs?  No
  • What happens when you ping via DNS?  We see the F5 VIP IP
  • Change your hosts file to point to a WFE directly, can you hit our server?  Yes
  • Oh, we need a bounceback iRule for the SP servers to talk to each other, let's add that now
  • Maybe we need a reverse proxy on the VIP?  Let's add that?
  • Remove your hosts file, can you hit our servers?  No
  • Fire up wireshark on all the servers, do logging on the F5
  • Traffic flows from the ElasticSearch, through the F5 and does arrive at our SP WFE however the WFE kills the TCP connection and no IIS request is logged – WTF…
  • Chris – "OK guys, let's start at the bottom and work our way up the OSI layers…"
    • Ethernet adapters good? – Yup
    • Level 2 ok?  Yup
    • Level 3 – got IPs? Yup – Chris – "Hey, what is your guys subnet?".  Them – "255.255.255.240".  Chris – "Ours is "255.255.252.0"….FUCK

5 hours over 5 days wasted, frustrated, starting to think they were crazy F5 guys…all because the network guys didn't setup our subnet properly.  What was happening is the SharePoint servers had a huge subnet configured.  This caused the SP servers to think that the ElasticSearch servers were on the same subnet but weren't.  Therefore when it couldn't connect to them using layer 2, it would kill the TCP layer.  Awesome.

Enjoy!
Chris