My first MTU issue: Outlook Connectivity issues via a Site-to Site VPN
When studying for my CCNA in Routing and Switching, I heard
mentioned various times the importance of MTU on devices with routing
capabilities. For something with such importance, I had never come across a
scenario where MTU was causing problems. Until today!
Symptoms:
- No Outlook 2010 Connectivity to Exchange 2010 via
Site-to-Site VPN.
- Creation of Outlook profile hangs and eventually fails at
"Logon to Server" stage.
- HTTPS connectivity, OWA for instance, works fine.
- Outlook.exe /rpcdiag shows a successful connection of type
"Directory" but nothing more
- Investigation shows that all routing is configured
correctly
Cause:
A device on the path between the Outlook client and the Exchange environment is functioning with a lower MTU than the default. Usually "Path MTU Discovery" would ensure that packets are transmitted at the lower MTU, but this is failing, and connections time-out, seemingly entering a black hole.Related Devices and Network Diagram:
CheckPoint Firewall
Load Balancer
Microsoft Exchange 2010
Resolution:
Enable MSS clamping on the CheckPoint firewalls creating the
site-to-site VPN. Upon failure of Path MTU Discovery, the firewalls will forge
the "Maximum Segment Size" of the route, forcing all devices to
transmit at an artificially lower MTU, thus increasing the chances of all
packets reaching their destination and not being dropped.
To fix this, add the following lines to the respective config files on each CheckPoint Gateway:
Add this line to $PPKDIR/boot/modules/simkern.conf file and reboot:
sim_clamp_vpn_mss=1
fw_clamp_vpn_mss=1
The Evidence:
Firstly, what is MTU? You may have read, seen or heard this
mentioned over the years when discussing your broadband package with your ISP,
or when configuring a home broadband router. MTU stands for "Maximum
Transmission Unit" and is used to define the largest size packet that can
be transmitted on a network. The default MTU on many devices is 1500 bytes,
probably because this value is defined in the IEEE Ethernet standard. Weirdly,
the most frequent MTU found when hopping between routers on the internet is 576
bytes. MTU values however have largely become meaningless for everyday internet
users, because modern operating systems like Windows 7/8 are able to sense
whether 1500 or 576 should be used. But, the general idea is that too large an
MTU size will mean more retransmissions if the packet encounters a router that
can't handle that large packet. The packet will be broken down into smaller
chunks and re-transmitted. Too small an MTU size means relatively more header
overhead and more acknowledgements that have to be sent and handled when
compared to the actual useful data in the packet. Large MTU sizes can be useful
when implemented in a private network where large packets are being sent a lot.
If all of your routers, switches and hosts support "Jumbo Frames" (an
MTU of 9000) then you can send large packets very quickly!
All of the above is true for MTU when the Transmission
Control Protocol (TCP) is being used. Other protocols have varying
implantations of MTU. On top of this, vendors often implement the TCP/IP stack
slightly differently in even their own devices, meaning that the MTU value can
be different and can change on two devices with even a slight firmware version
difference. And often this is the reason why you might run into MTU problems.
So to give you some context for the strange issue I
experienced today. The company I work for has recently opened a new branch
office, with a two-way site-to-site VPN connection back to the existing head
office using CheckPoint firewalls. It’s one of these fully kitted out offices
where you pay rent to the management company and they make the office liveable
and usable. Therefore, all connections pass through the managed office network
equipment. The new office is part of the existing domain, and has been set up
with its own Active Directory Site and Domain Controller. Microsoft Exchange
2010 is hosted at the Head Office, and there are a handful of users in the new
office that need to use Outlook 2010 to connect to Exchange. There are two
Exchange Client Access Servers, with a KEMP Loadmaster 3600 HA deployment sat
in front, load balancing connections between the two. Finally, there is a
database availability group containing two Exchange Mailbox Servers behind the
CAS servers.
I've been looking into the issue that has existed since the initial
setup where Outlook 2010 doesn't properly connect from the new branch office.
The behaviour we see is when creating the first profile on a workstation, the
wizard freezes after it has looked up the user's account and is at the
"Log onto server" stage. If we run Outlook.exe /rpcdiag, I can see
that a "Directory" connection has been made over port 59667, but I
would expect to see a "Mail" connection over port 59666 also and this
never appears. Eventually, the wizard will time out saying the Exchange server
is unavailable.
Initially, I suspected a routing issue. I went through the
entire network route with a fine toothed comb and could not find anywhere where
the traffic might be routing incorrectly. The management of the CheckPoint
firewalls are out-sourced to a third-party, so I also raised this with them,
asking for their top CheckPoint guru to take a look at it (what this guy
doesn't know about firewalls, isn't worth knowing!). He was stumped also! So
after back-tracking and going over the route multiple times I was convinced
that this was not a network routing issue, and decided to take a different
approach.
I knew that some sort of connection was capable, as I could
see Outlook connecting on port 59667, and it was able to look up my account in
the directory. At some point however, packets were failing or being dropped. We
confirmed that the firewall was allowing ALL traffic between the two sites. So
finally, I decided a packet capture between the load balancer and the client
workstations was needed to see if this could give me any insight into the
issue.
I took a packet
capture between the KEMP Load Balancer and a client at the new branch site by
doing the following:
- Log into KEMP Load Balancer Management Interface.
- Go to System Configuration > Logging Options > System Log Files.
- Press Debug Options.
- Under "TCP Dump", select Interface All, set address to be that of the client workstation.
- Click Start.
- Attempt the outlook connection by launching outlook and stepping through the new profile wizard. The wizard will freeze at "log onto server" stage.
- Back in Kemp, click Stop, then Download.
- Open the downloaded pcap file in Wireshark
Within the packet capture, I could see that the load
balancer and the client workstation were talking to each other as I could see
lots of SYN, SYN ACKs and ACKs. Everything looked pretty normal. That was until
I got towards the end of the capture and noticed some black-filled entries
(black usually signifies bad in Wireshark!) and some ICMP packets flying
around. ICMP stands for "Internet Control Message Protocol", and is
basically a protocol that devices can use to tell each other useful
information. For example, the message you receive with each successful ping is
an ICMP message.
The ICMP messages I found stated "Destination
Unreachable (Fragmentation Needed)". I did some research into this ICMP
message and found that this message is sent if a packet sent is larger than the
MTU, but also has a flag set that says "Do Not Fragment". What
usually happens if a packet is larger than the MTU of a device is that the
packet is "fragmented" or broken down into chunks smaller than the
MTU and is re-transmitted, then reassembled when they reach the destination. If
a packet has a flag of "do not fragment", rather than doing this, the
device with the MTU smaller than the packet size will just drop the packet! So
this is how I knew that we had an MTU issue.
I wanted to be certain that what I was seeing was indeed the
cause of our connectivity issue and not just a red herring. I used the
following ping command which allows me to send ICMP packets of a size of my
choice (the normal ping command sends only 32 byte packets). As mentioned
earlier, by default Windows and many other devices have a default MTU of 1500.
What I guessed we were experiencing was a hop along the journey where the MTU
was lower than this default, and indeed lower than packet size getting sent. It
made sense in theory but I needed to prove this. I sent a ping packet with a
size of 1472 bytes (because all packets get a header added to them of 28 bytes,
1472+28=1500, or the MTU) from the outlook client to the load balancer by
running the following from a command prompt:
ping -f -l 1472 10.0.0.1
Here, the -f flag creates the packet with the "Do not
fragment" flag set that I mentioned earlier and the -l flag (that’s a
lower-case L) allows me to define the packet size.
What I got back was
our trusty ICMP message:
"Reply from 10.0.0.1: Packet needs to be fragmented but
DF set.".
Instantly, this backs up my theory that there is a device
between the branch office and Head Office that has a MTU lower than 1500. So
next, I began reducing the packet size by 10 bytes, until my ping packet
succeeded. Eventually I found that the maximum packet size I could send was
1410. Therefore, there is a device on the path that has a maximum MTU of 1438
(1410+28).
So I was certain that there definitely was a device with a
lower MTU somewhere on our journey, but I still didn't know whether this was
what was causing our issue. After all, if a device does have a lower MTU, it’s
supposed to send an ICMP message back to the source, and the source should
break the packet into smaller chunks and re-transmit right?
To prove I wasn't being lead on a wild goose chase, I needed
to manipulate the packet size being sent when Outlook 2010 tried to connect, so
that it was within the restricted MTU of 1438, rather than 1500 in the first
place. If my theory was correct, Outlook would then connect without any issues.
After some research I found that in Windows, I could run the following commands
in a privilege command prompt to force the MTU of the NIC in the computer:
netsh interface ipv4 show interfaces
This gave me the following output:
Idx Met MTU State Name
--- ---------- ----------
------------
---------------------------
1 50
4294967295 connected Loopback Pseudo-Interface 1
12 20
1500 connected
Local Area Connection
As you can see, the MTU for our NIC is the default 1500. I
ran the following to force this to 1438 until next reboot:
netsh interface ipv4 set subinterface "12"
mtu=1438 store=active
I then tried Outlook again and it connected perfectly. Point
proven, but how do we fix it? Everything I've been reading is telling me that
packets should be fragmented and re-transmitted in the event that it hits a
device with a lower MTU than the packet size. So why isn't a fundamental part
of the TCP/IP protocol working as described?
Well, I looked further into what should happen, and found
that the process is called "Path MTU Discovery". Essentially, hops on
a journey of a packet use ICMP packets to negotiate the lowest MTU on the
route, and therefore what size packets should be sent at for successful
transmission. So something wasn't able to successfully complete PMTUD. Looking
at our network diagram, this is likely to be one of these:
KEMP Load Balancer
CheckPoint Firewall at Head Office
Managed Office Firewall/Router
CheckPoint Firewall at Branch Office
As I said earlier, our firewalls are managed by a third
party, so whilst I investigated the load balancer and got in touch with the
Managed Office regarding their equipment, I reported my findings to the third
party so that they could investigate the CheckPoint Firewalls.
Shortly after I received a response from our Firewall
support. They confirmed that the firewalls at both Head Office and Branch
Office had a MTU of 1500. I had confirmed that the load balancer was also 1500,
so this meant that a device within the Managed Office setup was functioning
with a lower MTU. So a potential fix would be to get Managed Office to find
this device and change it. Chances are though, there is probably a reason why
it is set with a lower MTU, and so this may not be an option!
The firewall guys also confirmed that both firewalls
supported the use of Path MTU discovery, and this was also enabled and should
be sending the load balancer as the source, the necessary packet to tell it to
re-transmit with a lower MTU. Basically, this should work out of the box.
As part of the Path MTU Discovery process, a device will
figure out the size of the actual useful data in a packet by taking the MTU and
minus around 40 bytes for headers. So if the transmitting MTU is 1500-40, its
first guess will be the size of the useful data is 1460. The device then sends
some data at this size. It expects that if the packet is too large, it will
receive an ICMP message from the device that can't handle the packet, and it
will send a message again at the smaller size. It keeps doing this until it
doesn't receive an ICMP packet, then the actual data is transmitted at the
correct MTU.
Very strange issues can occur if these ICMP packets don't
arrive successfully. As you can imagine, if the device performing PMTUD doesn't
receive an ICMP packet to say that packet is too large, it will tell devices to
transmit at an MTU that is too large! The device with the lower MTU will then
drop the packet, and the connection will disappear into a black hole and
time-out.
So how do you fix this you ask? How can we ensure our data
will be sent and received, even if the intermediary ICMP packets don't arrive?
Well, CheckPoint Firewalls (and I'm sure all other vendors) are capable of
doing something called MSS Clamping. It appears that this has been available since R77.20. This is, in essence, a "fudging"
of the MSS value. Telling devices on the journey that the Maximum Segment Size
is artificially lower than it actually is. Doing this transmits data at a lower
MTU size, and therefore the packets are more likely to reach their destination.
MSS Clamping will notify the Client and Sender to use smaller packets where
they don't seem to understand the PMTUD packets.
On each CheckPoint Gateway, you want to add the following lines to some config files:
Add this line to $PPKDIR/boot/modules/simkern.conf file and reboot:
sim_clamp_vpn_mss=1
Add this line to $FWDIR/boot/modules/fwkern.conf file and reboot:
fw_clamp_vpn_mss=1
Upon enabling the above parameters, Outlook sprung into life
for all workstations at the Branch Office. So the root cause seems to be the Path MTU
discovery protocol not working correctly, probably due to multi-vendor devices
talking to one another across the VPN tunnel between the Bran Office and Head Office.
And the resolution is to configure MSS clamping on a device or devices on the
journey which ensures that in the event that PMTUD fails to function correctly
and set the correct MTU, this is controlled artificially to ensure packets
still reach their destination.
No comments:
Post a Comment