Thursday, 1 February 2018

Configuring Topdesk SAML Single Sign On with F5 Big-IP IDP

I recently had a requirement to configure SAML2.0 Single-Sign on for Topdesk SaaS. My company currently use on-premise F5 Big-IP as a local IDP for SAML and have a few cloud apps working this way. When I set these up however, the documentation from both F5 or the cloud providers themselves was readily available and very thorough.

Topdesk however, not so much. The Topdesk Operator Manual, which is only available if you are a customer and through their extranet, talks only about SAML2.0 with Microsoft ADFS. A disclaimer however states along the lines of “SAML single sign should work with any SAML2.0 compliant IDP in theory, but has only been thoroughly tested with Microsoft ADFS”. Not useful, but reassuring that I should be able to get this to work.

Google Searches didn’t return anyone with experience of implementing F5 as an IDP for SAML authentication to Topdesk SaaS, but did return a few SAML cloud providers who offered it as a service. So hopefully this blog post will help someone else like me out there in the future.

The documentation is written for TOPdesk 8.01.015 and F5 Big-IP 13 Hotfix 3. Other versions may differ.

Configure F5 Big-IP

Configure Virtual Server

First things first, you will need a virtual server on your F5 to listen for HTTPS traffic.

  1. Go to Local Traffic > Virtual Servers > Virtual Server List.
  2. Click Create.
  3. Give your Virtual Server a name that will help you easily identify it in the future. Description can help here too.
  4. Set type to Standard.
  5. Source Address will allow you to filter out who can access the virtual server. If you want it to be accessible from anywhere, set it to 0.0.0.0/0.
  6. Destination Address/Mask allows you to set the IP Address the F5 will listen on. This IP address will need to be accessible from the internet, either by being a publically routable address, or with NAT configured on your perimeter firewall. It should also be part of a public DNS A Record.
  7. Service Port allows you to set what port the F5 will listen on. Set this to 443 HTTPS.
  8. Ensure state is set to Enabled.
  9. The only other option you need to set right now, is your SSL Profile (Client) and SSL Profile (Server). This needs to be a trusted certificate (by devices that will be accessing Topdesk). The easiest is to purchase a certificate from a well-known Certificate Authority. I won’t go over creating SSL Profiles in this blog post, the documentation is out there if you need it.
  10. Click the Finished button.
Your config should look something like this:



Configure SAML IdP Service

Next, we’ll configure the F5 SAML IdP Service.
  1. Go to Access > Federation > SAML Identify Providers > Local IdP Services.
  2. Click Create.
  3. On the General Settings page:
    1. In IdP Service Name, enter a name for you to easily identify this service in the future. Eg. Topdesk IdP.
    2. In IdP Entity ID, you need to enter a unique identifier for this IdP service. It needs to include the public DNS name of the virtual server you created earlier. For example: https://topdesksso.example.com/idp/topdesk.

  1. Click on the Assertion Settings section:
    1. Set Assertion Subject Type to Transient Identifier.
    2. Set Assertion Subject Value needs to be the variable you will pull from logged in users that you will use to login users to Topdesk. Usually this is email address so set this to %{session.ad.last.attr.mail}
    3. Set Authentication Context Class Reference to urn:oasis:names:tc:SAML:2.0:nameid-format:transient.


  1. Click on the SAML Attributes section:
    1. Click Add.
    2. Set Name to Email Address.
    3. Click Add.
    4. Set value to %{session.ad.last.attr.mail}.
    5. Click Update.
    6. Click OK.

  1. Click Security Settings section:
    1. Select the certificate and signing key that you want to use to sign the SAML assertion. This adds a layer of protection to the sign in process.
    2. Click OK.



The SAML Local Identity Provider service is configured. We are ready to share this metadata with topdesk. Select it and choose Export Metadata to download an .xml version that will be uploaded to Topdesk.

Configure Topdesk SAML Service

  1. Log into your Topdesk SaaS instance as an operator.
  2. Click Topdesk Menu > Settings.
  3. Drill down Functional Settings > Login Settings > General.
  4. Under SAML Login, you’ll see two sections; Public and Secure. Public will configure SAML for the Self-Service Portal, whilst Secure will configure SAML for Operator login. I’ll describe configuring for Self-Service Portal, but the process is the same for Operator.
  5. Click Add Configuration under Public.
  6. Select Upload as file and click Browse. Select the .xml file you downloaded shortly.
  7. This will place an entry in the drop-down box of Entity ID, of the IdP Entity ID that you entered in the F5 IdP config (eg. https://topdesk.example.com/idp/topdesk). Select this.
  8. In User name attribute, this needs to match the entry you configured in the F5 IdP setting SAML attribute. In our example this was Email Address.
  9. The logout url will automatically complete when selecting the Entity ID.
  10. The Topdesk endpoint should have automatically completed, but this will be the URL of your Topdesk SaaS instance.
  11. Tick Host Topdesk Metadata.
  12. Untick Assertions will be encrypted.
  13. Upload the certificate and key that you configured in the Security Settings section of the F5 IdP. This can be a pain to get right, as the certificate and key must be separate, the certificate must be a .pem with x.509 formatting, and the private key must be RSA encrypted with PKCS8 formatting and DER encoding. There are guides out there to do this with OpenSSL.
  14. Enter a display name, this is what the button will say on the Topdesk login page once SAML is enabled, so something simple like Login works well here.
  15. Click Save.

  1. Select the config you just made, and click Download. This will download the .xml metadata copy we can use to complete the configure on our F5.
  2. Just before we finish in Topdesk, we should enable SAML authentication on the Self-Service Portal. Head to Self-Service Portal and tick SAML Single Sign On.
  3. Click Save.

Complete F5 SAML Configuration


Create External SP Connector and Bind to Local IdP Service

Finally, back on our F5:
  1. Go to Access > Federation > SAML Identity Providers > External SP Connectors.
  2. Click the drop-down arrow next to create and choose From Metadata.
  3. Click Browse and select the .xml metadata file you just downloaded from Topdesk.
  4. Enter a Service Provider Name (eg. Topdesk) and click OK.


Next we must link together the IdP configuration, and the SP configuration you just created.

  1.  Go to Access > Federation > SAML Identity Providers > Local IdP services.
  2. Tick the Topdesk IdP configuration from earlier.
  3. Click Bind/Unbind SP Connectors.
  4. Tick the Topdesk SP configuration you just created.
  5. Click OK.



Configure SAML Resource and Attach to IdP Service

  1. Next, go to Access > Federation > SAML Resources.
  2. Click Create
  3. Give the resource a name, eg. Topdesk
  4. From the SSO Configuration dropdown, select the Topdesk IdP service you created earlier
  5. Click Finished
Note: The remainder of the process is standard SAML F5 configuration. The below is taken from the F5 guide, I’ve added in red some notes that need specific config, but there are few and far between:

Configure a full webtop

A full webtop allows your users to connect and disconnect from a network access connection, portal access resources, SAML resources, app tunnels, remote desktops, and administrator-defined links.
  1. On the Main tab, click Access Policy > Webtops.
  2. Click Create to create a new webtop.
  3. Type a name for the webtop you are creating.
  4. From the Type list, select Full.
  5. Click Finished.
The webtop is now configured, and appears in the list. You can edit the webtop further, or assign it to an access policy.
To use this webtop, it must be assigned to an access policy with a full resource assign action or with a webtop and links assign action. All resources assigned to the full webtop are displayed on the full webtop.

Configuring an access policy for a SAML SSO portal

Configure an access policy so that the BIG-IP system (as an IdP) can authenticate users using any non-SAML authentication type, and assign SAML resources and a webtop to the session.
Note: This access policy supports users that initiate a connection at a SAML service provider or at the SAML IdP.
  1. On the Main tab, click Access Policy > Access Profiles. The Access Profiles List screen opens.
  2. In the Access Policy column, click the Edit link for the access profile you want to configure to launch the visual policy editor. The visual policy editor opens the access policy in a separate screen.
  3. Click the (+) sign anywhere in the access policy to add a new action item. An Add Item screen opens, listing Predefined Actions that are grouped by General Purpose, Authentication, and so on.
  4. From the General Purpose area, select Logon Page and click the Add Item button. The Logon Page Agent properties window opens.
  5. Make any changes that you require to logon page properties and click Save. The Access Policy window displays.
  6. Add one or more authentication checks on the fallback branch after the Logon Page action. Select the authentication checks that are appropriate for application access at your site.
  7. On a successful branch after an authentication check, assign SAML resources and a full webtop to the session.
    1. Click plus [+] on a successful branch after an authentication check. The Add Item window opens.
    2. From the General Purpose list, select the Full Resource Assign agent, and click Add Item. The Resource Assignment window opens.
    3. Click Add new entry. An Empty entry appears.
    4. Click the Add/Delete link below the entry. The window changes to display resources on multiple tabs.
    5. Select the SAML tab, then from it select the SAML resources that represent the service providers that authorized users can access.
    6. Click Update. The window changes to display the Properties screen, where the selected SAML resources are displayed.
    7. Click the Add/Delete link below the entry. The window changes to display resources on multiple tabs.
    8. Select the Webtop tab, then select a full webtop on which to present the selected resources. You must assign a full webtop to the session even if you have configured all SAML resources to not publish on a webtop.
    9. Click Update. The window changes to display the Properties screen. The selected webtop and SAML resources are displayed.
    10. Click Save. The Properties window closes and the Access Policy window is displayed.
You have configured a webtop to display resources that are available from service providers and that an authorized user can access.
  1. Optional: Add any other branches and actions that you need to complete the access policy.
  2. Change the Successful rule branch from Deny to Allow and click the Save button.
  3. Click the Apply Access Policy link to apply and activate your changes to the access policy.
  4. Click the Close button to close the visual policy editor.
You have an access policy that presents a logon page, authenticates the user, and assigns SAML resources and a full webtop on which to present them to the user.
Simple access policy for access to services on SAML service providers

As we want to use AD attributes (email address) of the logged in user to send to Topdesk, we need to add an AD query object between “Successful” and “Full Resource Assign”. You’ll need your Active Directory configured as a AAA server under Access > Authentication > Active Directory. The AD Query object should look like:

Adding the access profile to the virtual server

You associate the access profile with the virtual server so that Access Policy Manager®can apply the profile to incoming traffic and run an access policy if one is configured for this access profile.
  1. On the Main tab, click Local Traffic > Virtual Servers. The Virtual Server List screen opens.
  2. Click the name of the virtual server you want to modify. The one we created at the start!
  3. In the Access Policy area, from the Access Profile list, select the access profile.
  4. Click Update to save your changes.
Your access policy is now associated with the virtual server.

You’re ready to go! You can either login to your F5 webtop using AD credentials, and click the Topdesk icon, or you can go to your Topdesk SaaS instance, and use the SAML Login button we created earlier. The latter is a bit nice for users. Either way, the webtop will prompt you for credentials. There are ways around this using Kerberos/NTLM, but that’s for another time.

Monday, 18 December 2017

Nimble Windows Toolkit fails to install "Error 1920: Service Nimble Hotfix Monitor failed to start. Verify that you have sufficient privileges to start system services."

Today I was pre-empting the install of a new Nimble CS3000 Array that's going to happen later in the week, but installing the "Nimble Windows Toolkit" on my Server 2012 R2 Hyper-V hosts. This is recommended by Nimble as it installs VSS providers and a handy iSCSI toolkit for managing connections.

Unfortunately, my installation was failing when getting to the stage of "Starting Services". The error message I would receive is:

 Error 1920: Service Nimble Hotfix Monitor failed to start. Verify that you have sufficient privileges to start system services.#
Windows Event Logs would show that the Nimble Hotfix Monitor failed to start multiple times in a "timely fashion", 30000 seconds by default. I tried a few things, including enabling UAC (why other admins feel the need to disable UAC in this day and age is beyond me!) but to no avail.
Eventually, I resigned to logging my first ticket to Nimble via their "Infosight" website. I was typing away the description of my problem, and on the right-hand side a possible solution appeared - "NWT installation error: Error 1920: Service Nimble Hotfix Monitor failed to start. Verify that you have sufficient privileges to start system services.".
I opened the potential solution (bearing in mind none of the others were remotely like what my issue was) without much hope. And there it was, in all its glory:
"This seems to occur for the most part on Windows boxes that are not connected to the internet"
Why on earth would my Hyper-V host have access to the Internet??!!
So the solutions were:
  1. Temporarily allow access to the Internet (no thank you!)
  2. Increase the service timeout temporarily on the host machine.
I opted for option 2, as I didn't want to grant internet access on my Hyper-V host. To even recommend this as a workaround is ludicrous in my opinion. So the fix is to fudge a Registry Key, reboot the server and try the install again:

KEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ServicesPipeTimeout 
DWORD  -   set it to   120000    (this is a 120 second timeout)

Voila, my issue was resolved. According to Nimble, it is to do with .NET try to validate the signature of the service with Microsoft Servers. The normal service timeout of 30 seconds happens before an outbound connection timeout.

I searched all over and couldn't find this solution anywhere, so hopefully this will help others out.

TL;DR:
If Nimble Windows Toolkit installation fails with error "1920: Service Nimble Hotfix Monitor failed to start. Verify that you have sufficient privileges to start system services.", either grant the server access to the internet (!) or preferably, change the following reg key to increase service timeout to 120 seconds (reboot required):
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\ServicesPipeTimeout 
DWORD  -   set it to   120000    (this is a 120 second timeout)


Saturday, 29 October 2016

How To Install Nvidia Drivers for hybrid Intel/Nvidia GPU in Ubuntu 16.04 LTS

TL;DR - if you are having issues getting Nvidia drivers to work in Ubuntu, particularly in a hybrid graphics setup, then check the following:
  1. Secure Boot is disabled during installation if you have a UEFI BIOS.
  2. nomodeset is not set in /etc/default/grub
~

I, like many others before me, have endured the pure pain involved with trying to get a hybrid graphics setup working in Ubuntu. Having searched for a solution myself, it is clear that the system in question impacts the solution massively. Here's my setup and what I did to get it working:

Setup:
  • MSI GT70 2OC (UK edition)
  • CPU: Intel Core i7-4700MQ Processor
  • GPU: Nvidia Geforce GTX 770M/3GB GDDR5 (hybrid with Intel Embedded)
  • OS: Ubuntu 16.04 LTS (Xenial Xerus)

 1. Find out what graphics card you've got:

Pull up a terminal session by pressing CTRL+ALT+T and run the following command:

sudo lshw -numeric -C display

My output looks as follows:


2. Find the latest supported driver


Now head over to https://www.nvidia.co.uk/Download/index.aspx?lang=en-uk and find the latest supported driver for your card (its worth noting here that at the time of writing, the latest available driver from Nvidia was 370, and this may or may not work with my GTX770M. I may test this when I'm bored):

 Make a note of the driver version that is displayed, but don't download it:


Instead, we are going to pull this straight from the Linux graphic drivers team repository.

3. Install the driver and Nvidia Prime


In your terminal, first add the repository:

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update

Once successfully added, install the driver and Nvidia Prime (used to jump between integrated Intel Graphics, and dedicated Nvidia graphics).

sudo apt-get install nvidia-367 nvidia-prime

 Once completed, reboot your laptop. Now this is where the fun can begin, so make sure you can access this blog from somewhere other than the laptop you're working on!


Troubleshooting


If at this point you are chucked back to the login prompt or you hit a black screen, do the following:


Workaround:


Hit CTRL+ALT+F1 to get to command-line. Or if this doesn't work, hold shift whilst you power up your laptop to get to the GRUB, then go into recovery mode.
Run the following command to activate the Intel graphics card:

sudo prime-select intel

Reboot and your problem should go away, however this doesn't fix the issue with the Nvidia GPU.

If this doesn't work, pull the nvidia drivers completely by running

sudo apt-get purge nvidia*

Reboot and your device will use the working nouveau open source driver.

Things to check:

Secure Boot:

If you have a modern laptop with a UEFI style BIOS, it may have something called secure boot. You can find information out about secure boot here. Essentially, you want to disable secure boot in your BIOS and get the Nvidia drivers working prior to re-enabling it (or leave it disabled altogether, that call is yours). So:

Run the following command in terminal to remove the Nvidia drivers:

sudo apt-get purge nvidia*

  Reboot your laptop.
Disable secure boot in your BIOS
Following the instructions to install the Nvidia drivers.
Reboot and check if you can login find after running prime-select nvidia
If you can, re-enable secure boot in your BIOS if you wish.
Check that you can still login with prime-select nvidia working.

nomodeset

If Secure Boot isn't your problem, or disabling it didn't work above, then you may need to get google searching. There are reams of different solutions out there for different models of system. If you have been trying to get the nvidia drivers working prior to landing here, then before you do any more searching,I have one more suggestion for you to try:

Whilst searching for a resolution to this (which turned out to be secure boot), one solution was to set the nomodeset parameter. So what is nomodeset? The answer from this page is:

The newest kernels have moved the video mode setting into the kernel. So all the programming of the hardware specific clock rates and registers on the video card happen in the kernel rather than in the X driver when the X server starts.. This makes it possible to have high resolution nice looking splash (boot) screens and flicker free transitions from boot splash to login screen. Unfortunately, on some cards this doesn't work properly and you end up with a black screen. Adding the nomodeset parameter instructs the kernel to not load video drivers and use BIOS modes instead until X is loaded.

So where to check? And trust me, its worth a check as I was banging my head on my desk for two days trying to find this:

Launch terminal and run the following command to edit the grub file (you can replace nano with a text editor of your choice eg. gedit, vi):

sudo nano /etc/default/grub

My grub file looked like this:


I remember setting nomodeset during the early days of troubleshooting my Nvidia woes, before I had found the Secure Boot solution. But having this set, and configuring prime-select nvidia would give me a login loop, presumably because the kernel was unable to load the nvidia drivers.

So edit your grub file to remove nomodeset and save the grub file (in nano this is CTRL+O, Enter, CTRL+X).

Importantly, you must run the following command for the changes to take effect:

sudo update-grub

Run prime-select nvidia to force the nvidia gpu to be used, then reboot and you may have resolved your issues like me!!


Hopefully this fixes your issues and gets your Nvidia GPU working in Ubuntu, or at the very least points you in the right direction. Once working, a couple of things to remember:

prime-select query will show you the selected graphics card to use

prime-select will activate the selected graphics card following a reboot

Having used Windows on my GT70 for 4 or so years before, I got quite used to Nvidia automatically jumping between integrated and dedicated graphics. I'm not sure if Nvidia Prime can do this, but if you take a look at Nvidia X Server Settings application, you will see something called "Application Profiles". I have a hunch that these may allow certain applications to run using a specific GPU, so in theory you could run prime-select intel normally, then have the Nvidia card jump in when you launch a game. I don't know if this works, there is very little information regarding it but I may have a play with this in the future.

There is also a project called Bumblebee that was created before Prime was available for jumping between integrated and dedicated graphics. This may be worth a look into. 

Wednesday, 26 October 2016

When my bad memory worked out for the best

TL;DR: Blue screen of death is commonly caused by bad memory! Bad memory can manifest itself in lots of different ways, and is different between operating systems. I saw BSOD in Windows, and kernel panics only when copying large files in Ubuntu. Bad Memory prompted my move to Ubuntu as my main operating system, but it wasn't without woes!

~

About a week ago, my MSI GT70-2OC running Windows 10 started acting weirdly. I leave the laptop on 99% of the time running Universal Media Server to allow me to watch the media I have stored on it anywhere in my home. Usually on my smartphone whilst washing up, or on my living room TV using my PS4.

I first noticed that now and again UMS would be in-contactable. Physically checking my laptop I'd see that it was at the login screen. After being certain I wasn't going mad and rebooting or forgetting to login, I checked the event logs to find that quite regularly, the laptop was rebooting due to a bugcheck (more commonly known as the Blue Screen of Death!).

Checking the logs further I could see that Microsoft had released an October Driver Pack around the time that I started having problems. I'd had driver issues frequently since upgrading from Windows 8.1 to 10, as MSI no longer support my model, they haven't released official Windows 10 drivers. Unfortunately, Microsoft also auto-install driver updates. There are ways of stopping them from being installed, but this is on a per-update basis, so unless you are eagle-eyed there is a good chance you won't notice until the driver goes on, and in my case, the track pad stops working!

Presuming it was a driver issue, and because I had some time on my hands, I decided to bite the bullet and browse for the most user-friendly version of Linux available. I pulled out my 10 year old laptop (still going strong, also an MSI!) and began browsing the Internet. By this point, my GT70 was unusable, rebooting with a BSOD after every login attempt. After some browsing, I came across Ubuntu 16.04 LTS. Supposedly user-friendly, and the screenshots looked beautiful. I downloaded the ISO image, burned to disk and attempted the install. It failed on three separate attempts. So I grabbed the checksum, and compared the two and noticed the checksum of the ISO I burned to CD was different. This usually means that the ISO become corrupt during download, or when burning to the CD (moral of the story; time and effort can be saved by checking the checksum!).

So I went back to the Ubuntu download site, grabbed the ISO again and compared the checksum. It showed up OK, so presumably the disk burner on my 10 year old laptop is giving up the ghost a little and caused the issue. So, let's get with the times (who uses CDs anymore?) and go for the USB boot method. The process was surprisingly simple from this page, and it not only allows you to install Ubuntu, but also acts as a bootable version, which can be very useful if you lose access to your OS for whatever reason.

So, I booted from the USB and ran the install successfully. I chose to format my 2x RAIDED 128GB SSD's and install Ubuntu here, as I didn't want to lose my media on my 750GB mechanical. The process was a lot quicker from the USB, and I was soon booted into a beautiful Ubuntu desktop.


Now I started to hit some more issues, and it was already getting to 11pm. Firstly, the disk containing my media couldn't be mounted. An error message appeared to say that it contained hibernation data on it, and therefore could only be opened in read only mode. I did some digging and Windows 10 uses a feature called "Fast Startup", that essentially always keeps the OS in hibernation to give you quicker boot times. The workaround, of course, was to boot back into Windows, remove the hibernation file and disable Fast Startup. Impossible for me as I had gone the whole hog and nuked my Window setup!

I could mount the disk in read-only mode using the following command so at least I could still get to the data.

mount -t ntfs-3g -o ro /dev/sda3 /media/windows


So the best course of action would be to backup the data, wipe and re-partition the disk (this would also allow me to use the Linux friendly ext4 file system over Windows traditional NTFS). So I began copying off of my 600GB of media onto a USB hard disk I had lying around. Then I hit my next problem...

After around 5GB of transfer, my laptop would freeze completely, with just the Num Lock LED flashing. I had to hard power down the laptop and log back in but every time I tried to copy the files, to either the USB hard disk or my RAIDED SSD, I hit the same issue.

Alas, I called it a night there, to tackle another day. I had a working laptop, running a nice new OS, UMS was working so the wife and I could carry on watching Arrow and The Flash, and we had our boy's favourite "In the Night Garden" ready for when he wanted it. Only downsides were that I couldn't write to my 750GB disk, and I hadn't even begun to ponder whether my Nvidia Graphics Card was working to its full potential. Or so I thought...

So the next day, I did some research into the flashing Num Lock and freeze when copying files, and the most common answer tended to be dodgy RAM. At this point something clicked. When I had seen Blue screens in the past on Windows, it was almost always down to a dodgy RAM stick. Could it be that one of my RAM sticks were faulty, and it was actually this that was causing my problems all along?

Only one way to find out. I jumped onto my old laptop, went to http://www.memtest86.com/download.htm, and downloaded the bootable USB image. I followed the instructions at http://www.memtest86.com/technical.htm#win to create the USB stick and soon had booted my GT70 into memtest. I kept the default settings and started the test, and quickly the errors were in their thousands.



So I'm fairly certain I've got some bad RAM. Next, I wanted to find out which RAM stick was causing the issues. I have two 8GB sticks that came pre-installed. I wanted to run memtest with only one of the sticks installed to see if the errors dissappear. I found https://www.youtube.com/watch?v=QkL5q1-K15s and got to work taking my keyboard off and getting to my RAM. I quickly found the suspect RAM stick, took it out and began transferring my media off again. It worked perfectly!

Meanwhile, I headed over to uk.crucial.com and got myself a new 8GB stick of RAM. Once the transfer was finished, I used Gparted to wipe and repartition my 750GB using the ext4 file system. Copied everything back and the laptop was working perfectly.

There was one last thing for me to do - get my hybrid graphics working. But that deserves its own separate blog post.

So it was a torrid time, but I'm extremely happy with Ubuntu 16.04 LTS. Its a beautiful, slick and responsive operating system. It actually feels like its unlocked some hidden potential in my GT70, which is nearing 4 years old now.

Saturday, 1 August 2015

I got 99 problems but MTU ain't one



My first MTU issue: Outlook Connectivity issues via a Site-to Site VPN

When studying for my CCNA in Routing and Switching, I heard mentioned various times the importance of MTU on devices with routing capabilities. For something with such importance, I had never come across a scenario where MTU was causing problems. Until today!

Symptoms:

- No Outlook 2010 Connectivity to Exchange 2010 via Site-to-Site VPN.
- Creation of Outlook profile hangs and eventually fails at "Logon to Server" stage.
- HTTPS connectivity, OWA for instance, works fine.
- Outlook.exe /rpcdiag shows a successful connection of type "Directory" but nothing more
- Investigation shows that all routing is configured correctly

Cause:

A device on the path between the Outlook client and the Exchange environment is functioning with a lower MTU than the default. Usually "Path MTU Discovery" would ensure that packets are transmitted at the lower MTU, but this is failing, and connections time-out, seemingly entering a black hole.

Related Devices and Network Diagram:

CheckPoint Firewall
Load Balancer
Microsoft Exchange 2010
"Third party Managed Office" Router/Firewall

Resolution:

Enable MSS clamping on the CheckPoint firewalls creating the site-to-site VPN. Upon failure of Path MTU Discovery, the firewalls will forge the "Maximum Segment Size" of the route, forcing all devices to transmit at an artificially lower MTU, thus increasing the chances of all packets reaching their destination and not being dropped.

To fix this, add the following lines to the respective config files on each CheckPoint Gateway:

Add this line to $PPKDIR/boot/modules/simkern.conf file and reboot:
sim_clamp_vpn_mss=1 

Add this line to $FWDIR/boot/modules/fwkern.conf file and reboot:
fw_clamp_vpn_mss=1


The Evidence:

Firstly, what is MTU? You may have read, seen or heard this mentioned over the years when discussing your broadband package with your ISP, or when configuring a home broadband router. MTU stands for "Maximum Transmission Unit" and is used to define the largest size packet that can be transmitted on a network. The default MTU on many devices is 1500 bytes, probably because this value is defined in the IEEE Ethernet standard. Weirdly, the most frequent MTU found when hopping between routers on the internet is 576 bytes. MTU values however have largely become meaningless for everyday internet users, because modern operating systems like Windows 7/8 are able to sense whether 1500 or 576 should be used. But, the general idea is that too large an MTU size will mean more retransmissions if the packet encounters a router that can't handle that large packet. The packet will be broken down into smaller chunks and re-transmitted. Too small an MTU size means relatively more header overhead and more acknowledgements that have to be sent and handled when compared to the actual useful data in the packet. Large MTU sizes can be useful when implemented in a private network where large packets are being sent a lot. If all of your routers, switches and hosts support "Jumbo Frames" (an MTU of 9000) then you can send large packets very quickly!

All of the above is true for MTU when the Transmission Control Protocol (TCP) is being used. Other protocols have varying implantations of MTU. On top of this, vendors often implement the TCP/IP stack slightly differently in even their own devices, meaning that the MTU value can be different and can change on two devices with even a slight firmware version difference. And often this is the reason why you might run into MTU problems.

So to give you some context for the strange issue I experienced today. The company I work for has recently opened a new branch office, with a two-way site-to-site VPN connection back to the existing head office using CheckPoint firewalls. It’s one of these fully kitted out offices where you pay rent to the management company and they make the office liveable and usable. Therefore, all connections pass through the managed office network equipment. The new office is part of the existing domain, and has been set up with its own Active Directory Site and Domain Controller. Microsoft Exchange 2010 is hosted at the Head Office, and there are a handful of users in the new office that need to use Outlook 2010 to connect to Exchange. There are two Exchange Client Access Servers, with a KEMP Loadmaster 3600 HA deployment sat in front, load balancing connections between the two. Finally, there is a database availability group containing two Exchange Mailbox Servers behind the CAS servers.

I've been looking into the issue that has existed since the initial setup where Outlook 2010 doesn't properly connect from the new branch office. The behaviour we see is when creating the first profile on a workstation, the wizard freezes after it has looked up the user's account and is at the "Log onto server" stage. If we run Outlook.exe /rpcdiag, I can see that a "Directory" connection has been made over port 59667, but I would expect to see a "Mail" connection over port 59666 also and this never appears. Eventually, the wizard will time out saying the Exchange server is unavailable.

Initially, I suspected a routing issue. I went through the entire network route with a fine toothed comb and could not find anywhere where the traffic might be routing incorrectly. The management of the CheckPoint firewalls are out-sourced to a third-party, so I also raised this with them, asking for their top CheckPoint guru to take a look at it (what this guy doesn't know about firewalls, isn't worth knowing!). He was stumped also! So after back-tracking and going over the route multiple times I was convinced that this was not a network routing issue, and decided to take a different approach.

I knew that some sort of connection was capable, as I could see Outlook connecting on port 59667, and it was able to look up my account in the directory. At some point however, packets were failing or being dropped. We confirmed that the firewall was allowing ALL traffic between the two sites. So finally, I decided a packet capture between the load balancer and the client workstations was needed to see if this could give me any insight into the issue.

 I took a packet capture between the KEMP Load Balancer and a client at the new branch site by doing the following:
  1. Log into KEMP Load Balancer Management Interface.
  2. Go to System Configuration > Logging Options > System Log Files.
  3. Press Debug Options.
  4. Under "TCP Dump", select Interface All, set address to be that of the client workstation.
  5. Click Start.
  6. Attempt the outlook connection by launching outlook and stepping through the new profile wizard. The wizard will freeze at "log onto server" stage.
  7. Back in Kemp, click Stop, then Download.
  8. Open the downloaded pcap file in Wireshark

Within the packet capture, I could see that the load balancer and the client workstation were talking to each other as I could see lots of SYN, SYN ACKs and ACKs. Everything looked pretty normal. That was until I got towards the end of the capture and noticed some black-filled entries (black usually signifies bad in Wireshark!) and some ICMP packets flying around. ICMP stands for "Internet Control Message Protocol", and is basically a protocol that devices can use to tell each other useful information. For example, the message you receive with each successful ping is an ICMP message.

The ICMP messages I found stated "Destination Unreachable (Fragmentation Needed)". I did some research into this ICMP message and found that this message is sent if a packet sent is larger than the MTU, but also has a flag set that says "Do Not Fragment". What usually happens if a packet is larger than the MTU of a device is that the packet is "fragmented" or broken down into chunks smaller than the MTU and is re-transmitted, then reassembled when they reach the destination. If a packet has a flag of "do not fragment", rather than doing this, the device with the MTU smaller than the packet size will just drop the packet! So this is how I knew that we had an MTU issue.

I wanted to be certain that what I was seeing was indeed the cause of our connectivity issue and not just a red herring. I used the following ping command which allows me to send ICMP packets of a size of my choice (the normal ping command sends only 32 byte packets). As mentioned earlier, by default Windows and many other devices have a default MTU of 1500. What I guessed we were experiencing was a hop along the journey where the MTU was lower than this default, and indeed lower than packet size getting sent. It made sense in theory but I needed to prove this. I sent a ping packet with a size of 1472 bytes (because all packets get a header added to them of 28 bytes, 1472+28=1500, or the MTU) from the outlook client to the load balancer by running the following from a command prompt:

ping -f -l 1472 10.0.0.1

Here, the -f flag creates the packet with the "Do not fragment" flag set that I mentioned earlier and the -l flag (that’s a lower-case L) allows me to define the packet size.

 What I got back was our trusty ICMP message:

"Reply from 10.0.0.1: Packet needs to be fragmented but DF set.".

Instantly, this backs up my theory that there is a device between the branch office and Head Office that has a MTU lower than 1500. So next, I began reducing the packet size by 10 bytes, until my ping packet succeeded. Eventually I found that the maximum packet size I could send was 1410. Therefore, there is a device on the path that has a maximum MTU of 1438 (1410+28).

So I was certain that there definitely was a device with a lower MTU somewhere on our journey, but I still didn't know whether this was what was causing our issue. After all, if a device does have a lower MTU, it’s supposed to send an ICMP message back to the source, and the source should break the packet into smaller chunks and re-transmit right?

To prove I wasn't being lead on a wild goose chase, I needed to manipulate the packet size being sent when Outlook 2010 tried to connect, so that it was within the restricted MTU of 1438, rather than 1500 in the first place. If my theory was correct, Outlook would then connect without any issues. After some research I found that in Windows, I could run the following commands in a privilege command prompt to force the MTU of the NIC in the computer:

netsh interface ipv4 show interfaces

This gave me the following output:

Idx     Met         MTU          State                Name
---  ----------  ----------  ------------  ---------------------------
  1          50  4294967295  connected     Loopback Pseudo-Interface 1
 12          20        1500  connected     Local Area Connection

As you can see, the MTU for our NIC is the default 1500. I ran the following to force this to 1438 until next reboot:

netsh interface ipv4 set subinterface "12" mtu=1438 store=active

I then tried Outlook again and it connected perfectly. Point proven, but how do we fix it? Everything I've been reading is telling me that packets should be fragmented and re-transmitted in the event that it hits a device with a lower MTU than the packet size. So why isn't a fundamental part of the TCP/IP protocol working as described?

Well, I looked further into what should happen, and found that the process is called "Path MTU Discovery". Essentially, hops on a journey of a packet use ICMP packets to negotiate the lowest MTU on the route, and therefore what size packets should be sent at for successful transmission. So something wasn't able to successfully complete PMTUD. Looking at our network diagram, this is likely to be one of these:

KEMP Load Balancer
CheckPoint Firewall at Head Office
Managed Office Firewall/Router
CheckPoint Firewall at Branch Office

As I said earlier, our firewalls are managed by a third party, so whilst I investigated the load balancer and got in touch with the Managed Office regarding their equipment, I reported my findings to the third party so that they could investigate the CheckPoint Firewalls.

Shortly after I received a response from our Firewall support. They confirmed that the firewalls at both Head Office and Branch Office had a MTU of 1500. I had confirmed that the load balancer was also 1500, so this meant that a device within the Managed Office setup was functioning with a lower MTU. So a potential fix would be to get Managed Office to find this device and change it. Chances are though, there is probably a reason why it is set with a lower MTU, and so this may not be an option!

The firewall guys also confirmed that both firewalls supported the use of Path MTU discovery, and this was also enabled and should be sending the load balancer as the source, the necessary packet to tell it to re-transmit with a lower MTU. Basically, this should work out of the box.

As part of the Path MTU Discovery process, a device will figure out the size of the actual useful data in a packet by taking the MTU and minus around 40 bytes for headers. So if the transmitting MTU is 1500-40, its first guess will be the size of the useful data is 1460. The device then sends some data at this size. It expects that if the packet is too large, it will receive an ICMP message from the device that can't handle the packet, and it will send a message again at the smaller size. It keeps doing this until it doesn't receive an ICMP packet, then the actual data is transmitted at the correct MTU.

Very strange issues can occur if these ICMP packets don't arrive successfully. As you can imagine, if the device performing PMTUD doesn't receive an ICMP packet to say that packet is too large, it will tell devices to transmit at an MTU that is too large! The device with the lower MTU will then drop the packet, and the connection will disappear into a black hole and time-out.

So how do you fix this you ask? How can we ensure our data will be sent and received, even if the intermediary ICMP packets don't arrive? Well, CheckPoint Firewalls (and I'm sure all other vendors) are capable of doing something called MSS Clamping. It appears that this has been available since R77.20. This is, in essence, a "fudging" of the MSS value. Telling devices on the journey that the Maximum Segment Size is artificially lower than it actually is. Doing this transmits data at a lower MTU size, and therefore the packets are more likely to reach their destination. MSS Clamping will notify the Client and Sender to use smaller packets where they don't seem to understand the PMTUD packets.

On each CheckPoint Gateway, you want to add the following lines to some config files:

Add this line to $PPKDIR/boot/modules/simkern.conf file and reboot:
sim_clamp_vpn_mss=1 

Add this line to $FWDIR/boot/modules/fwkern.conf file and reboot:
fw_clamp_vpn_mss=1

Upon enabling the above parameters, Outlook sprung into life for all workstations at the Branch Office. So the root cause seems to be the Path MTU discovery protocol not working correctly, probably due to multi-vendor devices talking to one another across the VPN tunnel between the Bran Office and Head Office. And the resolution is to configure MSS clamping on a device or devices on the journey which ensures that in the event that PMTUD fails to function correctly and set the correct MTU, this is controlled artificially to ensure packets still reach their destination.