Mathieu Chateau's blog in plain english: All you need to know on NLB

28 July 2009

All you need to know on NLB

NLB (Network Load Balacing) from Microsoft have the advantage to come directly through the OS. As its name state, it allows to spread the load among many nodes, that are members of the farm (cluster). It's quick & easy to set up, or it looks like so, but there are many things to check if you want it to be more than appearing to work...

Network impact

NLB can works in two modes:

Unicast
Multicast (with or without IGMP)

Which one to pick ? It Depends! Things that make choose one:

Which application will be used through the farm ? Does it support both mode ? For example, ISA 2006 only supported unicast until Service Pack 1 (a hotfix was available but not so famous)
How many network cards do the nodes have ? Unicast will require 2 interfaces minimum to respect best practice.
Do the nodes need to communicate between them ?
Is the multicast filtering activated on the switches ? It prevent flooding the network
Some switches (Cisco as example) do not stand at all to see the same mac address on the network from each node. You then have to convert your switch to hub, sending all packets to all farm members.

Monitoring & availability

It is true that if one node goes out of the network, the others will take its load over. But it's a full failure. If you have 2 nodes, and just stop your business application on one node, NLB will still send clients to it, and so you just lost half of your customers. NMB is layer 3 (IP), and so isn't aware at all of anything upper this layer. Even if the TCP port is not listened anymore. That's the pitfall of NLB. Microsoft included sentinel in the resource kit. It allowed to test a web page on each node and push it out of the farm if it's not working. ISA 2006 manage directly NLB, and can push a node out if ISA goes mad. So it's your duty to fill the gap. If it's a web site running through IIS, you can change a key in the metabase,LoadBalancerCapabilities to replace the 503 per a TCP reset. So the client will reconnect and send again its request, on another node.

To fill this gap, you can use your monitoring solution or a script looping on each node. The goal is to test each node from the application point of view, and push it out in case of error. Appliance load balancer (Alteon...) do the same, the industrial way. What you must take care:

You must check nodes as often as possible, but without overloading them. The best is to include this monitoring need in the application, by including a special web page that will test for us the applications compponents (database access...) and then back the result through a code.
Your monitoring becomes "active" (acting directly on the production by its own)

The Microsoft monitoring, SCOM, is interesting since you can act on trigger (eventlog, files...)

NLB versus MSCS ?

an MSCS cluster is meant to be active/passive. At anytime, resources are owned by only one node, which must be able to handle the full load. The good things is it can manage data, which are shared accross nodes and it monitors resources (state of windows services..). There again, it doesn't cover all case, especially when the application is there, but not answering anymore requests (database access lost...).

Other solution ?

I already set up Safekit from Evidian on Windows. Not bad, but applications checks are still for you (how could it be the other way ?)
Load balancer appliances (F5;Alteon...). As great as expensive...
Keep with only one node ?

KB/Articles:

IIS Responses to Load-Balanced Application Pool Behaviors

NLB Operations Affect All Network Adapters on the Server

Unicast NLB nodes cannot communicate over an NLB-enabled network adaptor in Windows Server 2003

The “NLB troubleshooting overview for Windows Server 2003″ article is available

How to deploy a Secure Socket Tunneling Protocol (SSTP)-based VPN server that uses Network Load Balancing (NLB) in Windows Server 2008

An update enables multicast operations for ISA Server integrated NLB

Windows Server 2008 Hyper-V virtual machines generate a Stop error when NLB is configured or when the NLB cluster does not converge as expected

Terminal Services Client Cannot Connect to NLB Cluster TCP/IP Address

The NLB WMI Provider Generates a Lot of Error Entries in the Wbemcore.log File

How NLB Hosts Converge When Connected to a Layer 2 Switch

Windows Server 2003-based NLB nodes in an NLB cluster cannot communicate with each other over an NLB network adapter