Friday, November 20, 2009

The most common poor design that effects server Up time and troubleshooting bottleneck in IT Enterprises.

The most important thing for an IT enterprise is to ensure that there servers remains Up and running and one of the most common cause which brings down the servers is Blue screen or more commonly known as blue screen of Death. I have seen hundreds of big enterprises with huge mission critical servers running applications that need to be highly available with poor design. Sometimes these servers are highly available cluster nodes and sometimes single stand alone servers however the famous blue screen of death can struck these servers anytime. So what happens if a server starts blue screening. What shall we do? Well as said that 90% of the blue screens occur due to buggy drivers and other 10% due to hardware issues. So as a first step you can check what the stop code is and what does that stop code refers to. You can update the drivers and Bios but you need to be careful of the computability issues. So the next step is to grab a memory dump to prepare a conclusive action plan after its analysis to resolve the issue. That’s where the design flaw appears.



Most of the IT Enterprises don’t plan this scenario which always leads to disruption of server Up time and sometimes the after effects of this bad design leads to troubleshooting bottle neck. Microsoft Operating systems enables IT administrator 3 kinds of memory dumps: small, kernel and full memory dump.


A Small Memory Dump is much smaller than the other two kinds of kernel-mode crash dump files. It is exactly 64 KB in size on 32 bit machine and 128kb in 64 bit machine, and requires only 64 KB/128KB of pagefile space on the boot drive.This dump file includes the following:


1.The bug check message and parameters, as well as other blue-screen data.
2.The processor context (PRCB) for the processor that crashed.
3.The process information and kernel context (EPROCESS) for the process that crashed.
4.The thread information and kernel context (ETHREAD) for the thread that crashed.
5.The kernel-mode call stack for the thread that crashed. If this is longer than 16 KB, only the topmost 16 KB will be included.
6.A list of loaded drivers.
7.A list of loaded modules and unloaded modules.
8.The debugger data block. This contains basic debugging information about the system.

This kind of dump file can be useful when space is greatly limited. However, due to the limited amount of information included, errors that were not directly caused by the thread executing at time of crash may not be discovered by an analysis of this file. If a second bug check occurs and a second Small Memory Dump file is created, the previous file will be preserved. Each additional file will be given a distinct name, which contains the date of the crash encoded in the filename. For example, mini022900-01.dmp is the first memory dump file generated on February 29, 2000. A list of all Small Memory Dump files is kept in the directory %SystemRoot%\Minidump

Unfortunately stack traces reported by WinDbg, especially involving 3rd-party components, are usually incomplete and sometimes not even correct. They can also point to stable drivers when the system failure happened after slowly accumulated corruption caused by some intermediate driver or a combination of drivers. In other words small memory dumps are helpful but not always reliable to conclude.

Kernel dumps almost always capture the relevant information required in case of a blue screen. Though it does not contain user mode data however that is not required most of the time. As kernel mode in a 32 bit machine can be only 2GB+1mb at max it is easy to get a kernel dump on 32 bit machines. But the real problem lies in 64 bit operating systems. Customers generally don’t create required big enough page file on C drive or do not have required free space on boot drive i.e. C. the same thing happens in case of full memory dumps [which are required in case your server hard hangs/freezes]

Here you go with a public article from Microsoft which explains that even if you point dump to D or E drive you still need free space on boot volume C drive which or at least require that the page file on C drive is at least Ram +1 MB[ in case of kernel dump and 1.5* RAM [in case of full dump] [though Microsoft article is not very clear in beginning and confuses audience though it helps] Although you can change the path for the location of the dump file using Control Panel, Windows always writes the debugging information to the pagefile on the %SYSTEMROOT% partition first, and then moves the dump file to the path specified. Well the kernel dumps is not always that big as your ram however you can't exactly predict the size of a kernel memory dump is because its size depends on the amount of kernel-mode memory in use by the operating system and drivers and this becomes more complex in 64 bit environment.

Please review the following articles to plan your C boot drive and page file size on servers in your Enterprise.

886429 What to consider when you configure a new location for memory dump files in Windows Server 2003
http://support.microsoft.com/default.aspx?scid=kb;EN-US;886429


141468 Additional Pagefile Created Setting Up Memory Dump
http://support.microsoft.com/default.aspx?scid=kb;EN-US;141468


Another article on how to determine the appropriate page file size for 64-bit versions of Windows Server 2003 or Windows XP
http://support.microsoft.com/kb/889654

For business-critical 64 bit operating system servers where business processes require to server to capture physical memory dumps for analysis, the traditional model of the page file should be at least the size of physical ram plus 1 MB, or 1.5 times the default physical RAM. This makes sure that the free disk space of the operating system partition is large enough to hold the OS, hotfixes, installed applications, installed services, a dump file, and the page file. On a server that has 32 GB of memory, drive C may have to be at least 86 GB to 90 GB. This is 32 GB for memory dump, 48 GB for the page file (1.5 times the physical memory), 4 GB for the operating system, and 2 to 4 GB for the applications, the installed services, the temp files, and so on. Remember that a driver or kernel mode service leak could consume all free physical RAM. Therefore, a Windows Server 2003 x64 SP1-based server in 64-bit mode with 32GB of RAM could have a 32 GB kernel memory dump file, where you would expect only a 1 to 2 GB dump file in 32-bit mode. This behavior occurs because of the greatly increased memory pools.

130536 Windows does not save memory dump file after a crash
http://support.microsoft.com/default.aspx?scid=kb;EN-US;130536

So in case you already are stuck with same issue or if your IT enterprise has servers with configuration where the dumps cannot be captured, either we can increase free space on boot volume c:\ [something not supported by Microsoft] to follow above mentioned articles or we can reduce RAM by using /maxmem switch in boot.ini [reducing ram won’t be a feasible option always in production environments]. Another option is to try live debug by engaging Microsoft Customer support however customer needs to sets the machine for live debug.

All what I said above does not apply to win2k8 [ In windows 2008 you have a new feature of dedicated dump file http://support.microsoft.com/kb/957517 ]

In Windows Vista and Windows Server 2008, the paging file does not have to be on the same partition as the partition on which the operating system is installed. To put a paging file on another partition, you must create a new registry entry named DedicatedDumpFile. You can also define the size of the paging file by using a new registry entry that is named DumpFileSize. By using the DedicatedDumpFile registry entry in Windows Server 2008 and in Windows Vista, a user can configure a registry setting to store a dump file in a location that is not on the startup volume.

Location: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
Name: DedicatedDumpFile
Type: REG_SZ
Value: A dedicated dump file together with a full path

Location: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\CrashControl
Name: DumpFileSize
Type: REG_DWORD
Value: The dump file size in megabytes.

Please review the following articles to plan dump captures on servers in your Enterprise.


How to generate a kernel or a complete memory dump file in Windows Server 2008
http://support.microsoft.com/kb/969028


Dedicated dump files are unexpectedly truncated to 4 GB on a computer that is running Windows Server 2008 or Windows Vista and that has more than 4 GB of physical memory.http://support.microsoft.com/kb/950858

Hope you guys find this as a reading pleasure and it helps. Very thanks for your time and stay tuned to blog for more intersting upcoming topics.

GAURAV ANAND

1 comment:

  1. Further consider that many sysadmins who get to the point you discuss do not again realize the impact of writing such large kernel memory dumps on restart after a crash. A 32GB memory dump takes quite a long lime to write, and frankly is not that much these days. Consider the potential size required for a kernel only dump on a 64-bit server with say 256GB of RAM, now consider how long it will take to finish writing the memory dump while the server is restarting. In smaller shops without a clustered setup, this downtime cannot be afforded.

    ReplyDelete