Debugging RunUO Hangs

Discussion in 'RunUO Development' started by Chris, Jul 7, 2013.

  1. Chris

    Chris Renaissance Staff
    Renaissance Staff

    Joined:
    May 14, 2012
    Messages:
    3,385
    Likes Received:
    6,195
    Original Thread by Derrick, Archived from JoinUO Forums
    http://www.joinuo.com/forums/viewtopic.php?f=27&t=681
    ------

    Crashes happen, and are generally well handled on RunUO, with exception logging the cause of a crash can be determined rather quickly (aside from stack overflows). Hangs on the other hand can be much more difficult to diagnose, and are most often caused by infinite loops. I'm going to try to document what I've learned from using WinDbg to help determine to cause of a a server "hang". WinDbg can also be used to find the source of a stack overflow which i will try to more briefly address.

    This is a work in progress and I'm going to have to publish it in multiple parts. I'm not an author nor do I aspire to be one, so please go easy on the potentially confusing format of this; I'll try to work it over a bit once it's done.

    "WinDbg is a multipurpose debugger for Microsoft Windows, distributed on the web by Microsoft." [wiki]

    You won't likely know when a hang is going to occur, which is the problem with this type of thing to begin with. The first part of this article includes steps to prepare for the debug session before it's ever expected to occur.

    Part 1: Prerequisites for Debugging
    • WinDbg, available from Microsoft as Debugging Tools from the Windows SDK
    • The "Symbols" for your server core (Usually RunUO.exe), and Scripts.CS.dll; These would be named RunUO.pdb and Scripts.CS.pdb (This is the part that needs to be done ahead of time)
    • The running process in the hung state, or a crash dump file as described above which will be generated when the server is "hung"
    • Access to the Server desktop, or if working with a dump file another PC with the exact same .Net framework and processor architecture can be used.

    Generating Symbols:
    These symbols should already be present at the time of the hang; it is not reliable to use symbols generated after the fact, although it's possible that they could work.
    For RunUO.exe:

    • The RunUO distro does not include symbols for RunUO.exe. In order to debug the core you will need to generate these symbols. This is not a primer on compiling the core; but to generate symbols for a non-debug assembly, use the compiler option "/debug:pdbonly". If you are compiling in Visual Studio, this can be set in the project properties page: Build, Advanced, Debug info:
    [​IMG]

    For Scripts:

    • When you start the RunUO server with the command line option /debug, Scripts are compiled as a debug build, and will include Scripts.CS.pdb in the /Scripts/Output folder, alongside Scripts.CS.dll. But it's not recommended to use a Debug compile of scripts in a live shard environment, so to get RunUO to give you the Symbols file for a non debug run, you will need to modify ScriptCompiler.cs in the RunUO core.
      In Server\ScriptCompiler.cs, around Line 99, find:
      Code:
      			return (sb == null ? null : sb.ToString());
      
      replace with:
      Code:
                  if (!Core.Debug)
                      AppendDefine(ref sb, "/debug:pdbonly");   // always add a pdb
      
      			return (sb == null ? null : sb.ToString());
      

    Creating a Dump File:
    Once the server is Hung my immediate temptation is to restart it immediately. Before you do that, create a dump file.
    Dump files can be generated at any time for a running application from the Task Manager:
    [​IMG]
    (in these examples we are working with a RunUO core executable called Server.T2A.exe, however the standard name for the RunUO server is RunUO.exe)
    WinDbg can both attach to a running process, or analyze a dump file, but to reduce server downtime, the creation of a dump file is probably preferred.

    Once you have created the dump file, you'll want to Move or copy this file, along with your RunUO.exe/pdb and Scripts.dll/pdb to a safe location. The Dump will be otherwise deleted by windows eventually. If running in a Windows Server enviornment, this DMP is usually in it's own folder.
    [​IMG]
    Copy The DMP file, RunUO.exe, RunUO.pdb, and all .dll's and pdbs from Scripts\Output\ usually this will just be Scripts.CS.dll/pdb to another folder.
    Optionally, you may wish to include other files in here, such as any other non Framework user defined assemblies that are used by your server. These are listed in Data\Assemblies.cfg

    At this point we should have the minimal set of files needed for a successful Debug session:
    [​IMG][/quote]
    Last edited: Aug 18, 2014
  2. Chris

    Chris Renaissance Staff
    Renaissance Staff

    Joined:
    May 14, 2012
    Messages:
    3,385
    Likes Received:
    6,195
    Part 2: A test case
    To set up a simple test scenario, I added the Following to EventSink_Speech in Scripts\Misc\Keywords.cs:
    Code:
            public static void EventSink_Speech(SpeechEventArgs args)
            {
                if (args.Speech == "hang") while (true) { }
    ...
    
    This will cause a server hang whenever anyone says "hang" in game.
    Once executed, it will be immediately apparent after taking two steps in game, and checking TaskManager, that things are not good.
    [​IMG]
    In the above example from a Quad Core workstation, 25% of the total CPU resource is being consumed (one core); this is the main timer thread in a loop "while (true) { }" On a Dual Core, this would be 50% on this screen, on a 8 core, it'd be 18%, etc...

    At this point, a Crash Dump should be generated and along with the Core and Scripts assemblies and symbols copied to a new location. After which you can restart the shard (if in production and desired). The .DMP file will include the memory image of the process, which can be quite huge; these files do however compress well.

    Fire Up WinDbg, and Open Crash Dump...
    [​IMG]
    Open your copied Dump File.
    [​IMG]
    Issue the following three commands:

    • !sym noisy
      .symfix
      .reload

    The "noisy" option allows you to see all the symbol load messages, which may be helpful in diagnosing missing symbols. .symfix configures WinDbg do download needed symbols for .Net framework assemblies from the Microsoft symbol server, and .reload will download and load those symbols.

    Note that WinDbg will become busy, and it will not respond to commands in this state, if you don't notice the *BUSY* indicator things can be confusing so I've noted that indicator below with an arrow
    [​IMG]

    WinDbg is designed to debug "native" code, .Net applications are "managed code". The "SOS" extension allows us to debug managed code in WinDbg
    Enter the follwing command to load this extension:
    • .loadby sos clr
    Now run "!threads", this will likely initiate some downloading of symbols, but if successful you'll see a list of the active threads when the Dump File was captured.
    If running the RunUO server from the commandline, the main runuo timer loop is usually running in thread 0, however if running as a service, it may be running in thread 1 or higher. Also, the number of active threads will vary, in the example above there are 6 threads running, You will likely have at least two or three.

    To find out which thread is the timer loop, we need to look at the stacks. You can use the following command to dump the stacks from all threads:
    • ~* e !clrstack
    The following shows the first four threads from this dump:
    If you aren't familiar with looking at stacks this may look a little baffling. The key to this is that in a stack, as the name might imply, the newest stuff is on the top. As we look at thread zero, the first line of the stack is the function that was executing at the time of the Dump, the next line is the function that invoked that function and so on, all the way back to the initial invocation of the processes in the core, Server.Core.Main(System.String[]) .

    We can see from this dump that the thread that is running our scripts code is thread 0, and that the actual function that thread is in, is Server.Misc.Keywords.EventSink_Speech(Server.SpeechEventArgs) [d:\RunUO\SecondAge\Scripts\Misc\Keywords.cs @ 21]
    which is this line in the source code:
    Code:
                if (args.Speech == "hang") while (true) { }
    
    This is all we need to know in order to find this problem. usually the originating bug will be quite a bit more obscure, however as you can see this can be invaluable for hunting down code bugs, especially ones which don't occur very often and which only occur on a live server.
  3. Chris

    Chris Renaissance Staff
    Renaissance Staff

    Joined:
    May 14, 2012
    Messages:
    3,385
    Likes Received:
    6,195
    Part 3: A real case
    A crash hang on the production server itself is best diagnosed on the server itself, because all the necessary files are sure to be there. Officially WinDbg requires that the Processor architecture be the same on the debugger as on the machine that generated the Dump File. The installed framework must also be the same, or you'll get an error message similar to this:
    WinDbg does not have much of a system resource footprint, as such there shouldn't be any issue with loading up a large dump file on the server, even while RunUO is running. In this example, a 7GB dump file is being analyzed.
    This information isn't very helpful, but I'm just running it to see that it looks rational, and that we do actually have the image loaded.

    And then the CLR Stack; we've got 27 threads runing in this example, this Dump was captured at peak hours, so the stacks below are snipped after the first 7
    Above "Server.Items.TillerMan.OnSingleClick" jumped right out at me, and we can see that the thread of interest is this one:
    In this real example the clr stack has narrowed down the issue for me to something beginning with single clicking the tillerman. This result is not as clear as the contrived example in Part 2, but it significantly narrows down where to begin looking. Prior to obtaining this information there were absolutely no clues as to what was causing the shard to hang every couple weeks.

    In the end, the problem was found to be a bug in a string parser which was in a custom core modification, triggered by specific and uncommon user input.
    Is worth noting that this was not the top line of the stack. It may be more common than not that the problem lies down the stack a ways. Starting at a place deeper in the stack that was known to be solid (Server.Items.TillerMan.OnSingleClick(Server.Mobile)), and working up the stack until some dubious code is found.

    In this exercise we now have a pretty good idea of where the problem might lie, but just as an example, there is a way to get even more detailed information on this stack.

    If we want to see some really excessive details on the call stack, we can switch to the thread thread of interest and run !dumpstack. Our prompt in WinDbg tells us which thread our commands are going to operate on, currently 0. I don't know the console command to switch threads so I'll bring up the threads window:
    [​IMG]

    Double clicking on 004:ce4 will switch to this thread, and your prompt will then read 0:004>
    the command !dumpstack will then give a complete stack trace for that thread, including both user and framework code:
    It's a little excessive, the !clrstack is usually sufficient.
  4. Chris

    Chris Renaissance Staff
    Renaissance Staff

    Joined:
    May 14, 2012
    Messages:
    3,385
    Likes Received:
    6,195
    Part 4: A Stack Overflow
    With most shard crashes a log file is written to the working directory for the RunUO server, and RunUO can even be configured to Email you these logs. This is not however the case with a stack overflow. Stack overflows are usually caused by recursion. We can simulate one with the following code:
    Code:
            public static void EventSink_Speech(SpeechEventArgs args)
            {
                EventSink_Speech(args);
    ...
    
    This will cause a stack overflow exception as soon as a players says anything in game. Stack overflow exceptions cannot be gracefully handled and therefore there is little trace of what happened.
    Fortunately, Windows will (or can) create a dump file for you when this occurs using "Windows Error Reporting". For information on enabling WER if it has been disabled or for other options, see this page on MSDN: http://msdn.microsoft.com/en-us/library/bb513638(v=vs.85).aspx
    After the shard crashes with a stack overflow error you should have an entry indicating the fault in the Windows Application Log. These examples are from a Win7 workstation, but Windows Server is similar.[​IMG]
    The next entry should be a WER entry, and will give you the location of the "report" that was saved:
    [​IMG]
    Browse to this folder and copy out the files:
    [​IMG]
    Load the .hdmp into WinDbg.

    • .loadby sos clr
    You will get a warning about this being a "minidump" but it still contains the information we are looking for.

    It's likely in this situation that you will already be on the correct thread, as this was certainly the last thread executing. Try:
    • !clrstack
    This will likely yield a very tall stack and may take a minute of so to print out, you will see that the stack is full of calls to
    taking us directly to the problem.

    If you weren't on the right thread, you can use
    • ~* e !clrstack
    to show the stacks from all the threads,

Share This Page