Fix for BizTalk ESB Toolkit 2.0 Error 115004 in ALL.Exceptions Send Port

On occasion we have had messages suspend on the ALL.Exceptions send port with the error:

Error 115004: An unexpected error occurred while attempting to retrieve the System.Exception object from the ESB Fault Message.

The source of the error is the pipeline component Microsoft.Practices.ESB.ExceptionHandling.Pipelines.ESBFaultProcessor, part of the ESB Toolkit’s custom pipeline on the ALL.Exceptions port.

Some suspicions and hunting through code using .NET Reflector led to an explanation.

The ExceptionMgmt.CreateFaultMessage() method, which is used to create a fault message in an orchestration exception handler, automatically locates and stores the exception object that was previously thrown.  It stores the exception by binary-serializing it, Base64 encoding it and storing it in a property on the first message part of the fault message.  Later on, the ESBFaultProcessor pipeline component attempts to de-serialize the exception.

The trouble arises when the thrown exception contains a non-serializable inner exception more than one level deep.  The method ExceptionMgmt.IsExceptionSerializable() only checks the root exception and the first InnerException.  If a non-serializable exception happens to be nested further, the code does not detect it.  As a result, the ESBFaultProcessor fails while attempting to de-serialize it.

In our case, we are pulling a flat file down from a web service and disassembling it inside of an orchestration using the XLANGPipelineManager class.  If there is a problem with the file format, an XLANGPipelineManagerException is thrown.  It contains an InnerException of XmlException, which in turn contains an InnerException of Microsoft.BizTalk.ParsingEngine.AbortException – which has no serialization constructor.

To solve this issue, I wrote a short helper method in C#.  I call it immediately after ExceptionMgmt.CreateFaultMessage() and pass it the newly created fault message and the caught exception’s Message property value.  It checks whether the stored exception can be de-serialized, and if not, replaces the stored exception with a special exception class.  This is the same thing that would have happened had the IsExceptionSerializable() method correctly detected the situation.

I submitted this bug to Microsoft Connect.

To use this code, you’ll need a C# class library with references to:

  • Microsoft.Practices.ESB.ExceptionHandling
  • Microsoft.Practices.ESB.ExceptionHandling.Schemas.Faults
  • Microsoft.XLANGS.BaseTypes

For convenience, I added a couple of using statements at the top.

using System;
using Microsoft.XLANGs.BaseTypes;
using ExceptionHandling = Microsoft.Practices.ESB.ExceptionHandling;
using ExceptionHandlingSchemas = Microsoft.Practices.ESB.ExceptionHandling.Schemas.Property;

namespace BizTalkHelpers
{
    public static class OrchestrationHelper
    {
        /// <summary>
        /// Work around a bug in the BizTalk ESB Toolkit 2.0 related to
        /// non-serializable exceptions. When
        /// ExceptionMgmt.CreateFaultMessage() creates a message, it
        /// automatically locates and stores the caught exception. If the
        /// exception contains an InnerException more than one level deep
        /// that is not serializable, the ESBFaultProcessor pipeline
        /// component will later fail when it attempts to deserialize the
        /// exception, resulting in the error:
        /// Error 115004: An unexpected error occurred while attempting to
        /// retrieve the System.Exception object from the ESB Fault Message.
        /// </summary>
        /// <param name="msg">
        /// Message created by ExceptionMgmt.CreateFaultMessage()</param>
        /// <param name="exceptionMsg">
        /// Message property value of the caught exception</param>
        public static void FixNonSerializableExceptionInFaultMsg(
            XLANGMessage msg, string exceptionMsg)
        {
            // Incoming msg must have been created by
            // ExceptionMgmt.CreateFaultMessage()
            XLANGPart p = msg[0];

            if (p == null)
            {
              return;
            }

            // Extract the Base64-encoded string representation of the
            // exception serialized by CreateFaultMessage().
            string str =
              p.GetPartProperty(
                typeof(ExceptionHandlingSchemas.SystemException)) as string;

            if (str == null)
            {
              return;
            }

            try
            {
              ExceptionHandling.Formatter.DeserializeObject<Exception>(str);
            }
            catch (Exception)
            {
              // If an exception is not serializable, the correct behavior
              // is to store a serialized instance of
              // SetExceptionNonSerializableException.
              ExceptionHandling.SetExceptionNonSerializableException ex =
                new ExceptionHandling.SetExceptionNonSerializableException(
                  0x1c13e, new object[] { exceptionMsg });

              p.SetPartProperty(
                typeof(ExceptionHandlingSchemas.SystemException),
                ExceptionHandling.Formatter.SerializeObject<
                ExceptionHandling.SetExceptionNonSerializableException>(ex));
            }
        }
    }
}

An Optimization for the BizTalk ESB Toolkit 2.0 Portal Faults Page

While debugging the issues described in my previous post, I looked at how the ESB.Exceptions.Service’s GetFaults() method was implemented.  In our case, we had stack traces inside the Description text, so the size of the data returned for each fault was quite large.  Multiplied by thousands of faults, this is why we overflowed the default setting for maxItemsInObjectGraph.

However, this raised an important question: why was the value for Description (and many other fields) being returned from the service when the web pages never show it?

The answer?  The service’s GetFaults() method returns every column from the Fault table, including potentially large values like ExceptionStackTrace, ExceptionMessage and Description.  These fields are never used by the ESB Portal, so this behavior only serves to slow down the queries and cause issues like that described in my last post!

I modified the GetFaults() method’s Linq query to select only the columns used in the portal:

select new
{
    f.Application,
    f.DateTime,
    f.ErrorType,
    f.FailureCategory,
    f.FaultCode,
    f.FaultGenerator,
    f.FaultID,
    f.FaultSeverity,
    f.InsertMessagesFlag,
    f.MachineName,
    f.Scope,
    f.ServiceName
};

And then created the actual Fault objects just before returning from the method:

List<Fault> faults = new List<Fault>();

foreach (var fault in result)
{
    Fault f = new Fault()
    {
        Application = fault.Application,
        DateTime = fault.DateTime,
        ErrorType = fault.ErrorType,
        FailureCategory = fault.FailureCategory,
        FaultCode = fault.FaultCode,
        FaultGenerator = fault.FaultGenerator,
        FaultID = fault.FaultID,
        FaultSeverity = fault.FaultSeverity,
        InsertMessagesFlag = fault.InsertMessagesFlag,
        MachineName = fault.MachineName,
        Scope = fault.Scope,
        ServiceName = fault.ServiceName
    };
    
    faults.Add(f);
}

return faults;

This avoids the expense of SQL Server selecting many large data values that are never used, and can greatly reduce the amount of data that must be serialized and de-serialized across the service boundary.

This change provided a very noticeable boost in performance on the Faults page when searching, filtering and moving between pages.

BizTalk ESB Toolkit 2.0 Portal Timeouts and (401) Unauthorized Errors

The Problem

During application testing in our recently-built test and newly-built production BizTalk 2009 environments, we started having problems with the ESB Portal throwing a System.TimeoutException or a (401) Unauthorized error.  This was happening with increasing frequency on the portal home page and the Faults page.  On the home page, the problem seemed to be localized to the Faults pane.

When we saw the (401) Unauthorized errors, they contained a detail message like this:

MessageSecurityException: The HTTP request is unauthorized with client authentication scheme ‘Negotiate’. The authentication header received from the server was ‘Negotiate,NTLM’.

De-selecting some of the BizTalk applications in My Settings seemed to decrease but not eliminate the problem.  We had already checked and re-checked virtual directory authentication and application pool settings, etc.  Needless to say, everyone was tired of being unable to reliably view faults through the portal.

Debugging

A couple of issues complicated the debugging process, both related to the portal pulling fault data from a web service – specifically the ESB.Exceptions.Service.

First, the ESB.Exceptions.Service uses the webHttp (in other words, REST) binding introduced in .NET 3.5.  REST is fine for certain applications, but it also lacks many features of SOAP.  The one that stands out in particular here is REST’s lack of a fault communication protocol.  SOAP has a well-defined structure and protocol for faults, so from the client side it’s easy to identify and obtain information about a service call failure.  With REST, you’ll probably end up with a 400 Bad Request error and you’re on your own to guess as to what happened.

In other words, one can’t really trust the error messages arising from calls to the ESB.Exceptions.Service.

Second, the ESB.Exceptions.Service does not have built-in exception logging.  [In another post I’ll have a simple solution for that.]  Combined with REST’s lack of a fault protocol, any exception that occurs inside the service is essentially lost and obscured.

One of our first debugging steps was to run SQL Profiler on the EsbExceptionDb and see which queries were taking so long.  To our great surprise, when we refreshed the Faults page in the portal we saw in Profiler the same query running over and over, dozens or hundreds of times!

Fortunately, I was able to obtain permissions to our test EsbExceptionDb, which had over 10,000 faults in it, and run the portal and WCF services on my development machine.  Sure enough, I kept hitting a breakpoint inside the ESB.Exceptions.Service GetFaults() method over and over until the client timed out.  However, there were no loops in the code to explain that behavior!

Next, I turned on full WCF diagnostics for the ESB.Exceptions.Service, including message logging, using the WCF Service Configuration Editor.  Using the Service Trace Viewer tool, I indeed saw the same service call happening again and again – but the trace also captured an error at the end of each call cycle.

The error was a failure serializing the service method’s response back to XML.  The service call was actually completing successfully (which I had also observed in the debugger).  Once WCF took control again to send the response back to the client, it failed.  Instead of just dying, it continuously re-executed the service method!  This could be a bug in WCF 3.5 SP1.

Problem Solved

The solution to the WCF re-execution problem was increasing the maxItemsInObjectGraph setting.  On the service side, I did this by opening ESB.Exceptions.Service’s web.config, locating the <serviceBehaviors> section, and adding the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “ExceptionServiceBehavior” behavior.

With that simple configuration change, the service call now returned promptly and the portal displayed a matching error about being unable to de-serialize the data.  As with the service, I needed to increase the maxItemsInObjectGraph setting.  I opened the portal’s web.config, located the <endpointBehaviors> section, and added the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “bamservice” behavior.  The error message didn’t change!  I eventually discovered that the <dataContractSerializer> element must be placed before the <webHttp /> element.

The portal now displayed the home page and Faults page properly, and the timeout and unauthorized errors disappeared.

Race Condition in BizTalk ESB Toolkit 2.0 Exception Notification Service

We are using the ESB Exception Notification (aka ESB.AlertService) Windows service in conjunction with the ESB Portal website.  On occasion, we have a problem where the service indefinitely sends out duplicate emails for the same alert.  In the server’s Application Event Log, we see the error: “An exception of type ‘System.Data.StrongTypingException’ occurred and was caught.”  The log entry also includes “The value for column ‘To’ in table ‘AlertEmail’ is DBNull.”

We are allowing the service to pull user email addresses from Active Directory by configuring the LDAP Root under Alert Queue Options to LDAP://DC=company, DC=com.  With Active Directory you don’t need to specify a server name in your LDAP path.  Just point to the domain itself and Windows will figure out which domain controller to contact.

The vast majority of the rows in AlertEmail contain the correct email address in the To column, but every once in a while there is a NULL.  Looking at the service code (QueueGenerator.cs), we can see that the email address in CustomEmail is always used first, if one was provided when the alert subscription was created.  We do not set this value, so the code next attempts to pull the email address from Active Directory using the GetEmailAddress() method (ActiveDirectoryHelper.cs).

In order to reduce the number of AD queries, the code caches email addresses using the Enterprise Library caching block.  The cached entries expire after a configurable interval, which defaults to 1 minute.  If the username is already in the cache, then the corresponding email address is returned.  Otherwise, the code looks up the username in AD, grabs the email address and caches it.  The lookup code throws an exception if it doesn’t get back a valid email address, so it doesn’t explain how we got a NULL email address.

The problematic code is the cache lookup:

if (CacheMgr.Contains(name))
{
  Trace.WriteLine("Reusing email address for " + name + " from cache.");
  return (string)CacheMgr.GetData(name);
}

This is a classic race condition.  The code checks to see if the username is in the cache, then runs a Trace.WriteLine(), then asks for the cached data associated with the username.  In the time between the Contains() and the GetData() calls, the cached data can expire and drop out of the cache, in which case GetData() will return null.  Most of the time it gets lucky and the data is still cached.  This probably explains how we sometimes get NULL values in the database.

The proper code is simple because GetData() simply returns null when the requested data is not in the cache:

string cachedEmail = (string)CacheMgr.GetData(name);

if (!string.IsNullOrEmpty(cachedEmail))
{
    Trace.WriteLine("Reusing email address for " + name + " from cache.");
    return cachedEmail;
}

The new version of the code eliminates the race condition and should prevent us from ever seeing NULL values in the database.

I also created a bug report on Microsoft Connect.

Fix for BizTalk ESB Toolkit 2.0 Portal Message Viewer Error About BizTalkMsgBoxDb.dbo.ProcessHeartbeats

When we recently configured the ESB Portal website, we encountered a number of permissions-related issues.  Our initial experience was the same as that of many others who have discovered that the Portal’s included permissions script is inadequate.  Once we granted additional permissions to the existing database roles the permission errors cleared up – but we couldn’t overcome one last error: Invalid object name ‘BizTalkMsgBoxDb.dbo.ProcessHeartbeats’.

As most of you know, Microsoft decided not to ship the source code for the ESB Toolkit 2.0 aside from the Management Portal “sample”.  In order to diagnose this error, I pulled out Red Gate’s .NET Reflector and started digging through disassembled code.  The source of this particular issue lies in the ESB.BizTalkOperationsService.

In our environment, as in most high-performance BizTalk installations, the message box database is on a different SQL Server instance than the other BizTalk databases.  In a great oversight, the BizTalkOperationsService was hard-coded to expect the message box database to be present on the same server as the management database.  The operations service attempts to run this SQL query on the database that holds the management database: SELECT 1 FROM BizTalkMsgBoxDb.dbo.ProcessHeartbeats with (nolock) where uidProcessID='{0}’.

You’ll note another potential issue here: the message box database name is hard-coded in the query.  That has also caused trouble for people.

To solve this problem, I first used .NET Reflector to re-create Visual Studio 2008 projects for the ESB.BizTalkOperationsService ASMX web service and Microsoft.Practices.ESB.BizTalkOperations.dll class library.  Once the projects were cleaned up and building successfully, I modified the code to query the management database for the primary message box database name and server using the existing stored procedure adm_MessageBox_Enum.  With that information in hand, I updated the code to create a connection string to the message box database and execute the ProcessHeartbeats query there.  I also removed the hard-coded database name.

I tested my version of the BizTalkOperationsService using the ESB.BizTalkOperations.Test.Client included with the Toolkit source code and verified that everything still worked as expected.

Since this was a fairly time-consuming issue to fix and it is a problem that should affect a good percentage of the installations out there, I decided to post my updated service and source code (download link at the end of this post).  I cannot make any guarantees about the correctness of the code, so consider it as-is and use at your own risk.  (That said, I believe that it works just fine.)

Let’s hope that Microsoft reconsiders its unfortunate decision not to ship source code.

ESB.BizTalkOperationsService.zip

%d bloggers like this: