07 May 2017

a case of fixing SQL deadlocks without any code/schema changes

I work on high volume message processing app which is built on .NET, C#, MS SQL server and IBM WebSphere MQ. The application process messages to the tune of 200 messages a second. During LnP testing, there were deadlocks happening in few scenarios and we fixed all of them.

Recently there was a need to load data from the same MS SQL database tables, which are being updated from the HIGH VOLUME message flow 24X7. The new data load tool should run daily to pick the changes happened over last business day based on the message was last modified and send it another interfacing system.

The new tool developed to be a .NET Console app and scheduled to run every day using Windows Task Scheduler.

The tool ran for some days and we were monitoring logs. Unfortunately, the deadlocks issues were back again at the forefront. When the tool ran, the deadlocks were seen only with new tool DA layer logic but not with the original application which updates the same tables. This is because it's easy to ROLLBACK a SELECT only SQL TRANSACTION for SQL SERVER than to rolling back an Insert/Update/Delete SQL TRANSACTION. So the SELECT only queries used by the new tool were chosen as DEADLOCK VICTIM all the time.

Transaction (Process ID) was deadlocked on lock resources with another process and has been chosen as the deadlock victim

With the above introduction (actually a bit long), let's get start to justify the blog post title.

After analyzing the logs which the tool produced we learned below:-

Deadlocks were happening while loading the records which were being updated by the main process app at the same time.
Analyzing the SELECT only queries which are used to load data in batches, we found no easy ways to optimize the queries to prevent deadlocks from happening. The queries were already optimized. No low hanging fruits.
The tool was running to select changes which were happened over 24 hours basing its logic on the record last modified time.
After doing analysis of record's last modified date-time we found that:-
1. Most the records were heavily updated in the first hour they got created.
2. Almost all the records were being modified only during the first 5-7 hours from their created time. This was an important clue to fix the deadlocks issue. we were facing.
So instead of selecting the records modified in the last 24 hours, we selected new 24 hours date time range such that whose MAX date time is 8 hours less than current date time. In this new date range, there were only a few records were getting updated. Post this change in selection date-time range, the SELECT only queries were getting started to succeeded ALL THE TIME without being a victim of deadlocks.

Deadlock issues were resolved without any code or database schema changes. These Complex deadlocks issues were fixed by employing altogether a different approach to look at the problem.

so, THINK OUTSIDE THE BOX!!!

References

stackoverflow QNA on cause-of-a-process-being-a-deadlock-victim

MSDN article on Detecting and Ending Deadlocks

04 June 2016

supporting high volume httpWebRequests

If your .NET app is seeing Timeouts while making huge volume of HTTP web requests, stop blaming NETWORK bandwidth between the client and server and consider below points:

How many concurrent HTTP Web Requests are being allowed by .NET framework?
KeepAlive setting on HttpWebRequest
Handling connection error that was expected to be kept alive was closed by the server
Pay attention to HTTP errors rate

Let's dive in to understand more:

Configuring Maximum Concurrent HTTP requests

By default your application running on .NET framework allowed to MAKE only certain number concurrent HTTP WEB requests. I think its around MAX of 2 to 4. You can easily increase this limit to safe number as needed by application to support the load to be supported.

Below is the config section settings you need to have in your application / web config files.

<system.net>
  <connectionManagement>
     <add address="*" maxconnection="80"/>
  </connectionManagement>
</system.net>

Before increasing the MAX CONNECTION COUNT in PROD you load test the same in TEST. You need to ensure it doesn't overload the HTTP WEB server you are hitting. In case if your server not able to handle the requests, you still get exceptions (May be Server is too busy, etc).

Setting KeepAlive on HttpWebRequest

By setting KeepAlive = true on HttpWebRequest object, you will tell the mechanism to use the persistent HTTP connection between client and the server.

With persistent connection, the connection need not to be created afresh and destroyed with each HTTP WEB request you make. However KeepAlive setting value may not always be considered if the PROXY between the client and server is not supporting the persistent connection and also when HTTP WEB server has disabled PERSISTENT HTTP connections.

Handling connection error that was expected to be kept alive was closed by the server

When you application issues bust of HTTP requests in short span of time, it will try CREATE as many HTTP Connection your application requested. (Off course MAX concurrent connections are limited by the maxconnection setting)
After the activity slowdown, your application might not need so many connection, so many of the created HTTP connection are idle now.
When again there is bust HTTP request made after sometime, it might reuse an HTTP connection from the pool thinking server still has this connection open at its end. If for some reason HTTP server had closed down that HTTP connection then your client application bound to get the error "connection error that was expected to be kept alive was closed by the server".
To fix this error you need set the System.Net.ServicePointManager.MaxServicePointIdleTime to safe value at HTTP client end. The MaxServicePointIdleTime at client end should be less than what is set at HTTP web server end.

System.Net.ServicePointManager.MaxServicePointIdleTime = 8000; // 8 sec

If your HTTP server has set MaxServicePointIdleTime value to 10 sec, then at client side setting MaxServicePointIdleTime to 8 sec will reduce the probability of the Connection closed errors.

If this reduced error probability is not acceptable, then you need to consider retrying the same HTTP web request upon timeout error, so that failed HTTP web request gets a second chance to succeed.

Pay attention to HTTP errors rate

Exceptions are costly. So they need to be wisely. Don't design your HTTP APIs to throw exceptions for normal business errors if clients doesn't have anything to act upon the error. When HTTP API clients are going ignore the exceptions and do no thing about the exceptions, they probably that shouldn't be proper case to through exception especially if you want to support high volume HTTP requests processing. This is not true for HTTP APIs alone, true for any other protocol APIs.

If for some reason if HTTP APIs have to throw the exceptions and exceptions rate is high, then the calling HTTP client app logic should be careful about not re throwing the exception up the call-stack. Just handle the exception in the first place where it caught and return the error code from then on.

In high volume HTTP requests scenario if exception rate is high you end taking more CPU which will deprive CPU from other apps / process.

More info

MSDN blog on Understanding MaxServicePointIdleTime and DefaultConnectionLimit