Nightmare on TSQL Street – #TSQL2sday
|Thank you to Nick Haslam ( blog | twitter ) for providing the topic for T-SQL Tuesday #35: What is your most horrifying discovery from your work with SQL Server? Nick, I have never even heard of Soylent Green, let alone watched it. So, I’ll be interested to see if other posters have seen the movie. I’m more of a Nightmare on Elm Street kinda guy, for obvious reasons – and no, I’m not related to Freddy Krueger or SQL MVP Ted Krueger ( blog | twitter ).
Nightmare on TSQL Street: The Case of the Missing Cache – Based on a True Story
This is a cautionary tale of Operations understanding a new application as well as the team that developed the application. The non-linear narrative is used for extra dramatic effect.
Scene One – The Discovery
The handsome Development DBA (aren’t they all?) was exhausted from fighting with Operations about getting read-only access to the Production servers to help with maintenance issues. The system was still in its infancy and there weren’t many devices on the system yet. We see our friendly neighborhood DBA taking care of some long overdue exploration in the production databases to make sure that the system is working as planned. You should picture this as a teenager that just walked off into the woods alone for no good reason. As a matter of fact, this DBA was also wearing headphones and had no idea what was about to sneak up on him. The DBA had really wanted access to the DMVs, because that gives him a glimpse into how the business application is affecting SQL Server (The Beast). So, he did what any DBA in a horror movie would do, he checked the top sprocs by execution count – I believe the code is from Glenn Berry’s ( blog | twitter ) Diagnostic Information Queries:
SELECT TOP 100 qt.text AS 'SP Name' , qs.execution_count AS 'Execution Count' , qs.execution_count / DATEDIFF(Second, qs.creation_time, GETDATE()) AS 'Calls/Second' , qs.total_worker_time / qs.execution_count AS 'AvgWorkerTime' , qs.total_worker_time AS 'TotalWorkerTime' , qs.total_elapsed_time / qs.execution_count AS 'AvgElapsedTime' , qs.max_logical_reads , qs.max_logical_writes , qs.total_physical_reads , DATEDIFF(Minute, qs.creation_time, GETDATE()) AS 'Age in Cache' FROM sys.dm_exec_query_stats AS qs CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS qt WHERE qt.dbid = DB_ID() -- Filter by current database ORDER BY qs.execution_count DESC
[You see the DBA’s face go pale as Scene One ends]
Scene Two – The Flashback
Once upon a time, there was an application that was to have hundreds of thousands of devices connected to a central repository. These devices were known to have various issues from time to time, and there was a business requirement to report those issues back to the central repository where corrective action could be taken. There was also a business requirement to send a heartbeat, even when there were no issues, just to know that the device hadn’t gone offline. The Development DBA insisted that the application developers tread lightly on the database where possible, and they did. In a proof of concept completed before this application began, we had validated our architecture which included caching to reduce the strain on the central server. The device would hit the cache, and as long as the status hadn’t changed, there was no reason to update the central repository. The application could hit the cache to get the latest information, and The Beast would store changes in history.
Scene Three – The Lesson
The camera moves off the DBA’s pale face and zooms in on his monitor. Where he had expected to see maybe 10K executions, he saw 10M executions (that’s three extra zeroes folks). Imagine a denial of service attack – and the calls are coming from inside your own house!! This was a big system, and the parts had all been tested individually, and to a large degree comprehensively. But it was especially challenging to test those devices, and the testing certainly didn’t include database monitoring for edge cases. The lesson learned is that either the Development DBA needs to take on some of the Operational responsibilities, or Operations needs to be much more involved in the development of the application. There’s just to much to ‘hand off’ once the application ships. The Beast was a large server, so it could absorb those 10M executions, and the Operational DBA wouldn’t have known that the caching tier should have absorbed the large majority of those hits. This story had a happy ending, because it was spotted before too many devices were connected in the wild, but it could have been very bad news.