Rules of Engagement: Polyglot Persistence
|This is space! Course, we’re just in the beginning part of space, we-we haven’t even got to *outer* space yet! – Armageddon (1998)
In the movie Armageddon, Oscar (played by Owen Wilson) is very anxious and excited at the same time and says this as he is flying to stop the asteroid from hitting earth. We don’t do anything quite that exciting, but to me, NoSQL is in that beginning part of space. And as somebody who has been working with relational databases for almost 20 years now, I’m both anxious and excited about exploring NoSQL.
Polyglot Persistence
Relational databases have been the backbone of systems and stood the test of time for many years. Rules of Engagement has just explored the tip of the NoSQL iceberg. You might be asking yourself: ‘How do I pick the right NoSQL database for my next project? That seems very risky.’ It doesn’t have to be. Polyglot Persistence is just a fancy way of saying ‘it’s OK to mix and match data stores’ for an application. A relational database is going to be the core, if not all, of the vast majority of systems that are built. However, consider using multiple data stores when it makes sense.
Write-Only
While write-only may be a slight exaggeration, there is one essential ‘table’ that nearly every system I have ever worked has written to, but is rarely ever read. In many systems this table can contain more rows than the largest business table. Can you guess which table I’m referring to? I’ve seen bad code write tens of millions of rows a day to this table in a production system. It’s the application log, of course. I completely agree that this data needs to be stored centrally, especially for a distributed system. But, now that I have a better understanding of NoSQL, I can see how it would likely be more cost effective than storing the logged information in my fully licensed SQL database with expensive SAN storage. Log tables typically aren’t even normalized in the first place. Below are the normalized tables that you get out of the box with Microsoft’s Logging Application Block 5.0. Other logging schemas will be very similar.
These tables will be used not only to log exceptions, but any other events that developers decide to log; typically breadcrumbs so support can do a post mortem when a serious issue is found. This ends up being quite a bit of database I/O. The real kicker here are the Message and FormattedMessage columns. Those columns actually contain a full Stack Trace when a system exception has been logged.
Given what you know about NoSQL now, what type of data store might you use instead of SQL Server for this data? We’ll eliminate the graph database as an option. A traditional key-value store won’t work because if you do want to query this data, you aren’t likely to know the key to the log record you are looking for. This data is written once, and never updated. I think both document databases and column stores are well suited for storing log data, but I would lean towards a document database. I really like the concept of a capped collection on log data.
Archiving
Another common issue that we frequently need to address is data archival. What happens when you make a non-trivial schema change to a table? It often results in creating a ‘temporary table’ (not to be confused with a temp table) with the new structure, copying all of the data from the old table to the new, and then renaming the ‘temporary table’ to the original name. If this is a large table, it could take a considerable amount of time to complete this operation. Personally, I would not be opposed to having the older, infrequently accessed data available in a NoSQL store where the database isn’t even aware that the ‘schema’ has changed. The burden is on the application.
Another consideration is sheer volume – what is the retention policy? SAN storage can be really expensive, causing difficult decisions to be needed. Or, do they? Most NoSQL solutions were written to scale horizontally on commodity hardware. If we are already using a document database in our polyglot persistence solution, it would work well enough for storing historical data on inexpensive disks. If we don’t already have a document database in our solution, it might be time for someone to point out the elephant in the room.
How many of you knew that HBase is a very capable column store, and that Active Archive is one of the most common use cases for Hadoop?
“Cache”
This is my favorite example, because it is the first time that I suggested to use NoSQL in a solution. It is quite common for data to be initially stored in an OLTP database and later moved to a Data Mart or Warehouse. Other times, the data is never needed in the OLTP database, and gets imported directly into the OLAP database.
In this case, there was a rather complex scheduling algorithm that happened in a long running workflow, and generated a large amount of data that would go directly into the Data Mart. And, the users needed to be able to preview these schedules via a SSRS report. So, the data needed to be available as soon as possible.
And, after previewing the schedules, they might need to change them a few times. But they really only care about the final schedule per day. No need to track the history of these schedule changes. At this point, I’m thinking that sounds like a lot of overhead that the system doesn’t need, scalability is going to suffer. Although our first instinct (as data professionals) is to store any data in SQL Server, there was really no need to have these schedules in SQL Server until a final decision had been made; other than needing to be able to preview them.
We had already been using a key-value store as an application cache to quickly and effectively serve up data in the application. So, the data in question was already put into ‘buckets’, and easily accessed with the right keys. But, how are we going to write a report that hits a data source that doesn’t have an ODBC driver? Web Services. It took some trial and error on the SSRS side, but we were able to create a report whose data came entirely from a key-value store and performs reasonably. The input parameters’ drop down lists even came from web service calls.