There are a wide variety of Database Performance Monitoring (DPM) tools out there enabling organizations to gain visibility over their data estate. They perform a particularly important role, given the cost and consequences of downtime, and its similarly frustrating counterpart, slowtime. Calculating an overall cost for application downtime, for example, is far from an exact science. Gartner provides a useful starting point, putting the average cost at $5600 per minute. While this will vary depending on the size and nature of the organization involved, even small companies can see costs approaching $100,000 per hour of downtime. At the other end of the scale, larger businesses can expect the downtime bill to escalate beyond $1 million per hour.
Then there’s the issue of slowtime, which incurs only around one fifth the cost per hour of downtime, but happens ten times as often. Clearly, there are a lot of additional factors to consider, but the point is that downtime and slowtime costs to any company can add up very quickly. So, speed to identify, determine root cause, resolve, and ultimately avoid these incidents has tremendous value to any company.
Indeed, minimizing downtime and slowtime should be a major indicator of ROI for any database performance monitoring solution. Fundamentally, DPM tools should focus on getting users out of a reactive fire fighting stance – a situation familiar to many IT teams – and into a more proactive mode. This enables them to get out ahead of downtime and slowtime issues and mitigate them before impacting the end user and the business performance.
But, while many DPM tools appear to be comprehensive and authoritative, users always need to be mindful of their overall performance and the ROI they deliver. For example, are current DPM tools contributing to more issues than they solve? Do they offer sufficient details to help address and prevent performance and reliability issues? Can users call on live support from experienced engineers, and are they hosted in the right execution venue, be that on-premises or within an appropriate cloud environment?
The lack of detail that typically comes to light in today’s DPM tools comes in a few different forms. The most obvious are the counter-based metrics like CPU, IO, etc. Some products and in-house solutions only capture snapshots of this data every several minutes. This can be due to the onerous means of collection where admins don’t want to collect with any greater frequency for fear of over-burdening the monitored server.
Additional limitations may involve query level details, where only the Top N queries are ever collected or shown, regardless the level of activity on the server. Some tools also focus on queries based on their own waits, as opposed to the actual resource consumption of the request where there is a much better chance of identifying the root cause.
Even in smaller, simpler environments, database estates require dynamic technologies where one size does not fit all and complications may arise. That’s why it’s vital that tools are backed by responsive, expert support engineers to ensure that it is running optimally, and the users are getting the most value from them.
This becomes particularly evident when downtime hits. If an organization is struggling to get the pertinent insight from their DPM product, they need to be able to turn to an expert without delay who can help get them back on track. Industry professionals will be familiar with the experience of DPM users who end up with entire spreadsheets of open support tickets that go unanswered for months on end. Ultimately, what use is a monitoring tool if users can never get the support they need to keep it running properly?
With all that in mind, what does ‘good’ look like? Let’s start with scalability – most DPM tools on the market have the same limitation when it comes to how many servers they can monitor with a single installation of their product. Many start to choke somewhere in the neighborhood of 200 to 300 monitored SQL Servers, whereas users really need tools that support environments closer to 1000 servers or more with a single SQL Server database.
Alerts are another important component. A true monitoring system – as opposed to some products, which should be more accurately described as analysis tools – will not only offer template-based alerts, where users can change the alerting threshold, but also more configurable alerts. These allow DBAs to choose when they want to be alerted, avoiding times when a typical alert would just be noise and not warranted.
This kind of flexibility also applies to metrics and their analysis. For example, a developer might want to look at only the queries that are running, a DBA might prioritize wait statistics, a SAN admin the disks, and a VMware admin the hypervisor. In reality, all of these things and more need to be taken into consideration when troubleshooting a database performance problem, and they need to be present in a way that makes them easy to visualize. DPM tools shouldn’t require users to look at memory on one screen, IO on another, and wait statistics on yet another. This information should be visible in one place to help correlate the root cause of the problem in hand.
And let’s not underestimate the importance of user experience, performance and reliability on DBA happiness, stress levels and ultimately the business risk of losing talent. Working in the DBA and related fields can be extremely challenging, with many constantly reacting to issues in “firefighter mode.” But, with proper visibility and detail, it’s possible to get into a more proactive stance and get time to work on more strategic initiatives. Without the proper tools and methodology, however, the situation rarely gets better.
DPM tools should be regularly evaluated for technical effectiveness and business impact. And by gathering the experiences and opinions of the people using them on a daily basis, organisations can more effectively assess how well these important technology investments are delivering ROI. In doing so, they can also focus more clearly on the positive impact of minimising the twin challenges of downtime and slowtime.
Steven Wright, Director of Solutions Engineering, SentryOne