Modern software infrastructure is getting more complex. Microservices, Kubernetes, ephemeral pods: these are patterns that are getting higher adoption. To understand what is going on in the infrastructure, it is now essential to use one form of monitoring tools or other.
In this post, we will analyze popular application monitoring tools and how they compare among themselves. We have only included tools that serve most of the popular languages. So tools like Scout APM, Skylight, AppSignal are not included here as they support only 2-3 mainstream languages.
The vendors we will analyze in this post are the ones you are most likely to hear about when you are thinking of buying an APM tool as an SMB. This is solely based on our observation in different Reddit and slack discussions and not due to any affiliation to these vendors. From our understanding vendors like AppDynamics & Dynatrace are more focused on enterprise customers.
The tools which we will analyze in detail in this post are:
- New Relic
Too many ways of pricing
The first thing that strikes you when you start comparing different APM vendors is that there is no consistency in parameters based on which pricing is done. While Instana does pricing only based on the number of hosts, HoneyComb does just based on the number of events sent to their servers.
Few key parameters based on which pricing is done currently
- Price per hosts
- Price per mn spans or events
- Price per mn traces
- Price per service
- Price per user
Vendors are either pricing based on one of the above parameters or a linear combination of them.
Below is a summary:
We will discuss the pricing of each vendor in more detail below. Though one aspect to note here is that although their pricing doesn't depend on it, vendors have very different retention periods. This may range from 7 days to 60 days. If you want a higher retention period, most vendors are happy to negotiate a custom retention period for you. But then you have to talk to their account team. And who wants to do that! So, carefully examine this factor when you are evaluating.
Here we are only evaluating APM and Distributed tracing pricing for vendors. We are not analyzing their pricing for Infra monitoring, Network monitoring, etc. Also, we are only evaluating based on the public prices quoted on their webpage. We understand that many vendors give their customers custom pricing once you talk to their account manager. We think that the key parameters of their pricing won't change even in a custom quote.
Datadog based its pricing based on the number of hosts and the number of analyzed spans. If you are using Fargate, it has separate pricing for that - but we will not get into it.
Things to note here:
- Infra monitoring - If you want to use APM for DataDog, you install a Datadog agent in each of the hosts (VMs) you are monitoring. Though the pricing here says 31 USD/month, for each host you are monitoring using APM you also need to pay the infrastructure monitoring cost which comes to 15 USD/month. So, effectively per host pricing is 46 USD/month.
- Spot instances - Of course, the number of hosts an engineering team is not always constant. It changes. Based on the external traffic, tests, etc. DataDog uses 99 percentile as the number to charge for. A month has ~750 hrs. So, they will charge you for the maximum number of hosts for more than 8 hrs in a month.
- Custom Metric pricing - What is not very clearly mentioned on their main pricing page, and is often the cause of surprise for customers is how they charge custom metrics. Custom Metrics is priced at USD 5 per 100 custom metrics per month. So, if you are sending business metrics or other custom metrics to Datadog, be very careful of the number of metrics you send them as this can blow up fast.
- Tracing data retention - The standard retention period for tracing data in DataDog is 15 days. If you want to retain for 30 days instead, the price shoots up from 1.7 USD /mn spans to USD 3/ mn spans.
- Filtering Analysed spans - DataDog provides the capability to filter analyzed spans for your services. This will reduce cost but it will also be reducing the granularity of tracing. The below analysis is based on the assumption that analyzed span% is 100%.
From DataDog's website
You can run an estimate on the number of Analyzed Spans that would be generated from your services with the Analyzed Span Estimator. After ingestion, you can filter Analyzed Spans from 100% to a lower percentage on a service-by-service level under APM settings. This reduces billable Analyzed Spans.
Archived on 27 April 2020 - Archived Page
New Relic prices its APM solution according to the number of hosts (VMs) you want to monitor. Their current pricing page is very opaque. The show two plans - Pro & Essential.
Archived on 27 April 2020 - Archived Page
Don't get fooled by the starting price they show in each of the plans. The pricing shown is for
a t2.micro instance running 750 hrs per month when billed annually.
Now, that's too many ifs and buts to consider. Why would they show t2.micro pricing - who runs them for any production infra? I think, what they are trying to do here is to get you started with a free trial and then assign you an account manager to negotiate the pricing etc.
If you are someone like me, why would you want to talk to an account manager to get the pricing of their service? And, that is what pisses me off about New Relic's current pricing. I think they still price based on the number of hosts you run. But the details are not clear. How much would it cost you if you run 5 c5.xlarge? Not clear from their pricing page.
We license by the number of hosts—physical servers and/or virtual machines—on which those containers run.
They followed a compute unit based pricing model earlier. They would charge you based on the number of hosts, and each VM will cost based on the RAM and compute power of the VM. So, I am assuming their pricing would still consider RAM & compute power of VMs to price them.
Based on my discussions with people, New Relic pricing is still broadly based on compute + RAM cost though with some more flexibility now. Here's the old pricing calculator, which would give you some sense of the pricing.
The issue with host-based pricing is that your cost would depend a lot on how is your infra setup. If you have 10 micro-services, with each micro-service running independently on a couple of hosts, you will pay New Relic much more than running whole infra on a few big machines.
If you use bigger VMs to host your service, you increase the risk you run of that VM going down and impacting a huge chunk of your infra. Compare this to independent microservices running on different nodes. The impact caused due to a node failing in this scenario is much lower in this case.
Many people find New Relic services very costly. To reduce the cost they would need to pay if they run New Relic, they only monitor "important" hosts using New Relic. For example, your infra as 40 VMs. But the critical web server which is most likely to fail is contained in only 3 big VMs. If you want to reduce the money you pay to New Relic, you only install it on these 3 VMs. Your cost will be much lower compared to running it in 40 nodes.
This practice has its tradeoffs though. You only get to monitor & trace what's happening in these 3 machines. The rest of the infrastructure is black box to you. New Relic won't be able to provide any visibility into them.
Few things to note about New Relic:
- New Relic Essential plan doesn't include distributed tracing. Distributed tracing is only present in the Pro plan.
- In the Pro plan, New Relic limits tracing to 10K transaction events/minute. A transaction event is essential a span in distributed tracing parlance. So, if you take 25 spans generated per request, this implies ~166 events/s or ~ 7 requests per second. So, if your system is serving more than 7 RPS, your traces will be sampled.
3. Retention - New Relic has different retention periods for metrics & traces, which makes sense as storing metrics is much more costly than storing traces. It's Pro plan gives 30 days metric retention & 8 days events (traces) retention. Essentials plan gives 3 days retention for both metrics & traces.
Instana charges flat 75 USD per host per month. This plan includes End User Monitoring (EUM) and infrastructure monitoring.
If you compare with Datadog, Datadog's Infra monitoring costs 15 USD per month and EUM costs 15 USD per month per 10K sessions. Anyways, using Datadog's APM product also adds infra monitoring to your subscription which takes just the per host cost to 46 USD per month. If you add EUM, that becomes 61 USD per month per host.
Pricing plan archived on 27 April 2020 - Archived Page
Instana claims that its flat per host pricing is the simplest model - but if you have a large number of hosts in your infra, this plan would be very costly.
Here's a video by Instana which claims that their pricing is the simplest - though higher - as it doesn't increase with time.
We think that this is a tall claim to make, as generally increase in requests per second also increases the number of hosts. With the coming of technologies like Kubernetes, people prefer horizontal scaling(adding more machines) compared to vertical scaling(making machines bigger).
Some points to keep in mind about Instana's pricing:
- Their plan is only billed annually. 75 USD/month/host implies 900 USD/host per annum. So, even if you have 10 hosts in your infrastructure, that amounts to 9000 USD ticket size to be paid in one go - which is not a small amount for SMBs.
- Instana's plan includes Infra monitoring & End User Monitoring
- Retention - They provide only 7 day default retention with some metrics (anomalies?) stored for a year. You would have to negotiate a higher retention period by shelling a higher price.
LightStep has a very different pricing compared to other vendors. They don't charge on the number of hosts or the amount of tracing data but on the number of services monitored and the number of users using LightStep.
This is very curious as we think the number of services depends a lot on your architecture. If you are using a micro-services architecture, you can have 50+ lightweight micro-services to handle only 10 requests per second.
They add a pretty steep USD 199 for each extra microservice in their Pro plan. We think that if you have less than 8 micro-services but a huge load - LightStep would be the most cost-effective vendor.
Archived on 27 April 2020 - Archived Page
Some points about LightStep pricing:
- Their starter plan has limited event data and only 3 services, so we think it would be only useful if you are just testing things out.
- LightStep has recently (April 2020) made infra metrics free
3. If you have more than 8 services or more than 10 users, LightStep may prove very costly for you.
4. Default retention for LightStep is 28 days, which is a good time period.
HoneyComb has recently announced shifting completely to events based pricing. This suits well with their overall positioning of sending high cardinality data with each event and slice and dice them in different ways. An event can be up to 100kb with a max of 2000 columns. That's ~12 KB, but it seems good enough size for each event.
The way they define events 👇
An event contains information about what it took for a service to perform a unit of work. For example a log entry or trace span.
Pricing page archived on 27 April 2020 - Archived Page
Notes on HoneyComb pricing
- With an event-based pricing, we think HoneyComb has the right pricing model to attract SMBs to try them, as they only need to pay what they use.
- HoneyComb is positioned more as a debugging & observability tool. They give you good capability to pass high cardinality events and visualize them in different ways. But there is no default metrics dashboard. You have to create visualizations based on what you are looking for.
- Default retention is 60 days which should be good for most of the business use cases.
- At ~0.8 USD/mn events(spans), HoneyComb is much cheaper than DataDog which charges 3 USD/mn spans, that too only for 30 days retention. Also, there is no host-based pricing in HoneyComb, which removes dependency on the way your infrastructure is organized. You only pay for the amount of data you analyze.
Epsagon started its journey primarily as a serverless monitoring system but is slowly opening up for generic APM use cases also. They also charge based on the number of traces.
They differ from HoneyComb & Datadog as they price on the number of traces and not on the number of spans/events. A trace can have multiple spans within it - typically ~10-30 spans per trace.
Pricing page archived on 27 April 2020 - Archived Page. Per month charge for annual plans
For Epsagon, this is how they define traces
One trace will be sent for every container run or serverless function execution.
Up to 100K traces per month, free of charge, with full product functionality.
Epsagon's price seems to be on the higher side compared to HoneyComb & DataDog. It has a default retention period of 7 days, which is not very generous.
Now that we have gone through pricing details and nuances of each vendor, let's compare them on what would each vendor cost for different customer scenarios. I am using two scenarios here, one of an SMB customer & one of a medium-sized business. The scenarios are certainly simplistic and don't take into account many nuances related to architecture( monolith vs micro-services) and service type. Nevertheless, would give us some benchmarks on comparing different vendors.
PS: I am not comparing New Relic below, as their current pricing page doesn't give enough information on how their service will be priced for granular scenarios below.
Scenario 1 - Small Business | Assuming 10 hosts serving 10 rps with 5 services, 25 spans per trace, 1 trace per request
As you can see here, if assume pro-rata USD 1/mn trace pricing for HoneyComb, they are the most cost-effective. With only 10 hosts, even Instana is cost-effective. DataDog is costlier as it has both host & span cost in its pricing.
Scenario 2 - Medium Scale | Assuming 30 hosts serving 100 rps with 8 services, 25 spans per trace, 1 trace per request
For a medium scale business, if it has only 8 services, LightStep becomes very cost-effective as it prices only based on the number of services and users. DataDog & Epsagon are much costlier as their trace pricing is much higher, which is the major chunk of the cost for each vendor (except Instana)
- APM suitable for you depends a lot on your architecture - Monolith vs Microservices. APMs charge on a variety of factors like the number of VMs, number of events sent, number of services, etc.
- There are lots of details around retention period, pricing for custom metrics, limitations on the number of services and hosts, which once should be careful about when deciding the APM tool to use.
- One thing which we didn't discuss here is the monitoring of dev & staging branches. None of the vendors currently have any differential pricing for monitoring dev and staging. Though, it helps if developers can find issues at staging itself rather than encountering them at production. Most companies don't have frameworks to simulate real-life loads in staging.
We have not evaluated languages specific or Open Source APM solutions in this post. Here are some of the Language specific APMs for those who want to check it out.
Language Specific APMs
|Tool Name||Languages Supported|
|Scout APM||Ruby, Python|
The other possible solution to get application metrics & observability is hosting open-source APM & Distributed tracing solution. Of course, this would also involve developer time from your team, so an accurate comparison would be to evaluate the total cost of ownership of setting up an open-source system.
We will write about how much would it cost you to host your own Jaeger & storing tracing data in Cassandra storage.
Do let us know, how did you find this blog and how we can improve it. Write to us at hello at apmtools dot io