Personal Update: Going full time on Grafana

It’s been a year since I started at Raintank (soon to be renamed to GrafanaLabs). During that time, I have worked on lots of interesting stuff. Some highlights:

I’ve learnt a lot. The biggest change being the switch from Windows to Linux. I have worked on projects in Node and Go, both new languages for me. It’s been intensive but now it is going to get even more intensive!

As one of my colleagues in the Stockholm office, Carl, is gone on parental leave for a few months, we decided that now was a good time for me to start focusing on Grafana full time.

It’s a bit mad starting to work on Grafana due to the sheer volume of issues, comments on issues, PR’s, questions on the forum, questions on StackOverflow, questions on our public Slack channel, comments on Twitter and questions on the #grafana channel on IRC. Torkel even answers questions on Google+!

It’s very obvious from the outside that Grafana is a really popular project. It has just under 15 000 stars on GitHub and a look at the GitHub pulse shows a lot of activity – during the last month there have been 58 active pull requests and 252 active issues.

GitHubPulse

But what those stats do not show is the number of comments on issues, a lot of them on closed issues. Grafana currently has 900 open issues (570 of those are feature requests) but if you count closed issues and pull requests then there are more than 7000. Carl and Torkel have closed tons of issues and pull requests over the last 12 months but they have also answered tons of follow up questions on closed issues. Since I started writing this blog post a few minutes ago, 12 13 14 notifications from GitHub for Grafana issues have landed in my gmail.

It’s crazy that just two full time people have kept up this furious tempo. You’re machines, Torkel and Carl!

The Grafana community is still growing rapidly and it is noticeable that merging pull requests and answering issues generates more pull requests and feature requests. Hopefully we’ll be growing our team soon so that we can work on more of those 500+ feature requests! The future looks very exciting (and busy).

Graphite and Grafana – How to calculate Percentage of Total/Percent Distribution

When working with Grafana and Graphite, it is quite common that I need to calculate the percentage of a total from Graphite time series. There are a few variations on this that are solved in different ways.

SingleStat

diskusagesinglestat

With the SingleStat panel in Grafana, you need to reduce a time series down to one number. For example, to calculate the available memory percentage for a group of servers we need to sum all available memory for all servers and sum total memory for all servers and then divide the available memory total by the total memory total.

The way to do this in Grafana is to create two queries, #A for the total and #B for the subtotal and then divide #B by #A. Graphite has a function divideSeries that you can use for this. Then hide #A (you can see that is grayed out below) and use #B for the SingleStat value.

The divideSeries function can be used in a Graph panel too, as long as the divisor is a single time series (for example, it will work for the sum of all servers but not when grouped by server).

diskusagesinglestatqueries

Graph Multiple Percentage of Totals

diskusagepernodegraph

Sometimes I want to graph the percentage of total grouped by server/node e.g. disk usage percentage per server. In this case, divideSeries will not work. It cannot take multiple time series and divide them against each other (Prometheus has vector matching but Graphite does not have anything quite as smooth unfortunately). One way to solve this is to use a different graphite function called reduceSeries.

diskusagepernodegraphquery
Query to calculate subtotals for multiple time series
diskusagebynodequerysnippet
Same query – zoomed in on the end of the query

In the example, there are two values, capacity (the total) and usage (the subtotal). First, a groupByNode function is applied, this will return a list with the two values for each server (e.g. minion-2.capacity and minion-2.usage). The mapSeries and reduceSeries take this list and for each server applies the asPercent reduce function to the two values. The result is a list of percentage totals per server.

The reduceSeries function can also apply two other reduce functions: a diff function and a divide function.

Same result with the AsPercent Function

In the query above, the values (usage and capacity) are in the same namespace if that is not the case then the reduceSeries technique will be difficult or not work. Another function worth checking out is the AsPercent function which might work better in some cases. The example below uses the same two query technique that we used for divideSeries but it works with multiple time series!

diskusageaspercentquery

I learned these three techniques by looking at Grafana dashboards built by some of the Graphite experts that work with me at Raintank/Grafana Labs. I did not know them before so I think they will help others too.

Profiling Golang Programs on Kubernetes

Recently I needed to profile a Go application running inside a Kubernetes pod using net/http/pprof. I got stuck for a while trying to figure out how to copy the profile file from a pod but there is an easier way.

net/http/pprof – A Short Intro

First, a little about profiling in Go. net/http/pprof is a library for profiling live Go applications and exposing the profiling data via HTTP. If you want to profile an application, it needs to be instrumented before profiling. Here are some articles that describe that process:

Once you have instrumented your application, you just need to be able to access it from outside of a Kubernetes cluster.

Kubernetes and Pprof

The easiest way to get at the application in the pod is to use port forwarding with kubectl.

kubectl port-forward pod-123ab -n a-namespace 6060

The HTTP endpoint will now be available as a local port.

You can now generate the file for the CPU profile with curl and pipe the data to a file (7200 seconds is two hours):

curl "http://127.0.0.1:6060/debug/pprof/profile?seconds=7200" > cpu.pprof

It is also possible to send the data directly to the pprof tool. The pprof tool (not the same thing as the net/http/prof library) is a tool for generating a pdf or svg analysis from the profile data.

To save the pprof data with the pprof tool you can use the interactive mode:

go tool pprof http://localhost:8282/debug/pprof/profile

And per default the generated data will be saved as a tar file in the pprof subdirectory in your home directory. Exit interactive mode by typing exit.

Congratulations, you now have the raw profile from your application from inside a Kubernetes pod!

Some bonus information on the pprof tool

To generate an analysis, you will need the binary file for your Go application.

Here is how to pipe the profile data in directly:

go tool --pdf your-binary-file pprof http://localhost:8282/debug/pprof/profile > profile.pdf
go tool --svg your-binary-file pprof http://localhost:8282/debug/pprof/profile > profile.svg

You can do a lot more with net/http/pprof and the pprof tool.

Memory profile for in use space:

go tool --pdf your-binary-file pprof http://localhost:8282/debug/pprof/heap > in-use-space.pdf

Memory profile for allocated objects:

1. Generate the data:

go tool pprof http://localhost:8282/debug/pprof/heap

2. Exit interactive mode by typing exit.
3. Analyse the data:

go tool pprof -alloc_objects --svg your-binary-file /home/username/pprof/pprof.localhost:6060.alloc_objects.alloc_space.003.pb.gz > alloc-objects.svg

There are also switches for in use object counts (-inuse_objects) and allocated space (-alloc_space).

The pprof tool has an interactive mode that has lots of nifty functions like topn. Read more about that on the official Go blog.

Review of Release It!

Release It! Design and Deploy Production-Ready Software by Michael T. Nygard, published by The Pragmatic Programmers.
ISBN: 978-0-9787-3921-8

Introduction

Release It! is a book I have had on my reading list for a few years. I started a new job at Tradera/eBay Sweden in June last year and Release It! felt more relevant to my work than ever before. It’s the first time in my career that I am working on a public site with a lot of traffic (well, a lot for Sweden anyway). Downtime can mean a lot of lost revenue for my company. As I’m on call every second month for a week at a time, downtime also really affects how much sleep I get during that week. So perfect timing to read a book about pragmatic architecture.

System downtime can both cost huge amounts of money and seriously damage the reputation of a company. The incident that springs to mind when I think about damage to a company’s reputation is the Blackberry three day outage in 2011. It was the beginning of the end for a previously very successful company. Release It! is all about building a pragmatic architecture and infrastructure to avoid one of these doomsday scenarios for your company.

The book is organised into four different parts and roars into action with the first part: Stability or how to keep your system up 24 hours a day.

The Exception That Grounded An Airline

The author Michael Nygard has worked on several large, distributed systems and uses his war stories to illustrate anti-patterns.

The first story is of a check-in system for an airline that crashes early in the morning – peak time for check-ins. First, he describes the process of figuring out which servers were the problem (they had no errors in their logs). They then took the decision to restart them all. This restart took three hours and caused so many flight delays that it made the TV news. The post-mortem involved thread dumps and decompiling production binaries (there was bad blood between the developers and operations so they weren’t too keen to give him the source code). The bug turned out to be an uncaught sql exception that only ever got triggered during a database failover.

I love the this way of illustrating technical concepts. The stories are great and make the book a very easy read despite it being quite technical. And the postmortems after the incident are like CSI for programmers.

The lesson learned in this case was that it is unlikely that you can catch all the bugs but that you CAN stop them spreading and taking down the whole system. This bug was a simple bug but one unlikely to be ever caught in testing. Who tests that code works during database failovers and has that setup in their test environment?

However, by failing fast and using patterns like timeouts or circuit breakers you can stop the problem spreading through the whole system and crashing everything.

Cascading Failures vs Fail Fast

Circuit-breaker at High Voltage test bay, BerlinAfter setting the scene, the author then introduces a group of Stability anti-patterns. They have great names like Cascading Failures and Attacks of Self-Denial. These anti-patterns represent all the ways you can design an unstable system.

To counteract these anti-patterns are a group of Stability patterns.These include some very useful patterns like the practice of using timeouts so that calls don’t hang forever. Fail Fast counteracts the Cascading Failure anti-pattern. Instead of an error being propagated from subsystem to subsystem crashing each one, the idea is to fail as fast as you can so that only the subsystem where the error occurred is affected.

The most interesting pattern for me was the Circuit Breaker pattern. I haven’t used it yet in a production system but it got me looking at a .NET implementation called Polly.

Anyone else used it in a production system?

Capacity

The second part of the book introduces the capacity theme with a story about the launch of a retailer’s new website. They deployed the site at 9 a.m. and by 9:30 a.m. it had crashed and burned. Total nightmare. One of the sysadmins had a 36 hour shift setting up and deploying new app servers.

Again the author introduces a group of anti-patterns to watch out for and some patterns to counteract them. This time the anti-patterns are more obvious and are mostly some sort of waste of resources. They can be seemingly insignificant anti-patterns like not creating indexes for lookup columns in a database table, session times that are too long (a session in memory is using valuable memory), sending too much data over the wire or cookies that are too large. For a small system these are not a problem but if you multiple them with thousands of concurrent users then a lot more servers will be needed.

Michael Nygard makes a very good point here that most developers have less than 10 years of experience and most systems that are built do not have capacity problems; the combination of these two statistics means that very few developers have experience of working on large systems. This also means that the same mistakes are repeatedly made when designing large systems.

I personally still have a lot to learn here. I will be digging deeper into load testing and some of the more technical details of how our servers are configured. It’s easy to forget about how the code that I write affects the production environment.

General Design Issues

This part of the book is a collection of disparate design issues that can be important for large systems. Networking, security, load balancing, SLA’s, building a QA environment and other topics are covered here. Interesting stuff if not as compelling as the first two parts of the book.

Operations

The last part of the book kicks off with another great war story about a site suffering performance problems on Black Friday (the day after Thanksgiving in the US when every shop in the country has a sale). This time the problem was caused by an external system run by another company that had taken down two of its four servers for maintenance (on the busiest day of the year!). They were able to save the situation by reconfiguring a component not to call the external servers. This was possible due to good component design that allowed them to just restart the component instead of the server (it would have taken 6 hours to restart all the servers). The site was back working within a few minutes.

Transparency

So how did they know so fast what was wrong in this case? They had built the system to be transparent. Most systems are black boxes that do not allow any sort of insight into how healthy it is. To make a system transparent it is has to be designed to reveal its current state and a monitoring infrastructure is needed to create trends that can be easily analyzed by a human.

For example, at Tradera/eBay Sweden for our frontend servers we can see the number of errors per minute, page response time (both client and server side) and number of 404 and 500 HTTP response per minute (as well as lots of other measurements). For each server we can see memory usage, number of connections and a host of other performance indicators. By just glancing at our dashboard for live statistics I can see if we are suffering any performance issues on the frontend. Without this help it would be extremely difficult to keep the system running when things go wrong. As you can see this theme is close to my heart!

I did get a bit lost in the technical details of this section however. It is a really interesting theme but the technical details are either JVM-specific or have not aged well. A lot has happened in this space since the book was written in 2007.

Verdict

If you work with a system that is deployed on more than ten servers then this book is a must-read. There is a difference between just writing code and keeping a system up 24 hours a day. If you burn for Devops or are on-call regularly or care about creating stable systems then this book is for you.

 

 

FluentMigrator – Setting the collation on a column

I got a question via a Twitter DM about how to set the collation for a column in a FluentMigrator migration. I gave a quick answer there but here is a longer answer for anyone stuck on this.

Using AsCustom

There is no explicit support for setting collation per column in FluentMigrator but it can be done per database. When creating a column using the WithColumn attributes you can use the AsCustom to set the type with any valid sql. This will be specific for one database and will not be translated to the correct form for other databases.

Here is an example for MS Sql Server where the “test” column’s collation is set to Latin1_General_CI_AS using the COLLATE clause:

public override void Up()
{
    Create.Table("Order")
        .WithColumn("Id").AsInt32().NotNullable().Identity().PrimaryKey()
        .WithColumn("test").AsCustom("varchar(10)COLLATE Latin1_General_CI_AS");
}

Supporting non-standard SQL for multiple databases

FluentMigrator supports a lot of databases and has fluent expressions that support most of the common scenarios for all of them. However, these databases have loads of different features like collation that are not standard ANSI SQL. FluentMigrator will never be able to support all of these. Occasionally you will need to handle edge cases like these for multiple databases.

The way to do this is using FluentMigrator’s IfDatabase. This is quite simple. Just add IfDatabase before the create expression:

IfDatabase("sqlserver").Create.Table("Order")
    .WithColumn("Id").AsInt32().NotNullable().Identity().PrimaryKey()
    .WithColumn("test").AsCustom("varchar(10)COLLATE Latin1_General_CI_AS");

IfDatabase("postgres").Create.Table("Order")
    .WithColumn("Id").AsInt32().NotNullable().Identity().PrimaryKey()
    .WithColumn("test").AsCustom("text collate \"en_US\"");

And now you can write migrations with non-standard SQL for multiple databases in the same migration.

How to catch JavaScript Errors with window.onerror (even on Chrome and Firefox)

I’m working on a new (mostly greenfield) responsive website for eBay Sweden that has a fair amount of JavaScript and  that is viewed in lots of different browsers (mobile, tablet, desktop). Naturally we want to log our JavaScript exceptions and their stacktraces, just like we log server-side exceptions. It is impossible to test every combination of device and browser so we rely on logging to find the edge cases we miss in our testing.

The way we handle our JavaScript exceptions is to:

  1. catch the exception.
  2. collect data about the useragent, context etc.
  3. Save it to our logs by sending an ajax request with the data and the exception information.

I can finally log JS Exceptions!

We decided to use window.onerror which is a DOM event handler that acts like a global try..catch. This is great for catching unexpected exceptions i.e. the ones that never occur while testing.

It is very simple to get started with, you just have to override the handler like this:

window.onerror = function (errorMsg, url, lineNumber) {
    alert('Error: ' + errorMsg + ' Script: ' + url + ' Line: ' + lineNumber);
}

But It Was Too Good To Be True

If you test this on a local server (say IIS or nginx) then it should work fine. But it is not the same as a normal try..catch, so producing a stacktrace with a library like stacktrace.js will probably not work too well. The window.onerror handler does not have the same context and the context varies enormously from browser to browser.

Also, if you have minified your files then line number is not very useful. For example:

Error: ‘a’ is undefined Script: build.js Line: 3

Variable ‘a’ is very hard to find when line 3 has 30000 characters of minified JavaScript.

Unfortunately, I do not have a solution for this for all browsers. This will get better over the next few months as a new standard for window.onerror has been agreed upon. It is already implemented for Chrome.

The new standard adds two parameters; column number and an error object.  Our window.onerror handler now looks like this:

window.onerror = function (errorMsg, url, lineNumber, column, errorObj) {
    alert('Error: ' + errorMsg + ' Script: ' + url + ' Line: ' + lineNumber
    + ' Column: ' + column + ' StackTrace: ' +  errorObj);
}

I made a little test project to see what level of support there is. It contains one test page with a button and two script files. The button triggers a function in one script file that creates an exception and the other script file contains the window.onerror handler.

As of 2014-01-18 the results were:

  • Firefox 26.0 returns the first three parameters (hopefully this will be implemented soon)
  • Internet Explorer 10 will return a column number but no error object. However by using arguments.callee.caller you can get a stacktrace.
  • Chrome 32.0.1700.76 (for desktop) returns all five parameters
  • Chrome for Android (version 32) returns all five parameters
  • Safari for iOS (6 and 7) returns the first three parameters (here is the Webkit issue)

So for older mobiles you will never be able to get decent error logging but when all the browsers have implemented this and people switch to newer phones it will get better. Not perfect but this would still be a good start if not for…

Chrome and Firefox Break It Totally For CDNs

It is actually even worse if you use a CDN for your script files. window.onerror won’t work at all in this case. If you have tried this then you have probably visited this StackOverflow page:

Cryptic “Script Error.” reported in Javascript in Chrome and Firefox

Firefox and Chrome will for security reasons not report exceptions from scripts that are of a different origin. All you get is a cryptic “Script Error.” and nothing else. Totally useless.

Solving the Cryptic “Script Error.”

The latest versions of Chrome (see point 3 in the linked post) and Firefox have now provided a way to allow reporting of exceptions from CDNs.

There are two steps to be taken to get it working:

Access-Control-Allow-Origin:*
  • Use the new crossorigin attribute on the source tag. Here is an example from our website at work:
 <script crossorigin="anonymous" src="//static.tradera.com/touchweb/static/output/script/344c8698.build.js"></script>

The crossorigin tag has two possible values.

  • anonymous means no user credentials are needed to access the file.
  • use-credentials if you are using the Access-Control-Allow-Credentials header when serving the external JavaScript file from the CDN.

Now it should work. There is one caveat here. If you have set the crossorigin attribute on the script tag then you MUST set the Access-Control-Allow-Origin header otherwise the browser will discard the script for not having CORS enabled.

At the moment, we filter out the exceptions for iOS Safari and older Androids by checking if the error message is equal to “Script error.”.

window.onerror = function (errorMsg, url, lineNumber, column, errorObj) {
        if (errorMsg.indexOf('Script error.') > -1) {
            return;
        }

Getting Closer

The state of play in January of 2014 is that you can catch unexpected JavaScript exceptions for most desktop browsers and for Android mobiles with a recent version of Chrome for Android as well as Windows Phones. Hope this helps out other developers trying to track JavaScript errors on public sites where you can’t ask the user for help recreating errors. Roll on 2015 and better JavaScript error handling!

Git for Windows tip: opening Sublime Text from bash

I wrote a post about how to open NotePad++ from the bash prompt a few years ago but I recently made the switch to using Sublime Text 2 as my standard text editor and had to figure out how to do the same thing with Sublime. It was surprisingly difficult to find it on Google.

1. Create a text file called subl (with no extension) with the following content:

#!/bin/sh
"C:\Program Files\Sublime Text 2\sublime_text.exe" $1 &

2.  copy it into the C:\Program Files (x86)\Git\bin folder.

The first line indicates that this is a shell script. The first part of the second line is the path to the Sublime exe. The $1 parameter passes in any parameters so that you can use it like this from the bash prompt (command line) to open a file:

subl text.txt

or

subl .

to open the current folder.

The last parameter & indicates that it should open Sublime in the background so that you can continue using the command prompt.