· 6 min read

Overview#

Recently we released a new feature where users can run Pyroscope in pull mode. It allows you to pull profiling data from applications and it has various discovery mechanisms so that you can easily integrate with things like kubernetes and start profiling all of your pods with minimum setup.

For Pyroscope, the difference between push and pull mode is that:

  • Push mode: Sends a POST request with profiling data from the application to the Pyroscope server and return a simple response
  • Pull mode: Pyroscope sends a GET request to targets (identified in config file) and the targets return profiling data in the response.

push_vs_pull_diagram_07

One of the major benefits of pull mode is creating meaningful tags. For example, in Kubernetes workflows you can tag profiles with popular metadata fields:

  • pod_name: Name of the pod
  • container_name: Name of the container
  • namespace: Namespace of the pod
  • service_name: Name of the service
  • etc

Then using Pyroscope's query language, FlameQL, you can filter profiles by those tags over time to see where the application is spending most of its time.

Early on we had a user who ran pyroscope in pull mode with about a thousand of profiling targets and about 1 TB of raw profiling data per day, but their Pyroscope server was running into some performance issues.

This was surprising to us because in push mode, we've seen pyroscope handle similar amounts of traffic without issues. Therefore, we suspected that the performance issue had to do with one of the key architectural difference between push mode and pull mode.

While pprof is a great format for representing profiling data, it's not the most optimal format when it comes to storing profiling data. So, we transcode pprof into our internal format that we optimized for storage in our custom storage engine.

  • Push mode: In push mode pprof transcoding is done on each profiling target so the load was distributed across many targets
  • Pull mode: In pull mode we moved pprof transcoding to the server side so the load was distributed across only one target

So, we immediately suspected that this performance regression might have something to do with the load increase on the pulling server.

pprof_transcoding_04

Using profiling to spot performance issues in Go#

Pyroscope server is written in Golang and it continuously profiles itself. So, when these kinds of issues happen we're usually able to quickly find them.

On this screenshot you can see that the 2 functions that take up a lot of time are FindFunction and FindLocation. When Pyroscope transcodes pprof profiles into an internal flamegraph format these functions are called as part of that process. However, it seemed suspicious that such simple functions were consuming so much CPU time.

screenshot of pyroscope showing the performance issue

FindFunction and FindLocation consume a significant amount of CPU

To understand the necessary elements to "find" Location and Function we looked up how pprof objects are structured:

anatomy of a pprof profile

The id field for both Location and Function arrays ascend in consecutive numerical order for each new object

Note that location and function fields are actually arrays, containing the locations and functions respectively. The objects in these arrays are identified by IDs.

FindFunction and FindLocation functions are almost identical and they both search through their respective arrays searching for objects by IDs.

func FindFunction(x *Profile, fid uint64) (*Function, bool) {    // this sort.Search function is the expensive part    idx := sort.Search(len(x.Function), func(i int) bool {        return x.Function[i].Id >= fid    })
    if idx < len(x.Function) {        if f := x.Function[idx]; f.Id == fid {            return f, true        }    }    return nil, false}

And if you look closer at the functions they seem to be pretty optimized already — they use sort.Search which is a Golang implementation of binary search algorithm. We initially assumed that binary search would be the fastest here, because it's typically the fastest way to search for an element in a sorted array.

However, looking at the flamegraph, this was the bottleneck that was slowing down the whole system.

Performance Optimization #1: Caching the data in a hash map#

In our first attempt at fixing the issue we tried to use caching. Instead of performing the binary search every time we needed to find a function, we cached the data in a hash map.

That did improve the performance a little bit, but we only traded one relatively expensive operation (binary search - Green) for another one (map lookups - Red) that used slightly less CPU, but still were expensive.

second_fg_image

The green nodes are functions where we decreased CPU usage and the red nodes are functions where we added CPU usage for this optimization. While we removed FindName and FindLocation, we added runtime.mapaccess2_fast64

Performance Optimization #2: Eliminating the need for binary search#

As I mentioned earlier, objects in function and location arrays were sorted by ID. Upon closer inspection, we discovered that not only were the arrays sorted, but the IDs also started at 1 and ascended in consecutive numerical order (1,2,3,4,5). So if you wanted to get an object with ID of 10, you look at the object at position 9.

So, although we initially thought binary search was the fastest way to find functions and locations in their respective arrays, it turned out that we could eliminate the need to search altogether by referencing objects by their ID – 1.

This resulted in complete removal of the performance overhead caused by FindFunction and FindLocation functions.

third_fg_image

The green nodes are functions where we decreased CPU usage and the red nodes are functions where we added CPU usage for this optimization. We removed FindName and FindLocation and also removed runtime.mapaccess2_fast64

What happens when you profile a profiler?#

In retrospect, we probably should have started with looking at the specifications for a pprof object before assuming that binary search was the best way to find objects. But we wouldn't have even know to look at the specifications until we started profiling our own code.

It's just the nature of developing software that as complexity increases over time, more performance issues sneak their way into the codebase.

However, with continuous profiling enabled, it allows you to spot these performance issues in your code and understand which parts of your code are consuming the most resources.

If you'd like to learn more about how to get started with Pyroscope and learn where your code's bottlenecks are, check out the Pyroscope documentation.

Links#

· 5 min read

Continuous Profiling for Golang applications#

Profiling a Golang Rideshare App with Pyroscope#

golang_example_architecture_05_1

Note: For documentation on Pyroscope's golang integration visit our website for golang push mode or golang pull mode

Background#

In this example we show a simplified, basic use case of Pyroscope. We simulate a "ride share" company which has three endpoints found in main.go:

  • /bike : calls the OrderBike(search_radius) function to order a bike
  • /car : calls the OrderCar(search_radius) function to order a car
  • /scooter : calls the OrderScooter(search_radius) function to order a scooter

We also simulate running 3 distinct servers in 3 different regions (via docker-compose.yml)

  • us-east-1
  • us-west-1
  • eu-west-1

One of the most useful capabilities of Pyroscope is the ability to tag your data in a way that is meaningful to you. In this case, we have two natural divisions, and so we "tag" our data to represent those:

  • region: statically tags the region of the server running the code
  • vehicle: dynamically tags the endpoint

Tagging static region#

Tagging something static, like the region, can be done in the initialization code in the main() function:

    pyroscope.Start(pyroscope.Config{        ApplicationName: "ride-sharing-app",        ServerAddress:   serverAddress,        Logger:          pyroscope.StandardLogger,        Tags:            map[string]string{"region": os.Getenv("REGION")},    })

Tagging dynamically within functions#

Tagging something more dynamically, like we do for the vehicle tag can be done inside our utility FindNearestVehicle() function using a with pyroscope.tag_wrapper() block

func FindNearestVehicle(search_radius int64, vehicle string) {    pyroscope.TagWrapper(context.Background(), pyroscope.Labels("vehicle", vehicle), func(ctx context.Context) {               // Mock "doing work" to find a vehicle        var i int64 = 0        start_time := time.Now().Unix()        for (time.Now().Unix() - start_time) < search_radius {            i++        }    })}

What this block does, is:

  1. Add the label pyroscope.Labels("vehicle", vehicle)
  2. execute the FindNearestVehicle() function
  3. Before the block ends it will (behind the scenes) remove the pyroscope.Labels("vehicle", vehicle) from the application since that block is complete

Resulting flamegraph / performance results from the example#

Running the example#

To run the example run the following commands:

# Pull latest pyroscope image:docker pull pyroscope/pyroscope:latest
# Run the example project:docker-compose up --build
# Reset the database (if needed):# docker-compose down

What this example will do is run all the code mentioned above and also send some mock-load to the 3 servers as well as their respective 3 endpoints. If you select our application: ride-sharing-app.cpu from the dropdown, you should see a flamegraph that looks like this (below). After we give 20-30 seconds for the flamegraph to update and then click the refresh button we see our 3 functions at the bottom of the flamegraph taking CPU resources proportional to the size of their respective search_radius parameters.

Where's the performance bottleneck?#

golang_first_slide

The first step when analyzing a profile outputted from your application, is to take note of the largest node which is where your application is spending the most resources. In this case, it happens to be the OrderCar function.

The benefit of using the Pyroscope package, is that now that we can investigate further as to why the OrderCar() function is problematic. Tagging both region and vehicle allows us to test two good hypotheses:

  • Something is wrong with the /car endpoint code
  • Something is wrong with one of our regions

To analyze this we can select one or more tags from the "Select Tag" dropdown:

image

Narrowing in on the Issue Using Tags#

Knowing there is an issue with the OrderCar() function we automatically select that tag. Then, after inspecting multiple region tags, it becomes clear by looking at the timeline that there is an issue with the us-west-1 region, where it alternates between high-cpu times and low-cpu times.

We can also see that the mutexLock() function is consuming almost 70% of CPU resources during this time period.

golang_second_slide-01

Comparing two time periods#

Using Pyroscope's "comparison view" we can actually select two different time ranges from the timeline to compare the resulting flamegraphs. The pink section on the left timeline results in the left flamegraph, and the blue section on the right represents the right flamegraph.

When we select a period of low-cpu utilization and a period of high-cpu utilization we can see that there is clearly different behavior in the mutexLock() function where it takes 33% of CPU during low-cpu times and 71% of CPU during high-cpu times.

golang_third_slide-01

Visualizing Diff Between Two Flamegraphs#

While the difference in this case is stark enough to see in the comparison view, sometimes the diff between the two flamegraphs is better visualized with them overlayed over each other. Without changing any parameters, we can simply select the diff view tab and see the difference represented in a color-coded diff flamegraph.

golang_fourth_slide-01

More use cases#

We have been beta testing this feature with several different companies and some of the ways that we've seen companies tag their performance data:

  • Tagging Kubernetes attributes
  • Tagging controllers
  • Tagging regions
  • Tagging jobs from a queue
  • Tagging commits
  • Tagging staging / production environments
  • Tagging different parts of their testing suites
  • Etc...

Future Roadmap#

We would love for you to try out this example and see what ways you can adapt this to your golang application. While this example focused on CPU debugging, Golang also provides memory profiling as well. Continuous profiling has become an increasingly popular tool for the monitoring and debugging of performance issues (arguably the fourth pillar of observability).

We'd love to continue to improve our golang integrations and so we would love to hear what features you would like to see.

· 5 min read

Pyroscope Rideshare Example#

python_example_architecture_05_00

Note: For documentation on the Pyroscope pip package see our python docs

Background#

In this example we show a simplified, basic use case of Pyroscope. We simulate a "ride share" company which has three endpoints found in server.py:

  • /bike : calls the order_bike(search_radius) function to order a bike
  • /car : calls the order_car(search_radius) function to order a car
  • /scooter : calls the order_scooter(search_radius) function to order a scooter

We also simulate running 3 distinct servers in 3 different regions (via docker-compose.yml)

  • us-east-1
  • us-west-1
  • eu-west-1

One of the most useful capabilities of Pyroscope is the ability to tag your data in a way that is meaningful to you. In this case, we have two natural divisions, and so we "tag" our data to represent those:

  • region: statically tags the region of the server running the code
  • vehicle: dynamically tags the endpoint (similar to how one might tag a controller rails)

Tagging static region#

Tagging something static, like the region, can be done in the initialization code in the config.tags variable:

import pyroscopepyroscope.configure(    app_name       = "ride-sharing-app",    server_address = "http://pyroscope:4040",    tags           = {        "region":   f'{os.getenv("REGION")}', # Tags the region based off the environment variable    })

Tagging dynamically within functions#

Tagging something more dynamically, like we do for the vehicle tag can be done inside our utility find_nearest_vehicle() function using a with pyroscope.tag_wrapper() block

def find_nearest_vehicle(n, vehicle):    with pyroscope.tag_wrapper({ "vehicle": vehicle}):        i = 0        start_time = time.time()        while time.time() - start_time < n:            i += 1

What this block does, is:

  1. Add the tag { "vehicle" => "car" }
  2. execute the find_nearest_vehicle() function
  3. Before the block ends it will (behind the scenes) remove the { "vehicle" => "car" } from the application since that block is complete

Resulting flamegraph / performance results from the example#

Running the example#

To run the example run the following commands:

# Pull latest pyroscope image:docker pull pyroscope/pyroscope:latest
# Run the example project:docker-compose up --build
# Reset the database (if needed):# docker-compose down

What this example will do is run all the code mentioned above and also send some mock-load to the 3 servers as well as their respective 3 endpoints. If you select our application: ride-sharing-app.cpu from the dropdown, you should see a flamegraph that looks like this (below). After we give 20-30 seconds for the flamegraph to update and then click the refresh button we see our 3 functions at the bottom of the flamegraph taking CPU resources proportional to the size of their respective search_radius parameters.

Where's the performance bottleneck?#

python_first_slide_05

The first step when analyzing a profile outputted from your application, is to take note of the largest node which is where your application is spending the most resources. In this case, it happens to be the order_car function.

The benefit of using the Pyroscope package, is that now that we can investigate further as to why the order_car() function is problematic. Tagging both region and vehicle allows us to test two good hypotheses:

  • Something is wrong with the /car endpoint code
  • Something is wrong with one of our regions

To analyze this we can select one or more tags from the "Select Tag" dropdown:

image

Narrowing in on the Issue Using Tags#

Knowing there is an issue with the order_car() function we automatically select that tag. Then, after inspecting multiple region tags, it becomes clear by looking at the timeline that there is an issue with the us-west-1 region, where it alternates between high-cpu times and low-cpu times.

We can also see that the mutex_lock() function is consuming almost 70% of CPU resources during this time period. python_second_slide_05

Comparing two time periods#

Using Pyroscope's "comparison view" we can actually select two different time ranges from the timeline to compare the resulting flamegraphs. The pink section on the left timeline results in the left flamegraph, and the blue section on the right represents the right flamegraph.

When we select a period of low-cpu utilization and a period of high-cpu utilization we can see that there is clearly different behavior in the mutex_lock() function where it takes 51% of CPU during low-cpu times and 78% of CPU during high-cpu times. python_third_slide_05

Visualizing Diff Between Two Flamegraphs#

While the difference in this case is stark enough to see in the comparison view, sometimes the diff between the two flamegraphs is better visualized with them overlayed over each other. Without changing any parameters, we can simply select the diff view tab and see the difference represented in a color-coded diff flamegraph. python_fourth_slide_05

More use cases#

We have been beta testing this feature with several different companies and some of the ways that we've seen companies tag their performance data:

  • Tagging controllers
  • Tagging regions
  • Tagging jobs from a redis / sidekiq / rabbitmq queue
  • Tagging commits
  • Tagging staging / production environments
  • Tagging different parts of their testing suites
  • Etc...

Future Roadmap#

We would love for you to try out this example and see what ways you can adapt this to your python application. Continuous profiling has become an increasingly popular tool for the monitoring and debugging of performance issues (arguably the fourth pillar of observability).

We'd love to continue to improve this pip package by adding things like integrations with popular tools, memory profiling, etc. and we would love to hear what features you would like to see.

· 5 min read

Pyroscope Rideshare Example#

ruby_example_architecture_05

Note: For documentation on the Pyroscope ruby gem see our ruby docs

Background#

In this example we show a simplified, basic use case of Pyroscope. We simulate a "ride share" company which has three endpoints found in server.rb:

  • /bike : calls the order_bike(search_radius) function to order a bike
  • /car : calls the order_car(search_radius) function to order a car
  • /scooter : calls the order_scooter(search_radius) function to order a scooter

We also simulate running 3 distinct servers in 3 different regions (via docker-compose.yml)

  • us-east-1
  • us-west-1
  • eu-west-1

One of the most useful capabilities of Pyroscope is the ability to tag your data in a way that is meaningful to you. In this case, we have two natural divisions, and so we "tag" our data to represent those:

  • region: statically tags the region of the server running the code
  • vehicle: dynamically tags the endpoint (similar to how one might tag a controller rails)

Tagging static region#

Tagging something static, like the region, can be done in the initialization code in the config.tags variable:

require "pyroscope"Pyroscope.configure do |config|  config.app_name = "ride-sharing-app"  config.server_address = "http://pyroscope:4040"  config.tags = {    "region": ENV["REGION"],                     # Tags the region based of the environment variable  }end

Tagging dynamically within functions#

Tagging something more dynamically, like we do for the vehicle tag can be done inside our utility find_nearest_vehicle() function using a Pyroscope.tag_wrapper block

def find_nearest_vehicle(n, vehicle)  Pyroscope.tag_wrapper({ "vehicle" => vehicle }) do    ...code to find nearest vehicle  endend

What this block does, is:

  1. Add the tag { "vehicle" => "car" }
  2. execute the find_nearest_vehicle() function
  3. Before the block ends it will (behind the scenes) remove the { "vehicle" => "car" } from the application since that block is complete

Resulting flamegraph / performance results from the example#

Running the example#

To run the example run the following commands:

# Pull latest pyroscope image:docker pull pyroscope/pyroscope:latest
# Run the example project:docker-compose up --build
# Reset the database (if needed):# docker-compose down

What this example will do is run all the code mentioned above and also send some mock-load to the 3 servers as well as their respective 3 endpoints. If you select our application: ride-sharing-app.cpu from the dropdown, you should see a flamegraph that looks like this. After we give 20-30 seconds for the flamegraph to update and then click the refresh button we see our 3 functions at the bottom of the flamegraph taking CPU resources proportional to the size of their respective search_radius parameters.

Where's the performance bottleneck?#

ruby_first_slide_01-01

The first step when analyzing a profile outputted from your application, is to take note of the largest node which is where your application is spending the most resources. In this case, it happens to be the order_car function.

The benefit of using the Pyroscope package, is that now that we can investigate further as to why the order_car() function is problematic. Tagging both region and vehicle allows us to test two good hypotheses:

  • Something is wrong with the /car endpoint code
  • Something is wrong with one of our regions

To analyze this we can select one or more tags from the "Select Tag" dropdown:

image

Narrowing in on the Issue Using Tags#

Knowing there is an issue with the order_car() function we automatically select that tag. Then, after inspecting multiple region tags, it becomes clear by looking at the timeline that there is an issue with the us-west-1 region, where it alternates between high-cpu times and low-cpu times.

We can also see that the mutex_lock() function is consuming almost 70% of CPU resources during this time period.

ruby_second_slide_01

Comparing two time periods#

Using Pyroscope's "comparison view" we can actually select two different time ranges from the timeline to compare the resulting flamegraphs. The pink section on the left timeline results in the left flamegraph and the blue section on the right represents the right flamegraph.

When we select a period of low-cpu utilization, and a period of high-cpu utilization we can see that there is clearly different behavior in the mutex_lock() function where it takes 23% of CPU during low-cpu times and 70% of CPU during high-cpu times.

ruby_third_slide_01-01

Visualizing Diff Between Two Flamegraphs#

While the difference in this case is stark enough to see in the comparison view, sometimes the diff between the two flamegraphs is better visualized with them overlayed over each other. Without changing any parameters, we can simply select the diff view tab and see the difference represented in a color-coded diff flamegraph.

ruby_fourth_slide_01-01

More use cases#

We have been beta testing this feature with several different companies and some of the ways that we've seen companies tag their performance data:

  • Tagging controllers
  • Tagging regions
  • Tagging jobs from a redis or sidekiq queue
  • Tagging commits
  • Tagging staging / production environments
  • Tagging different parts of their testing suites
  • Etc...

Future Roadmap#

We would love for you to try out this example and see what ways you can adapt this to your ruby application. Continuous profiling has become an increasingly popular tool for the monitoring and debugging of performance issues (arguably the fourth pillar of observability).

We'd love to continue to improve this gem by adding things like integrations with popular tools, memory profiling, etc. and we would love to hear what features you would like to see.

· 4 min read

Using flame graphs to get to the root of the problem#

I know from personal experience that debugging performance issues on Python servers can be incredibly frustrating. Usually, increased traffic or a transient bug would cause end users to report that something was wrong.

More often than not, it's impossible to exactly replicate the conditions under which the bug occured, and so I was stuck trying to figure out which part of our code/infrastructure was responsible for the performance issue on our server.

This article explains how to use flame graphs to continuously profile your code and reveal exactly which lines are responsible for those pesky performance issues.

Why You should care about CPU performance#

CPU utilization is a metric of application performance commonly used by companies that run their software in the cloud (i.e. on AWS, Google Cloud, etc).

In fact, Netflix performance architect Brendan Gregg mentioned that decreasing CPU usage by even 1% is seen as an enormous improvement because of the resource savings that occur at that scale. However, smaller companies can see similar benefits when improving performance because regardless of size, CPU is often directly correlated with two very important facets of running software:

  1. How much money you're spending on servers - The more CPU resources you need, the more it costs to run servers
  2. End-user experience - The more load placed on your server's CPUs, the slower your website or server becomes

So when you see a graph of CPU utilization that looks like this: image

During the period of 100% CPU utilization, you can assume:

  • End-users are having a frustrating experience (i.e. App / Website is loading slow)
  • Server costs will increase after you provision new servers to handle the additional load

The question is: which part of the code is responsible for the increase in CPU utilization? That's where flame graphs come in!

How to use flame graphs to debug performance issues (and save $66,000 on servers)#

Let's say the flame graph below represents the timespan that corresponds with the "incident" in the picture above where CPU usage spiked. During this spike, the server's CPUs were spending:

  • 75% of time in foo()
  • 25% of time in bar()
  • $100,000 on server costs

pyro_python_blog_example_00-01

You can think of a flame graph like a super detailed pie chart, where:

  • The width of the flame graph represents 100% of the time range
  • Each node represents a function
  • The biggest nodes are taking up most of the CPU resources
  • Each node is called by the node above it

In this case, foo() is taking up 75% of the total time range, so we can improve foo() and the functions it calls in order to decrease our CPU usage (and save $$).

Creating a flame graph and Table with Pyroscope#

To recreate this example with actual code, we'll use Pyroscope - an open-source continuous profiler that was built specifically for debugging performance issues. To simulate the server doing work, I've created a work(duration) function that simulates doing work for the duration passed in. This way, we can replicate foo() taking 75% of time and bar() taking 25% of the time by producing this flame graph from the code below:

ai_flamgegraph_mockup

# where each iteration simulates CPU timedef work(n):    i = 0    while i < n:        i += 1
# This would simulate a CPU running for 7.5 secondsdef foo():    work(75000)
# This would simulate a CPU running for 2.5 secondsdef bar():    work(25000)

Then, let's say you optimize your code to decrease foo() time from 75000 to 8000, but left all other portions of the code the same. The new code and flame graph would look like:

foo_25_bar_75_minutes_10

# This would simulate a CPU running for 0.8 secondsdef foo():    # work(75000)    work(8000)
# This would simulate a CPU running for 2.5 secondsdef bar():    work(25000)

Improving foo() saved us $66,000#

Thanks to the flame graphs, we were able to identify immediately that foo() was the bottleneck in our code. After optimizing it, we were able to significantly decrease our cpu utilization.

image

This means your total CPU utilization decreased 66%. If you were paying $100,000 dollars for your servers, you could now manage the same load for just $34,000.