Macbook and GPE storm — How not to feel bad about your own skills?

Paulo Almeida
8 min readJan 1, 2020

Let me guess, you tried installing Linux on your MacbookPro and realised that the CPU fan was running way too often and that 1 CPU core was consistently over 90% utilisation even though not many processes were running. You’ve googled it for a few minutes and stumbled upon a thread in whatever forum in which some random person said “Just disable the gpe06 interruption and it will work out” as shown on the screenshot below:

You might have seen something akin to it

Although you were happy to have the your CPU core back, now you feel somehow an impostor since you wouldn’t even know where to begin if you were to debug this problem on your own. If that’s your feeling then you’re on the right place.

You are not alone!

As frustrating as it can be, all of us feel that way and probably most of us would like to have half of that person’s knowledge and yet that feels so distant and far-fetched from your reality that you just feel bad about yourself for some reason.

So this is what I propose: We will debug it together and come to the same conclusion so that instead of perceiving it as “I’m not good enough”, you will now see is as “hey, that can be fun to find out what’s wrong”. Sounds good?

Disclaimer

I’m using Fedora 31 so please make sure that, in case you’re using a Linux distribution with a different package management, you will convert the commands accordingly.

Let the debugging commence!

So far we know that the problem affects the CPU and that causes the CPU fan to be switched on. Having said that, let’s take a look at it using the htop utility

# dnf install htop
$ htop

The previous screenshot alone gives us the impression that a good portion of the CPU 0 and 1 are being used, however, we can’t seem to find the culprit yet. Maybe we could get a better idea if we knew what this red colour means?

A good description can be found if we use the good old “Help” button which many of us seem to have forgotten after Google was invented.

Alright, that seems to be some processing happening at the kernel space rather than being a misbehaving user-space process. Also, another interesting thing we found here is that by pressing “K” we can hide/show kernel threads. Let’s try it.

Okay. We found a kworker which is consuming a significant amount of CPU time. Although we may not understand much about the kernel, we can infer a a few things for now.

  1. It has to do with something called “kacpi_notify”.
  2. Although the CPU consumption is high, the process has its status set as idle.
  3. We don’t know where the CPU consumption is coming from.

Given the fact that our “suspect” here lives on the kernel side and that we are using a non-tainted Linux kernel then we download the source code and wrap our heads around it. For more information about tainted kernel take a look here: https://www.kernel.org/doc/html/v5.3/admin-guide/tainted-kernels.html

In order to better understand what is happening let’s take a look at the Linux kernel’s source code and see if we can find where that process (kacpi_notify) is defined.

Here is where you can find it: https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/osl.c#L1725-L1726

Navigating through the definition of the alloc_workqueue function takes us to a file called kernel/workqueue.c.

An interesting thing we found at the top of the file is a link for the official documentation which will likely point us in the right direction.

Voilà! We found it!

Now we can take a look at the docs and see if anything stands out that might explain why all of this is happening… https://www.kernel.org/doc/Documentation/core-api/workqueue.rst

This can be also a great opportunity to learn a bit more about workqueues why they were implemented in the first place.

Not many lines away from the top, we found an interesting statement:

That at least explains why the process was idle so we can cross that one out. That also tell us that what we saw on the htop utility wasn’t accurate enough to pinpoint where that CPU consumption was coming from.

Going back to htop, we can enable the detailed CPU time option by pressing F2 as shown below

Let’s see how it looks like now

Great! Now that we enabled the Detailed CPU Time option the CPU load went from red to yellow and given the description obtained when pressing the Help button, it means that most of that processing comes from IRQ (Interrupt Request). That gives us a hint of why the process was set as idle even though the CPU load was relatively high.

Just knowing that it comes from IRQ doesn’t solve much of the problem

This is true. That’s why we will have to jump head first into the OS in order to validate whether or not our theory is right but before we do this I need to tell you one thing. It’s beyond the scope of this article to teach you things like kernel space vs user space or how IRQs are handled by the OS. Instead, I will give you the pleasure of researching about them yourself :) It will be fun anyway ;)

For now, all you need to know is:

  1. Interruptions (IRQs) are handled by the kernel
  2. You can’t access kernel memory directly from user space.

Making the kernel “cough up” the info we want

The kernel is so vast and comprehensive that it’s hard sometimes to find what we want to (unless we really know where to look for). So to reduce the search area let us define the following:

The Linux kernel exposes a few ways of interacting with it such as:

  1. filesystem /proc (procfs) and /sys (sysfs)
  2. syscalls
  3. devices (Block devices, Character devices, Network devices, Miscellaneous devices)

Having said that, a file called /proc/interrupts can give us a hint.

So far, this matches with the info we found on the htop utility which is a good sign which means that our next logical step is to take a look at the acpi driver’s source code and see what else it exports through the kernel communication mechanisms with user space that might help us.

That will lead us to a file called drivers/acpi/sysfs.c (https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c)

Right off the gate we found an interesting comment within the conditional compilation directive. We may use this to increase the verbosity and finally know what is causing it. On the other hand, given the number of interrupts we saw on /proc/interrupts file, I’m not sure we should use it as our plan A. Let’s keep this as a card up our sleeves in case we can’t find anything else.

After reading a good portion of it we found some helpful bits like these:

Root folder of the interrupt files available via sysfs https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c#L556-L559
sysfs files being created https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c#L913-L919
Values that are accepted through sysfs for this files in the counter_set method https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c#L733-L804

Don’t feel bad if you don’t know straight away what those pieces of code do. Here goes some references you can read later:

  1. http://man7.org/linux/man-pages/man5/sysfs.5.html
  2. https://www.kernel.org/doc/Documentation/filesystems/sysfs.txt (this will probably give you the insight into most of what acpi/sysfs.c source code is doing there)

The most helpful piece of code I found was this one shown below, which precisely tells us what we can expected if we read a file within the /sys/firmware/acpi/interrupts/* space. Basically, it will shows us whether or not the event (fixed or general purpose) is enabled and also the how many times it was triggered by the hardware. This is exactly what we were looking for!

counter_show function https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c#L676-L726

Knowing what we know now, we can not only read those files and find out who is the culprit but also write a valid/accepted value to it so that we can “address” the issue.

$ grep . /sys/firmware/acpi/interrupts/*

Great! Now we can see the gpe06 count is substantially bigger than the other ones. So let’s try to disable it by writing “disable” to the sysfs file (now we know that this will be handled by the counter_set method) and see the results on the htop utility.

# echo disable > /sys/firmware/acpi/interrupts/gpe06
htop after disabling gpe06

Conclusion

You made it! Congratulations!!!!! That wasn’t that hard, was it?

I hope that you found the process of figuring the problem out by connecting the dots between the OS behaviour and kernel source code satisfactory and fun above all.

Here goes a list of additional references to enhance your understanding of the ACPI driver and also about the underlying causes of the issue.

  1. This thread is an eye opener of why this problem exists :: https://gitlab.freedesktop.org/drm/intel/issues/30
  2. The ACPI specification (You don’t need to read all of 1192 pages) :: https://uefi.org/sites/default/files/resources/ACPI_6_2.pdf
  3. The kernel thread about this issue :: https://bugzilla.kernel.org/show_bug.cgi?id=117481
  4. Linux on the Mac — state of the union :: https://lwn.net/Articles/707616/

Bonus round

A cool thing about reading the source instead of just the posts made by other people is that you can find some code comments that can point you to a different approach of addressing the issue. (in our case “addressing”).

An extremely intriguing comment about what kernel developers thought when creating the sysfs integration for acpi. https://github.com/torvalds/linux/blob/v5.3/drivers/acpi/sysfs.c#L806-L821

Having said that, instead of telling the kernel via sysfs that we want to disable the gpe06, we can specify it as a boot parameter :)

This Git commit is also an eye opener when it comes to how this can be achieved in the Linux kernel. https://github.com/torvalds/linux/commit/9c4aa1eecb48cfac18ed5e3aca9d9ae58fbafc11

I hope you have enjoyed.

Paulo Almeida

--

--

Paulo Almeida

Interested in technical deep dives and the Linux kernel; Opinions are my own;