Debugging by testing assumptions

Posted on February 23, 2019 by Tommy Ku

⚠

This post is more than 2 years old, it may contain outdated information

When troubleshooting an issue introduced not by a simple and apparent cause, it could take from from 10 minutes all the way to hours and days, leaving ourselves scratching our heads and nearly suffering an emotional breakdown. However, it needs not be the case had we had a healthy mindset when troubleshooting the issue.

It works on my machine — most common excuse used by programmers

At the beginning of troubleshooting, be an error message or something not working the way we expect it does, we can take this assumption, and start from here.

Ask yourself one question: What are the key differences between my system and the system I am troubleshooting on?

As we know, computer systems nowadays are constructed as layers upon layers of abstractions, so to troubleshoot we need to focus on each layers as if we are peeling an onion.

Starting from runtime level, the key differences may be:

versions of the runtime
versions of the dependencies
application state (e.g. uninitialized database, config files)

On layers below the application there are:

external dependencies (e.g. database, library, network and external APIs)
system level restrictions/configuration (e.g. firewall, RAM/CPU limitation, user permissions)

Then there are hardware, which is rather difficult to troubleshoot unless we have replacement parts to see which is working/not working.

We should test our assumptions that every layers of the system works, until we figured out which doesn’t. To test the assumptions, we perform test on separate layers to ensure that they work, and move on to the next layer.

We should test our assumptions that every layers of the system works, until we figured out which doesn’t.

Here comes a real-world example of me trying to trouble a mysterious issue in my program — not a bug because the issue ended up to be non-code related.

I upgraded my OS from Ubuntu 16.04 to Ubuntu 18.04. With it comes PHP 7.2 which is incompatible with the framework Laravel 5.5 that my application depended on. Luckily, there is an PPA ppa:ondrej/php that adds php7.0 and related packages to Ubuntu 18.

Now here comes a problem. My application looks at some .lock files stored in system /tmp folder to determine it’s state. That worked on my Ubuntu 16.04 setup and my development environment on Docker. However, it stopped working after the OS upgrade.

What gives?

My original thought was that the view cache was not updated, so I cleared the caches in Laravel by running php artisan cache:clear. Nope that doesn’t work. What about actually running the code? I fired up a PHP command prompt with php artisan tinker and ran my line var_dump(file_exists('/tmp/something.lock')) and sure enough I got bool(true). By that time I realized that I was stepping into the WTF land of troubleshooting.

It was supposed to be an one-hour deployment, now it has passed the 1.5 hour mark and I still hasn’t figured out what happened. One signature sign of you’re looking into the wrong problem is that there aren’t many StackOverflow questions about the topic because that means nobody was unfortunate enought to stumble upon the same problem.

Here is where we begin testing our assumptions. We have tested using the PHP interpreter to see that file_exists works. Now we need to move one layer up (or down depending on which way you look): Apache.

I will skip the dirty details and jump right to the conclusion. I did something similar with that php artisan tinker step by creating a separate PHP script inside my /var/www/html where my PHP pages reside.

Thing is, whenever I run file_exists by invoking a script from Apache, it fails to recognize the file inside /tmp, so I can tell that something is wrong with Apache. Further investigation by running scandir tells me that from Apache’s prespective, /tmp is an empty folder.

But why? systemd - why php can not see /tmp files - Unix & Linux Stack Exchange

When you ask the right question, it’s often easier to find the answer. And that’s why I believe troubleshooting requires testing the assumptions. By testing assumptions you can drill down layer by layer until you located where the culprit is. Once the defect has been identified, it’s much easier to figure out a solution.

Consider this approach a linear search — obviously there exists the binary search variant called half-splitting, which I eventually had to employ on another issue that was closely related to two pieces of hardware talking to each other.

You could also look at...

A self-hosting lifecycle
How is it like to self-host something? This post describes the process of self-hosting from the initial thought process, to patching and housekeeping, all the way to the end-of-life of a self-hosted service
Leaving EventXtra with 26 lessons learned
I switched from GNOME to i3wm
Ubuntu 18.04 ditched Unity in favor of GNOME as the default window manager, so I switched to i3wm to desperately save some pixels on my small laptop screen

This post first appeared on Tommy Ku's blog. Permalink: https://blog.tommyku.com/blog/debugging-by-testing-assumptions/

About the author

Tommy Ku, a Hong Kong-based Software Engineer experienced developing PHP and Java-based web solutions and passionate in Web technology.

Also a hobbyist digital and film photographer.