Background
Two nights ago, the online system suddenly broke down, and after immediately opening the online error log, I only got a bunch of meaningless program call stack (traceback) output, so the team members were caught in a long and blind problem solving process. The problem was fortunately solved, but I couldn’t figure out why the call stack printed in the log was meaningless when, as a rule of thumb, it should print the call stack during the exception generation process. After subsequent source code analysis and troubleshooting, I realized it was actually due to an old piece of code in the project using monkey patch, which is what this article is trying to discuss.
What is a monkey patch
Monkey patching is a type of programming used to modify (add, change, delete, etc.) the behavior of system software at runtime. There is a wide range of monkey patch applications in dynamic languages, such as Ruby’s open class feature that supports extending class definitions and even replacing method implementations at runtime, Python’s methods or functions that can be replaced at runtime, and other languages like JavaScript that can also apply monkey patches.
Monkey patching is a double-edged sword
Monkey patches are flexible enough to allow complete separation of the patch code from the application code, while allowing the application code to be invoked in a way that remains constant. From the perspective of the application code, it calls the original defined method or function of a module; from the perspective of the called method or function, the presence of the monkey patch is transparent to it, as shown in the following demo in Python.
Let’s start with a minimalist example and say hello to a wonderful world.
If the above script is executed, the result will be
This is easy. Next, let’s say we do a monkey patch: we expand the original greet
behavior and now print the current time in addition to the message.
|
|
Run it and get the following results.
We get the expected result!
From the code analysis, we added a new function greet_with_time
, which calls the original greet
function, then prints the current time, and finally replaces the greet
function by assigning the function to a variable. The final call to the greet
function does not require any changes, thus achieving the same call to the greet
function but with a very different behavior.
The above demo is just a simplified code for the sake of space, the monkey patch code in a real project is always in a separate module or file. Imagine a large, complex project where your code is flooded with monkey patches, it would be a disastrous challenge to analyze the behavior of the system and troubleshoot problems.
Now that we have some understanding of monkey patches, let’s look at an example I encountered in a real project.
A bunch of meaningless stack information
I reproduced locally the exception we encountered as I mentioned at the beginning, and here are the stack messages consistent with the online environment.
|
|
From this stack message, it actually prints the stack of the Logger.log
function called, in which there is no code at all to see the word .insert
, which has nothing to do with AttributeError("'long' object has no attribute 'insert'",)
. Such a stack of information is basically the same with and without. So, I went on to explore more through the editor and the source code.
The first thing I did was to use the stack above to analyze what went wrong, so I looked at the code at celery/worker/job.py:504-511
first.
|
|
The logger.log
method is called here (the source of the logger can be analyzed in Celery’s code, but is not the focus of this article, so I won’t expand on it) and two important messages are passed in via the context
object: exception
and traceback
. Further reading of the logger.log
source code confirmed that the core of this piece of log printing relies on a call to the traceback.print_exception
function.
So, I went back to the code at celery/worker/job.py:504-511
and inserted two codes that print error stack information before logger.log
.
|
|
After restarting celery and executing the asynchronous task, the first error stack is identical to the one I posted earlier, which is understandable since the call to the print_exception
function here is the core implementation in logger.log
. The call to format_exception
gives me the real meaningful error stack information.
|
|
Now it’s clear, this is where the exception in this code is really coming from!
But then the question arises, why do print_exce ption
and format_exception
give different stack information? I went to the official documentation with a lot of questions, but the confusion got worse.
traceback.format_exception(etype, value, tb[, limit]) Format a stack trace and the exception information. The arguments have the same meaning as the corresponding arguments to print_exception(). The return value is a list of strings, each ending in a newline and some containing internal newlines. When these lines are concatenated and printed, exactly the same text is printed as does print_exception().
Focus on the last sentence, the official Python documentation says that both functions output the same error stack (EXACTLY the same text)!
Spot the monkey patch
Actually, the real process of troubleshooting the problem took me a lot of time. I hadn’t been thinking about the monkey patch, but I finally found the culprit on the subway on my way out to a friend’s dinner date, when I looked at the project code on my company’s GitLab on my phone.
From the patch code, the patch directly overwrites the original code, and the implementation also directly and brutally ignores the several exception message parameters passed in! That’s why it’s such a big oops, with unrelated exception stack information! (╯‵□′)╯︵┻┻
Tips for troubleshooting monkey patches
There are certainly pros and cons to this type of programming technique, monkey patches, and using them necessarily requires extra caution. However, it is not necessary to stay away from them, the focus is on mastering the necessary troubleshooting skills, and I will go back to this lesson to find some methods that may help.
1. Check the method or function information through the function or method itself properties
As you know, all Python objects have a bunch of built-in properties, and functions are no exception.
A quick look shows that the real code of this function is actually the patch code of the project!
2. Inspecting with the inspect package
Python provides a lot of toolkits, and inspect is naturally one of them, and it can be used to do runtime checks on almost any type.
|
|
In short, if you encounter code behavior does not match the expected but can not correspond with the official documentation or official source code, then it is possible that the dependent method or function was monkey patched, and the quickest way to confirm the monkey patch is to check the actual definition of the function or method called in the first place, that is, to apply the above method!
Off-topic remarks
I’ve encountered the monkey patch trap when doing Ruby development, and there is a similar approach in Ruby.