Virtual Memory
The first thing we need to have a clear idea of what we normally call memory is that what we are talking about is actually virtual memory, not manipulation of real physical memory. On a bare metal machine without an operating system installed, the memory we are dealing with is physical memory. As computers evolved, a program called an “operating system” was installed on the computer, which took over all the hardware resources and provided us with a layer of abstraction. We, as software developers, only need to engage with the abstraction layer of the operating system, which simplifies many of the problems caused by hardware inconsistencies.
Virtual memory is the abstraction layer that the operating system provides to us for the actual physical memory. Most modern operating systems are multitasking systems, and each of these “tasks”, generally called processes, has its own separate process address space. In other words, each process looks at virtual memory as if it owns all the space, i.e. as if it were the only process on the entire system. If it wants to communicate with other processes, it can do so using sockets, for example. This brings us to the cleverness of the design of operating systems and computer networks. Sockets are designed as a way for two processes to communicate, whether they are two processes on the same machine or two processes on two machines in a network, using this same abstract concept to communicate, shielding a lot of concrete details.
Compilation models for C and C++
Here we focus on the C case. C++ follows the C model without major differences.
For the C compiler, it can compile a separate .c
file independently, which is produced as an .o
file. Finally we link them into an executable file by means of a linker.
And in the middle of this arises the question, if each .c
file is compiled separately, how does it know the information of other files? Let’s say a function in file A needs to call another function in file B.
To give a concrete example.
Here a function in bar.c
references a function in foo.c
, so how do we tell the compiler about the function foo
? The final design is to pretend we have this function .
And as more of this information is declared, we can put all of this information into a separate file and include
it in, forming the form of a header file.
ELF file format
The source file is compiled by the compiler to form an object file, which is a binary file with only 0’s and 1’s. Different operating systems have different formats for object files, and only the ELF format used on Linux will be discussed below.
The Object file we compile from a .c
file needs to have at least two parts of information in it.
- Code, which represents the binary sequence of instructions.
- Data, which is a binary sequence representing global data.
By design, we put information with the same attributes together in what is called a segment. And since the data in ELF files are binary, there is no way to distinguish them directly. To solve this, we separate out another area called the segment table, which records the mapping of each segment’s name to the area it occupies. In order to find the segment table, we use a fixed-length area at the beginning of an ELF file as the header, where each offset represents a fixed meaning. Using this, we can record the start position of the segment table in the header, so we can get all the information about the segments.
Here are some of the more important segments.
segment name | role |
---|---|
.text | code |
.data | Initialized global and local static variables |
.bss | Uninitialized global and local static variables |
.rodata | Read-only data, such as string constants and global const variables |
We can view the information with the objdump and readelf tools. As an example.
Compile with the following command.
|
|
View file header.
|
|
View information about each segment.
|
|
View the code segment.
|
|
Links and symbols
The essence of linking is to put together a single Object file. If the variables and functions in each Object file were independent, this would be a very simple process. However, in practice it is often the case that this file references a function in another file, and we must solve this problem. If a function in foo.c
calls an external function bar()
, the jmp
instruction we generate must point to the correct address.
So one of the main responsibilities of the connector is to determine the actual address of each symbol. For this purpose, each ELF file maintains a symbol table that represents a mapping of all symbols’ names and its related information (addresses, types, etc.).
As an example.
We check it for all symbols.
T
represents symbols defined in the code segment, D
represents symbols defined in the data segment, and U
represents undefined symbols that need to be processed by the linker later.
C++ Name Mangling
Since function overloading exists in C++, multiple functions can be renamed. If this is not handled, then symbolic conflicts can occur. For this, the C++ compiler adds the function arguments to the symbolic name as well. Different ABI’s have different rules. For the Itanium ABI used in Linux, the above example is compiled as C++ code with the following symbols.
The detailed rules can be found in the Itanium ABI documentation, where it is sufficient to know that name mangling exists. C++ also provides a way to link C++ code without mangle symbols in order to make C code link correctly.
Weak symbols and strong symbols
The types of symbols are classified as strong symbol and weak symbol. If there are two symbols with the same name, take the definition of strong symbol, multiple strong symbols with the same name will conflict and report an error, multiple weak symbols with the same name is undefined behavior.
The compiler defaults functions and initialized global variables to strong symbols, and uninitialized global variables to weak symbols; GCC also provides an extension to define any symbol as a weak symbol.
|
|
Weak symbols are useful for libraries. Weak symbols defined by libraries can be overridden by strong symbols defined by the user, allowing programs to use custom library functions.
For example, the definition of operator new in libcxx.
Users can provide their own operator new functions, overriding the self-contained library functions.
Static links
A static library can be seen as a collection of Object files. A static library can be created with the following command.
In fact, ar
itself is a tool for packaging multiple files together, nothing unusual.
The details of linking static libraries with other object files into executable files belong to the internal implementation of the connector, so we won’t discuss it too much here, but the general idea is to merge all the .o
files referenced in the static library (.a
) together, which is the reason why the statically linked program is so large.
Global objects in C++
We usually think of C and C++ programs as starting with main
, but before that we need to initialize the process environment, call the global object constructor, and many other things in order for the program to run properly. ELF also defines two special segments for constructing and destructing global objects.
.init
holds the initialization code for the process, and the constructor calls for the global object..fini
holds the termination code of the process, and the destructor calls of the global object.
Dynamic linking
The idea of static linking is simple and effective, but its biggest problem is that it is too large. For this reason, dynamic linking was created.
The basic idea of dynamic linking is that the composition of the link is pushed back from before the original program is loaded to when it is loaded.
Location-independent code
One of the main goals of shared libraries is to allow multiple processes to use the same library code, thus saving hard disk and memory. A simple idea is to pre-allocate a dedicated address space and always load the shared library into this location. However, this is not a good use of the address space and also tends to cause a lot of memory fragmentation. To solve these problems, modern operating systems use a technique called location-independent code.
Data References
The compiler creates a table at the beginning of the data segment called the Global Offset Table. This table is an array, and each table entry is the address of the referenced global data target (function or global variable). And because the distance between a data segment and a code segment is always a constant, we can find the corresponding table entry by using the instruction pointer (%rip) and an offset.
As an example.
The equivalent C code is as follows.
code reference
Suppose a program calls a function defined by a shared library. The compiler cannot predict the actual address of this function because it can be loaded to an arbitrary location. To solve this problem, we use a technique called Lazy Binding, which defers the binding of the address until the first call.
Lazy Binding is implemented through a combination of a Global Offset Table and a Procedurev Linkage Table. The Procedure Linkage Table PLT is also an array that acts like a proxy whose table entries are responsible for calling the actual called function, and the GOT, as mentioned above, holds the actual address of the called function (note that it needs to be resolved at runtime).
As an example.
- When a user program first calls a function
foo
that is defined in a shared library, it enters the correspondingPLT
table entry. ThisPLT
table entry is ajmp
instruction that jumps to the address defined in the correspondingGOT
table entry. - Since this is the first call, the
GOT
table entry is not the real address of the functionfoo
, but the next address of the previousjmp
instruction. The instruction at this address is apush
instruction, representing the ID of the functionfoo
, followed by anotherjmp
instruction, this time jumping to PLT[0], which is the call to the dynamic connector. - The dynamic connector finds the actual location of the function
foo
and rewrites its correspondingGOT
table entry. - From then on, all subsequent calls will jump to the corresponding location after step 1.
Symbol Visibility
According to the original linking model, all symbols in a shared library should be externally publicly accessible. But there are obviously many problems with this design, the loss of good encapsulation is only one, but most importantly, when the number of symbols is very large, as in some C++ libraries that abuse templates (yes I’m talking about boost), it can cause a significant increase in loading time.
The compiler provides a command line argument for this purpose, which makes all symbols invisible externally by default.
|
|
Then we can mark the symbols we need to export with one of the GCC provided extensions.
This reduces the size of the dynamic library (by reducing the symbol table) and reduces symbol collisions.