Regardless of the field, building is often much more difficult than destroying. Building a building may take years, while destroying it may be a matter of seconds. At the root of this, because construction is an entropy-reducing process, there must be an input of energy from outside the system. One type of problem in creation that seems even more special is the problem of origins. How did the first life come into being? How did mankind originate? Questions such as these have never ceased to be explored by mankind.
Today we are going to explore the topic of process creation in Linux. Most of you probably know about fork, and you know that processes are divided like a cell division. So are you aware of the following: what do you need to prepare in order to split it? What exactly is divided and what is inherited by the so-called division? How is the genetic material passed on? How does the first process come about?
Let’s explore these questions together!
What is a process
Definition and Background
Before discussing any issue we need to define it clearly, otherwise the discussion will seem pointless. Similarly, since we are talking about processes, we need to know what a process is first.
This is how processes are usually defined in books: A process is an instance of program execution. You can think of a process as a collection of data structures that describe how far a process has been executed.
This is a simple definition, but it doesn’t help us much to understand the underlying layers. Processes are actually a concept that followed the multi-channel programming of operating systems. Early computers did not support multitasking and could only have one program running, but this was too inefficient. So from Intel 80286/80386 onwards, CPUs started to support protected mode, and operating systems started to support multitasking with it. In the era of single-channel programs CPU and memory are yours, you can use them as you like. But not in the era of multi-channel programs, because system resources (CPU, memory, etc.) need to be shared, and the whole system is in chaos if not managed accordingly.
So, the concept of process was abstracted, which takes on the role of an entity that allocates system resources (CPU time, memory, etc.) and provides isolation and protection for access to shared resources.
Processes are like humans, they are created, have their own life cycle and may be in several different states. Similarly, there are kinship and non-kinship relationships between processes.
Process Descriptors
In order to manage processes, the kernel must provide a clear description of the process state. This is where the process descriptor comes in. task_struct is a very complex structure that contains all the information related to a process. Even in the early days of Linux 0.11, it already contained so many fields: the most basic process state information, memory-related information, file system-related information, local descriptor table ldt, and task state segment tss. Where ldt holds the descriptors of the code and data segments of the process, and tss is used to record the CPU register information used when the process switches.
|
|
The structure of tss_struct
is as follows, you can see that most of it is register information.
|
|
Note: The code and narrative in this article are based on Linux 0.11
Process 0
Process 0 is like Adam, the ancestor of all processes, and Linus is like God, who pre-set the data structure of process 0 statically and then breathed on it during the kernel initialization phase, and process 0 came to life.
As you can see below, the task_struct of process 0 is pre-set in the static variable init_task with the INIT_TASK macro. From here I can see that the page where task_struct is located is also the kernel stack, and the stack pointer starts at the end of the page.
The macro INIT_TASK pre-sets the fields of the task_struct structure, including ldt and tss. You can understand the meaning of each initial value in conjunction with the previous fields of the task_struct. We mention a few special ones here
- ldt is a local descriptor table LDT,
0x9f, 0xc0fa00
and0x9f, 0xc0f200
represent the code segment and data segment descriptors of process 0 respectively, with base address 0x0, limit 640K, G=1, D=1, DPL=3, P=1, and TYPE 0x0a and 0x02 respectively. The code segment and data segment of the kernel. - tss is the task status segment
- The value of the second field,
PAGE_SIZE+(long)&init_task
, indicates the kernel stack pointer esp0, which points to the end of the page where init_task is located. - The subsequent ss is initialized to 0x10, indicating that it is a kernel data stack segment selector.
(long)&pg_dir
initializes cr3 to the address where the page directory is located- The next 10 registers are initialized to 0, including eip and esp in user state
- All 6 segment registers are initialized to 0x17, indicating the user state data segment selector
_LDT(0)
indicates the LDT selector for process 0
- The value of the second field,
|
|
You can see that most of the process information has been set statically, but it is still a dead process 0. Let’s see how the kernel gives it a breath of air to bring it back to life.
The following is the kernel main function. Note that the main function is not the entry point for the kernel program, but rather the system bootloader, followed by the main function after the system enters protected mode and completes some initialization. At the beginning the system uses a temporary stack, after entering protected mode the stack segment is set to the kernel data segment (0x10) and esp points to the end of the user_stack array, which is still used when the system first enters main.
Part of the initialization of process 0 is done in the sched_init()
function. First, the TSS and LDT descriptor entries for process 0 are set in the global descriptor table GDT, and then loaded into the tr and ldtr registers respectively. This is the only time the kernel loads the LDT descriptor explicitly, and later the LDT descriptor of the process will be loaded automatically by the CPU according to the LDT selector in the TSS when the process switches.
The next move_to_user_mode()
is the final exhalation. It accomplishes two important historical tasks: moving from the kernel state to the user state, and process process 0. The interrupt process is simulated manually by pressing specific values into the stack.
- Press in the user state stack segment selector, 0x17 for the data stack segment of the user state local table (previously ss was 0x10)
- Press in the user state stack pointer, which inherits the esp before executing
move_to_user_mode
- Pressing into the status register
- Press in the user state code segment selector, 0x0f means the code segment in the user state local table (previously cs was 0x08)
- Press in the offset address, which is the location marked 1 below
|
|
Finally, by executing iret
, the system moves from the kernel state to the user state and process 0 comes to life. Its kernel stack is set statically in the init_task structure, and the user stack segment and code segment are also set statically in the ldt of init_task, whose base address is 0, so it still points to the kernel code segment and data segment, but the privilege level becomes the user level. The user state stack pointer inherits the previous esp, so it still points to the original location. eip is manually set to the location of the next instruction in iret
. So after iret
, the execution of the code continues.
The task of process 0 is simple: fork process 1, and then it goes into a dead loop. Process 0 runs when there is no other process to run on the system, so it is also called idle process.
fork and system calls
Process 1 has a slightly special status, it is forked from process 0, so its code and data segments are the same as those of process 0. As for all subsequent processes, they are all children of process 1, because they are all exec’d by process 1 after fork, so their code and data segments are no longer kernel code and data segments.
Next, let’s explore how fork is divided into two, starting with the fork system call definition.
|
|
_syscall0
is a macro, and the kernel defines several such macros, where the number that follows it indicates the number of arguments.
Let’s look at the specific definition of _syscall0
.
These macros are actually embedded assembly functions, with only one int $0x80
instruction, where 0x80 is the interrupt number of the system call. There is 1 output parameter and input parameter respectively. The local variable __res
is bound to the eax register as an output parameter and is used to receive the return value. The input parameter __NR_fork
is the system call number, each system call has a separate number, also bound to register eax.
After the int
instruction is executed, the CPU goes to the IDT to find the corresponding interrupt descriptor. Since the system call is implemented as a system gate (a trapdoor with a privilege level DPL of 3), it can be called in the user state, and the selector of the segment where the interrupt routine is located and the in-segment offset can be found in the gate descriptor to jump there. For the system call, it is a jump to the following system_call:
.
This way the process enters the kernel space through the system call, and the interrupt process of the CPU triggered by the int
instruction automatically puts the user state original stack segment ss, stack pointer esp, status register eflags, original user code segment cs, and original user code eip on the stack (note: the stack here is the kernel stack). So why does the CPU stack these registers? Because it has to remember the way back. Just as a function call needs to stack the return address, so does an interrupt need to stack the return address. In addition, the kernel state and the user state use separate stacks, so the user state stack segment register ss and the stack pointer esp also need to be stacked. So when you first enter the kernel space, the kernel stack pointer is located at esp0 in the following figure.
This code at the beginning of systen_call
is common to all system calls. It first eax whether the system call number in eax is out of range, if it is normal, it will stack the segment registers and general registers, and then reset ds and es to kernel data segment and fs to user data segment. Finally, the corresponding processing function is called through the system call number lookup table.
|
|
fork system call kernel stack situation.
|
|
For the fork system call, it goes to sys_fork
, where the kernel stack pointer is at esp1. The find_empty_process
function just gets a free process number (note that it is not a pid), and if there are no more free ones it jumps directly to the marker 1 and returns. Then all the remaining programmable registers are also put on the stack, and finally the new process number is put on the stack.
Next, the core function copy_process
is called, and the kernel stack pointer is located at esp2 when you first enter the function. This function has many arguments, which are actually registers from the previous stack. Because the call convention in C on x86 is to stack arguments from right to left, the last nr on the stack is the first argument. The reason why so many arguments are needed is that the parent process needs to pass on the values of all these registers to the child process, while most of the rest of the information can be obtained through the process descriptor. The function first allocates a free page as the task_struct structure for the new process, and then saves the pointer to the task array.
|
|
The next step is to initialize the process descriptor of the child process p. The first line below copies the process descriptor of the parent process directly to the child process, so the fields that are not additionally modified later are by default the same as the parent process. But after all, the child process is a separate process and needs to modify the necessary fields, such as pid, parent process, time slice, signal, timer, time, etc.
|
|
The next step is to initialize the tss of the child process, noting that the kernel stack is pointing to the end of the page where task_struct is located, p->tss.esp0 = PAGE_SIZE + (long) p;
. The kernel stack segment selector ss0 is 0x10, which is the kernel data segment. The rest of the register values are copied from the parent process, except for the eax register. You can see that the eax of the child process is assigned to 0. As a comparison, the eax of the parent process (i.e., the last return value) is last_pid. The eax value here is actually the final return value of the fork function, which is why fork returns 0 for the child process and the pid of the child process for the parent process.
p->tss.ldt = _LDT(nr)
is to set the LDT selector of the new process so that the process can find its LDT when it switches.
|
|
The next copy_mem()
function starts copying the memory.
Let’s take a look at its concrete implementation, we can see that it does not copy the memory pages directly, but the page table, and only the page table of the data segment is copied. (Because all processes in Linux 0.11 share the page directory, and the base address of the code segment and data segment in Linux 0.11 is the same, and the length of the data segment is not less than that of the code segment, so only the data segment needs to be copied here.)
(Linux 0.11 supports up to 64 processes at the same time, and the linear base address of each process is obtained according to its process number * 64MB, i.e. nr * 0x4000000
in the above code.)
Before just set the ldt selector in tss, the LDT table is still a copy of the parent process. The user code segment and data segment descriptors are not updated, but are set here with the new base address.
This is only the first piece of the write time copy puzzle, not the complete write time copy mechanism, so I won’t expand on it here for space reasons. (Perhaps a separate article can be written later)
|
|
Let’s go ahead and finish the rest of the copy_process()
function first.
|
|
The next step is to add the reference count, all files opened by the process, and the three i-nodes pwd, root, and executable. Then the TSS and LDT descriptor entries of the child process are set in GDT. Finally, change the process state to runnable and return the pid of the child process.
The preceding is the process of system call in, and the next is a process of system call return, note that the eax on the stack here is the return value of fork. After the execution of the specific system call processing function to this may occur scheduling behavior, so fork after the parent process and the child process in the end which first run is not certain.
If the system call is made from the user state, the process will also be signaled, which is not the focus of our attention today and will not be expanded here. Finally, all registers from the previous stack are restored, and iret
returns to user space to continue execution. At this point, the kernel stack is restored to its original clean state, and cs/eip/ss/esp is restored to the state before the int
instruction was executed, and the process continues to execute after the user state fork()
.
|
|
Summary
To briefly review, the process creation is mainly to initialize the process descriptor structure, including the ldt and tss in it. memory copy, because it is a copy-on-write only copies the page table (Linux 0.11 all processes are shared page directory). Also set the LDT and TSS descriptors of the process in the GDT global descriptor table.
Process 0 is special as the ancestor of all processes, its process descriptor structure is set statically, then its LDT and TSS descriptors are set in the GDT during the initialization phase and tr and ldtr are loaded manually. finally it is transferred to the user state by manually simulating the interrupt process and setting its stack pointer and eip pointer.
The subsequent process creation is done by forking, and most of the information in the process descriptor is copied directly from the parent process. To keep the state consistent with the parent process, all programmable registers are passed as arguments to copy_process
, although there are some differences between the child process and the parent process as an independent process, notably the kernel stack pointer, the local descriptor table LDT, and the return value eax.
The initial questions that were raised should all be answered now.