This article describes the boot process and the initialization process of the Linux kernel. What happens from the time you press power until you finally log in to the terminal?

Note: This article is based on Linux 0.11, and the kernel source code can be downloaded from here. 0.11 is simple enough, but the basic functionality of the kernel is close enough to the current version to make it a good choice for learning how the kernel works as a whole.

Pre-requisite knowledge

Before reading this article, it is recommended to have the most basic understanding of CPU protection mode, segmentation paging mechanism, GDT/LDT/IDT and other concepts. Here is only a brief introduction.

  • 32-bit protection mode: supports multitasking, supports 4GB physical memory, supports virtual memory, supports segmentation and paging mechanisms, and supports privilege level. The specific protection mechanisms provided can be found in 80386 Protection Mechanisms
  • Segmentation and paging mechanism: The protection mode provides a virtual memory mechanism where the segment base value, limit length and some other protection parameters are stored in the descriptor. Each process uses an independent virtual address. Linear addresses can be obtained by the base address of the segment descriptor, which can then be converted into actual physical addresses according to the page directory and page table of the paging mechanism.
  • GDT: Global descriptor table, which holds the data segment and code segment descriptors of the kernel, as well as the local table and status segment descriptors of each task. gdtr addresses can be located through the gdtr register.
  • IDT: Interrupt descriptor table, indicates the code information to be called when an interrupt occurs, somewhat similar to the interrupt vector table, but contains more information. idt address can be located through the idtr register.
  • LDT: Local descriptor table of the task, item 0 is not used, item 1 and 2 are the code segment and data stack segment descriptors of the task respectively. ldt address can be located through the ldtr register.
  • TSS: Task status segment, which mainly stores two types of information. One type is dynamic information updated when the task switches: e.g. general registers, segment registers, EFLAGS, EIP, TSS selector of the previous executed task. One type is static information level: LDT selector of the task, page directory base address register CR3 (PDBR), stack pointer of privilege level 0 to 2, debug trace bits, I/O bitmap base address. tss address can be located by tr register.
  • Descriptor table entry: 8 bytes, including the maximum length limit of segment (16 bits), linear base address of segment (32 bits), privilege level of segment, whether segment is in memory, read/write privilege, and some other flags for protected mode operation.
  • Segment selector: 2 bytes, bits 0 to 1 indicate the requested privilege level, Linux uses only two levels, 0 for system level 3 for user level, bit 2 is used for global (0) or local (1) descriptor table, bits 3 to 15 are the index of descriptor table entries.
  • Interrupt descriptor table entry: 8 bytes, called gate descriptors, 0 to 1 and 6 to 7 bytes are the in-segment offset of the segment where the handler is located, 2 to 3 bytes are the segment selector of the segment where the handler is located, and 4 to 5B are some flags. Contains 3 kinds of gate descriptors: interrupt gate DPL privilege level is 0, off interrupt, trap gate DPL is 0, on interrupt, system gate (actually also trap gate) however DPL is 3, on interrupt, so it can be called from user state.
  • Page table entry: 4 bytes, low 0 to 11 bits store some flags, such as bit 0 whether in memory, bit 1 read/write flag bit, bit 2 normal user or super user, bit 6 whether dirty page, high 12 to 31 bits are page frame address.

Disk boot (bootsect.s)

  1. When the computer boots up, the x86 CPU will automatically enter real mode and start executing the program at address 0xffff0, which is usually the ROM-BIOS address. The BIOS will perform hardware system detection and create an interrupt vector table at physical address 0, i.e., the entry address of the interrupt routine provided by the BIOS will be registered in the interrupt vector table.

    Memory address space allocation for 8086 CPU

    1
    2
    
    0x0                    0x9ffff 0xa0000     0xbffff 0xc0000  0xfffff
    |中断向量表 主储存器地址空间RAM   |     显存地址空间     | 各类ROM地址空间 |
    
  2. The BIOS then reads the first sector of the boot disk (the boot block, Linux 0.11 corresponding to the bootsect.s code) into memory at 0x7c00 and jumps there to continue execution

  3. From here on, the kernel takes over control of the CPU. The boot block is limited in size, so it does a simple job of moving the boot block itself to memory at 0x90000 and then reading the rest of the kernel program into memory. The four sectors corresponding to setup.s are read at 0x90200, i.e. immediately after the boot block. The main part of the kernel, system, is then read at 0x10000, at which point the memory is laid out as follows.

    1
    2
    
    0x0         0x7c00      0x10000            0x90000 0x90200      0xa0000
    |中断向量表| |引导块|     |system        |    |引导块  |setup.s|   |    
    
  4. After reading, jump between paragraphs to 0x90200 and start executing the setup.s code.

System setup (setup.s)

  • The main task of the setup.s program is to use ROM BIOS interrupts to get information about the system and put it into the appropriate location in memory (starting at 0x90000) to provide it for later use by the operating system in protected mode. This is because in protected mode, the operating system will set up its own interrupt descriptor table to handle interrupts and will no longer use the interrupt vector table in real mode.
  • The system information read includes display-related, memory information, hard disk information, root file system device number, etc. This information will be used by kernel-related programs.
  • The next step is to prepare for entering protected mode. After turning off interrupts, move the system module from 0x10000 to 0x8ffff (512KB) in its entirety to 0x00000. Here are two questions to think about: Why didn’t the previous bootsect.s read the system directly to 0x00000? Why do we need to turn off the interrupt here? The first question is because we still need to use the BIOS interrupts when reading the system module from disk, so we cannot overwrite the interrupt vector table at 0x0. The second problem is that since our system module is about to overwrite the interrupt vector table, if an interrupt occurs in real mode, we will get an invalid interrupt vector and the system will crash. Likewise, after entering protected mode, the interrupt descriptor entries may be invalid until the initialization is done.
  • Then it is load the interrupt descriptor table register (idtr) and global descriptor table register (gdtr), then turn on the A20 address line, program the two interrupt control chips 8259A, set its interrupt number to 0x20-0x2f. temporary idt table base address 0, limit length 0, temporary gdt table base address 512+gdt, limit length 2048, i.e. 256 items.
  • Finally, the lmsw instruction sets the CPU’s machine status word (also known as control register CR0), thus entering protected mode and jumping to the beginning of the code segment, i.e., the head.s at the very beginning of the system module to continue execution. (The value of the segment register in protected mode no longer indicates the segment’s base address, but rather the segment selector.)

The layout of the memory at this point is as follows.

1
2
3
0x0                             0x90000 0x90200
|head.s|main.c|kernel|mm|lib|...|系统参数|setup.s|临时gdt|...|
\----------system-----------/

The system is moved to the beginning of 0x0, and 0x90000 holds the system information (the previous boot block bootsect.s has been overwritten). Item 0 in the temporary gdt is not used, item 1 is the system code segment descriptor and item 2 is the system data segment descriptor, both of which have a base address of 0.

Startup program (head.s)

From head.s onwards, the kernel is now running in full protected mode. However, before starting the execution of the C code, there are some tasks that need to be done in assembly, mainly including

  1. reset each segment register to the corresponding segment selector (code segment 0x08, data segment 0x10);
  2. set the formal interrupt descriptor table idt, 256 items, initialize each gate descriptor to point to an error-only dumb interrupt program;
  3. set the formal global descriptor table gdt, 256 entries. Item 0 of the global table is NULL, item 1 is the system code segment descriptor, item 2 is the system data segment descriptor, and item 3 is the system segment not used, followed by the TSS descriptor and LDT descriptor reserved for each task. Each local descriptor table LDT contains three descriptors, the 0th one is not used, the 1st one is task code segment, and the 2nd one is task data/stack segment;
  4. Then set the stack segment bit kernel data segment (0x10), the stack pointer esp points to the end of the user_stack array, reserving 4KB space for stack use. (Previously, the top of stack location used temporarily in bootsect.s was at 0x9000:0xff00.);
  5. check if the A20 address line is really on;
  6. test whether the machine contains a math coprocessor chip, and set the corresponding flag bit in control register CR0;
  7. next is the paging mechanism related settings, the first 5 pages of memory from 0x0 to zero, set the first page for the page directory, the remaining 4 pages for the page table, and then the first 4 items in the page directory and all page table items point to the corresponding physical memory address, set the last 3 bits of the flag, indicating the page exists and the user can read and write ;
  8. Set the value of the page directory base address register cr3 to point to the page directory table. 9;
  9. Now you can start the paging mechanism by setting bit 31 of the control register cr0. 10;
  10. Finally, use the ret instruction to pop up the pre-stacked main function entry address and start running the main function;

Note: In Linux 0.11, all processes use the same page directory table, but each has its own page table. The maximum number of processes supported is 64, and each process has 64MB of virtual address space, which can be converted to a linear address by multiplying the logical address by the task number, which is exactly 4GB.

At this point the memory layout is as follows.

1
2
0x0                       0x5000
|页目录(4K)|pg0|pg1|pg2|pg3|软盘缓冲区(1K)|head.s部分|idt(2K)|gdt(2K)|main.c|

Kernel initialization (main.c)

The code of this program is not long, but includes all the work of kernel initialization, the main initialization process is as follows (at this time is still in the state of interrupt off):

  1. First set the root file system device number and memory global variables using the system parameters obtained by the setup.s program to set the system’s root file system device number and some memory global variables. These memory global variables specify the start address of the main memory, the memory capacity of the system, and the end address of the memory used as cache memory. Linux 0.11 kernel supports up to 16 MB of physical memory by default, which is distributed as follows.

    1
    2
    3
    
    内核模块 /    高速缓冲区          \ 虚拟盘        主内存区
    |        |    |显存和BIOS ROM|    |      |                      |
    0        end   640K          1MB  4MB    4.5MB                  16MB
    

    The cache is to be deducted from the video memory and BIOS ROM parts, and if a virtual disk (RAMDISK) is defined, the main memory is reduced appropriately. The kernel program can freely access the data in the cache, while the main memory is managed and allocated by the mm module through a paging mechanism.

  2. Then comes the initialization of each part of the system: memory, trapdoor, block device, character device, tty, boot time, scheduler (tr and ldtr with task 0 loaded, clock interrupt gate and system call system gate), buffers, hard disk, floppy drive.

    • ramdisk virtual disk: Determine rd processing function, memory start address and length, and zero out the entire virtual disk.
    • mem memory: initialize mem_map, main memory page is 0, rest is USED.
    • trap trapdoor: Set trapdoor and system gate. (Selector 0x0008 indicates system level, global table, item 2, i.e. kernel code segment cs.)
    • blk block device: initialize request queue
    • tty character device: includes rs_init and con_init. rs_init sets the serial port rs1 and rs2 interrupt gates, initializes serial ports rs1 and rs2 (rs232), allows IRQ3 and IRQ4 on the main 8259A chip. con_init initializes console interrupts, reads the information saved in setup.s, determines the current display type and sets all relevant parameters. Set the keyboard interrupt trapdoor and reset the keyboard.
    • time time: read time from CMOS, initialize startup time startup_time
    • sched scheduling: set the tss and ldt descriptors of task0 in GDT (init_task has been statically set), clear the task array and its descriptor table entries in GDT, clear the NT bit of the flag register (no task switching at iret), load the TSS of task0 into task register tr, load the ldt descriptor of task0 in GDT into ldtr selector of task 0 in GDT to ldtr (only explicitly load this time, subsequent tasks CPU automatically load according to the LDT entry in TSS). Initialize 8253 timer, set timer interrupt gate, set system call system gate.
    • buffer cache: initialize the cache, the low end of the front buffer is the buffer header structure, the high end is the buffer block (1KB each), and the first buffer header points to the last buffer block. The head of the free chain table points to the first buffer header, and the first buffer header and the last one are connected to form a ring chain. Initialize the hash table to be empty.
    • hd hard disk: set up hard disk request processing function, interrupt gate, and allow hard disk controller to send interrupt request signal.
    • floppy floppy drive: Set floppy request processing function, interrupt gate, and allow floppy controller to send interrupt request signal.
  3. OK, all initialization is done, interrupts on. (Interrupts are turned off from before the system module is moved in setup.s until here)

  4. next to do magic, move_to_user_mode() to complete two important historical tasks: one is start the first task (process 0), and the second is switch the privileged level to the user state (previously it has been working in the kernel state)

    we look at this key code.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    
    #define move_to_user_mode() \
    __asm__ ("movl %%esp,%%eax\n\t" \
        "pushl $0x17\n\t" \
        "pushl %%eax\n\t" \
        "pushfl\n\t" \
        "pushl $0x0f\n\t" \
        "pushl $1f\n\t" \
        "iret\n" \
        "1:\tmovl $0x17,%%eax\n\t" \
        "movw %%ax,%%ds\n\t" \
        "movw %%ax,%%es\n\t" \
        "movw %%ax,%%fs\n\t" \
        "movw %%ax,%%gs" \
        :::"ax")
    

    The first few instructions manually simulate the stack situation during an interrupt, so let’s explain each one of them.

    1. save the stack pointer esp to eax;
    2. put the task0 stack segment selector (0x17) on the stack (previously ss was 0x10);
    3. stack the value of the stack pointer in eax;
    4. add the flag register to the stack;
    5. put the task0 segment selector (0x0f) on the stack (previously cs was 0x08);
    6. put the offset address of marker 1 on the stack;
    7. Next execute iret, which is a landmark moment. The offset address, segment selector, and flag register are off the stack, and the next jump is to task0 for execution. Since task0’s descriptor privilege level is 3 (i.e. user state), the stack pointer and stack segment selector that were on the stack before are also popped, i.e. the previous kernel stack is restored. This completes the manual startup of task 0, which is the ancestor of all processes, but it is special in that its data and code segments are mapped directly to the kernel code and data space, i.e., starting from 0 and limited to 640 KB. its kernel state stack is located at the end of the page where its task data structure is located, and the kernel state stack pointer is set manually in its initialized task data structure. Its user state stack is set by the first two instructions, esp still points to the original location, but ss has become 0x17, the data segment in the user state local table;
    8. the next four instructions are all set segment registers, pointing to the data segment of the local table. Where 0x17 is the data/stack segment selector for task0, the last two bits of 00010111 indicate a privilege level of 3, bit 2 indicates the local table, the high 15 bits indicate the index, and item 2 of the local table is the data segment and stack segment descriptor (0 is not used, 1 is the code segment). Similarly, 0x0f is the code segment selector;
  5. Task 0 own thing has been settled, the next natural thing is to start making children.

    1
    2
    3
    4
    
    if (!fork()) {      /* we count on this going ok */
        init();
    }
    for(;;) pause();
    

    This is the last few lines of the main function, which forks a child process (i.e. process1) to execute init() and process 0 loops through pause(). Process 0 is executed whenever the system has no other executable processes.

    It is worth mentioning that in order to ensure that the new process created by fork does not have the extra information of process 0, process 0 is required not to use the user state stack before creating the first new process, i.e. process 0 is required not to call functions, so here fork() and pause() are executed using the gcc function inline.

  6. Finally, let’s look at the init() function executed by process;

    1. It first executes the setup() function, which mainly initializes the hard disk partition, and finally executes mount_root() to install the root file system: initialize the superblock array, read the superblock and root i-node structure information on the root device, set the number of root i-node references, and use it as the i-node of the current working directory pwd and root directory root of process 1. Count and display the number of free blocks according to the logical block bitmap, and count and display the number of free i-nodes according to the i-node bitmap;
    2. read and write to open terminal console /dev/tty0 with file handle 0, and copy to generate handles 1 and 2;
    3. Next, fork process 2 and execute the /etc/rc script with the /bin/sh program. 4;
    4. If process 2 fails or ends, it enters a big dead loop: create a new process, close the legacy handle, create a session and set the process group number, open the terminal console /dev/tty0 with handles 0/1/2 read/write, and finally /bin/sh with execve(). At this point the user is ready to start operating the terminal;