Background
We usually encounter a lot of file system anomalies after a system crash. We usually fix them with the fsck tool, today we will learn what fsck does and how it does it.
Workload example
Suppose there is a workload that appends a single block of data to an existing file. The append is done by opening the file, calling lseek()
to move the file offset to the end of the file, and then issuing a single 4KB write to the file before closing it.
It is assumed that a standard simple file system structure is used on the disk, consisting of an inode bitmap (only 8 bits, one per inode), a databitmap (also 8 bits, one per data block), inodes (8 in total, numbered 0 to 7, spread over 4 blocks), and data blocks (8 in total, numbered 0 to 7). numbered 0 to 7). The following is a diagram of the file system.
Looking at the structure in the diagram, you can see that an inode is allocated (inode number 2), which is marked in the inode bitmap, and a single allocated data block (data block 4) is also marked in the data bitmap. inode is denoted as I [v1], as it is the first version of this inode. It will be updated soon (due to the above workload). Take a look at this simplified inode again. in I[v1], you can see
In this simplified inode, the file has a size of 1 (it has one block located in it), the first direct pointer points to block 4 (the first data block of the file, Da), and all the other 3 direct pointers are set to null (indicating that they are not used). Of course, the real inode has more fields.
When appending to a file, a new data block is added to it, so 3 structures on disk must be updated: the inode (which must point to the new block and has a larger size due to the append), the new data block Db and a new version of the data bitmap (called B[v2]) indicating that the new data block has been allocated. As a result, there are 3 blocks in the system’s memory that must be written to disk. The updated inode (inode version 2, or simply I [v2]) now looks like this
The updated data bitmap (B[v2]) now looks like this: 00001100. Finally, there is the data block (Db), which is just the content that the user puts into the file. Hopefully the final disk image of the file system will look like this
To achieve this transformation, the file system must perform 3 separate writes to the disk, for the inode (I[v2]), the bitmap (B[v2]) and the data block (Db). Note that these writes do not usually occur immediately when the user issues the write() system call. The dirty inode, bitmap and new data first exist in memory (page cache, page cache, or buffer cache, buffer cache) for a while.
Then, when the file system finally decides to write them to disk (say 5s or 30s), the file system issues the necessary write requests to disk. Unfortunately, crashes can occur which interfere with these updates to the disk. In particular, if a crash occurs after one or two of these writes have completed, rather than all 3, the file system may be in an abnormal state.
Crash scenario
Imagine that only one write succeeds. There are therefore 3 possible outcomes.
- Only the data block (Db) is written to disk. In this case, the data is on the disk, but there is no inode pointing to it and no bitmap indicating that the block has been allocated. Therefore, it is as if the write never happened. From the point of view of file system crash consistency, this situation is not a problem at all.
- Only the updated inode (I[v2]) is written to disk. In this case, the inode points to the disk address (5), where Db is about to be written, but Db has not yet been written. Therefore, if the pointer is trusted, then rubbish data (the old contents of disk address 5) will be read from the disk.
In addition, a new problem was encountered, which is called file-system inconsistency. The bitmap on the disk shows that block 5 has not been allocated yet, but the inode says it has. This disagreement in the file-system data structure is a file-system data structure inconsistency. To use the file system, this problem must be resolved in some way.
- Only the updated bitmap (B [v2]) is written to disk. In this case, the bitmap indicates that block 5 has been allocated, but there is no inode pointing to it. thus the file system is once again inconsistent. If not fixed, this write will lead to a space leak, as the file system will never use block 5.
There are 3 other crash scenarios in this 3 write attempts to disk. In these scenarios, two writes succeed and the last one fails.
- inode (I[v2]) and bitmap (B[v2]) were written to disk, but no data (Db) was written. In this case, the file system metadata is perfectly consistent: inode has a pointer to block 5, and the bitmap indicates that 5 is in use, so from the point of view of the file system metadata, everything looks fine. But there’s a problem: 5 is rubbish in 5 again.
- The inode (I[v2]) and the data block (Db) are written, but not the bitmap (B[v2]). In this case the inode points to the correct data on the disk, but again there is an inconsistency between the inode and the older version of the bitmap (B1). Therefore, the problem needs to be solved again before the file system can be used.
- The bitmap (B[v2]) and the data block (Db) were written, but not the inode (I[v2]). In this case, there is again an inconsistency between the inode and the data bitmap. However, even if the block is written and the bitmap indicates its use, it is not known which file it belongs to, as there is no inode pointing to the block.
From these crash scenarios, you can see the many problems that can occur with a disk filesystem image as a result of a crash. There are many problems: there may be inconsistencies in the file system data structure. There may be space leaks, rubbish data may be returned to the user, and so on.
The ideal approach is to move the file system from a consistent state (before the file is appended), atomically, to another state. (atomically) to another state (after inodes, bitmaps and new data blocks have been written to disk). Unfortunately, this is not easy to do, as the disk is only committed to one write at a time, and crashes or power failures can occur between these updates. Refer to this general problem as the crash-consistency problem, which can also be referred to as the consistent-update problem.
fsck
Early file systems took a simple approach to handling crash consistency. Basically, they decided to let inconsistencies happen and then fix them (on reboot). A typical example of this lazy approach can be found in one tool: fsck. fsck is a UNIX tool for finding these inconsistencies and fixing them. Similar tools for checking and repairing disk partitions exist on different systems. Please note that this approach will not solve all problems. For example, consider the case above, where the file system appears to be consistent, but the inode points to rubbish data. The only real goal is to ensure that the filesystem metadata is internally consistent.
The tool fsck runs in a number of stages, it runs before the filesystem is mounted and available (fsck assumes that no other filesystem activity is going on at the time of running). Once completed, the filesystem on disk should be consistent and therefore accessible to the user. The following is a basic summary of fsck.
- Superblocks: fsck first checks that the superblocks make sense, mainly by performing soundness checks, e.g. to ensure that the file system size is larger than the number of blocks allocated. Usually, the aim of these soundness checks is to find a suspicious (conflicting) superblock. In this case, the system (or administrator) can decide to use an alternate copy of the superblock.
- Free blocks: Next, fsck scans the inode, indirect blocks, double indirect blocks, etc. for blocks currently allocated in the file system. It uses this knowledge to generate the correct version of the allocation bitmap. Thus, if there is any inconsistency between the bitmap and the inode, it is resolved by trusting the information within the inode. The same type of check is performed on all inodes to ensure that all inodes that look like they are in use, are marked in the inode bitmap.
- inode status: Check each inode for corruption or other problems. For example, fsck ensures that each allocated inode has a valid type field (i.e. regular file, directory, symbolic link, etc.). If an inode field is faulty and not easily repairable, the inode is considered suspect and is cleared by fsck, and the inode bitmap is updated accordingly.
- inode links: fsck also verifies the link count for each allocated inode. As you may recall, the link count indicates the number of different directories that contain references (i.e. links) to this particular file. To verify the link count, fsck scans the entire directory tree, starting with the root directory, and constructs its own link count for each file and directory in the filesystem. If the newly calculated count does not match the count found in the inode, corrective action must be taken, usually by repairing the count in the inode. If an allocated inode is found but no directory refers to it, it is moved to the lost + found directory.
- Duplicates: fsck also checks for duplicate pointers, i.e. cases where two different inodes refer to the same block. If an inode is clearly bad, it may be cleared. Alternatively, the block pointed to can be copied, thus providing each inode with its own copy as required.
- Bad blocks: When scanning the list of all pointers, bad block pointers are also checked. A pointer is considered “bad” if it apparently points to a pointer that is outside its valid range, for example, if its address points to a block larger than the partition size. In this case, fsck cannot do anything too intelligent. It just removes (clears) the pointer from the inode or indirect block.
- Directory checking: fsck does not know the contents of user files. However, directories contain information in a specific format created by the file system itself. Therefore, fsck performs an additional integrity check on the contents of each directory, making sure that “.” and “…” are preceding entries, that each inode referenced in the directory entry is assigned, and that no directory is referenced more than once in the entire hierarchy.
Summary
As mentioned above, building fsck that works effectively requires complex knowledge of the filesystem. Ensuring that such code works properly in all cases can be challenging. However, fsck (and similar methods) have a larger and perhaps more fundamental problem: they are too slow.
For very large volumes, scanning the entire disk to find all allocated blocks and read the entire directory tree can take minutes or hours. As disk capacity grows and RAID becomes more popular, the performance of fsck becomes prohibitive. At a higher level, the basic premise of fsck seems a little unreasonable. Consider the example above, where only 3 blocks are written to the disk. It is very expensive to scan the entire disk and only fix the problems that occur during the update of 3 blocks.
This situation is similar to putting your keys on the floor of your bedroom and then searching every room, starting from the basement, to perform a “search the whole house for keys” recovery algorithm. It works, but it’s wasteful.