Basics of Linux FileSystem

Simplified introduction to various concepts of linux file system

Snehasish Roy
8 min readDec 10, 2023
Photo by Mr Cup / Fabien Barral on Unsplash

While working on creating a directory-like hierarchy for an application at work, I stumbled upon a series of questions that left me puzzled

How does Linux handle directories and files?

What happens when you try to read a 10GB of file? How does Linux maintain the list of blocks to read from?

How is Linux able to list down thousands of files associated with a directory in seconds? Why is an empty directory always 4 KB?

In this blog, I will try to answer these questions in a simplified manner.

How is a file stored on the Disk?

Image Source

Disk provides a block storage. It only deals with blocks and provides a facility to read and write blocks. If you ask it to read the contents of my cool movie.mkv it will get confused but if you ask it to return the content of the block number 20693, it will happily return you a series of 0's and 1's.

What is a block?

Block is the smallest unit of bytes that the disk can read/write in one go. If the block size is 4 KB, it can contain max 4 KB of data. If you modify a single bit of that block, the disk will need to update the entire block as it cannot modify anything smaller than the block size.

If the block size is too high, it can lead to wasted space because even if you want to write one byte, you will occupy one entire block. Too small a block size can also be problematic, as the data will now be spread across too many blocks, causing random seeks to be slow.

How to maintain a list of blocks associated with a file?

Enter FileSystem! A filesystem e.g. FAT, NTFS, ext4, and ZFS provides abstraction over the underlying disks — it maintains a list of blocks a file is associated with, so an end user can read a file seamlessly. How does it do that?

The simplest way would be to only assign a contiguous set of blocks to a file i.e. if you want to write a 4 MB of file, the filesystem will ensure that you only get 1024 contiguous blocks of 4 KB each. However, this approach isn’t practical because of fragmentation. The files can get modified which can increase its length. In that case, we would have to defragment (compact) the entire disk, and then proceed with the update — which can be time-consuming.

The second approach would be to use a linked list. A block will link to the next block of the same file. This approach suffers from slow performance due to random seek. If we want to jump to the 10th block of a file, we would have to read the first 9 blocks, which can be slow.

The third approach is to maintain a file allocation table. This strategy was used in the FAT file system.

When you initialize a filesystem, it creates a fixed-size index based on the number of blocks. When you create a file, that spans across multiple blocks, it keeps on updating the index of the current block to the next block.

This looks very similar to the linked list approach but the caveat is that this index is kept in memory. If you have to jump to the 100th block, you still need to traverse 99 array indices, but since it’s in memory, random accesses are much faster.

The fourth approach is maintaining a file indexing structure. If the file spans across 2000 blocks, the easiest way would be to keep track of all those 2000 blocks in a list-like data structure — but this would be very memory-expensive.

What if we could introduce some kind of tiered indexes? For small-sized files, we can maintain a 1:1 mapping of blocks but for bigger files, we can maintain a single level of indirection and for even larger files a double/triple layer of indirection can suffice.

In the diagram below, 10 blocks keeps a direct mapping of 10 data blocks i.e. for a file with size <= 40 KB (10 * 4 KB), we can perform random operation in one go.

For files larger than 40 KB, we have multiple levels of indirection. Assuming our block size is 4 KB, and a block number can be represented by an integer (32 bits i.e. 4 Bytes), a block can have 1024 such integers (4 KB / 4 Bytes). With a single level of indirection, we can keep track of 1024 other blocks = 1024 * 4 KB = 4096 KB = 4 MB.

As we add multiple levels of indirection, we can keep track of files with sizes of multiple GB’s — all with a fixed set of memory. Isn’t that cool !

Image Source

A file system maintains this information in Inode or Index Nodes. Every file has one unique Inode associated with it. Consider the below file of size 4.4 GB It's split across 9163680 such blocks and those blocks are mapped to the INode number 2098469.

P.S. Note that the file name is not associated with the Inode because we can have multiple names of the same file using hard links.

How to maintain directories?

Directories allow us to maintain files in a hierarchical structure. So far, we have understood how to represent files in a file system, so let’s deep dive into how we can model directories too.

In Linux, everything is a file. So directories can also be thought of as a file. So directories contain essentially a list of file names present under it.

When you create a new/empty directory, it still takes 4 KB of space — which is equivalent to the block size. As and when the files keep on increasing, so does the size of the directory itself. (Note that when I talk about the size of the directory, it does not mean the size of the files contained inside it).

In the below screenshot, I have created 256 0-byte files in the directory, post which the size of the directory increases to 20480 bytes = 5 blocks. Why? To maintain the list of file names present under it.

In the original Unix implementation, 16 bytes were reserved for one directory entry, 14 bytes for the file name and 2 bytes for the Inode number, resulting in a filesystem that supports file names up to 2¹⁴ (16,384). Now, this has changed and varies across file system implementations like ext4 and ZFS.

Image Source

How to trace pathnames?

With the internal representation of directories in place, let’s evaluate the scenario when you perform ls -l /home/movies/action/Interstellar.mkv

You start your search from the root directory. Go through the directory entries of the root directory present at Inode 20030 and see if any file name matches home. If yes, go to the Inode of 56969. Recursively, repeat the search until you find your target file which is present at Inode 60025. From the Inode, the file system can easily find the list of blocks the file is present in, and you can view the contents of that file.

In the above example, we performed a linear search i.e. we searched the directories one by one. But what if, the directories contain millions of files? In that case, the search will be extremely slow. To optimize for large directories, filesystems maintain an external index (similar to the one used by databases) to optimize the search e.g. ext4 uses Hash Tree implementation to maintain a directory index.

Bonus Section

If you are with me so far, thank you. Hope you learned something new today. In this section, I will list down random learnings that I found during the research of this article.

How to output to another terminal window?

In Linux, everything is a file, even the disk, sockets and processes. When you run a terminal, the input and output to it are also modeled as a file. Open two terminal windows, find their device ID using tty and use echo command to output to a remote terminal window. So cool!

How to find the list of free blocks and free Inodes?

Whenever you want to write data to a file, you need an available block? How to quickly find an available block?

Enter the humble BitMap. Filesystems commonly maintain a bitmap to track free blocks. Bitmaps consume only 1 bit for 1 block and hence can be easily kept in memory to perform quick operations.

ZFS — the cool kid in town

I am not going to write in detail about the ZFS but it’s the cool kid in town with a completely new architecture like inbuilt snapshots (the ability to version documents as and when they change), has an inbuilt cache layer (ARC which supposedly provides a better hit ratio than LRU cache — the default used in page cache), uses Copy on Write semantics and provides inbuilt checksums using Merkle Trees.

References

--

--

Snehasish Roy

Building core platform team at PhonePe, Ex Google, Ex Amazon