Introduction to Linux File Systems

This is mostly described in Stevens, Chapter 4. Read it. Now.

Layout of the file system:

                  Each physical drive can be divided into several partitions

                  Each partition can contain one file system

                  Each file system contains:

1.             boot block(s);

2.             superblock;

3.             inode list;

4.             data blocks.

                  A boot block may contain the bootstrap code that is read into the machine upon booting.

                  A superblock describes the state of the file system:

                  The inode list is an array of "information nodes" analogous to the FAT (File Allocation Table) system in MS-DOS.

                  data blocks start at the end of the inode list and contain file data and directory blocks.

 

The term file system can mean a single disk, or it can mean the entire collection of devices on a system. It's held together in this second sense by the directory structure.

The directory "tree" usually spans many disks and/or partitions by means of mount points. For example, in Red Hat Linux, there are pre-defined mount points for floppy disks and CD-ROMs at floppy and cdrom in /mnt. See also fstab and mtab in /etc.


Some Linux-supported File Systems

minix       is the filesystem used in the Minix operating system, the first to run under Linux. It has a number of shortcomings, mostly in being small.

ext           is an elaborate extension of the minix filesystem. It has been completely superseded by ext2.

ext2         is the disk filesystem used by Linux for both hard drives and floppies. ext2, designed as an extension to ext, has in its turn generated a successor, ext3.

ext3         offers the best performance (in terms of speed and CPU usage) combined with data security of the file systems supported under Linux due to its journaling feature.

xiafs        was designed as a stable, safe file system by extending minix. It's no longer actively supported and is rarely used.

msdos      is the filesystem used by MS-DOS and Windows. msdos filenames are limited to the 8 + 3 form. It's especially good for floppies that you move back and forth.

umsdos    extends msdos by adding long filenames, ownership, permissions, and special files while remaining compatible with MS-DOS and Windows.

vfat         extends msdos to be compatible with Microsoft Windows' support for long filenames (a good choice for dual-boot).

proc        is a pseudo-filesystem which is used as an interface to kernel data. Its files do not use disk space. See proc(5).

iso9660   is a CD-ROM filesystem type conforming to the ISO 9660 standard, including both High Sierra and Rock Ridge.

nfs           is a network filesystem used to access remote disks.

smb         is a network filesystem used by Windows.

ncpfs       is a network filesystem that supports the NCP protocol, used by Novell NetWare.


Partition Structure

Boot Block(s)

Blocks on a Linux (and often a Unix) filesystem are 1024 bytes in length, but may be longer or shorter. The blocks are normally a power of 2 in size (1024 is 2 to the 10th power). Some systems use 512 bytes (2 to the 9th) but 2048 and 4096 are also seen.

The first few blocks on any partition may hold a boot program, a short program for loading the kernel of the operating system and launching it. Often, on a Linux system, it will be controlled by LILO or Grub, allowing booting of multiple operating systems. It's quite simple (and common) to have a multiple boot environment for both Linux and a flavour of Windows.

Superblock

The boot blocks are followed by the superblock, which contains information about the geometry of the physical disk, the layout of the partition, number of inodes and data blocks, and much more.

Inode Blocks

Disk space allocation is managed by the inodes (information node), which are created by the mkfs(1) (make filesystem) command. Inodes cannot be manipulated directly, but are changed by many commands, such as touch(1) and cp(1) and by system calls, like open(2) and unlink(2), and can be read by stat(2). Both chmod(1) and chmod(2) change access permissions.

Data Blocks

This is where the file data itself is stored. Since a directory is simply a specially formatted file, directories are also contained in the data blocks. An allocated data block can belong to one and only one file in the system. If a data block is not allocated to a file, it is free and available for the system to allocate when needed.

 


Structure of the super block

struct ext2_super_block {

__u32  s_inodes_count;        /* Inodes count */

__u32  s_blocks_count;        /* Blocks count */

__u32  s_r_blocks_count;      /* Reserved blocks count */

__u32  s_free_blocks_count;   /* Free blocks count */

__u32  s_free_inodes_count;   /* Free inodes count */

__u32  s_first_data_block;    /* First Data Block */

__u32  s_log_block_size;      /* Block size */

__s32  s_log_frag_size;       /* Fragment size */

__u32  s_blocks_per_group;    /* # Blocks per group */

__u32  s_frags_per_group;     /* # Fragments per group */

__u32  s_inodes_per_group;    /* # Inodes per group */

__u32  s_mtime;               /* Mount time */

__u32  s_wtime;               /* Write time */

__u16  s_mnt_count;           /* Mount count */

__s16  s_max_mnt_count;       /* Maximal mount count */

__u16  s_magic;               /* Magic signature */

__u16  s_state;               /* File system state */

__u16  s_errors;              /* Behaviour if error detected */

__u16  s_minor_rev_level;     /* minor revision level */

__u32  s_lastcheck;           /* time of last check */

__u32  s_checkinterval;       /* max. time between checks */

__u32  s_creator_os;          /* OS */

__u32  s_rev_level;           /* Revision level */

__u16  s_def_resuid;          /* Def uid for reserved blocks */

__u16  s_def_resgid;          /* Def gid for reserved blocks */

__u32  s_first_ino;           /* First non-reserved inode */

__u16  s_inode_size;          /* size of inode structure */

__u16  s_block_group_nr;      /* block grp # of this s'block */

__u32  s_feature_compat;      /* compatible feature set */

__u32  s_feature_incompat;    /* incompatible feature set */

__u32  s_feature_ro_compat;   /* readonly-compat feature set */

__u8   s_uuid[16];            /* 128-bit uuid for volume */

char   s_volume_name[16];     /* volume name */

char   s_last_mounted[64];    /* dir where last mounted */

[some compression stuff]

__u16  s_padding1;            /* Padding for alignment */

[some journaling stuff]

__u32  s_reserved[197];       /* Padding to end of the block */

};

 

Most of this structure, which is defined fully in /usr/include/ext2fs/ext2_fs.h, is only used by the operating system, and is neither readable nor modifiable by a user or a program.

 


Structure of an inode on the disk

Each file (a unique collection of data blocks) has only 1 inode, which completely defines the file except for its name(s). The filenames are actually links in the directory structure to the inode for the file.

The representation of the inode to use is the stat(2) structure, which can be seen in /usr/include/bits/stat.h or on page 73 in Stevens.

struct ext2_inode {

__u16  i_mode;      /* File mode */

__u16  i_uid;       /* Low bits of Uid */

__u32  i_size;      /* Size in bytes */

__u32  i_atime;     /* Access time */

__u32  i_ctime;     /* Creation time */

__u32  i_mtime;     /* Modification time */

__u32  i_dtime;     /* Deletion Time */

__u16  i_gid;       /* Low bits of Gid */

__u16  i_links_count;   /* Links count */

__u32  i_blocks;    /* Blocks count */

__u32  i_flags;     /* File flags */

union {

   . . .

} osd1;             /* OS dependent 1 */

__u32  i_block[EXT2_N_BLOCKS];/* Data */

__u32  i_generation;/* Version (NFS) */

   [ACLs and stuff]

union {

   . . .

} osd2;             /* OS dependent 2 */

};


Directory entry

Only ever use the d_name field in the directory entry, for use if you are searching through a directory for a filename. Use the filename in a call to stat(2) or lstat(2) to get the struct stat information.

struct dirent {

long         d_ino;    /* don't use */

__kernel_off_t d_off; /* don't use */

unsigned short d_reclen; /* don't */

char         d_name[256];/* OK */

};

Special inode numbers

There are several pre-defined inodes with special purposes. Because of the way the superblock and inodes are defined, you do not need to know these unless you are creating or modifying filesystem software.

EXT2_BAD_INO           1  Bad blocks inode

EXT2_ROOT_INO          2  Root inode

EXT2_ACL_IDX_INO       3  ACL index inode

EXT2_ACL_DATA_INO      4  ACL data inode

EXT2_BOOT_LOADER_INO   5  Boot loader inode

EXT2_UNDEL_DIR_INO     6  Undelete dir inode

EXT2_RESIZE_INO        7  Res grp desc inode

EXT2_JOURNAL_INO       8  Journal inode

First non-reserved inode for ext2 filesystems

EXT2_GOOD_OLD_FIRST_INO    11


Inode Contents

Disk inodes contain the following information:

        file creation time

        last file access time

        last inode modification time

Access Permissions

Directories

Two categories of Users


Linking files

In Linux and Unix, a data file is a bunch of data blocks on a disk, managed by an inode. Its name is stored only in the directory. Or in many directories. This is the concept of linking, as discussed in Stevens sections 4.14 through 4.17. Both "soft" (symbolic) links and "hard" links can be made using the ln(1) command or the link(2) and symlink(2) system calls.

Original file

System Prompt: ls -l

-rw-r--r-- 1 allisor staff 0 D/T abc

Create hard link

System Prompt: ln abc h-abc

System Prompt: ls -l

-rw-r--r-- 2 allisor staff 0 D/T abc

-rw-r--r-- 2 allisor staff 0 D/T h-abc

Create symbolic (soft) link

System Prompt: ln -s abc s-abc

System Prompt: ls -l

-rw-r--r-- 2 allisor staff 0 D/T abc

lrwxr-xr-x 1 allisor staff 3 D/T s-abc -> abc

-rw-r--r-- 2 allisor staff 0 D/T h-abc

Examine inodes

System Prompt: ls -i1

25263 abc

25263 h-abc

25265 s-abc

Remove original file

System Prompt: rm abc

System Prompt: ls -l

-rw-r--r-- 1 allisor staff 0 D/T h-abc

lrwxr-xr-x 1 allisor staff 3 D/T s-abc -> abc


Directory names

Directory names from the current directory

/*

Modified from the Irix (SGI) man pages for the directory entry functions List the entry names from the current directory

*/

#include <stdio.h>

#include <dirent.h>

 

int main( void )

{

DIR *dp;

struct dirent *dirp;

 

if ( ( dp = opendir( "." ) ) == NULL )

{

printf( "Can't open current dir\n");

exit( 0 );

}

while ( ( dirp = readdir( dp ) ) != NULL )

{

printf( "%s\n", dirp->d_name );

}

closedir( dp );

return 1;

}


Directory names from argv

/*

Modified from Program 1.1 on page 4 in Stevens. List the entry names from the directory name given as a command-line argument

*/

#include <stdio.h>

#include <dirent.h>

 

int main( int argc, char *argv[] )

{

DIR *dp;

struct dirent *dirp;

 

if (argc != 2)

{

printf("usage: %s dirname\n", argv[0]);

exit( 0 );

}

if ( ( dp = opendir( argv[1] ) ) == NULL )

{

printf( "Can't open %s\n", argv[1] );

exit( 0 );

}

while ( (dirp = readdir( dp ) ) != NULL )

{

printf( "%s\n", dirp->d_name );

}

closedir( dp );

return 1;

}


Some useful system calls

stat(2) Stevens, section 4.2, page 73

#include <sys/types.h>

#include <sys/stat.h>

#include <unistd.h>

 

int stat( const char *path, struct stat *buf);

int fstat( int fdes, struct stat *buf );

int lstat(const char *path, struct stat *buf);

Purpose: To retrieve the file statistics (inode info) for the file at path, the open(2) file fdes, or the actual information for the symbolic link (via lstat(2)) for the file.

Returns: On success, buf is filled with the inode information. On failure, returns -1 and sets errno for the failure.

struct stat      [some padding removed]

{

__dev_t st_dev;     /* Device.  */

__ino_t st_ino;     /* File serial no. */

__mode_t st_mode;   /* File mode  */

__nlink_t st_nlink;/* Link count */

__uid_t st_uid;     /* file owner uid */

__gid_t st_gid;     /* file owner gid */

__dev_t st_rdev;    /* Dev no, if device */

__off_t st_size;    /* filesize in bytes */

__blksize_t st_blksize;/* Best blksize */

__blkcnt_t st_blocks;  /* Blocks alloc */

__time_t st_atime;  /* last access time */

__time_t st_mtime;  /* last mod time */

__time_t st_ctime;  /* last change time */

};


access(2) Stevens, section 4.7, page 82

#include <unistd.h>

 

int access(const char *pathname, int mode);

Purpose: To see if the process would be allowed to read, write, or test for existence of the file named in pathname. If it's a symbolic link, permissions of the linked file are tested. mode is a mask consisting of one or more of R_OK, W_OK, X_OK, and F_OK.

Returns: If all requested permissions are granted 0 is returned. On error (at least one permission is denied, or some other error occurred), -1 is returned, and errno is set appropriately.

chmod(2) Stevens, section 4.9, page 85

#include <sys/types.h>

#include <sys/stat.h>

 

int chmod(const char *pathname, mode_t mode );

int fchmod( int fdes, mode_t mode );

Purpose: Sets file permissions of the file named via pathname, or the open(2) file identified by fdes. If mode is octal, it must begin with a zero character. There are pre-defined flags available.

Returns: On success, 0. On failure, -1 and sets errno to indicate the failure.

unlink(2) Stevens, section 4.15, page 95

#include <unistd.h>

 

int unlink( const char *pathname );

Purpose: Unlinks (that is, deletes a link to remove a file name's connection to an inode) the file named by pathname.

Returns: On success, 0. On failure, -1 and sets errno to indicate the failure.


perror(3) Stevens, section 1.7, page 14

#include <stdio.h>

 

void perror(const char *s);

Purpose: perror()translates error codes into human-readable form, writing to stderr to describe the latest error from a system function. The argument string is printed, then colon-blank, then the system error text and a newline. You should include diagnostic information such as the name of the failing function and its location.

The error number is taken from extern errno, set when errors occur but not cleared. Save errno if you plan to use it and the failing call is not immediately followed by a call to perror(3).

Some errno values (see /usr/include/asm/errno.h and other variations of errno.h in /usr/include for more):

ENOENT       No such file or directory

ENOEXEC      Exec format error

ECHILD       No child processes

ENOMEM       Out of memory

EACCES       Permission denied

EEXIST       File exists

ENOTDIR      Not a directory

EISDIR       Is a directory

EINVAL       Invalid argument

EPIPE        Broken pipe

ENOTEMPTY    Directory not empty

ENOSPC       No space left on device

ESPIPE       Illegal seek

EROFS        Read-only file system

EMLINK       Too many links

EPIPE        Broken pipe

EDOM         Math argument out of domain

ERANGE       Math result not representable


Some Time and Date Routines

All require #include <time.h>, and are described in Stevens section 6.9 on page 155. They work with the various time fields, including those in the inodes, and all are based on time_t and the count of seconds from the Epoch:  00:00:00 UTC 1 January 1970.

You can convert a time in time_t to a broken-down time in the tm structure with gmtime(3) or localtime(3), and convert a struct tm time to time_t with mktime(3). Either form can be converted to a standard 26-byte string (plus end-of-string) by asctime(3) or ctime(3), or you can put a time into a string using your own format with strftime(3). You can get the current time with time(2).

char *asctime(const struct tm *timeptr);

char *ctime(const time_t *timep);

struct tm *gmtime(const time_t *timep);

struct tm *localtime(const time_t *timep);

time_t mktime(struct tm *timeptr);

 

struct tm

{

int    tm_sec;      /* seconds */

int    tm_min;      /* minutes */

int    tm_hour;     /* hours */

int    tm_mday;     /* day of the month */

int    tm_mon;      /* month */

int    tm_year;     /* year */

int    tm_wday;     /* day of the week */

int    tm_yday;     /* day in the year */

int    tm_isdst;    /* daylight saving */

};

If time_t is a 32-bit signed int, when does the Epoch end?
If it's an unsigned integer? What if it has a 64-bit value?