2024-01-18发表2024-06-11更新MIT6.S081 / xv6book43 分钟读完 (大约6499个字)

MIT6.S081 xv6book chapter8

这八章讲述了xv6的文件系统，这个文件系统的实现很简单，有许多可以优化的地方，但也有很多复杂的地方。从磁盘组织、缓存、日志、Inode、目录、文件名与文件描述符。其实可以简单分为三部分：

底层存储（磁盘组织、磁盘缓存、Inode、Directory）
持久层（日志事务）
用户层（文件名、文件描述符）

本节融合了lec14和lec15的内容，收获颇丰。

前言

文件系统的有趣性：

抽象：如何对硬件的抽象
崩溃恢复——文件系统的持久性
文件组织——如何在磁盘上排布文件系统
性能——读取磁盘通常比较慢，如何取得性能的提升

文件系统分层：

最底层是磁盘，也就是一些实际保存数据的存储设备，正是这些设备提供了持久化存储。
在这之上是buffer cache或者说block cache，这些cache可以避免频繁的读写磁盘。这里我们将磁盘中的数据保存在了内存中。
为了保证持久性，再往上通常会有一个logging层。

longging层之上有inode cache，这主要是为了同步（synchronization）。
再往上就是inode本身了。它实现了read/write。
再往上，就是文件名，和文件描述符操作。

文件究竟维护了怎样的数据结构呢？核心的数据结构就是inode和file descriptor。后者主要与用户进程进行交互。

最重要的就是Inode，它代表了一个文件。文件名只是一个link，Inode之间通过编号进行区分。

inode必须有一个link count来跟踪指向这个inode的文件名的数量。一个文件（inode）只能在link count为0的时候被删除。
实际中还有一个openfd count，也就是当前打开了文件的文件描述符计数。一个文件只能在这两个计数器都为0的时候才能被删除。
文件描述符必然自己悄悄维护了对于文件的offset。

磁盘组织

屏蔽不同磁盘的区别，我们可以将磁盘看成block或setor的数组，每个block大小固定，这样我们就能通过块号读取一个块的内容。

block0要么没有用，要么被用作boot sector来启动操作系统。
block1通常被称为super block，它描述了文件系统。它可能包含磁盘上有多少个block共同构成了文件系统这样的信息。
在XV6中，log从block2开始，到block32结束。实际上log的大小可能不同，这里在super block中会定义log就是30个block。
接下来在block32到block45之间，XV6存储了inode。我之前说过多个inode会打包存在一个block中，一个inode是64字节。
之后是bitmap block，这是我们构建文件系统的默认方法，它只占据一个block。它记录了数据block是否空闲。
之后就全是数据block了，数据block存储了文件的内容和目录的内容。

On-disk inode structure

Inode代表一个文件，那Inode究竟是怎么组织的呢？

Inode有两种形式，一种是在磁盘上，一种是在内存上。稍后将会解释为什么会有两种Inode模型，以及这两种Inode之间的同步。

// On-disk inode structure
struct dinode {
  short type;           // File type， files, directories, and special files (devices) or zero
  short major;          // Major device number (T_DEVICE only)
  short minor;          // Minor device number (T_DEVICE only)
  short nlink;          // Number of links to inode in file system
  uint size;            // Size of file (bytes)
  uint addrs[NDIRECT+1];   // Data block addresses #define NDIRECT 12
};

The type field distinguishes between files, directories, and special files (devices). A type of zero indicates that an on- disk inode is free.
The nlink field counts the number of directory entries that refer to this inode, in order to recognize when the on-disk inode and its data blocks should be freed.
The addrs array records the block numbers of the disk blocks holding the file’s content.The last entry in the addrs array gives the address of the indirect block.

dinode 最令人不解的通常就是addrs数组了，它其实类似多级页表。首先，addrs数组存储了所有文件内容所在块的块号。它分为两级索引，第一级索引是直接映射，意思是我得到的块号就是文件内容真实的块号。第二级索引是addrs[12]的内容，它得到块号指向一个数据块，这个数据块才真正指向了文件内容块号。所以在xv6中，文件大小最多是（12 + 256）* block size 这么大。

当dinode的type字段为T_DIR时，表明这个dinode是目录。目录的数据内容存的是目录条目，包含2字节的inode编号和14字节的文件名。

struct dirent {
  ushort inum; 
  char name[DIRSIZ]; // #define DIRSIZ 14
};

看完On-disk inode structure我们就知道了，文件内容是放在data区，Inode用来索引这些文件，并区分文件类型。这是通常的设计，文件的元信息和文件内容分离存储。

in-memory copy of an inode

// in-memory copy of an inode
struct inode {
  uint dev;           // Device number
  uint inum;          // Inode number
  int ref;            // Reference count 指向inode的指针个数
  struct sleeplock lock; // protects everything below here
  int valid;          // inode has been read from disk?

  short type;         // copy of disk inode
  short major;
  short minor;
  short nlink;  //指向inode的目录项个数
  uint size;
  uint addrs[NDIRECT+1];
};

struct inode is the in-memory copy of a struct dinode on disk. The kernel stores an inode in memory only if there are C pointers referring to that inode

为什么会有in-memory copy of an inode 的存在？其实就是方便操作系统管理inode，提供更多关于inode的信息，比如引用计数等。

Buffer Cache

文件

创建文件的过程其实就是创建inode，然后写inode的内容，最后同步到磁盘。通过inode代表文件。

// Allocate an inode on device dev.
// Mark it as allocated by giving it type.
// Returns an unlocked but allocated and referenced inode.
struct inode* ialloc(uint dev, short type) {
  int inum;
  struct buf *bp;
  struct dinode *dip;

  for(inum = 1; inum < sb.ninodes; inum++){ // sb == super block
    bp = bread(dev, IBLOCK(inum, sb)); // 读取缓存
    dip = (struct dinode*)bp->data + inum%IPB; // 地址+偏移
    if(dip->type == 0){  // a free inode
      memset(dip, 0, sizeof(*dip));
      dip->type = type;
      log_write(bp);   // mark it allocated on the disk
      brelse(bp);
      return iget(dev, inum);
    }
    brelse(bp);
  }
  panic("ialloc: no inodes");
}

superblock记录了inode的数量，遍历所有创建好的inode，然后查看是否是空闲的inode（其type字段为0），然后使用它。

重点看这两行：

1 2	bp = bread(dev, IBLOCK(inum, sb)); dip = (struct dinode*)bp->data + inum%IPB;

// Inodes per block.
#define IPB           (BSIZE / sizeof(struct dinode))
// Block containing inode i
#define IBLOCK(i, sb)     ((i) / IPB + sb.inodestart)

通过bread函数，我们读取了特定设备dev上的第IBLOCK(inum, sb)个block的内容到buf上。通过宏定义我们能够知道IBLOCK(inum, sb)就是第inum 个 inode所在的块号，inum%IPB则是inode在块内的偏移量。

bread函数将读取磁盘的过程作了一层封装，我们再看bread函数:

// Return a locked buf with the contents of the indicated block.
struct buf* bread(uint dev, uint blockno){
  struct buf *b;

  b = bget(dev, blockno);
  if(!b->valid) { // 如果没在cache中找到，就读取磁盘
    virtio_disk_rw(b, 0);
    b->valid = 1;
  }
  return b;
}

可以看到bread其实是先调用bget

// Look through buffer cache for block on device dev.
// If not found, allocate a buffer.
// In either case, return locked buffer.
static struct buf* bget(uint dev, uint blockno){
  struct buf *b;
  acquire(&bcache.lock);

  // Is the block already cached?
  for(b = bcache.head.next; b != &bcache.head; b = b->next){
    if(b->dev == dev && b->blockno == blockno){
      b->refcnt++; // 增加引用计数
      release(&bcache.lock);
      acquiresleep(&b->lock);
      return b; // returns the locked buffer
    }
  }

  // Not cached. 回收一个buf
  // Recycle the least recently used (LRU) unused buffer.
  for(b = bcache.head.prev; b != &bcache.head; b = b->prev){
    if(b->refcnt == 0) {
      b->dev = dev;
      b->blockno = blockno;
      b->valid = 0;
      b->refcnt = 1;
      release(&bcache.lock);
      acquiresleep(&b->lock);
      return b;
    }
  }
  panic("bget: no buffers");
}

bget读取的是bcache（缓存），如果缓存中没有，返回一个加锁的buf（containing a copy of a block which can be read or modified in memory)，如果buf失效，最后才从磁盘读。

bcache

struct buf {
  int valid;   // has data been read from disk?
  int disk;    // does disk "own" buf?
  uint dev;
  uint blockno;
  struct sleeplock lock;
  uint refcnt;
  struct buf *prev; // LRU cache list
  struct buf *next;
  uchar data[BSIZE];
};
struct {
  struct spinlock lock;
  struct buf buf[NBUF]; // 30
  struct buf head;
} bcache;

bcache就是个大小为30的LRU链表，链表的元素为buf。buf具有引用计数，表明有多少个线程正在使用它。blockno块号的内容就存在data里。

The buffer cache has two jobs: (1) synchronize access to disk blocks to ensure that only one copy of a block is in memory and that only one kernel thread at a time uses that copy; (2) cache popular blocks so that they don’t need to be re-read from the slow disk.

buffer cache作为缓存能有效的减少访问磁盘的时间，同时buffer cache必须保证一个磁盘block之对应一个buf，一个buf只能被一个线程使用，否则就会出问题。

如果buffer cache中有两份block 33的cache将会出现问题。假设一个进程要更新inode19，另一个进程要更新inode20。如果它们都在处理block 33的cache，并且cache有两份，那么第一个进程可能持有一份cache并先将inode19写回到磁盘中，而另一个进程持有另一份cache会将inode20写回到磁盘中，并将inode19的更新覆盖掉。所以一个block只能在buffer cache中出现一次。

从bget返回加锁的buf这点来看，xv6并不支持多个线程同时读写一块block。第二个获取相同block的线程需要等待第一个线程释放，acquiresleep确保线程能够睡眠。

Logging Layer

logging的目标：实现原子的系统调用，快速恢复和高性能。

持久化的基本思想

xv6 系统调用不直接写入磁盘上的文件系统数据结构。相反，它会在磁盘上的日志中放置它希望进行的所有磁盘写入的描述。一旦系统调用记录了其所有写入操作，它就会将一条特殊的提交记录写入磁盘，指示该日志包含完整的操作。此时，系统调用将写入复制到磁盘文件系统数据结构中。完成这些写入后，系统调用将擦除磁盘上的日志。

简单来说：就是系统调用先写log区，log区写完后写一条提交记录。提交记录写完，再将log区的block搬回到data区，搬运完之后擦除磁盘的日志。

遵循write ahead rule，也就是说在写入commit记录之前，你需要确保所有的写操作都在log中。

日志设计

// Contents of the header block, used for both the on-disk header block
// and to keep track in memory of logged block# before commit.
struct logheader {
  int n; //  log blocks
  int block[LOGSIZE];
};

struct log {
  struct spinlock lock;
  int start;       // log信息在磁盘上的位置（开始的block块的索引号)
  int size;        // log区的总的block块的数目。
  int outstanding; // 当前正在使用LOG机制的文件系统调用数目(目的是别超过了LOG系统总容量)
  int committing;  // 当前是不是正处于LOG的提交中,也就是正在写LOG进入磁盘呢
  int dev;
  struct logheader lh;  // 磁盘logheader在内存中的一份映射
};
struct log log;

log主要由两部分组成，log header和log block。header里有一个数组追踪每个log block。

日志基本调用流程

begin_op(); // 第一步
...
bp = bread(...);
bp->data[...] = ...;
log_write(bp); // 中间步骤，记录写的buf块
...
end_op(); //最后一部

begin_op表示开启一次日志记录，这里暂时不揭示内容。

bread读缓存然后对data进行修改，此时需要调用log_write来记录修改到log。让我们来看看：

// Caller has modified b->data and is done with the buffer.
// Record the block number and pin in the cache by increasing refcnt.
// commit()/write_log() will do the disk write.
void log_write(struct buf *b){
  int i;

  acquire(&log.lock);
  if (log.lh.n >= LOGSIZE || log.lh.n >= log.size - 1)
    panic("too big a transaction");
  if (log.outstanding < 1)
    panic("log_write outside of trans");

  for (i = 0; i < log.lh.n; i++) {
    if (log.lh.block[i] == b->blockno)   // log absorption
      break;
  }
  log.lh.block[i] = b->blockno;
  if (i == log.lh.n) {  // Add new block to log?
    bpin(b); 
    log.lh.n++;
  }
  release(&log.lock);
}

当在内存中修改了一个磁盘数据后，如果想从缓存真正写入的磁盘上，有bwirte()函数。但是，xv6支持了log机制，使用log_write()代替了bwrite()函数来完成这个工作。该函数其实只完成：在logheader中记录要写入的block块。(如果已经记录过，就无需再次添加，称之为log absorbtion)。这里注意，还调用bpin让这个buf的引用计数增一，避免这个脏块在回写前被bget给回收掉。

真正的写磁盘的动作在end_op()函数中来完成。

void end_op(void) {
  int do_commit = 0;

  acquire(&log.lock);
  log.outstanding -= 1;
  if(log.committing)
    panic("log.committing");
  if(log.outstanding == 0){
    do_commit = 1;
    log.committing = 1;
  } else {
    // begin_op() may be waiting for log space,
    // and decrementing log.outstanding has decreased
    // the amount of reserved space.
    wakeup(&log);
  }
  release(&log.lock);

  if(do_commit){
    // call commit w/o holding locks, since not allowed
    // to sleep with locks.
    commit();
    acquire(&log.lock);
    log.committing = 0;
    wakeup(&log);
    release(&log.lock);
  }
}

代码最前端有一些复杂情况处理，直接跳过，看commit函数。

static void commit(){
  if (log.lh.n > 0) {
    write_log();     // Write modified blocks from cache to log
    write_head();    // Write header to disk -- the real commit
    install_trans(0); // Now install writes to home locations
    log.lh.n = 0;
    write_head();    // Erase the transaction from the log
  }
}
static void write_log(int dev){
    int tail;

    for (tail = 0; tail < log[dev].lh.n; tail++) {
        struct buf *to = bread(dev, log[dev].start+tail+1); // log block
        struct buf *from = bread(dev, log[dev].lh.block[tail]); // cache block
        memmove(to->data, from->data, BSIZE);
        bwrite(to);  // write the log
        brelse(from);
        brelse(to);
    }
}
static void write_head(int dev) {
    struct buf *buf = bread(dev, log[dev].start);
    struct logheader *hb = (struct logheader *) (buf->data);
    int i;
    hb->n = log[dev].lh.n;
    for (i = 0; i < log[dev].lh.n; i++) {
        hb->block[i] = log[dev].lh.block[i];
    }
    bwrite(buf);
    brelse(buf);
}

write_log将缓冲块写入到磁盘的日志区，完成脏块的持久化。

write_head则将logheader进行持久化。

install_trans负责使用日志覆盖掉对应的盘块。该方法读取logheader.blocks数组，将每个对应的日志区的盘块安装到文件区。

当成功用日志覆盖掉对应的盘块后，系统已经无需再继续持有这些日志，因此会再次调用write_head修改日志区的logheader来清除日志，并设定log[dev].lh.n = 0。

日志挑战

第一个挑战是bcache不能撤回已经在日志transaction中的buf，这么做会破坏原子性。所以有一个buf_pin操作会将buf固定在cache中（实际上就是增加一个引用计数）

第二个挑战是文件系统操作受限于log大小。这里意思是说xv6的log区大小为30，这就限制一次系统调用最多写的block数量。如果写入的block数超过了30，那么一个写操作会被分割成多个小一些的写操作。

int max = ((MAXOPBLOCKS-1-1-2) / 2) * BSIZE;
int i = 0;
while(i < n){
  int n1 = n - i;
  if(n1 > max)
    n1 = max;

第三个挑战就是并发操作。

log的并发还挺有意思的，除了加锁保护外，还限制于log的空间大小。

比如，现在有两个并发的transaction，其中t0在log的前半部分，t1在log的后半部分，可是用完了log空间但一个transcation都没完成。这样就陷入了死锁，任何一个transaction 都期待另一个提交释放空间，但一个都不能提交。

所以说还要保证多个并发transaction也适配log的大小。当我们还没有完成一个文件系统操作时，我们必须在确保可能写入的总的log数小于log区域的大小的前提下，才允许另一个文件系统操作开始。

XV6通过限制并发文件系统操作的个数来实现这一点。在begin_op中，我们会检查当前有多少个文件系统操作正在进行。如果有太多正在进行的文件系统操作，我们会通过sleep停止当前文件系统操作的运行，并等待所有其他所有的文件系统操作都执行完并commit之后再唤醒。这里的其他所有文件系统操作都会一起commit。有的时候这被称为group commit，因为这里将多个操作像一个大的transaction一样提交了，这里的多个操作要么全部发生了，要么全部没有发生。

// called at the start of each FS system call.
void begin_op(void){
  acquire(&log.lock);
  while(1){
    if(log.committing){
      sleep(&log, &log.lock);
    } else if(log.lh.n + (log.outstanding+1)*MAXOPBLOCKS > LOGSIZE){
      // this op might exhaust log space; wait for commit.
      sleep(&log, &log.lock);
    } else {
      log.outstanding += 1;
      release(&log.lock);
      break;
    }
  }
}

首先，如果log正在commit过程中，那么就等到log提交完成，因为我们不能在install log的过程中写log；其次，如果当前操作是允许并发的操作个数的后一个，那么当前操作可能会超过log区域的大小，我们也需要sleep并等待所有之前的操作结束；最后，如果当前操作可以继续执行，需要将log的outstanding字段加1，最后再退出函数并执行文件系统操作。

void end_op(void) {
  int do_commit = 0;

  acquire(&log.lock);
  log.outstanding -= 1;
  if(log.committing)
    panic("log.committing");
  if(log.outstanding == 0){
    do_commit = 1;
    log.committing = 1;
  } else {
    wakeup(&log);
  }
  release(&log.lock);
  ...
}

最后再看end_op的前半部分。在最开始首先会对log的outstanding字段减1，因为一个transaction正在结束；其次检查committing状态，当前不可能在committing状态，所以如果是的话会触发panic；如果当前操作是整个并发操作的最后一个的话（log.outstanding == 0），接下来立刻就会执行commit；如果当前操作不是整个并发操作的最后一个的话，我们需要唤醒在begin_op中sleep的操作，让它们检查是不是能运行。

所以，即使是XV6中这样一个简单的文件系统，也有一些复杂性和挑战。

崩溃恢复过程

崩溃恢复的过程也很简单，只需要读log里未提交的日志就行。那什么是真正的提交节点呢？在commit函数中已经指示了，write head就是那么一个节点。

static void commit(){
  if (log.lh.n > 0) {
    write_log();     // Write modified blocks from cache to log
    write_head();    // Write header to disk -- the real commit
    install_trans(0); // Now install writes to home locations
    log.lh.n = 0;
    write_head();    // Erase the transaction from the log
  }
}

write_head将n不为0的head写入log区。下次开机启动时，扫描log区的head，查看其n字段是否为0.

static void recover_from_log(void){
  read_head();
  install_trans(1); // if committed, copy from log to disk
  log.lh.n = 0;
  write_head(); // clear the log
}

Path name 命名空间

我们知道，inode代表文件，但是为了便于使用，我们还管理一个命令空间。命名空间是以”/“为起点的树，每个目录下有”.”代表自身，有”..”代表上一级目录。怎么实现呢？

一般来说，根目录是固定在磁盘上某个块上的，比如1号块。遍历路径逻辑就是从根目录开始读取目录条目，然后逐个查找。

1 2	if(*path == '/') ip = iget(ROOTDEV, ROOTINO);

xv6没有对目录查找进行优化，只是简单的线性查找。

static struct inode* namex(char *path, int nameiparent, char *name){
  struct inode *ip, *next;

  if(*path == '/')
    ip = iget(ROOTDEV, ROOTINO);
  else
    ip = idup(myproc()->cwd);

  while((path = skipelem(path, name)) != 0){// skiplem相当于将path分为两部分，name是前一个部分，path是后一部分
    ilock(ip);
    if(ip->type != T_DIR){
      iunlockput(ip);
      return 0;
    }
    if(nameiparent && *path == '\0'){
      // Stop one level early.
      iunlock(ip);
      return ip;
    }
    if((next = dirlookup(ip, name, 0)) == 0){
      iunlockput(ip);
      return 0;
    }
    iunlockput(ip); // avoid 死锁
    ip = next;  // 通过name目录找到当前的path
  }
  if(nameiparent){
    iput(ip);
    return 0;
  }
  return ip;
}

File descriptor layer

Unix 界面的一个很酷的方面是 Unix 中的大多数资源都表示为文件，包括控制台、管道等设备，当然还有真实文件。文件描述符层是实现这种通用性的层。

具体来说，就是用一个file结构体来包裹文件、管道、设备。

struct file { //万物皆文件，就是这里了
  enum { FD_NONE, FD_PIPE, FD_INODE, FD_DEVICE } type;
  int ref; // reference count
  char readable;
  char writable;
  struct pipe *pipe; // FD_PIPE
  struct inode *ip;  // FD_INODE and FD_DEVICE
  uint off;          // FD_INODE, io offset
  short major;       // FD_DEVICE
};

进程有一个打开文件的数组，数组下标就是文件描述符。

struct proc {
	...
  struct file *ofile[NOFILE];  // Open files
  ...
};

当我们使用文件描述符进行读写时，实际上是通过文件描述符作为索引得到file结构体，再通过file结构体进行相应type的读写。

ftable

// file.c
struct file*    filealloc(void);
void            fileclose(struct file*);
struct file*    filedup(struct file*);
void            fileinit(void);
int             fileread(struct file*, uint64, int n);
int             filestat(struct file*, uint64 addr);
int             filewrite(struct file*, uint64, int n);

xv6对于file结构体的使用，也采用了一个池化的思想。

struct {
  struct spinlock lock;
  struct file file[NFILE];
} ftable;

// Allocate a file structure.
struct file* filealloc(void){
  struct file *f;

  acquire(&ftable.lock);
  for(f = ftable.file; f < ftable.file + NFILE; f++){
    if(f->ref == 0){
      f->ref = 1;
      release(&ftable.lock);
      return f;
    }
  }
  release(&ftable.lock);
  return 0;
}

让我们看看打开文件的过程：open系统调用，如果传入的是 O_CREATE 标志，说明要新创建文件。

uint64 sys_open(void){
  char path[MAXPATH];
  int fd, omode;
  struct file *f;
  struct inode *ip;
  int n;

  if((n = argstr(0, path, MAXPATH)) < 0 || argint(1, &omode) < 0)
    return -1;

  begin_op();

  if(omode & O_CREATE){
    ip = create(path, T_FILE, 0, 0);
    if(ip == 0){
      end_op();
      return -1;
    }
  } else { //否则就通过name打开相应的inode
    if((ip = namei(path)) == 0){
      end_op();
      return -1;
    }
    ilock(ip);
    if(ip->type == T_DIR && omode != O_RDONLY){
      iunlockput(ip);
      end_op();
      return -1;
    }
  }

  if(ip->type == T_DEVICE && (ip->major < 0 || ip->major >= NDEV)){
    iunlockput(ip);
    end_op();
    return -1;
  }

  if((f = filealloc()) == 0 || (fd = fdalloc(f)) < 0){
    if(f) // 这里可能分配的file结构体，所以还需要释放它
      fileclose(f);
    iunlockput(ip);
    end_op();
    return -1;
  }

  if(ip->type == T_DEVICE){
    f->type = FD_DEVICE;
    f->major = ip->major;
  } else {
    f->type = FD_INODE;
    f->off = 0;
  }
  f->ip = ip;
  f->readable = !(omode & O_WRONLY);
  f->writable = (omode & O_WRONLY) || (omode & O_RDWR);

  if((omode & O_TRUNC) && ip->type == T_FILE){
    itrunc(ip);
  }

  iunlock(ip);
  end_op();

  return fd;
}

再看看create函数，其实就是找到父级目录，然后分配inode结构体，增加目录项。

//creates a new name for a new inode
static struct inode*
create(char *path, short type, short major, short minor){
  struct inode *ip, *dp;
  char name[DIRSIZ];
  // find path‘s parent，return inode
  if((dp = nameiparent(path, name)) == 0)
    return 0;

  ilock(dp);
 // look in directory first，file类型在dp中找到name是合理的
  if((ip = dirlookup(dp, name, 0)) != 0){ // Look for a directory entry in a directory.
    iunlockput(dp);
    ilock(ip);
    if(type == T_FILE && (ip->type == T_FILE || ip->type == T_DEVICE))
      return ip;
    iunlockput(ip);
    return 0;
  }
 // then create it
  if((ip = ialloc(dp->dev, type)) == 0)
    panic("create: ialloc");

  ilock(ip);
  ip->major = major;
  ip->minor = minor;
  ip->nlink = 1;
  iupdate(ip);

  if(type == T_DIR){  // Create . and .. entries.
    dp->nlink++;  // for ".."
    iupdate(dp);
    // No ip->nlink++ for ".": avoid cyclic ref count.
    if(dirlink(ip, ".", ip->inum) < 0 || dirlink(ip, "..", dp->inum) < 0)
      panic("create dots");
  }

  if(dirlink(dp, name, ip->inum) < 0)
    panic("create: dirlink");

  iunlockput(dp);

  return ip;
}

再回到sys_open，这里才是分配fd的地方：先分配file结构体，然后分配fd文件描述符。

1	if((f = filealloc()) == 0 \|\| (fd = fdalloc(f)) < 0){

分配文件描述符的过程就是遍历文件的openfiletable，找到未使用的下标。这样我们就能理解一个进程打开的文件描述符是有限的这句话。

// Allocate a file descriptor for the given file.
// Takes over file reference from caller on success.
static int fdalloc(struct file *f){
  int fd;
  struct proc *p = myproc(); // 遍历进程的open file table，find an unused slot

  for(fd = 0; fd < NOFILE; fd++){
    if(p->ofile[fd] == 0){
      p->ofile[fd] = f;
      return fd;
    }
  }
  return -1;
}

总结

通过这一章的学习，xv6的架构基本已经完成。我们学习了文件块在磁盘上的组织，读取磁盘块时的LRU缓存，为文件系统操作系统持久化的日志块，为用户提供友好的命名空间以及文件描述符，这都是unix很精髓的一部分。

文件系统的学习，主要有以下几个思想吧：

文件系统的设计（磁盘块具体存储分布、Inode文件、file统一文件）
缓存池化思想（bcache、itable、ftable）
日志持久化思想（transaction）

当然，文件系统还需应对并发处理，这里讲得内容很少。其中一个点是磁盘读取时用了sleeplock而不是spinlock，这是因为磁盘读取通常是一个持续时间较长的操作。更多关于加锁控制，比如ilock加锁，namex不加锁等等有具体的原因，这点也需要继续深入。

同时xv6的很多实现都可以继续优化，比如目录项查找以及日志性能等。这些下一章会讲。

推荐阅读：https://www.cnblogs.com/KatyuMarisaBlog/p/14385792.html

MIT6.S081 xv6book chapter8

https://xyz.desirer233.fun/2024/01/18/MIT6.S081/book/chapter8/

作者

Desirer

发布于

2024-01-18

更新于

2024-06-11

MIT6.S081 xv6book chapter8

前言

磁盘组织

On-disk inode structure

in-memory copy of an inode

Buffer Cache

文件

bcache

Logging Layer

持久化的基本思想

日志设计

日志基本调用流程

日志挑战

崩溃恢复过程

Path name 命名空间

File descriptor layer

ftable

总结

作者

发布于

更新于

许可协议

目录

分类