The Linux file system modules
description
Transcript of The Linux file system modules
The Linux file system modules
Nezer J. Zaidenberg
Minhala In 29.1 recitation I will publish ex.1 and 2 questions.
And ex. 2 solution. Students who have not yet submitted ex. 2 must do
so prior to 29.1 All students that submitted HW must schedule oral
exam prior to 29.1 or they will fail the homework! Students that cannot meet the 29.1 deadline with
good reason should inform me. We will work something out.
Minhala 2 You should submit ex 3. before the test, or request
extension before the test. If you will not request extension (sending me email
with your team members ID) We will publish your final grade after the exam.
Please send me the requests to [email protected]
We will not accept requests after the exam and if you have posted a request you should submit the ex.
Minhala 3 Shiurhazara before the test – will be held 1 day
before the exam at noon. (Room will be announced) I will answer all your questions and go over the
questions we asked in HW-1,2 and some issues that will be raised on the lectures and filesystem ex.
Back to file system
What should we know What is a File system How the VFS calls file system specific functions
via “virtual table” (“Inheritance in C”) How to operate (start/stop) VMWARE How to write simple (hello world) modules How to write file system modules that register file
system and read the super-block How to debug using printk and /var/log/messages
What next Successful mount Successful ls Successful open/touch Successful read/write Successful mkdir/remove dir Successful mmap/munmap List of functions to implement List of kernel function we can use
A word of caution… In order not to give all my cards…. I have cited sources from 3 different sources
My uxfs Minix Ext2
This way you can still think about ex.3 without getting all the code… But beware not everything is done exactly the same in all file
systems
You will also see examples of how the “inheritance” in Linux file system is implemented. (Think about “generic file system” from which uxfs, minix and ext2 inherit)
Working with block devices References
P. 348 (scanning for uxfs file system) UNIX filesystems – very simplified
Chapter 15.2 Understanding The Linux Kernel 3rdeditiion – much more then we need
Buffer head bread basics When we access a block from block device we call the bread()
function. The bread() function reads block from block device returning a
buffer_head object (this object can later be accessed for data) Each call to bread() will be followed by a call to brelse() which will
release the buffer. A 2nd call to bread() before brelse() was called will cause the
operation to block() Sb_bread() is a wrapper to bread()
Sb_bread(sb, block(==offset)) == bread(sb->s_dev, block, sb->s_blocksize)
We will use sb_bread() in most code samples (brelse still apply)
Buffer head writing and reading In order to write a buffer head we mark it as dirty
using mark_buffer_dirty(structbuffer_head) The dirty buffers are periodically written to disk
(or written on brelse) In order to access the data read we read b_data
member of struct buffer_head
Examples – ux_put_super + ux_write_super Void ux_put_super(struct super_block *s){ struct ux_fs *fs = (struct ux_fs *) s->s_fs_info; struct buffer_head *bh = fs->u_sbh;
printk (KERN_ERR "scipio : ux_put_super %s %d", __FILE__, __LINE__);kfree(fs);
brelse(bh);}
Ux_write_super 1/2void ux_write_super(struct super_block *sb){
struct ux_fs *fs = (struct ux_fs *) sb->s_fs_info;
struct buffer_head *bh = fs->u_sbh;printk (KERN_ERR "Scipio write super was called %s %d\n”, __FILE__, __LINE__);lock_kernel();
Ux_write_super 2/2 printk (KERN_ERR "Scipio write super after lock kernel %s %d\
n”,__FILE__, __LINE__); if (!(sb->s_flags & MS_RDONLY)) { mark_buffer_dirty(bh); } sb->s_dirt = 0;
printk (KERN_ERR "Scipio write super before unlock kernel %s %d\n”,__FILE__, __LINE__);unlock_kernel();printk (KERN_ERR "Scipio write super after unlock kernel %s %d\n”, _FILE__, __LINE__);
}
Completing the mount operation
And initial discussion on locking
So what does mount(1) check after mounting File system mount(1) operation also calls read to
the root inode verifing that indeed mount was successful and a directory was written
Some of you have demonstrated mount that fails with “not a directory” message.
For mount(1) to be completed successfully we need the XX_iget implementation.
(The kernel knows what is the root inode to read because of the d_alloc_root function)
ux_iget() – my iget (porting the book)struct inode *ux_iget(struct super_block *sb, unsigned long ino){ struct buffer_head *bh; struct ux_inode *di;
int block;struct inode * inode;printk (KERN_ERR "scipio : ux_iget was called %s %d\n", __FILE__, __LINE__);inode = iget_locked(sb,ino);
My ux_iget (2/6)if (!(inode)) {
printk (KERN_ERR "scipio : ux_iget iget_locked failed %s %d\n", __FILE__, __LINE__);
return ERR_PTR(-ENOMEM);}if (!(inode->i_state & I_NEW)) return inode;
if (ino < UX_ROOT_INO || ino > UX_MAXFILES) { printk("uxfs: Bad inode number %lu\n", ino);
printk (KERN_ERR "scipio : ux_iget bad inode number %lu, %s %d\n", ino, __FILE__, __LINE__);
goto ux_iget_error;}
My ux_iget 3/6 // Note that for simplicity, there is only one inode per
block! block = UX_INODE_BLOCK + ino; bh = sb_bread(inode->i_sb, block); if (!bh) {
printk (KERN_ERR "scipio : ux_iget problem with sb_bread on inode %d %s %d\n", ino, __FILE__, __LINE__);goto ux_iget_error;
} di = (struct ux_inode *)(bh->b_data);
inode->i_mode = di->i_mode;
My ux_iget (4/6)if (di->i_mode & S_IFDIR) {
inode->i_mode |= S_IFDIR; inode->i_op = &ux_dir_inops; inode->i_fop = &ux_dir_operations; } else if (di->i_mode & S_IFREG) { inode->i_mode |= S_IFREG; inode->i_op = &ux_file_inops; inode->i_fop = &ux_file_operations; inode->i_mapping->a_ops = &ux_aops; }
My ux_iget 5/6 inode->i_uid = di->i_uid; inode->i_gid = di->i_gid; inode->i_nlink = di->i_nlink; inode->i_size = di->i_size; inode->i_blocks = di->i_blocks;
inode->i_atime.tv_sec = di->i_atime; inode->i_mtime.tv_sec = di->i_mtime; inode->i_ctime.tv_sec = di->i_ctime; inode->i_atime.tv_nsec = 0;
inode->i_mtime.tv_nsec = 0;
My ux_iget 6/6 Inode->i_ctime.tv_nsec = 0; memcpy(&inode->i_private, di, sizeof(struct ux_inode)); brelse(bh);
unlock_new_inode(inode);printk (KERN_ERR "scipio : ux_iget before return %s %d\n", __FILE__, __LINE__);return inode;
ux_iget_error:printk (KERN_ERR "scipio : ux_iget had error %s %d\n", __FILE__, __LINE__);iget_failed(inode);return ERR_PTR(-EINVAL);
}
The new iget_locked()
New way Each file system has fs_iget()
which calls iget_locked(); Iget_locked() -> search for
inode in the inode cache (shared memory) if its there it is returned. If not it is red from disk.
(naturally all shared memory operations are locked)
Old way Iget() method Each fs had
read_inode() Disappeared : 2.6.25
(not so very long ago!) Problems : with style
and locking
For more information : http://kerneltrap.org/Linus/Removing_iget_and_read_inode
Some more kernel operations Printk - we know kmalloc/kfree – same as the none kernel function
(kmalloc should get extra parameter value GFP_KERNEL) (more on this… kzalloc = kmalloc + set memory to zero. Kcalloc = like normal calloc)
most strXXX and memXXX functions are usable in the kernel same as in user mode (though the implementation is built in kernel not via library function)
Complete kernel API reference : http://www.gelato.unsw.edu.au/~dsw/public-files/kernel-docs/kernel-api/index.html
Just a word of caution The Linux kernel is evolving beast with API
coming in and out with practically no attempt for backward compatibility.
Examples : iget_locked was added at kernel 2.6.25 while kzalloc was added at 2.6.14 (and doesn’t appear in the API reference)
The kernel progress via emails and post in mailing list and everything is documented. When in doubt ask google.
Reading inode from disk – minix stylefs/minix/bitmap.c115 minix_V1_raw_inode(struct super_block *sb, ino_tino, structbuffer_head **bh)
116 {
117 int block;
118 structminix_sb_info *sbi = minix_sb(sb);
119 structminix_inode *p;
120
121 if (!ino || ino>sbi->s_ninodes) {
122 printk("Badinode number on dev %s: %ld is out of range\n",
123 sb->s_id, (long)ino);
124 return NULL;
125 }
fs/minix/bitmap.c 126 ino--;127 block = 2 + sbi->s_imap_blocks + sbi->s_zmap_blocks +128 ino / MINIX_INODES_PER_BLOCK;129 *bh = sb_bread(sb, block);130 if (!*bh) {131 printk("Unable to read inode block\n");132 return NULL;133 }134 p = (void *)(*bh)->b_data;135 return p + ino % MINIX_INODES_PER_BLOCK;136 }
Writing inode Is done via call to iput. (This will also call your
routines) Iput() marks the inode as used one less time.
When usage equal zero the inode is put to disk and is freed.
Iget/iget_locked() increase the usage by 1
Write_inode (from minix)fs/minix/inode.c
560 static intminix_write_inode(structinode * inode, int wait)
561 {562 brelse(minix_update_inode(inode));563 return 0;564 }
Still minix : fs/minix/inode.c552 static structbuffer_head *minix_update_inode(structinode
*inode)553 {554 if (INODE_VERSION(inode) == MINIX_V1)555 return V1_minix_update_inode(inode);556 else557 return V2_minix_update_inode(inode);558 }
More from minixfs/minix/inode.c
499 static structbuffer_head * V1_minix_update_inode(struct inode * inode)
500 {501 structbuffer_head * bh;502 structminix_inode * raw_inode;503 structminix_inode_info *minix_inode =
minix_i(inode);504 inti;505
And… fs/minix/inode.c506 raw_inode = minix_V1_raw_inode(inode->i_sb,
inode->i_ino, &bh);507 if (!raw_inode)508 return NULL;…519 mark_buffer_dirty(bh);520 return bh;521 }
Creating new files When we call touch for example… We need to allocate new inode We allocate a Linux inode and also a file system
inode pointed by the above Please note : allocate_inode is a new method (It
does not appear in UNIX file system book) do not confuse with pate’s ux_ialloc() which finds a free inode.
How ext2 allocate inode 142 static struct inode *ext2_alloc_inode(struct super_block *sb)
143 { 144 struct ext2_inode_info *ei; 145 ei = (struct ext2_inode_info
*)kmem_cache_alloc(ext2_inode_cachep, GFP_KERNEL ); 146 if (!ei) 147 return NULL; // scipio : I removed some #ifdef
152 ei->i_block_alloc_info = NULL; 153 ei->vfs_inode.i_version = 1; 154 return &ei->vfs_inode; 155 }
For those who find it weird : ext2.h16 struct ext2_inode_info {17 __le32 i_data[15];18 __u32 i_flags;19 __u32 i_faddr;20 … 62 struct mutextruncate_mutex; 63 struct inode vfs inode; 64 struct list_headi_orphan; /* unlinked but open inodes
*/ 65 };
Explaining Struct inode is encapsulated in ext2_inode_info so using
simple pointer arithmetic one an find the correct pointer… that is done via thestatic inline struct ext2_inode_info *EXT2_I(struct inode
*inode)Function(Though it may be more correct that ext2_inode “is a”n inode
and not “has a”n inode kernel developers are more interested in speed and memory locality then OOP. I’ve implemented two mallocs and it also works)
Speed is of most importance to kernel developers (but I would be most willing to explain code lines)
Get block/put blockWorks roughly the same as with Inode but via
different data structure(blocks are read using sb_bread() and put using
brelese() after we mark the block as dirty)We may want to do our own locking (especially in
SMP systems)
Kernel spinlocks and the BKL Kernel spinlocks are named “recursive mutexes” When the lock is obtained nobody else can obtain the lock.
(operation would block) Previous versions of Linux had the “Big Kernel Lock” acronym
== BKL. That means that each lock locked the entire kernel (even unrelated parts)
This lock is beginning to phase out… But for simplicity and improved stability it may be a good idea
to have all your functions inside a “lock_kernel() statement. (The BKL is released with unlock_kernel())
Example in kernel code(from fs/ext2/inode.c)
BKL1384 lock_kernel();1385 ext2_update_dynamic_rev(sb);1386
EXT2_SET_RO_COMPAT_FEATURE(sb,1387
EXT2_FEATURE_RO_COMPAT_LARGE_FILE);1388 unlock_kernel();1389 ext2_write_super(sb);
Reading directories Reading directories is identical to reading inodes
as far as Inodes are concerned Reading directories requires directory_operation
struct with different functions then file operations Reading directories one has to fill a dirent
structure (take note that this is why dirent structure has inode number which we never used in user space)
List of useful directory functions d_alloc_root (p. 349 Unix filesystems) – allocate
the root Inode for the kernel to read filldir (p. 353 Unix filesystems) – copy directory
content to user space d_XXX (see the kernel API) – manipulate the
kernel directory cache does exactly what the name applies
NOW WHAT You should be able to create file system (using
Userland mkfs) You should be able to create file system that
support reading and writing files and directories (You have all the API’s and the kernel example. (feel free to examine other file systems))
You should be able to DIG into mmap and links alone… but I’ll cover that next lecture
Some more implementation hints It may be a good idea to turn SELinux off while working on
the kernel. echo 0 > /selinux/enforce
It may be better idea to make SELinux not start or permissive Edit /etc/selinux/config SELINUX=disabled // DISABLED OR SELINUX=permissive // generate warnings
http://www.geocities.com/ravikiran_uvs/articles/rkfs.html is an helpful (beej like) manual on how to write file system kernel modules may be worth your time
It’s never to late to start digging the kernel!