Thursday, April 30, 2009

The Linux Kernel Linked List

Prescript:
In addition to explaining yet another ingredient of Linux kernel, this post implicitly covers some useful C programming techniques. A reader with the thought “...perhaps I can use this in my xyz program...” can apply these techniques to everyday programming.
---

Now that we’ve seen an important data structure of the Linux kernel, it’s best to visit one of the most fundamental building blocks of all kernel data structures, the linked lists. Oh yes, we are all pretty much familiar with this classical data structure model, but what’s so great about it to dedicate one whole post on the subject? Well yes, it is that same old concept of sequentially putting related data together in a list, but here, the point to notice is that, in what a simple yet elegant way the kernel implements and uses this data structure.

In fact, it is this apt implementation, which enables the kernel manage its data much more efficiently, than achieved by the usual approaches. Truly, it is this beautiful and artistic implementation of the kernel linked list, for which I actually fell in love with Linux kernel code as a whole. If you don’t find it as interesting as I did, then you may like this small piece of code for its lucidity and pertinence, or may be, learn one smarter way of implementing lists.

To represent a single “node” of the linked list, the kernel defines a struct list_head (in linux/list.h) which just contains two pointers, next and previous, and nothing else (What about ‘data’ part? Read on...). This struct is put inside the bigger struct, which is needed to be stored in a list. Yes, you read it correct; the linked list node list_head is embedded in a larger structure, which actually contains the data to be stored in the list. Read the following illustration to understand what does this mean.


The Way We did It Always!
Ever since our data structure class, we have been seeing the data being part of the linked list structure, i.e., the usual approach of list implementation as follows:

/* declaration*/
struct info { int i; char c; float f };
struct list { struct list* pPrev; struct info info_instance; struct list* pNext; };
struct list list_instance1, list_ instance2, list_ instance3;
/* form a 3-element circular doubly linked list */
list_ instance1.pPrev = list_ instance3; list_ instance1.pNext = list_ instance2;
list_ instance2.pPrev = list_ instance1; list_ instance2.pNext = list_ instance3;
list_ instance3.pPrev = list_ instance2; list_ instance3.pNext = list_ instance1;
/* access the next-element-in-the-list’s char c element */
list_ instance2.pNext -> info_instance.c == list_ instance3.c


The Way The Linux Kernel Does It!
But kernel lists are implemented using a different approach.
/* declaration */
struct list_head { struct list_head* pPrev; struct list_head* pNext; };
struct info { int i; char c; float f; struct list_head list_instance; };
struct info info_instance1, info_instance2, info_instance3;
/* form a 3-element circular doubly linked list */
info_instance1.list_instance .pPrev = info_instance3.list_instance; info_instance1.list_instance.pNext = info_instance2.list_instance;
info_instance2.list_instance.pPrev = info_instance1.list_ instance; info_instance2.list_instance.pNext = info_instance3.list_instance;
info_instance3.list_instance.pPrev = info_instance2.list_ instance;
info_instance3.list_instance.pNext = info_instance1.list_instance;
/* access the next-element-in-the-list’s char c element */
info_instance2.list_instance.pNext ...??? == info_instance3.c;



Note that the pPrev and pNext pointers point to list_instance element of the struct info, not the struct info itself. Then how to access the actual data part of struct info? That’s easy, use the following statement-




(struct info*) ((char*) (info_instance2.list_instance.pNext) – (unsigned long)(&(struct info*)0 -> list_instance ))

Oh no no, it’s not that complicated as it looks at first glance. Let’s decipher it in two steps.

Cast an Element of a Structure out to Containing Structure
1) The generic statement (unsigned long) (& (TYPE *) 0 -> MEMBER) returns the byte offset of element MEMBER of a TYPE within the TYPE. Hence in our example, the statement

a. (unsigned long) (&(struct info*)0 ->i) returns 0
b. (unsigned long) (&(struct info*)0 ->c) returns 4
c. (unsigned long) (&(struct info*)0 ->f) returns 5
d. (unsigned long) (&(struct info*)0 -> list_instance) returns 9



This statement is #define’d as macro offsetof(TYPE,MEMBER) in linux/stddef.h
2) On a little-endian machine as Intel-x86, a multibyte object is allocated from lower memory address to higher memory address. Thus the statement ((char*) PTR_TO_MEMBER) – offsetof (TYPE, MEMBER) returns the address of the structure object containing the element MEMBER, which is pointed to by PTR_TO_MEMBER. For a big-endian machine, offset would be added. Thus, all of the following statements return the address of info_instance2:
a. (struct info*) ((char*) &(info_instance2.i) – offsetof (struct info, i))
b. (struct info*) ((char*) &(info_instance2.c) – offsetof (struct info, c))
c. (struct info*) ((char*) &(info_instance2.f) – offsetof (struct info, f))
d. (struct info*) ((char*) &(info_instance2.list_instance) – offsetof (struct info, list_instance))

Therefore in our example, since info_instance2.list_instance.pNext points to info_instance3.list_instance, the statement
(struct info*) ((char*) (info_instance2.list_instance.pNext) – offsetof (struct info, list_instance))
returns the address of object info_instance3 of type struct info.
That’s all! I have already said, it’s easy, isn’t it?



So that we don’t have to write this complicated statement every time, it is #define'd as macro container_of in linux/kernel.h.



Food for Thought- 1:
What we just discussed forms the second statement (line #238) of container_of. Then what is the role of first statement here? Why is ptr assigned to __mptr first, but not directly used?
Wrong Reason: For pointer manipulation in next statement, __mptr needs to be converted to the appropriate type before it is used. Huh! Anyway before being used, __mptr is cast to char in the next statement, then previous statement is just an extra cast?
Correct Reason: Since container_of is a macro and not a function, its parameter list doesn't have the type information of arguments received from the call; ptr is just a value of 'unknown' type. The first line of the macro acts as a type declaration in function parameter list, and causes the compiler to throw a warning if the ptr doesn’t actually point to member. It is very useful to detect any anomaly in the code, where mistakenly a wrong value for ptr or member or both has been passed. (Experiment with this code by removing first statement of container_of.)
Food for Thought- 2:
Why don’t we define container_of a function (and the typeof statement will be removed)?
Because, we actually want ptr to be of ‘unknown’ type. Otherwise in the function declaration, what type do we declare ptr as, since that will only be known at run time!

** The same container_of macro was introduced in Minix 2.0.4
with the name structof in src/kernel/const.h. Pointing at the
obscurity of the macro, author of the release notes
sarcastically commented, “Don't worry, it's only used in a few drivers.” :D

What Do You Win If You Use This List?
We see struct list_head embedded in every other kernel data structure. One great advantage of this approach is that if we want to put info_instance1 on two (or any number for that matter) different lists, we don’t have to create its multiple copies. That is to say, the same instance of struct info can become part of multiple linked lists at the same time without actually duplicating it. We just have to embed those many number of list_head elements in struct info, as many linked lists we want an instance of struct info to be present in. For a real example, task_struct (the kernel data structure which maintains a single process) has four list_head elements as its data members, viz., run_list, tasks, children, and sibling. Each instance of list_head helps a particular task_struct instance to be a part of four lists simultaneously-
1) run_list makes it a part of processor run-queue list of ready processes,
2) tasks makes it a part of the universal list of all the task_struct objects in the system,
3) children is a pointer to the list of it’s children processes’ task_struct objects, and
4) sibling links it with the other task_struct objects having the same parent.

Phew! One task_struct is on four lists at the same time. Did you think of this type of list implementation anytime in your data structures class? Well I didn’t!!

Where Do You Find It?
The Linux linked list implementation is given in a single file linux/list.h. This file is full of handy macros for list manipulation, like add/delete a node, list traversal etc. Having read this far, you will be able to understand these easily. If you don't have Linux source code (and don't wish to download it either) but yet want to get this file, click here.

You can use this implementation of linked list in general purpose or professional programming and grab those extra cookies which you have been missing by using the ordinary list implementation. Just remove the unwanted code from linux/list.h (which will mostly be the macros with _rcu prefix) and copy it to your code. After the introduction of struct list_head in kernel 2.1.45, the kernel code got rid of many redundant and inefficient lines of code. May you achieve the same results.

As we see in the snapshot of linux/kernel.h, just after the definition of macro container_of, another macro typecheck has been defined. This is an important macro, and as the macro header says, it is useful only at compile-time (???). I have yet to figure out its role exactly. Would like to hear from anyone who understands it.


References:
I addition to those listed, http://kernelnewbies.org/FAQ/LinkedLists

No comments: