Thursday, April 30, 2009

The Linux Kernel Linked List

Prescript:
In addition to explaining yet another ingredient of Linux kernel, this post implicitly covers some useful C programming techniques. A reader with the thought “...perhaps I can use this in my xyz program...” can apply these techniques to everyday programming.
---

Now that we’ve seen an important data structure of the Linux kernel, it’s best to visit one of the most fundamental building blocks of all kernel data structures, the linked lists. Oh yes, we are all pretty much familiar with this classical data structure model, but what’s so great about it to dedicate one whole post on the subject? Well yes, it is that same old concept of sequentially putting related data together in a list, but here, the point to notice is that, in what a simple yet elegant way the kernel implements and uses this data structure.

In fact, it is this apt implementation, which enables the kernel manage its data much more efficiently, than achieved by the usual approaches. Truly, it is this beautiful and artistic implementation of the kernel linked list, for which I actually fell in love with Linux kernel code as a whole. If you don’t find it as interesting as I did, then you may like this small piece of code for its lucidity and pertinence, or may be, learn one smarter way of implementing lists.

To represent a single “node” of the linked list, the kernel defines a struct list_head (in linux/list.h) which just contains two pointers, next and previous, and nothing else (What about ‘data’ part? Read on...). This struct is put inside the bigger struct, which is needed to be stored in a list. Yes, you read it correct; the linked list node list_head is embedded in a larger structure, which actually contains the data to be stored in the list. Read the following illustration to understand what does this mean.


The Way We did It Always!
Ever since our data structure class, we have been seeing the data being part of the linked list structure, i.e., the usual approach of list implementation as follows:

/* declaration*/
struct info { int i; char c; float f };
struct list { struct list* pPrev; struct info info_instance; struct list* pNext; };
struct list list_instance1, list_ instance2, list_ instance3;
/* form a 3-element circular doubly linked list */
list_ instance1.pPrev = list_ instance3; list_ instance1.pNext = list_ instance2;
list_ instance2.pPrev = list_ instance1; list_ instance2.pNext = list_ instance3;
list_ instance3.pPrev = list_ instance2; list_ instance3.pNext = list_ instance1;
/* access the next-element-in-the-list’s char c element */
list_ instance2.pNext -> info_instance.c == list_ instance3.c


The Way The Linux Kernel Does It!
But kernel lists are implemented using a different approach.
/* declaration */
struct list_head { struct list_head* pPrev; struct list_head* pNext; };
struct info { int i; char c; float f; struct list_head list_instance; };
struct info info_instance1, info_instance2, info_instance3;
/* form a 3-element circular doubly linked list */
info_instance1.list_instance .pPrev = info_instance3.list_instance; info_instance1.list_instance.pNext = info_instance2.list_instance;
info_instance2.list_instance.pPrev = info_instance1.list_ instance; info_instance2.list_instance.pNext = info_instance3.list_instance;
info_instance3.list_instance.pPrev = info_instance2.list_ instance;
info_instance3.list_instance.pNext = info_instance1.list_instance;
/* access the next-element-in-the-list’s char c element */
info_instance2.list_instance.pNext ...??? == info_instance3.c;



Note that the pPrev and pNext pointers point to list_instance element of the struct info, not the struct info itself. Then how to access the actual data part of struct info? That’s easy, use the following statement-




(struct info*) ((char*) (info_instance2.list_instance.pNext) – (unsigned long)(&(struct info*)0 -> list_instance ))

Oh no no, it’s not that complicated as it looks at first glance. Let’s decipher it in two steps.

Cast an Element of a Structure out to Containing Structure
1) The generic statement (unsigned long) (& (TYPE *) 0 -> MEMBER) returns the byte offset of element MEMBER of a TYPE within the TYPE. Hence in our example, the statement

a. (unsigned long) (&(struct info*)0 ->i) returns 0
b. (unsigned long) (&(struct info*)0 ->c) returns 4
c. (unsigned long) (&(struct info*)0 ->f) returns 5
d. (unsigned long) (&(struct info*)0 -> list_instance) returns 9



This statement is #define’d as macro offsetof(TYPE,MEMBER) in linux/stddef.h
2) On a little-endian machine as Intel-x86, a multibyte object is allocated from lower memory address to higher memory address. Thus the statement ((char*) PTR_TO_MEMBER) – offsetof (TYPE, MEMBER) returns the address of the structure object containing the element MEMBER, which is pointed to by PTR_TO_MEMBER. For a big-endian machine, offset would be added. Thus, all of the following statements return the address of info_instance2:
a. (struct info*) ((char*) &(info_instance2.i) – offsetof (struct info, i))
b. (struct info*) ((char*) &(info_instance2.c) – offsetof (struct info, c))
c. (struct info*) ((char*) &(info_instance2.f) – offsetof (struct info, f))
d. (struct info*) ((char*) &(info_instance2.list_instance) – offsetof (struct info, list_instance))

Therefore in our example, since info_instance2.list_instance.pNext points to info_instance3.list_instance, the statement
(struct info*) ((char*) (info_instance2.list_instance.pNext) – offsetof (struct info, list_instance))
returns the address of object info_instance3 of type struct info.
That’s all! I have already said, it’s easy, isn’t it?



So that we don’t have to write this complicated statement every time, it is #define'd as macro container_of in linux/kernel.h.



Food for Thought- 1:
What we just discussed forms the second statement (line #238) of container_of. Then what is the role of first statement here? Why is ptr assigned to __mptr first, but not directly used?
Wrong Reason: For pointer manipulation in next statement, __mptr needs to be converted to the appropriate type before it is used. Huh! Anyway before being used, __mptr is cast to char in the next statement, then previous statement is just an extra cast?
Correct Reason: Since container_of is a macro and not a function, its parameter list doesn't have the type information of arguments received from the call; ptr is just a value of 'unknown' type. The first line of the macro acts as a type declaration in function parameter list, and causes the compiler to throw a warning if the ptr doesn’t actually point to member. It is very useful to detect any anomaly in the code, where mistakenly a wrong value for ptr or member or both has been passed. (Experiment with this code by removing first statement of container_of.)
Food for Thought- 2:
Why don’t we define container_of a function (and the typeof statement will be removed)?
Because, we actually want ptr to be of ‘unknown’ type. Otherwise in the function declaration, what type do we declare ptr as, since that will only be known at run time!

** The same container_of macro was introduced in Minix 2.0.4
with the name structof in src/kernel/const.h. Pointing at the
obscurity of the macro, author of the release notes
sarcastically commented, “Don't worry, it's only used in a few drivers.” :D

What Do You Win If You Use This List?
We see struct list_head embedded in every other kernel data structure. One great advantage of this approach is that if we want to put info_instance1 on two (or any number for that matter) different lists, we don’t have to create its multiple copies. That is to say, the same instance of struct info can become part of multiple linked lists at the same time without actually duplicating it. We just have to embed those many number of list_head elements in struct info, as many linked lists we want an instance of struct info to be present in. For a real example, task_struct (the kernel data structure which maintains a single process) has four list_head elements as its data members, viz., run_list, tasks, children, and sibling. Each instance of list_head helps a particular task_struct instance to be a part of four lists simultaneously-
1) run_list makes it a part of processor run-queue list of ready processes,
2) tasks makes it a part of the universal list of all the task_struct objects in the system,
3) children is a pointer to the list of it’s children processes’ task_struct objects, and
4) sibling links it with the other task_struct objects having the same parent.

Phew! One task_struct is on four lists at the same time. Did you think of this type of list implementation anytime in your data structures class? Well I didn’t!!

Where Do You Find It?
The Linux linked list implementation is given in a single file linux/list.h. This file is full of handy macros for list manipulation, like add/delete a node, list traversal etc. Having read this far, you will be able to understand these easily. If you don't have Linux source code (and don't wish to download it either) but yet want to get this file, click here.

You can use this implementation of linked list in general purpose or professional programming and grab those extra cookies which you have been missing by using the ordinary list implementation. Just remove the unwanted code from linux/list.h (which will mostly be the macros with _rcu prefix) and copy it to your code. After the introduction of struct list_head in kernel 2.1.45, the kernel code got rid of many redundant and inefficient lines of code. May you achieve the same results.

As we see in the snapshot of linux/kernel.h, just after the definition of macro container_of, another macro typecheck has been defined. This is an important macro, and as the macro header says, it is useful only at compile-time (???). I have yet to figure out its role exactly. Would like to hear from anyone who understands it.


References:
I addition to those listed, http://kernelnewbies.org/FAQ/LinkedLists

Tuesday, December 9, 2008

Do You Know Your Limits!?

To continue further our discussion of what are process resource limits, let’s see some real figures. Try putting following two printk() statements in FirstModuleSource.c posted here.

printk (KERN_INFO "Process Name: %s", current -> comm);
printk(KERN_INFO "current limit- # processes: %ul\n", current -> rlim[RLIMIT_NOFILE].rlim_cur);
printk(KERN_INFO "maximum limit- # processes: %ul\n", current -> rlim[RLIMIT_NOFILE].rlim_max);

These statements print soft and hard limits (for the maximum number of files the current user can open) imposed on current process (which is insmod). On similarly printing the limits of other resources, we see that all values of soft and hard limits are assigned limits from array INIT_RLIMITS (see #line 24 in snapshot of resource.h posted here).


To see the difference between resource limits of a kernel process with an ordinary user process, we shall print the resource limits of a user process by adding following lines of code in the FirstModuleSource.c

int some_pid = 3630; /*reader should change this value appropriately*/
struct task_struct *task = find_task_by_pid(some_pid);
printk (KERN_INFO "Process Name: %s", task -> comm);
printk (KERN_INFO "current limit- # processes: %ul\n", task -> rlim[RLIMIT_NOFILE].rlim_cur);
printk (KERN_INFO "maximum limit- # processes: %ul\n", task -> rlim[RLIMIT_NOFILE].rlim_max);

Here, 3630 is the pid of the shell command tail, which I got by running tail and looking for it in output of ps. The find_task_by_pid() kernel function returns the process descriptor belonging to some_pid. We can specify pid of any user process running in background; but that pid must be alive while the module is inserted.

On my i386/RHEL-4 machine, both sets of statements given above produce following output, respectively:

Nov 15 21:27:32 localhost kernel: Process Name: insmod<6>
Nov 15 21:27:32 localhost kernel: current limit- # processes: 1024l
Nov 15 21:27:32 localhost kernel: maximum limit- # processes: 1024l

Nov 15 21:27:32 localhost kernel: Process Name: tail<6>
Nov 15 21:27:32 localhost kernel: current limit- # processes: 1024l
Nov 15 21:27:32 localhost kernel: maximum limit- # processes: 1024l

On similarly printing the limits of other resources and analyzing the output, we see that all the user/kernel processes’ resource limits are initialized with the same set of values which are specified by array INIT_RLIMITS of resource.h. This small experiment suggests that by default, the kernel imposes similar resource limits on any user/kernel mode process. To verify this further, I changed some values in INIT_RLIMITS, and then ran the same program again after compiling the kernel. The changed values were reflected as-is in the output.

How are Limits Assigned?
Unless a user is assigned custom limits by the superuser, all the processes belonging to that user are assigned default limits as follows.

The first Linux process which is created during system bootup is the swapper process. After kernel initialization, the swapper creates another process init, which then takes full control of all the system activities and creates more processes. swapper's data structures are initialized through init_task.h. It is the only process, for which data structures are statically allocated like this. Data structures for all other processes are dynamically allocated.

As we see in the snapshot here, INIT_TASK macro initializes the task_struct of swapper. At #line 80, rlim field of task_struct has been initialized with INIT_RLIMITS structure defined in resource.h (see last post), #line 82 is the name of the process.
Since every process inherits properties of its parent process when it is created, init receives the same values as swapper for all of its fields, including rlim. Further, when init creates more processes, these values are inherited by every child process and to their child processes and so on. This goes on until and unless
  1. A process executes setrlimit() system call
  2. The superuser has configured custom values for a user.
We've already covered the first point in last post. For information on second point, do a `man limits.conf' and open your /etc/security/limits.conf. The custom values for a user/group are specified in this configuration file by the system superuser.

Therefore, if a particular limit for a resource is set for user X in limits.conf, that limit shall be assigned to all the processes created by user X, at the time of process creation.

** Try shell command `ulimit –a’ to see all soft limits for current user. `ulimit -s unlimited’ sets the stack size soft limit to maximum for all the processes belonging to current session of the current user. (Current session is nothing but the current shell on which ulimit is entered; the changes by ulimit do not affect other shells.) If a user/resource doesn't appear in limits.conf, ulimit shall display default limits (i.e. the values of INIT_RLIMIT array) for that user/resource.

Shweta is going to share some findings on this very soon... on the same page.
-------------------------------------------------------------

Shweta added:

The above explanation can be understood with the help of a small and quick experiment. This should make clear this behavior of resource limits assignment. This experiment has two parts:

First, clear the entries belonging to user X from limits.conf, (if any), then login to your machine with user account X and write a user program to print value of RLIMIT_NOFILE using getrlimit(). Also check this value using 'ulimit'. Both outputs should show the value specified in INIT_RLIMIT array. Now change the value of RLIMIT_NOFILE using setrlimit(), fork() a new process in the same program and print the value of RLIMIT_NOFILE for child process using getrlimit(). This should output the value changed in parent process using setrlimit().

Now for the second part of the experiment, make an entry for RLIMIT_NOFILE for user X in limits.conf and repeat part-1 of the experiment. The outputs this time would differ in the way that wherever previous experiment printed values from INIT_RLIMIT array, this experiment would print values specified in limits.conf file.

Saturday, November 22, 2008

Process Resource Limits

Having seen the task_struct structure representing a process in kernel, in this post we shall discuss one very important field of task_struct, named rlim (#line 460 in the snapshot of sched.h in last post). A process can never access the system resources how much ever it wants. Linux kernel imposes limits on the amount of resources a process is allowed to use, so that a single process cannot cause other processes to starve by itself consuming large amounts of resources. There are RLIM_NLIMITS types of resources a process can use. The kernel specifies limits on all of these RLIM_NLIMITS types of resources; for kernel 2.6, the value of RLIM_NLIMITS is 13. See the snapshot of file include/asm-i386/resource.h below, which contains the macros for all 13 resource types.

All Linux processes are assigned the same values of resource limits when they are created. However, after creation, a process can increase or decrease its limits. The resource limits of a process are stored in the
rlim field of the process descriptor. The rlim field is a 13 elements array of type struct rlimit, which is defined in file include/linux/resource.h. (Snapshot below)

The
rlimit structure contains two fields specifying current and maximum limits of a resource (also called soft and hard limits, respectively) to be used by a process. Using system calls getrlimit()/setrlimit(), any process can increase/decrease its soft limits up to corresponding hard limits. Hard limits of a process can only be increased if the particular process has superuser privileges. Therefore, a “normal” process may use getrlimit()/setrlimit() system calls to irreversibly decrease its hard limits.

If a process exceeds the consumption of a resource beyond the assigned soft limit, the kernel signals the process, or the system call (e.g.
malloc, fopen) trying to consume the resource fails with errno set to appropriate value. For example, the signals SIGXCPU, SIGXFSZ, SIGSEGV are fired when a process exceeds RLIMIT_CPU, RLIMIT_FSIZE, RLIMIT_STACK respectively.

This was the general background; in the next post we will practically see the actual limits imposed on Linux processes, and how these limits are assigned.