Tuesday, July 29, 2008

Kernel Process Data-Structures

In the simplest words, a process is defined as "execution context of a running program". Execution context is the collection of all the data structures, registers, memory addresses and other resources that the process uses while executing a program. Therefore, a program itself is not a process. Rather, it is a collection of data structures that fully describes how far the execution of program has progressed. In other words, a process is an active program, i.e., an in-memory representation of a program stored in hard-disk. An important point to note is that two or more processes can co-exist which are executing the same program, and a single process can execute two or more programs in its life span.


Running and managing a process is a complex activity for kernel. In this post (and probably the next few) I try to make a note of how beautifully the Linux kernel performs this task. Specifically in this post I note how the Linux kernel keeps track of the process currently scheduled to run on CPU.


Data Structure to Hold Entire Information About a Process:

Kernel identifies each process with a unique process descriptor represented by struct task_struct, which contains all the information about a specific process. List of all process descriptors is maintained in a circular doubly linked list.

The following screenshot shows a (partial) definition of the task_struct structure, defined in linux/sched.h. I have removed many fields here, which are not directly relevant to our discussion here.

getpid() Does Not Return Process id ?!?

But I have been always using it to get the pid of calling process without knowing this fact!!! Indeed, getpid() returns tgid (thread group id), which is the pid of the leader of threads executing within the same group, (which is usually the first process created within that group). Since Linux doesn't support multithreading directly, each process executes within a single thread in its own thread group. Therefore, a single threaded process naturally becomes its thread group leader. The difference occurs only when some library (e.g. POSIX) is used to create multithreaded processes, in which case, getpid() returns the value of calling process' tgid field. By the way, in Linux, threads are called light weight processes and they also have pids, since as we can see, there is nothing called thread id in task_struct.

Food for Thought:

1. What if I need to determine the pid of a light weight process, which is not a group leader?


In a uniprocessor system, only a single process can run at a time. The pointer to the task_struct of the process currently running on CPU is returned by a macro current. Thus, the statement


printk(KERN_ALERT "Process Name: %s\n" "PID: %d\n", current->comm, current->pid);


prints the name and pid of the currently executing process.


How Does the current Macro Determine the Currently Running task_struct?

Other than task_struct, there is one more data structure associated with a process, thread_info, which contains the low level information about a process when it is running on CPU. Both the task_struct and thread_info structures contain a field pointing to each-other, named thread_info and task respectively (line #439 in the task_struct snapshot). But unlike task_struct, which is a dedicated structure for each process, there is one thread_info structure per processor in the system. The thread_info field of task_struct (and eventually, the task field of thread_info) is updated as soon as the process is scheduled onto the processor. When a process is not running, the value of thread_info field of task_struct does not mean anything.

For each process running in kernel mode, kernel uses the following union spanning 8KB:

union thread_union{

struct thread_info thread_info;

unsigned long stack[2048]; /* (long: 4 bytes) x 2K = 8KB*/

};


Food for Thought:

2. What happens if I reverse the order of the two declarations above in thread_union?

3. What is wrong in following code (courtesy of Kernel Trap mailing list http://kerneltrap.org/node/5835)

char *get_sp() { asm("movl %esp, %eax"); }


int main(void){

struct thread_info *current_thread_info;

struct task_struct *current_task;

current_thread_info = (struct thread_info *)((unsigned long)get_sp()&~0x1FFF));

current_task = (struct task_struct *)(current_thread_info->task);

printf("my pid is: %d\n",(unsigned int)current_task->pid);

}

As soon as a process switches to kernel mode, the (8 KB) thread_union is "created" in the memory area kernel data segment which is designated for kernel by kernel only (address to kernel data segment is returned by macro __KERNEL_DS). The user mode processes cannot access the kernel data segment.


Since at any given time, esp register points to a memory location lying somewhere between this 8 KB area (adrressed by 13 bits), the starting address of thread_union can be obtained by masking out the 13 least significant bits of the esp register. Furthermore, since the first field of thread_union is thread_info, obtained value is a pointer to thread_info. This masking operation is performed by kernel routine current_thread_info() which looks like following:


movl $0xffffe000,%ecx /* store the mask */

andl %esp,%ecx /* mask the esp, ecx points to thread_info*/


Having acquired the address of thread_info structure, getting the pointer to task_struct is trivial. Hence, the current macro is:


current = current_thread_info()->task


This is how the current macro points to the task_struct of the process currently running on CPU.

Monday, July 21, 2008

There is Something About printk()

In the example myFirstModuleSource.c of Writing the First Kernel Module, I used printk() just as another name of printf(). In this post I note the differences between the two, and the way Linux kernel uses it.


At the first glance, the only difference between printk() and printf() is the way the two functions are called. However, this single difference causes all the mysteries of printk() in displaying the kernel outputs. I called it “mystery” because we don't see printk() outputs at the place where (as naive kernel learners) we expect them to come, i.e., on the console. Actually, all outputs of printk() are logged into file /var/log/messages. After executing our program (inserting/removing modules), we have to go check /var/log/messages to see the outputs. Let's have a look at how printk() is called, and then see how to get its outputs at the place we want, besides /var/log/messages.


The printk() is called with one more argument than printf(), like this:


printk(KERN_log_priority "hello world\n");


Here, log_priority is one of the eight values (predefined in linux/kernel.h, similar to /usr/include/sys/syslog.h), EMERG, ALERT, CRIT, ERR, WARNING, NOTICE, INFO, DEBUG (in order of decreasing priority). See line #31 to #38 in the snapshot of linux/kernel.h below. In the example of myFirstModuleSource.c, I used printk() without mentioning any log priority. default_message_loglevel is assigned to such cases (line #43 in the snapshot below), whenever the log level is not specified explicitly while calling printk().




The kernel gives different treatment to these different priorities of messages. Different rules are followed depending upon whether the system is in one of the six text console mode (Ctrl-Alt-F1 to Ctrl+Alt+F6) or GUI mode.


Logging in Console Mode and Getting the printk() Output on the Console:

The file /proc/sys/kernel/printk contains four integer values. E.g., my RHEL-4 machine has values 6,4,1,7. These integers correspond to currently set, default, minimum allowed, and boot time default message log level, respectively. These are the values in line #42 to #45 in linux/kernel.h. Any message with priority less than the current console log level (i.e. the first integer in the file) is displayed on the console. The rest are logged into /var/log/messages.


The values in file /proc/sys/kernel/printk can be changed according to the requirements. For instance, changing the first integer value (current console log level) to 8 in causes messages with any priority to be printed on console. Similarly, changing the second value changes the default priority level assignment. The third and fourth values are generally not changed.


Logging When in GUI Mode:

In this case, logging is done according to the rules defined in /etc/syslog.conf file (snapshot below). This file contains two columns, one for the type of message (in the form of facility.priority), and the other for the place where to display the corresponding kernel log message. The facility specifies the subsystem that produced the message and the priority specifies its severity.

For instance, EMERG messages are logged everywhere (all the terminals and log files), no matter produced by which subsystem (line #16). All the messages from MAIL subsystem are logged into /var/log/maillog (line #12), INFO messages from any subsystem are logged into /var/log/messages (line #7). See syslog.conf manual page for more details.


Getting printk() Messages Displayed on the Terminal in GUI Mode:

This is the point where I am stuck right now. One way to do this is to assign one terminal, say /dev/pts/3, in syslog.conf at line #7, so that that all INFO (and higher) messages will go there. But this solution is no better than looking for output in /var/log/messages. My requirement is to see the printk() messages on whichever terminal I am using at that time. klogd manual page suggests to start klogd daemon with –c switch to change the current console log level according to need. I changed it to 8, but could not see any change in the manner messages being logged. This did not solve the problem either.


Thoughts/Suggestions are welcome. If someone has done that, guidance needed.


Saturday, July 19, 2008

Compiling and Inserting the First Kernel Module

Let’s now see how to compile the module program shown in previous post. We shall write a Makefile to make the procedure of compilation simpler. Below is a ‘template’ of a Makefile, which I use to compile my modules. The description follows.

########################################################################################

#Build as a loadable module
obj-m += ‘module_name_1’.o ‘module_name_2’.o
‘module_name_1’-objs := “space separated list of object files needed by ‘module_name_1’.o”
‘module_name_2’-objs := “space separated list of object files needed by ‘module_name_2’.o”

#Location of the current linux kernel source directory
SRC=/lib/modules/`uname -r`/build

#Working Directory
PWD=`pwd`

default:
make -C ${SRC} M=${PWD} modules

clean:
rm -f ${‘module_name’-objs}‘module_source_name.o ‘module_name’.ko ‘module_name’.mod.o
rm -f ‘module_name’.mod.c

################################################################################

obj-m tells the compiler which object files have to be created on make command. As many modules can be specified here, as we want to create. After that, for each module in obj-m list, a list of object files has to be specified which together link into that particular module’s object file.

Under /lib/modules, there is one subdirectory for each kernel installed in your system, with the name of that particular kernel. Each of these directories contain source (code, module object files) of the corresponding kernel image. The shell command uname -r gives as output the name of the currently running kernel. Therefore, SRC environment variable stores the path to the skeleton of source code of currently running kernel.

This path is specified to let make read the kernel top level Makefile, which defines the rules to make the target, i.e. modules. This environmental variable can alternatively/additionally be passed at command line, as an argument to make. As is the case with Makefiles, SRC specified at command line will take precedence over the one specified in Makefile.
The -C switch in make changes the current directory to $SRC (kernel source directory) to find kernel top level Makefile. M=dir specifies the directory where the module to be built is present. M=dir modules instructs to make all those modules in the directory dir, listed in variable obj-m.

The myFirstModule Makefile:


#####################################################################################

#Build as a loadable module
obj-m += myFirstModule.o
myFirstModule-objs:= myFirstModuleSource.o

#Location of the linux source directory
SRC=/lib/modules/`uname -r`/build

#Working Directory
PWD=`pwd`

default:
make -C ${SRC} M=${PWD} modules

clean:
rm -f ${ myFirstModule-objs} myFirstModule.o myFirstModule.ko myFirstModule.mod.o
rm -f myFirstModule.mod.c

###############################################################################


Running make:
[shweta@localhost modules]# make
make -C /lib/modules/`uname -r`/build M=`pwd` modules
make[1]: Entering directory `/usr/src/kernels/2.6.9-42.EL-smp-i686'
CC [M] /home/Shweta/wikalk/modules/myFirstModuleSource.o
LD [M] /home/Shweta/wikalk/modules/myFirstModule.o
Building modules, stage 2.
MODPOST
CC /home/Shweta/wikalk/modules/myFirstModule.mod.o
LD [M] /home/Shweta/wikalk/modules/myFirstModule.ko
make[1]: Leaving directory `/usr/src/kernels/2.6.9-42.EL-smp-i686'


Inserting (or linking) the module:
The shell provides two commands to insert a module into the kernel, insmod and modprobe, which both do the same set of activities. The difference lies in how they search for the module binary to load.

insmod requires the absolute path of the module as argument:
[shweta@localhost modules]# insmod /root/wikalk/modules/myFirstModule.ko
modprobe requires just the module name as argument:

[shweta@localhost modules]#modprobe myFirstModule
The modprobe searches for the module name in the default path /lib/modules/`uname -r`. If no module with the given name is found, an error is displayed.

So now, our myFirstModule is inserted/removed as follows:
[shweta@localhost modules]#insmod myFirstModule.ko
Hey! myFirstModule is in the kernel now.

[shweta@localhost modules]#rmmod myFirstModule
myFirstModule is removed from kernel
myFirstModule was in kernel for 143 seconds.

Some Points to Remember:

  • After inserting a module, it needs to be explicitly unloaded or else it will be removed when the system is shut down. However, it is a better way to explicitly unload the module if it no more required otherwise it will consume system resources (memory, CPU...) for no use.
  • If the return statement is missing in init() function, the compilation succeeds without any errors or warnings. But insmod gives “error inserting myFirstModule.ko

Tuesday, July 15, 2008

Writing the First Kernel Module

Having read the bare essential theory, we are ready to get revealed to the amateur beauty of a module. Well, at least I found it beautiful. Let’s find out, how you feel...

A Linux module is just a C program. However, writing a module requires a lot more attention, skill, awareness (and so on ...) than generally required in a normal C program, which runs in user-space. Since a kernel module runs in kernel-space, errors must be handled very intelligently, as even a smallest problem may result in a system crash.

Now, have a glance at the code below and then read the following text. There is nothing in this program that a C acquaint can't understand, except the absence of the main() function.

/***********************************************/

/* myFirstModuleSource.c */

#include linux/module.h /* macros for init(), exit() functions*/

#include linux/time.h

/* kernel data structures to represent time */

struct timespec moduleLoadTime, moduleUnloadTime;

int myFirstModuleInit(void) /* mandatory syntax for an init function */

{

printk("Hey! myFirstModule is in the kernel now.\n"); /* No, this ain't a typo error for printf() */

moduleLoadTime = current_kernel_time(); /* kernel routine to determine current timestamp */

return 0;

}

int calculateDifference(int one, int two) /* a normal C function*/

{

return (one-two);

}

void myFirstModuleExit(void) /* mandatory syntax for an exit function */

{

int moduleLifespan;

printk("myFirstModule is removed from kernel\n");


moduleUnloadTime = current_kernel_time(); /* kernel routine to get current timestamp */

moduleLifespan = calculateDifference(moduleUnloadTime.tv_sec,moduleLoadTime.tv_sec); /* a C function call */

printk("myFirstModule was in kernel for %d seconds.\n",moduleLifespan);

}

module_init(myFirstModuleInit);

module_exit(myFirstModuleExit);

/***********************************************/

The above module is mere a "hello world" module, with an added functionality of displaying the duration for which it remained linked into the kernel.


No Main()'s Land.

For a user space C program, the main() function acts as an entry point, which tells the system where to start the execution from. For a kernel module, it is an init() function. The init function is executed only once, when the module is linked into the kernel (usually by insmod shell command). However, unlike a user space program, a kernel module requires an additional function, the exit() function, which is executed when the module is removed from kernel (usually using rmmod shell command). Therefore, running a kernel module requires at least two functions:

1. An init function to load the module

2. An exit function to unload the module

A programmer can specify any function to be an init or exit function for a module. It is just that the name of that particular function has to be registered with the kernel using macros module_init() and module_exit() (as done in the last two lines of the above code). But, the the syntax cannot be altered. Every init() and exit() must have the syntax as shown.


printk()??? A typo error?

Not really!! How could I make the same spelling mistake at three places?

The Linux kernel does not have the standard libc C library (or any user space library, for that matter) which contains printf(). Therefore, it has no access to printf(). But (thank God) it has its own output function printk(). Well there is lot to say about printk() which I plan to tell in another post. For now, just consider it as an avatar of our old friend printf(). But just remember always to put a '\n' at the end of the format string in every printk() call and to look for all printk() outputs in /var/log/messages.


Some points to remember:

  • Each module must have an init() function, but exit() is not mandatory. However, if there is no exit() registered, the module is permanently linked into the kernel. It gets only removed on reboot.
  • All the clean up activities should be done in the exit() function for writing a clean and safe module.
  • It is not possible to have floating point arithmetic in kernel modules, since these operations are heavy and kernel does not have required libraries to perform them.

Module writing is over now, in next post I shall tell how did I compile the above program and insert it into the kernel.