By Piyush Shukla in Computers — Feb 7, 2022

Linux : Implementing pthreads!

Linux does not have a concept for threads, yet it is possible to implement features of POSIX threads just by using the clone() syscall

Multi threading is required across a myriad of applications for parallel executing processes on a hardware CPU.

Processes in Linux are called tasks. A task is a struct(task_struct) maintained by the kernel and the whole collection of tasks are stored in 2 ways

in the form of a hashed map, to support functions that find a task using a PID.
in the form of a circular doubly linked list, to traverse over all the tasks in the system.

Struct `task_struct`

struct task_struct {

    volatile long state;
    void ∗stack;
    unsigned int flags;

    int prio, static_prio;

    struct list_head tasks;

    struct mm_struct ∗mm, ∗active_mm;

    pid_t pid;
    pid_t tgid;

    struct task_struct ∗real_parent;

    char comm[TASK_COMM_LEN];

    struct thread_struct thread;

    struct files_struct ∗files;

    ...

};

A little about the POSIX Thread APIs

POSIX API Standard mandates an interface by which system issues parallel executions or threads.

A POSIX Process holds all the process releated information and all POSIX Threads(pthreads) are enclosed inside it.
Killing a pthread would result in killing the process along with all the threads in it.
A Signal dispatched to the process is delivered to the first thread which is not blocking that signal. If no such thread is found, then the signal waits for some time, untill some thread unblocks it to be handled, it is then delivered to that thread. Signals are handled at thread level, however, all signal handlers are defined globally by the process.

The Story in Linux so far...(History, not present)

Linux only used to conduct executions in 1:1 execution model, meaning that one kernel thread is created for every user thread in the system. This is the simplest kind of execution model where you do not have to manage memory and internal scheduling of threads.

Every process is a task from the kernel's POV. So this imposes problems in implementing the POSIX threads APIs . For example, even if we use the linux task model to simulate pthreads, we will have to deal with the following issues:

No way of storing information which task belongs to which process
Killing a task just kills that task and not the whole task group which we calling the process
Signals are not handled according to the POSIX rules. Only a single task receives that signal and if it blocks it then it is blocked forever, instead of waiting for some other thread to handle it.

Clearly, changes were required inside the kernel to facilitate these issues.

Working of clone() syscall

When either a clone(), fork(), or vfork() system call is issued, the kernel invokes the do_fork() function, which executes the following steps:

1. If the clone_pid flag is specified, the do_fork( ) function checks whether the PID of the parent process is not 0; if so, it returns an error code. Only the swapper process is allowed to set clone_pid; this is required when initializing a multiprocessor system.

2. The alloc_task_struct( ) function is invoked to get a new 8 KB union task_union memory area to store the process descriptor and the Kernel Mode stack of the new process.

3. The function follows the current pointer to obtain the parent process descriptor and copies it into the new process descriptor in the memory area just allocated.

4. A few checks occur to make sure the user has the resources necessary to start a new process. First, the function checks whether current->rlim[RLiMiT_NPROC] .rlim_cur is smaller than or equal to the current number of processes owned by the user. If so, an error code is returned, unless the process has root privileges. The function gets the current number of processes owned by the user from a per-user data structure named user_struct. This data structure can be found through a pointer in the user field of the process descriptor.

5. The function checks that the number of processes is smaller than the value of the max_threads variable. The initial value of this variable depends on the amount of RAM in the system. The general rule is that the space taken by all process descriptors and Kernel Mode stacks cannot exceed 1/8 of the physical memory. However, the system administrator may change this value by writing in the /proc/sys/kernel/threads-max file.

6. If the parent process uses any kernel modules, the function increments the corresponding reference counters. Each kernel module has its own reference counter, which ensures that the module will not be unloaded while it is being used.

7. The function then updates some of the flags included in the flags field that have been copied from the parent process:

It clears the pf_superpriv flag, which indicates whether the process has used any of its superuser privileges.
It clears the pf_usedfpu flag.
It sets the pf_forknoexec flag, which indicates that the child process has not yet issued an execve( ) system call.

8. Now the function has taken almost everything that it can use from the parent process; the rest of its activities focus on setting up new resources in the child and letting the kernel know that this new process has been born. First, the function invokes the get_pid() function to obtain a new PID, which will be assigned to the child process (unless the clone_pid flag is set).

9. The function then updates all the process descriptor fields that cannot be inherited from the parent process, such as the fields that specify the process parenthood relationships.

10. Unless specified differently by the flags parameter, it invokes copy_files(), copy_fs(), copy_sighand(), and copy_mm() to create new data structures and copy into them the values of the corresponding parent process data structures.

11. The do_fork() function invokes copy_thread() to initialize the Kernel Mode stack of the child process with the values contained in the CPU registers when the clone() call was issued (these values have been saved in the Kernel Mode stack of the parent). However, the function forces the value 0 into the field corresponding to the eax register. The thread.esp field in the descriptor of the child process is initialized with the base address of the child's Kernel Mode stack, and the address of an assembly language function (`ret_from_fork( )`) is stored in the thread.eip field. The copy_thread() function also invokes unlazy_fpu() on the parent and duplicates the contents of the thread.i387 field.

12. If either clone_thread or clone_parent is set, the function copies the value of the p_opptr and p_pptr fields of the parent into the corresponding fields of the child. The parent of the child thus appears as the parent of the current process. Otherwise, the function stores the process descriptor address of current into the p_opptr and p_pptr fields of the child.

13. If the clone_ptrace flag is not set, the function sets the ptrace field in the child process descriptor to 0. This field stores a few flags used when a process is being traced by another process. Even if the current process is being traced, the child will not.

14. Conversely, if the clone_ptrace flag is set, the function checks whether the parent process is being traced because in this case, the child should be traced too. Therefore, if pt_ptraced is set in current->ptrace, the function copies the current->p_pptr field into the corresponding field of the child.

15. The do_fork() function checks the value of clone_thread. If the flag is set, the function inserts the child in the thread group of the parent and copies in the tgid field the value of the parent's tgid; otherwise, the function sets the tgid field to the value of the pid field.

16. The function uses the set_links macro to insert the new process descriptor in the process list.

17. The function invokes hash_pid( ) to insert the new process descriptor in the pidhash hash table.

18. The function increments the values of nr_threads and current->user->processes.

19. If the child is being traced, the function sends a sigstop signal to it so that the debugger has a chance to look at it before it starts the execution.

20. It invokes wake_up_process() to set the state field of the child process descriptor to task_running and to insert the child in the runqueue list.

21. If the clone_vfork flag is specified, the function inserts the parent process in a wait queue and suspends it until the child releases its memory address space (that is, until the child either terminates or executes a new program).

22. The do_fork() function returns the PID of the child, which is eventually read by the parent process in User Mode.

Changes to the kernel...

Thread Id(tid) and thread group id(tgid) are added to task_struct. Tgid is equal to the first tid in a thread group.
getpid() now returns tgid of the running thread.
gettid() returns the tid of the thread.
A flag CLONE_THREAD added to be used in clone() syscall to create a thread like evironment for the child process. The child gets the same memory map as the address space but the stack is seperate along with a seperate mask for signals.
A monitor program keeps the signals waiting in a queue in order to be delivered to a thread which has not masked the signal or has just became ready to process it.(using pthread_sigmask()).

So how does it all work?

In order to create a thread now, call clone() along with the flag CLONE_THREAD creates a new process/task with the tgid equal to the parent's tgid. Signal mask can be set before creating the thread by setting its own mask and then creating the thread as the child process copies the parent's signal mask. exit() function exits the process by killing all the threads, internally it uses exit_group() function. To kill a single task use pthread_exit().

Code Example

Following code example is present on eliben Github page:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 
#include 
#include 
#include 

static int child_func(void* arg) {
  char* buf = (char*)arg;
  printf("Child sees buf = \"%s\"\n", buf);
  strcpy(buf, "hello from another thread");
  return 0;
}

int main(int argc, char** argv) {
  // Allocate stack for child task.
  const int STACK_SIZE = 65536;
  char* stack = malloc(STACK_SIZE);
  if (!stack) {
    perror("malloc");
    exit(1);
  }

  unsigned long flags = 0;

  char buf[100];
  strcpy(buf, "hello from main thread");
  if (clone(child_func, stack + STACK_SIZE, flags | CLONE_VM | CLONE_THREAD, buf) == -1) {
    perror("clone");
    exit(1);
  }

  int status;
  if (wait(&status) == -1) {
    perror("wait");
    exit(1);
  }

  printf("Child exited with status %d. buf = \"%s\"\n", status, buf);
  return 0;
}

Note: kernel threads are different than pthreads. To create, use kernl_thread().