进程是的创建与切换及系统调用

目标

请说明操作系统创建内核进程的方法,内核进程创建用户进程的方法,用户进程使用系统调用并从系统调用返回用户进程的方法,内核进程切换的方法,用户进程切换的方法,请比较 hariboteOS,XV6,uCore 三种自制系统的做法。

本文主要参考

  1. 《操作系统:设计与实现》(下称 b1)的 Minix3 实现
  2. 《Linux 内核设计与实现》(下称 b2)

本文是缝合文。

何谓进程

进程:处于执行期的程序,及相关的资源。(b2p20)

Linux 进程管理

每个进程在内核中都有一个进程控制块 (PCB) 来维护进程相关的信息,Linux 内核的进程控制块是 task_struct 结构体。

Linux 中进程和线程的区别

Linux 实现线程的机制非常独特。从内核的角度来说,它并没有线程这个概念。Linux 把所 有的线程都当做进程来实现。内核并没有准备特别的调度算法或是定义特别的数据结构来表征线 程。相反,线程仅仅被视为一个与其他进程共享某些资源的进程。每个线程都拥有唯一隶属于自 己的 task_struct,所以在内核中,它看起来就像是一一个普通的进程(只是线程和其他一些进程共 享某些资源,如地址空间)。

简单来说就是:基本没区别,除了线程能和父进程共享:地址空间、文件、信号处理器。

Linux 中内核进程和用户进程的区别

fork 系统调用

基本使用

int main()
{
    pid_t fpid;
    int count = 0;
    fpid = fork();
    if(fpid < 0)
            printf("error in fork!\n");
    else if (fpid == 0) //fpid 等于 0 则在子进程
    {
            printf("it's a child process,my process id is %d\n",getpid());
            count++;
    }
    else //fpid > 0 则在父进程
    {
            printf("it's a parent process,my process id is %d\n",getpid());
            count ++;
    }
    printf("count:%d\n",count);
    return 0;
}

fpid 相当于一个链表指针,指向子进程 pid。因此如果 fpid == 0,说明本进程不拥有子进程,因此本进程其实是被另一个进程所拥有。

中断 int 0x80 表示主动进入内核,这是用户程序发起的调用访问内核代码的唯一方式

实现原理

首先,通过执行 syscall0 执行软中断 0x80,操作系统从 3 特权级变成 0 特权级。然后执行 sys_fork 函数。

sys_fork 通过 find_empty_process () 为进程申请一个空闲位置并获取进程号(如果返回负数,则说明)然后通过 copy_process ()

  • 为新进程创建 task_struct, 将原先进程的 task_struct 的内容复制给新进程
  • 给新进程分配页表,并复制原先进程的页表到新进程
  • 共享原先进程的文件
  • 设置新进程的 GDT 项
  • 将新进程设置成就绪态,参与进程间的轮转

Global Descriptor Table(GDT)是 x86 的一个用来告诉 CPU 内存分段情况的结构

clone 系统调用

基本使用

// sample.c
#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/wait.h>

int fn(void *arg)
{
   printf("\nINFO: This code is running under child process.\n");

   int i = 0;
   
   int n = atoi(arg);

   for ( i = 1 ; i <= 10 ; i++ )
      printf("%d * %d = %d\n", n, i, (n*i));

   printf("\n");

   return 0;
}

void main(int argc, char *argv[])
{
   printf("Hello, World!\n");

   void *pchild_stack = malloc(1024 * 1024);
   if ( pchild_stack == NULL ) {
      printf("ERROR: Unable to allocate memory.\n");
      exit(EXIT_FAILURE);
   }

   int pid = clone(fn, pchild_stack + (1024 * 1024), SIGCHLD, argv[1]);
   if ( pid < 0 ) {
        printf("ERROR: Unable to create the child process.\n");
        exit(EXIT_FAILURE);
   }

   wait(NULL);

   free(pchild_stack);

   printf("INFO: Child process terminated.\n");
}

可以看到,clone 调用创建的子进程,允许我们自己为其分配和释放 stack 空间。

实现原理和用户进程 / 线程的创建过程

kernel/fork.c

#ifdef __ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
	struct kernel_clone_args args = {
		.exit_signal = SIGCHLD,
	};

	return kernel_clone(&args);
#else
	/* can not support in nommu mode */
	return -EINVAL;
#endif
}
#endif

#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
	struct kernel_clone_args args = {
		.flags		= CLONE_VFORK | CLONE_VM,
		.exit_signal	= SIGCHLD,
	};

	return kernel_clone(&args);
}
#endif

可以看到,clone, fork, vfork 底层都是 调用 kernel_clone,网上说的调用 _do_fork 有误,已经过时。

kernel_clone 源码如下:

/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 *
 * args->exit_signal is expected to be checked for sanity by the caller.
 */
pid_t kernel_clone(struct kernel_clone_args *args)
{
	u64 clone_flags = args->flags;
	struct completion vfork;
	struct pid *pid;
	struct task_struct *p;
	int trace = 0;
	pid_t nr;

	/*
	 * For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
	 * to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
	 * mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
	 * field in struct clone_args and it still doesn't make sense to have
	 * them both point at the same memory location. Performing this check
	 * here has the advantage that we don't need to have a separate helper
	 * to check for legacy clone().
	 */
	if ((args->flags & CLONE_PIDFD) &&
	    (args->flags & CLONE_PARENT_SETTID) &&
	    (args->pidfd == args->parent_tid))
		return -EINVAL;

	/*
	 * Determine whether and which event to report to ptracer.  When
	 * called from kernel_thread or CLONE_UNTRACED is explicitly
	 * requested, no event is reported; otherwise, report if the event
	 * for the type of forking is enabled.
	 */
	if (!(clone_flags & CLONE_UNTRACED)) {
		if (clone_flags & CLONE_VFORK)
			trace = PTRACE_EVENT_VFORK;
		else if (args->exit_signal != SIGCHLD)
			trace = PTRACE_EVENT_CLONE;
		else
			trace = PTRACE_EVENT_FORK;

		if (likely(!ptrace_event_enabled(current, trace)))
			trace = 0;
	}

	p = copy_process(NULL, trace, NUMA_NO_NODE, args);
	add_latent_entropy();

	if (IS_ERR(p))
		return PTR_ERR(p);

	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	trace_sched_process_fork(current, p);

	pid = get_task_pid(p, PIDTYPE_PID);
	nr = pid_vnr(pid);

	if (clone_flags & CLONE_PARENT_SETTID)
		put_user(nr, args->parent_tid);

	if (clone_flags & CLONE_VFORK) {
		p->vfork_done = &vfork;
		init_completion(&vfork);
		get_task_struct(p);
	}

	wake_up_new_task(p);

	/* forking complete and child started to run, tell ptracer */
	if (unlikely(trace))
		ptrace_event_pid(trace, pid);

	if (clone_flags & CLONE_VFORK) {
		if (!wait_for_vfork_done(p, &vfork))
			ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
	}

	put_pid(pid);
	return nr;
}

大致如下:

  1. 参数检查
  2. 调用 p = copy_process () 用于复制出子进程
  3. 调用 add_latent_entropy (),这个功能通过潜在熵插件提供,是为了避免随机数重复。(详见参考 5)
  4. 调用 trace_sched_process_fork 唤醒子进程 p
  5. pid = get_task_pid(p, PIDTYPE_PID);
  6. pid_vnr() 根据输入参数,获取进程的局部进程号
  7. put_user (nr, args->parent_tid); 设置父进程 ID(如果存在)
  8. wake_up_new_task (p); 唤醒进程
  9. put_pid () 试图减少引用计数,放回 pid

fs_struct 用于记录进程的用户、根目录、工作目录信息:

include/linux/fs_struct.h

struct fs_struct {
    int users;
    spinlock_t lock;
    seqcount_t seq;
    int umask;
    int in_exec;
    struct path root, pwd;
 };

进程描述符的 files 字段指向该进程的 files_struct 结构

include/linux/fdtable.h

struct files_struct {
    /*
     * read mostly part
     */
    atomic_t count;
    bool resize_in_progress;
    wait_queue_head_t resize_wait;
    struct fdtable __rcu *fdt;
    struct fdtable fdtab;
    /*
     * written part on a separate cache line in SMP
     */
    spinlock_t file_lock ____cacheline_aligned_in_smp;
    int next_fd;
    unsigned long close_on_exec_init[1];
    unsigned long open_fds_init[1];
    struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};

都是 Clone

创建线程:

clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);

普通 fork:

clone(SIGCHLD, 0);

vfork:

clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);

fork () 子进程拷贝父进程的数据段,代码段

vfork () 子进程与父进程共享数据段

内核进程的创建

内核线程只运行在内核态。Linux 内核可以看作一个服务进程。在 Linux 中进程和线程没有明显的区分,由于都是运行在内核空间,所以一般叫做内核线程。

相关函数:

  • kernel_thread
  • kthread_create
  • kthread_run

kthread_run () 调用 kthread_create (),kthread_create () 加入链表后,有 kthreadd () 线程读取链表然后再调用 kernel_thread () 创建线程。

kernel_thread () 实在真正的创建线程。

kthread_run ()/kthread_create () : 做特殊的准备工作,之后再调用 kernel_thread () 创建线程。

所有的内核线程都是 kthreadd (pid 2) 的子级而 kthreadd 又是 kernel (pid 0) 在引导过程中矿建的。我们可以通过 ps -el 列出内核线程。

UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 Nov17 ?        00:00:02 /sbin/init
root           2       0  0 Nov17 ?        00:00:00 [kthreadd]
root           3       2  0 Nov17 ?        00:00:00 [rcu_gp]
root           4       2  0 Nov17 ?        00:00:00 [rcu_par_gp]
root           6       2  0 Nov17 ?        00:00:00 [kworker/0:0H-events_highpri]
root           8       2  0 Nov17 ?        00:00:00 [mm_percpu_wq]
root           9       2  0 Nov17 ?        00:00:00 [rcu_tasks_rude_]
root          10       2  0 Nov17 ?        00:00:00 [rcu_tasks_trace]
root          11       2  0 Nov17 ?        00:00:13 [ksoftirqd/0]
root          12       2  0 Nov17 ?        00:00:43 [rcu_sched]
root          13       2  0 Nov17 ?        00:00:00 [migration/0]
root          15       2  0 Nov17 ?        00:00:00 [cpuhp/0]
root          17       2  0 Nov17 ?        00:00:00 [kdevtmpfs]
root          18       2  0 Nov17 ?        00:00:00 [netns]
root          19       2  0 Nov17 ?        00:00:00 [kauditd]
root          20       2  0 Nov17 ?        00:00:00 [khungtaskd]
root          21       2  0 Nov17 ?        00:00:00 [oom_reaper]
root          22       2  0 Nov17 ?        00:00:00 [writeback]
root          23       2  0 Nov17 ?        00:00:04 [kcompactd0]
root          24       2  0 Nov17 ?        00:00:00 [ksmd]
root          25       2  0 Nov17 ?        00:00:06 [khugepaged]
root          43       2  0 Nov17 ?        00:00:00 [kintegrityd]
root          44       2  0 Nov17 ?        00:00:00 [kblockd]
root          45       2  0 Nov17 ?        00:00:00 [blkcg_punt_bio]
root          46       2  0 Nov17 ?        00:00:00 [edac-poller]
root          47       2  0 Nov17 ?        00:00:00 [devfreq_wq]
root          48       2  0 Nov17 ?        00:00:03 [kworker/0:1H-kblockd]
root          49       2  0 Nov17 ?        00:00:04 [kswapd0]
root          50       2  0 Nov17 ?        00:00:00 [kthrotld]
root          51       2  0 Nov17 ?        00:00:00 [irq/24-pciehp]
root          52       2  0 Nov17 ?        00:00:00 [irq/25-pciehp]
root          53       2  0 Nov17 ?        00:00:00 [irq/26-pciehp]
root          54       2  0 Nov17 ?        00:00:00 [irq/27-pciehp]
root          55       2  0 Nov17 ?        00:00:00 [irq/28-pciehp]
root          56       2  0 Nov17 ?        00:00:00 [irq/29-pciehp]
root          57       2  0 Nov17 ?        00:00:00 [irq/30-pciehp]
root          58       2  0 Nov17 ?        00:00:00 [irq/31-pciehp]
root          59       2  0 Nov17 ?        00:00:00 [irq/32-pciehp]
root          60       2  0 Nov17 ?        00:00:00 [irq/33-pciehp]
root          61       2  0 Nov17 ?        00:00:00 [irq/34-pciehp]
root          62       2  0 Nov17 ?        00:00:00 [irq/35-pciehp]
root          63       2  0 Nov17 ?        00:00:00 [irq/36-pciehp]
root          64       2  0 Nov17 ?        00:00:00 [irq/37-pciehp]
root          65       2  0 Nov17 ?        00:00:00 [irq/38-pciehp]
root          66       2  0 Nov17 ?        00:00:00 [irq/39-pciehp]
root          67       2  0 Nov17 ?        00:00:00 [irq/40-pciehp]
root          68       2  0 Nov17 ?        00:00:00 [irq/41-pciehp]
root          69       2  0 Nov17 ?        00:00:00 [irq/42-pciehp]
root          70       2  0 Nov17 ?        00:00:00 [irq/43-pciehp]
root          71       2  0 Nov17 ?        00:00:00 [irq/44-pciehp]
root          72       2  0 Nov17 ?        00:00:00 [irq/45-pciehp]
root          73       2  0 Nov17 ?        00:00:00 [irq/46-pciehp]
root          74       2  0 Nov17 ?        00:00:00 [irq/47-pciehp]
root          75       2  0 Nov17 ?        00:00:00 [irq/48-pciehp]
root          76       2  0 Nov17 ?        00:00:00 [irq/49-pciehp]
root          77       2  0 Nov17 ?        00:00:00 [irq/50-pciehp]
root          78       2  0 Nov17 ?        00:00:00 [irq/51-pciehp]
root          79       2  0 Nov17 ?        00:00:00 [irq/52-pciehp]
root          80       2  0 Nov17 ?        00:00:00 [irq/53-pciehp]
root          81       2  0 Nov17 ?        00:00:00 [irq/54-pciehp]
root          82       2  0 Nov17 ?        00:00:00 [irq/55-pciehp]
root          83       2  0 Nov17 ?        00:00:00 [acpi_thermal_pm]
root          84       2  0 Nov17 ?        00:00:00 [ipv6_addrconf]
root          94       2  0 Nov17 ?        00:00:00 [kstrp]
root          97       2  0 Nov17 ?        00:00:00 [zswap-shrink]
root          98       2  0 Nov17 ?        00:00:00 [kworker/u257:0-hci0]
root         137       2  0 Nov17 ?        00:00:00 [mpt_poll_0]
root         139       2  0 Nov17 ?        00:00:00 [mpt/0]
root         142       2  0 Nov17 ?        00:00:00 [ata_sff]
root         143       2  0 Nov17 ?        00:00:00 [scsi_eh_0]
root         144       2  0 Nov17 ?        00:00:00 [scsi_tmf_0]
root         145       2  0 Nov17 ?        00:00:00 [scsi_eh_1]
root         146       2  0 Nov17 ?        00:00:00 [scsi_tmf_1]
root         152       2  0 Nov17 ?        00:00:00 [scsi_eh_2]
root         153       2  0 Nov17 ?        00:00:00 [scsi_tmf_2]
root         158       2  0 Nov17 ?        00:00:00 [kdmflush]
root         160       2  0 Nov17 ?        00:00:00 [kdmflush]
root         197       2  0 Nov17 ?        00:00:16 [jbd2/dm-0-8]
root         198       2  0 Nov17 ?        00:00:00 [ext4-rsv-conver]
root         235       1  0 Nov17 ?        00:00:00 /lib/systemd/systemd-journald
root         267       1  0 Nov17 ?        00:00:00 /lib/systemd/systemd-udevd
root         337       2  0 Nov17 ?        00:00:00 [cryptd]
root         345       2  0 Nov17 ?        00:00:10 [irq/16-vmwgfx]
root         348       2  0 Nov17 ?        00:00:00 [ttm_swap]
root         353       2  0 Nov17 ?        00:00:00 [card0-crtc0]
root         357       2  0 Nov17 ?        00:00:00 [card0-crtc1]
root         360       2  0 Nov17 ?        00:00:00 [card0-crtc2]
root         365       2  0 Nov17 ?        00:00:00 [card0-crtc3]
root         369       2  0 Nov17 ?        00:00:00 [card0-crtc4]
root         372       2  0 Nov17 ?        00:00:00 [card0-crtc5]
root         377       2  0 Nov17 ?        00:00:00 [card0-crtc6]
root         380       2  0 Nov17 ?        00:00:00 [card0-crtc7]
root         488       2  0 Nov17 ?        00:00:00 [kworker/u257:2-hci0]
root         490       2  0 Nov17 ?        00:00:00 [ext4-rsv-conver]
systemd+     500       1  0 Nov17 ?        00:00:00 /lib/systemd/systemd-timesyncd
root         509       1  0 Nov17 ?        00:00:00 /usr/bin/VGAuthService
root         510       1  0 Nov17 ?        00:02:01 /usr/bin/vmtoolsd
root         525       1  0 Nov17 ?        00:00:00 /usr/sbin/cron -f
message+     526       1  0 Nov17 ?        00:00:00 /usr/bin/dbus-daemon --system --address=sys
root         535       1  0 Nov17 ?        00:00:00 /usr/sbin/rsyslogd -n -iNONE
root         540       1  0 Nov17 ?        00:00:00 /lib/systemd/systemd-logind
root         589       1  0 Nov17 tty1     00:00:00 /bin/login -p --
root         599       1  0 Nov17 ?        00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-
root         686       1  0 Nov17 ?        00:01:08 /www/server/panel/pyenv/bin/python /www/ser
root         703       1  0 Nov17 ?        00:00:00 /www/server/panel/pyenv/bin/python /www/ser
root       35997       1  0 Nov17 ?        00:00:00 /lib/systemd/systemd --user
root       35998   35997  0 Nov17 ?        00:00:00 (sd-pam)
root       36017     589  0 Nov17 tty1     00:00:00 -bash
root      558961       2  0 12:49 ?        00:00:32 [kworker/0:1-events_power_efficient]
root      576562       2  0 17:42 ?        00:00:00 [kworker/u256:3-ext4-rsv-conversion]

kernel_thread 用于创建内核线程。

static noinline void __ref rest_init(void)
{
 int pid;

 rcu_scheduler_starting();
 /*
 * We need to spawn init first so that it obtains pid 1, however
 * the init task will end up wanting to create kthreads, which, if
 * we schedule it before we create kthreadd, will OOPS.
 */
 kernel_thread(kernel_init, NULL, CLONE_FS);
 numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
 rcu_read_lock();
 kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
 rcu_read_unlock();
 complete(&kthreadd_done);

 /*
 * The boot idle thread must execute schedule()
 * at least once to get things moving:
 */
 init_idle_bootup_task(current);
 schedule_preempt_disabled();
 /* Call into cpu_idle with preempt disabled */
 cpu_startup_entry(CPUHP_ONLINE);
}

上面的代码,通过调用 kernel_thread ,启动 kernel_init 进程和 kthreadd 进程。

kthread 是一个常驻进程,会监听 kthread_create_list 的变动

/*kthreadd routine(kthread.c) */
int kthreadd(void *unused)
{
 struct task_struct *tsk = current;

 /* Setup a clean context for our children to inherit. */
 set_task_comm(tsk, "kthreadd");
 ignore_signals(tsk);
 set_cpus_allowed_ptr(tsk, cpu_all_mask);
 set_mems_allowed(node_states[N_MEMORY]);

 current->flags |= PF_NOFREEZE;

for (;;) {
 set_current_state(TASK_INTERRUPTIBLE);
 if (list_empty(&kthread_create_list))
 schedule();
 __set_current_state(TASK_RUNNING);

 spin_lock(&kthread_create_lock);
while (!list_empty(&kthread_create_list)) {
 struct kthread_create_info *create;

 create = list_entry(kthread_create_list.next,
 struct kthread_create_info, list);
 list_del_init(&create->list);
 spin_unlock(&kthread_create_lock);

create_kthread(create); /* creates kernel threads with attributes enqueued */

 spin_lock(&kthread_create_lock);
 }
 spin_unlock(&kthread_create_lock);
 }

 return 0;
}

kthread_create 则调用 kthread_create_on_node (),后者会在当前 NUMA 节点创建内核线程。

非统一内存访问架构是一种为多处理器的电脑设计的内存架构,内存访问时间取决于内存相对于处理器的位置。在 NUMA 下,处理器访问它自己的本地内存的速度比非本地内存快一些。 非统一内存访问架构的特点是:被共享的内存物理上是分布式的,所有这些内存的集合就是全局地址空间。

NUMA 节点是 CPU / 内存对。 通常,CPU Socket 和最近的内存库构建了一个 NUMA 节点。 每当一个 CPU 需要访问另一个 NUMA 节点的内存时,它不能直接访问它,而是需要通过拥有该内存的 CPU 来访问它。

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 void *data,
 int node,
 const char namefmt[], ...);

/**
 * kthread_create - create a kthread on the current node
 * @threadfn: the function to run in the thread
 * @data: data pointer for @threadfn()
 * @namefmt: printf-style format string for the thread name
 * @...: arguments for @namefmt.
 *
 * This macro will create a kthread on the current node, leaving it in
 * the stopped state. This is just a helper for       
 * kthread_create_on_node();
 * see the documentation there for more details.
 */
#define kthread_create(threadfn, data, namefmt, arg...) 
 kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)


struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
 void *data,
 unsigned int cpu,
 const char *namefmt);

/**
 * kthread_run - create and wake a thread.
 * @threadfn: the function to run until signal_pending(current).
 * @data: data ptr for @threadfn.
 * @namefmt: printf-style name for the thread.
 *
* Description: Convenient wrapper for kthread_create() followed by
 * wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
 */
#define kthread_run(threadfn, data, namefmt, ...) 
({ 
 struct task_struct *__k 
 = kthread_create(threadfn, data, namefmt, ## __VA_ARGS__); 
 if (!IS_ERR(__k)) 
 wake_up_process(__k); 
 __k; 
})

kthread_create_on_node () 将要创建的 kthread 的详细信息(作为参数接收)实例化为 kthread_create_info 类型的结构,并将其排在 kthread_create_list 的尾部。 然后它唤醒 kthreadd 并等待线程创建完成:

/* kernel/kthread.c */
static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
 void *data, int node,
 const char namefmt[],
 va_list args)
{
 DECLARE_COMPLETION_ONSTACK(done);
 struct task_struct *task;
struct kthread_create_info *create = kmalloc(sizeof(*create),
 GFP_KERNEL);

 if (!create)
 return ERR_PTR(-ENOMEM);
create->threadfn = threadfn;
 create->data = data;
 create->node = node;
 create->done = &done;

 spin_lock(&kthread_create_lock);
list_add_tail(&create->list, &kthread_create_list);
 spin_unlock(&kthread_create_lock);

wake_up_process(kthreadd_task);
 /*
 * Wait for completion in killable state, for I might be chosen by
 * the OOM killer while kthreadd is trying to allocate memory for
 * new kernel thread.
 */
 if (unlikely(wait_for_completion_killable(&done))) {
 /*
 * If I was SIGKILLed before kthreadd (or new kernel thread)
 * calls complete(), leave the cleanup of this structure to
 * that thread.
 */
 if (xchg(&create->done, NULL))
 return ERR_PTR(-EINTR);
/*
 * kthreadd (or new kernel thread) will call complete()
 * shortly.
 */
 wait_for_completion(&done); // wakeup on completion of thread creation.
 }
...
...
...
}

struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
 void *data, int node,
 const char namefmt[],
 ...)
{
 struct task_struct *task;
 va_list args;

 va_start(args, namefmt);
task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
 va_end(args);

 return task;
}

kthreadd 调用 create_thread () 例程以根据排队到列表中的数据启动内核线程。 此例程创建线程并发出完成信号:

/* kernel/kthread.c */
static void create_kthread(struct kthread_create_info *create)
{
 int pid;

 #ifdef CONFIG_NUMA
 current->pref_node_fork = create->node;
 #endif

 /* We want our own signal handler (we take no signals by default). */
pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES |  
 SIGCHLD);
 if (pid < 0) {
 /* If user was SIGKILLed, I release the structure. */
 struct completion *done = xchg(&create->done, NULL);

 if (!done) {
 kfree(create);
 return;
 }
 create->result = ERR_PTR(pid);
complete(done); /* signal completion of thread creation */
 }
}

最终还是靠 do_work () 完成

即现有 Linux 源码中的 kernel_clone 函数。

img

Linux 的进程上下文切换

相关代码在 linux/core.c at master・torvalds/linux (github.com) 中。通过 context_switch 函数实现。主要要切换 内存寄存器状态

分为四种切换:

  • kernel -> kernel lazy + transfer active
  • user -> kernel lazy + mmgrab() active
  • kernel -> user switch + mmdrop() active
  • user -> user switch
/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	prepare_task_switch(rq, prev, next);

	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	/*
	 * kernel -> kernel   lazy + transfer active
	 *   user -> kernel   lazy + mmgrab() active
	 *
	 * kernel ->   user   switch + mmdrop() active
	 *   user ->   user   switch
	 */
	if (!next->mm) {                                // to kernel
		enter_lazy_tlb(prev->active_mm, next);

		next->active_mm = prev->active_mm;
		if (prev->mm)                           // from user
			mmgrab(prev->active_mm);
		else
			prev->active_mm = NULL;
	} else {                                        // to user
		membarrier_switch_mm(rq, prev->active_mm, next->mm);
		/*
		 * sys_membarrier() requires an smp_mb() between setting
		 * rq->curr / membarrier_switch_mm() and returning to userspace.
		 *
		 * The below provides this either through switch_mm(), or in
		 * case 'prev->active_mm == next->mm' through
		 * finish_task_switch()'s mmdrop().
		 */
		switch_mm_irqs_off(prev->active_mm, next->mm, next);

		if (!prev->mm) {                        // from kernel
			/* will mmdrop() in finish_task_switch(). */
			rq->prev_mm = prev->active_mm;
			prev->active_mm = NULL;
		}
	}

	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);

	prepare_lock_switch(rq, next, rf);

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
	barrier();

	return finish_task_switch(prev);
}

mmgrab (prev->active_mm); 将前一个任务的 prev->active_mm 引用计数加 1 防止被释放

switch_mm_irqs_off 负责切换地址空间

内核线程的 tsk->mm 永远为空,利用这个可以判断线程是内核的还是用户的。

注意到 next->active_mm = prev->active_mm;,即 前一个用户任务的 active_mm 赋值到自己的 active_mm,这是为了避免频繁不必要地切换地址空间。

switch_to 负责将上一个进程的处理器状态切换到下一个进程的处理器状态。包括保存和恢复:

  • 栈信息
  • 寄存器信息

抢占式切换的实现

如果没有抢占式,那么只能等进程显式调用 schedule ()。Linux 使用 need_resched 标识指示是否需要强行调度。这个标识位于所有进程的进程描述符 thread_info 结构体中。

XV6 的进程

PCB

struct proc {
  uint sz;
  pde_t* pgdir; //page table 指针
  char *kstack; // 内核栈底
  enum procstate state; // 内核当前的状态
  volatile int pid;
  struct proc *parent;
  struct trapframe *tf;
  struct context *context;
  void *chan;
  int killed;
  struct file *ofile[NOFILE];
  struct inode *cwd; // 当前工作目录
  char name[16];
};

ptable

struct {
  struct spinlock lock;
  struct proc proc[NPROC];
} ptable;

进程表

用户进程创建

xv6-public/proc.c at master · mit-pdos/xv6-public (github.com)

// Create a new process copying p as the parent.
// Sets up stack to return as if from system call.
// Caller must set state of returned proc to RUNNABLE.
int
fork(void)
{
  int i, pid;
  struct proc *np;
  struct proc *curproc = myproc();

  // Allocate process.
  if((np = allocproc()) == 0){
    return -1;
  }

  // Copy process state from proc.
  if((np->pgdir = copyuvm(curproc->pgdir, curproc->sz)) == 0){
    kfree(np->kstack);
    np->kstack = 0;
    np->state = UNUSED;
    return -1;
  }
  np->sz = curproc->sz;
  np->parent = curproc;
  *np->tf = *curproc->tf;

  // Clear %eax so that fork returns 0 in the child.
  np->tf->eax = 0;

  for(i = 0; i < NOFILE; i++)
    if(curproc->ofile[i])
      np->ofile[i] = filedup(curproc->ofile[i]);
  np->cwd = idup(curproc->cwd);

  safestrcpy(np->name, curproc->name, sizeof(curproc->name));

  pid = np->pid;

  acquire(&ptable.lock);

  np->state = RUNNABLE;

  release(&ptable.lock);

  return pid;
}

可以看到,XV6 的进程创建非常简单,就是 先 allocproc 分配,然后 copyuvm 复制页表。然后会做一些更改引用计数等状态之类的脏活,最后返回一个 PID。

内核进程创建

似乎不存在内核进程,就直接普通 fork。套了个皮而已。

xv6-public/sysproc.c at master · mit-pdos/xv6-public (github.com)

int
sys_fork(void)
{
  return fork();
}

调度切换

//PAGEBREAK: 42
// Per-CPU process scheduler.
// Each CPU calls scheduler() after setting itself up.
// Scheduler never returns.  It loops, doing:
//  - choose a process to run
//  - swtch to start running that process
//  - eventually that process transfers control
//      via swtch back to the scheduler.
void
scheduler(void)
{
  struct proc *p;
  struct cpu *c = mycpu();
  c->proc = 0;
  
  for(;;){
    // Enable interrupts on this processor.
    sti();

    // Loop over process table looking for process to run.
    acquire(&ptable.lock);
    for(p = ptable.proc; p < &ptable.proc[NPROC]; p++){
      if(p->state != RUNNABLE)
        continue;

      // Switch to chosen process.  It is the process's job
      // to release ptable.lock and then reacquire it
      // before jumping back to us.
      c->proc = p;
      switchuvm(p);
      p->state = RUNNING;

      swtch(&(c->scheduler), p->context);
      switchkvm();

      // Process is done running for now.
      // It should have changed its p->state before coming back.
      c->proc = 0;
    }
    release(&ptable.lock);

  }
}

switchuvm () 用于 切换到选定进程页表。

switchkvm () 用于切换到内核页表。

swtch 是一个使用汇编编写的函数,在 swtch.S 中编写,用于保存当前进程的上下文,然后返回到要切换的进程的上下文(cpu->scheduler)中。

uCore 的进程

PCB

uCore 的 PCB 叫做 proc_struct

下面是 alloc_proc 函数。内核线程和用户进程都是在这创建。

// alloc_proc - alloc a proc_struct and init all fields of proc_struct
static struct proc_struct *
alloc_proc(void) {
    struct proc_struct *proc = kmalloc(sizeof(struct proc_struct));
    if (proc != NULL) {
    //LAB4:EXERCISE1 YOUR CODE
    /*
     * below fields in proc_struct need to be initialized
     *       enum proc_state state;                      // Process state
     *       int pid;                                    // Process ID
     *       int runs;                                   // the running times of Proces
     *       uintptr_t kstack;                           // Process kernel stack
     *       volatile bool need_resched;                 // bool value: need to be rescheduled to release CPU?
     *       struct proc_struct *parent;                 // the parent process
     *       struct mm_struct *mm;                       // Process's memory management field
     *       struct context context;                     // Switch here to run process
     *       struct trapframe *tf;                       // Trap frame for current interrupt
     *       uintptr_t cr3;                              // CR3 register: the base addr of Page Directroy Table(PDT)
     *       uint32_t flags;                             // Process flag
     *       char name[PROC_NAME_LEN + 1];               // Process name
     */
     //LAB5 YOUR CODE : (update LAB4 steps)
    /*
     * below fields(add in LAB5) in proc_struct need to be initialized	
     *       uint32_t wait_state;                        // waiting state
     *       struct proc_struct *cptr, *yptr, *optr;     // relations between processes
	 */
     //LAB6 YOUR CODE : (update LAB5 steps)
    /*
     * below fields(add in LAB6) in proc_struct need to be initialized
     *     struct run_queue *rq;                       // running queue contains Process
     *     list_entry_t run_link;                      // the entry linked in run queue
     *     int time_slice;                             // time slice for occupying the CPU
     *     skew_heap_entry_t lab6_run_pool;            // FOR LAB6 ONLY: the entry in the run pool
     *     uint32_t lab6_stride;                       // FOR LAB6 ONLY: the current stride of the process
     *     uint32_t lab6_priority;                     // FOR LAB6 ONLY: the priority of process, set by lab6_set_priority(uint32_t)
     */
    //LAB8:EXERCISE2 YOUR CODE HINT:need add some code to init fs in proc_struct, ...
    }
    return proc;
}

通用进程创建

核心部分是 do_fork () 完成,比较像 Linux。

/* do_fork -     parent process for a new child process
 * @clone_flags: used to guide how to clone the child process
 * @stack:       the parent's user stack pointer. if stack==0, It means to fork a kernel thread.
 * @tf:          the trapframe info, which will be copied to child process's proc->tf
 */
int
do_fork(uint32_t clone_flags, uintptr_t stack, struct trapframe *tf) {
    int ret = -E_NO_FREE_PROC;
    struct proc_struct *proc;
    if (nr_process >= MAX_PROCESS) {
        goto fork_out;
    }
    ret = -E_NO_MEM;
    //LAB4:EXERCISE2 YOUR CODE
    //LAB8:EXERCISE2 YOUR CODE  HINT:how to copy the fs in parent's proc_struct?
    /*
     * Some Useful MACROs, Functions and DEFINEs, you can use them in below implementation.
     * MACROs or Functions:
     *   alloc_proc:   create a proc struct and init fields (lab4:exercise1)
     *   setup_kstack: alloc pages with size KSTACKPAGE as process kernel stack
     *   copy_mm:      process "proc" duplicate OR share process "current"'s mm according clone_flags
     *                 if clone_flags & CLONE_VM, then "share" ; else "duplicate"
     *   copy_thread:  setup the trapframe on the  process's kernel stack top and
     *                 setup the kernel entry point and stack of process
     *   hash_proc:    add proc into proc hash_list
     *   get_pid:      alloc a unique pid for process
     *   wakeup_proc:  set proc->state = PROC_RUNNABLE
     * VARIABLES:
     *   proc_list:    the process set's list
     *   nr_process:   the number of process set
     */

    //    1. call alloc_proc to allocate a proc_struct
    //    2. call setup_kstack to allocate a kernel stack for child process
    //    3. call copy_mm to dup OR share mm according clone_flag
    //    4. call copy_thread to setup tf & context in proc_struct
    //    5. insert proc_struct into hash_list && proc_list
    //    6. call wakeup_proc to make the new child process RUNNABLE
    //    7. set ret vaule using child proc's pid

	//LAB5 YOUR CODE : (update LAB4 steps)
   /* Some Functions
    *    set_links:  set the relation links of process.  ALSO SEE: remove_links:  lean the relation links of process 
    *    -------------------
	*    update step 1: set child proc's parent to current process, make sure current process's wait_state is 0
	*    update step 5: insert proc_struct into hash_list && proc_list, set the relation links of process
    */
	
fork_out:
    return ret;

bad_fork_cleanup_fs:  //for LAB8
    put_files(proc);
bad_fork_cleanup_kstack:
    put_kstack(proc);
bad_fork_cleanup_proc:
    kfree(proc);
    goto fork_out;
}

内核进程创建

通过 kernel_thread 实现。

kernel_thread(int (*fn)(void *), void *arg, uint32_t clone_flags)
{
    struct trapframe tf;
    memset(&tf, 0, sizeof(struct trapframe));
    tf.tf_cs = KERNEL_CS;
    tf.tf_ds = tf_struct.tf_es = tf_struct.tf_ss = KERNEL_DS;
    tf.tf_regs.reg_ebx = (uint32_t)fn;
    tf.tf_regs.reg_edx = (uint32_t)arg;
    tf.tf_eip = (uint32_t)kernel_thread_entry;
    return do_fork(clone_flags | CLONE_VM, 0, &tf);
}

根本上是使用 do_fork 实现。

用户线程创建

int
thread(int (*fn)(void *), void *arg, thread_t *tidp) {
    if (fn == NULL || tidp == NULL) {
        return -E_INVAL;
    }
    int ret;
    uintptr_t stack = 0;
    if ((ret = mmap(&stack, THREAD_STACKSIZE, MMAP_WRITE | MMAP_STACK)) != 0) {
        return ret;
    }
    assert(stack != 0);

    if ((ret = clone(CLONE_VM | CLONE_THREAD, stack + THREAD_STACKSIZE, fn, arg)) < 0) {
        munmap(stack, THREAD_STACKSIZE);
        return ret;
    }

    tidp->pid = ret;
    tidp->stack = (void *)stack;
    return 0;
}

调度

也是采用 need_resched 标识。

调度过程:

void
cpu_idle(void) {
    while (1) {
        if (current->need_resched) {
            schedule();

调度基本是 Linux 的简化版。

Haribote OS 的进程

创建进程

Haribote 的进程管理非常简陋,不分用户态和内核态。

30daysOS/mtask.c at develop · kamaboko123/30daysOS (github.com)


struct TASK *task_init(struct MEMMAN *memman){
    int i;
    struct TASK *task;
    struct TASK *idle;
    struct SEGMENT_DESCRIPTOR *gdt = (struct SEGMENT_DESCRIPTOR *) ADR_GDT;
    taskctl = (struct TASKCTL *) memman_alloc_4k(memman, sizeof(struct TASKCTL));
    for(i = 0; i < MAX_TASKS; i++){
        taskctl->tasks0[i].flags = 0;
        //TSS の ldtr に書き込んで、このタスクがどの LDT を使用するかを CPU にします
        taskctl->tasks0[i].tss.ldtr = (TASK_GDT0 + MAX_TASKS + i) * 8;
        taskctl->tasks0[i].sel = (TASK_GDT0 + i) * 8;
        
        //TSS 用のディスクリプタ (サイズは 104byte)
        set_segmdesc(gdt + TASK_GDT0 + i, 103, (int)&taskctl->tasks0[i].tss, AR_TSS32);
        
        //LDT 用のディスクリプタ (16byte)、セグメントディスクリプタと同じで一つ 8byte、これを 2 つ分
        // 各 TASK 構造体の中の ldt に記録しておく
        set_segmdesc(gdt + TASK_GDT0 + MAX_TASKS + i, 15, (int)&taskctl->tasks0[i].ldt, AR_LDT);
    }
    for(i = 0; i < MAX_TASKLEVELS; i++){
        taskctl->level[i].running = 0;
        taskctl->level[i].now = 0;
    }
    
    task = task_alloc();
    task->flags = 2; // 動作中
    task->priority = 2; //0.02 秒
    task->level = 0;
    task_add(task);
    task_switchsub();
    load_tr(task->sel);
    task_timer = timer_alloc();
    timer_settime(task_timer, task->priority);
    
    // 各タスクは sleep するときには cli したあと、でも起こして上げるために (fifo に書き込むために) 割り込み受付は必要
    //alloc で eflags の IF=1 にセットされるので割り込み許可
    //idle タスクによって fifo へ書き込みができるので他のタスクを起こせるってこと?
    idle = task_alloc();
    idle->tss.esp = memman_alloc_4k(memman, 64 * 1024) + 64 * 1024;
    idle->tss.eip = (int) &task_idle;
    idle->tss.es = 1* 8;
    idle->tss.cs = 2* 8;
    idle->tss.ss = 1* 8;
    idle->tss.ds = 1* 8;
    idle->tss.fs = 1* 8;
    idle->tss.gs = 1* 8;
    task_run(idle, MAX_TASKLEVELS -1, 1);
    
    return task;
}

简单粗暴,最大只能 1000 个进程。

通过 task_alloc 创建进程,通过 task_add 添加。

void task_add(struct TASK *task){
    // レベルにタスクを追加する
    struct TASKLEVEL *tl = &taskctl->level[task->level];
    tl->tasks[tl->running] = task;
    tl->running++;
    task->flags  = 2;
}

原理很简单,就是在对应 tasklevel,追加一个 task。tl->running 是一个计数变量,指向下次创建的位置。

调度

void task_switch(void){
    struct TASKLEVEL *tl = &taskctl->level[taskctl->now_lv];
    struct TASK *new_task;
    struct TASK *now_task = tl->tasks[tl->now];
    
    tl->now++;
    if(tl->now == tl->running){
        tl->now = 0;
    }
    
    // 優先度変更の更新が必要であれば、task_switchsub を読んで更新
    if(taskctl->lv_change != 0){
        task_switchsub();
        tl = &taskctl->level[taskctl->now_lv];
    }
    
    new_task = tl->tasks[tl->now];
    timer_settime(task_timer, new_task->priority);
    
    if(new_task != now_task){
        farjmp(0, new_task->sel);
    }
}

void task_switchsub(void){
    int i;
    
    // 一番上のレベルを探す
    // 查找正在运行的最高级别
    for(i = 0; i < MAX_TASKLEVELS; i++){
        if(taskctl->level[i].running > 0) break;
    }
    
    taskctl->now_lv = i;
    taskctl->lv_change = 0;
}

每个进程运行的时间就是它的优先级。

Linux 的系统调用

  1. sys_ 开头
  2. 通过宏 SYSCAL_DEFINE 参数数量 (名称) 定义
  3. 每个系统调用都有唯一的系统调用号

sys.c - kernel/sys.c - Linux source code (v3.10) - Bootlin

/**
 * sys_getpid - return the thread group id of the current process
 *
 * Note, despite the name, this returns the tgid not the pid.  The tgid and
 * the pid are identical unless CLONE_THREAD was specified on clone() in
 * which case the tgid is the same in all threads of the same group.
 *
 * This is SMP safe as current->tgid does not change.
 */
SYSCALL_DEFINE0(getpid)
{
	return task_tgid_vnr(current);
}

SYSCALL_DEFINE0 (getpid) 展开后变成 asmlinkage long sys_getpid (void)

用户程序如何执行系统调用?

用户程序并不能执行系统调用,只能让内核代为执行。即通知内核,然后内核返回。

通知内核的实现

通知内核通过 int 0x80 软中断,或者 sysenter 指令实现。用户需要在约定的寄存器(比如 x86 是 eax)写入系统调用号。然后 syscall handler 从中取得。并执行相应的系统调用:

call *sys_call_table(,%rax, 8)

系统调用参数传入内核的机制

将参数放在内存,然后用一个单独的寄存器存放指针。在 x86,返回值通过 eax 寄存器实现。

参考

  1. Linux 内核线程及普通进程总结 - kk Blog —— 通用基础 (abcdxyzk.github.io)
  2. C Programming: Creating a child process using “clone” - CodeSteps
  3. [fork 系统调用过程分析・De4dCr0w’s Blog](https://de4dcr0w.github.io/ 分析 fork 系统调用.html)
  4. Linux 文件系统之 VFS(六) (mudongliang.github.io)
  5. [初探 Linux kernel 亂數產生器 – random generator – SZ Lin with Cybersecurity & Embedded Linux](https://szlin.me/2016/10/05 / 初探 - linux-kernel - 亂數產生器 - random-generator/)
  6. Linux 下的进程类别(内核线程、轻量级进程和用户进程)以及其创建方式 –Linux 进程的管理与调度(四) - 明明是悟空 - 博客园 (cnblogs.com)
  7. Kernel threads | Mastering Linux Kernel Development (packtpub.com)
  8. 深入理解 Linux 内核之内核线程(下) - 知乎 (zhihu.com)
  9. [《操作系统》xv6 阅读报告之进程调度 | kinami’s](https://www.kinami.cc/2021/04/14/《操作系统》xv6 阅读报告之进程调度 /)
  10. 设计关键数据结构 – 进程控制块 | uCore Lab Documents (gitbooks.io)
  11. 进程切换过程・ucore_os_docs (gitbooks.io)