进程是的创建与切换及系统调用
目标
请说明操作系统创建内核进程的方法,内核进程创建用户进程的方法,用户进程使用系统调用并从系统调用返回用户进程的方法,内核进程切换的方法,用户进程切换的方法,请比较 hariboteOS,XV6,uCore 三种自制系统的做法。
本文主要参考
- 《操作系统:设计与实现》(下称 b1)的 Minix3 实现
- 《Linux 内核设计与实现》(下称 b2)
本文是缝合文。
何谓进程
进程:处于执行期的程序,及相关的资源。(b2p20)
Linux 进程管理
每个进程在内核中都有一个进程控制块 (PCB) 来维护进程相关的信息,Linux 内核的进程控制块是 task_struct 结构体。
Linux 中进程和线程的区别
Linux 实现线程的机制非常独特。从内核的角度来说,它并没有线程这个概念。Linux 把所 有的线程都当做进程来实现。内核并没有准备特别的调度算法或是定义特别的数据结构来表征线 程。相反,线程仅仅被视为一个与其他进程共享某些资源的进程。每个线程都拥有唯一隶属于自 己的 task_struct,所以在内核中,它看起来就像是一一个普通的进程(只是线程和其他一些进程共 享某些资源,如地址空间)。
简单来说就是:基本没区别,除了线程能和父进程共享:地址空间、文件、信号处理器。
Linux 中内核进程和用户进程的区别
fork 系统调用
基本使用
int main()
{
pid_t fpid;
int count = 0;
fpid = fork();
if(fpid < 0)
printf("error in fork!\n");
else if (fpid == 0) //fpid 等于 0 则在子进程
{
printf("it's a child process,my process id is %d\n",getpid());
count++;
}
else //fpid > 0 则在父进程
{
printf("it's a parent process,my process id is %d\n",getpid());
count ++;
}
printf("count:%d\n",count);
return 0;
}
fpid
相当于一个链表指针,指向子进程 pid。因此如果fpid == 0
,说明本进程不拥有子进程,因此本进程其实是被另一个进程所拥有。
中断
int 0x80
表示主动进入内核,这是用户程序发起的调用访问内核代码的唯一方式
实现原理
首先,通过执行 syscall0
执行软中断 0x80
,操作系统从 3 特权级变成 0 特权级。然后执行 sys_fork
函数。
sys_fork
通过 find_empty_process ()
为进程申请一个空闲位置并获取进程号(如果返回负数,则说明)然后通过 copy_process ()
:
- 为新进程创建 task_struct, 将原先进程的 task_struct 的内容复制给新进程
- 给新进程分配页表,并复制原先进程的页表到新进程
- 共享原先进程的文件
- 设置新进程的 GDT 项
- 将新进程设置成就绪态,参与进程间的轮转
Global Descriptor Table(GDT)是 x86 的一个用来告诉 CPU 内存分段情况的结构
clone 系统调用
基本使用
// sample.c
#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <sys/wait.h>
int fn(void *arg)
{
printf("\nINFO: This code is running under child process.\n");
int i = 0;
int n = atoi(arg);
for ( i = 1 ; i <= 10 ; i++ )
printf("%d * %d = %d\n", n, i, (n*i));
printf("\n");
return 0;
}
void main(int argc, char *argv[])
{
printf("Hello, World!\n");
void *pchild_stack = malloc(1024 * 1024);
if ( pchild_stack == NULL ) {
printf("ERROR: Unable to allocate memory.\n");
exit(EXIT_FAILURE);
}
int pid = clone(fn, pchild_stack + (1024 * 1024), SIGCHLD, argv[1]);
if ( pid < 0 ) {
printf("ERROR: Unable to create the child process.\n");
exit(EXIT_FAILURE);
}
wait(NULL);
free(pchild_stack);
printf("INFO: Child process terminated.\n");
}
可以看到,clone 调用创建的子进程,允许我们自己为其分配和释放 stack 空间。
实现原理和用户进程 / 线程的创建过程
#ifdef __ARCH_WANT_SYS_FORK
SYSCALL_DEFINE0(fork)
{
#ifdef CONFIG_MMU
struct kernel_clone_args args = {
.exit_signal = SIGCHLD,
};
return kernel_clone(&args);
#else
/* can not support in nommu mode */
return -EINVAL;
#endif
}
#endif
#ifdef __ARCH_WANT_SYS_VFORK
SYSCALL_DEFINE0(vfork)
{
struct kernel_clone_args args = {
.flags = CLONE_VFORK | CLONE_VM,
.exit_signal = SIGCHLD,
};
return kernel_clone(&args);
}
#endif
可以看到,clone, fork, vfork 底层都是 调用 kernel_clone
,网上说的调用 _do_fork
有误,已经过时。
kernel_clone
源码如下:
/*
* Ok, this is the main fork-routine.
*
* It copies the process, and if successful kick-starts
* it and waits for it to finish using the VM if required.
*
* args->exit_signal is expected to be checked for sanity by the caller.
*/
pid_t kernel_clone(struct kernel_clone_args *args)
{
u64 clone_flags = args->flags;
struct completion vfork;
struct pid *pid;
struct task_struct *p;
int trace = 0;
pid_t nr;
/*
* For legacy clone() calls, CLONE_PIDFD uses the parent_tid argument
* to return the pidfd. Hence, CLONE_PIDFD and CLONE_PARENT_SETTID are
* mutually exclusive. With clone3() CLONE_PIDFD has grown a separate
* field in struct clone_args and it still doesn't make sense to have
* them both point at the same memory location. Performing this check
* here has the advantage that we don't need to have a separate helper
* to check for legacy clone().
*/
if ((args->flags & CLONE_PIDFD) &&
(args->flags & CLONE_PARENT_SETTID) &&
(args->pidfd == args->parent_tid))
return -EINVAL;
/*
* Determine whether and which event to report to ptracer. When
* called from kernel_thread or CLONE_UNTRACED is explicitly
* requested, no event is reported; otherwise, report if the event
* for the type of forking is enabled.
*/
if (!(clone_flags & CLONE_UNTRACED)) {
if (clone_flags & CLONE_VFORK)
trace = PTRACE_EVENT_VFORK;
else if (args->exit_signal != SIGCHLD)
trace = PTRACE_EVENT_CLONE;
else
trace = PTRACE_EVENT_FORK;
if (likely(!ptrace_event_enabled(current, trace)))
trace = 0;
}
p = copy_process(NULL, trace, NUMA_NO_NODE, args);
add_latent_entropy();
if (IS_ERR(p))
return PTR_ERR(p);
/*
* Do this prior waking up the new thread - the thread pointer
* might get invalid after that point, if the thread exits quickly.
*/
trace_sched_process_fork(current, p);
pid = get_task_pid(p, PIDTYPE_PID);
nr = pid_vnr(pid);
if (clone_flags & CLONE_PARENT_SETTID)
put_user(nr, args->parent_tid);
if (clone_flags & CLONE_VFORK) {
p->vfork_done = &vfork;
init_completion(&vfork);
get_task_struct(p);
}
wake_up_new_task(p);
/* forking complete and child started to run, tell ptracer */
if (unlikely(trace))
ptrace_event_pid(trace, pid);
if (clone_flags & CLONE_VFORK) {
if (!wait_for_vfork_done(p, &vfork))
ptrace_event_pid(PTRACE_EVENT_VFORK_DONE, pid);
}
put_pid(pid);
return nr;
}
大致如下:
- 参数检查
- 调用
p = copy_process ()
用于复制出子进程 - 调用
add_latent_entropy ()
,这个功能通过潜在熵插件提供,是为了避免随机数重复。(详见参考 5) - 调用
trace_sched_process_fork
唤醒子进程 p pid = get_task_pid(p, PIDTYPE_PID);
- pid_vnr() 根据输入参数,获取进程的局部进程号
put_user (nr, args->parent_tid);
设置父进程 ID(如果存在)wake_up_new_task (p);
唤醒进程put_pid ()
试图减少引用计数,放回 pid
fs_struct 用于记录进程的用户、根目录、工作目录信息:
include/linux/fs_struct.h
struct fs_struct {
int users;
spinlock_t lock;
seqcount_t seq;
int umask;
int in_exec;
struct path root, pwd;
};
进程描述符的 files
字段指向该进程的 files_struct
结构
include/linux/fdtable.h
struct files_struct {
/*
* read mostly part
*/
atomic_t count;
bool resize_in_progress;
wait_queue_head_t resize_wait;
struct fdtable __rcu *fdt;
struct fdtable fdtab;
/*
* written part on a separate cache line in SMP
*/
spinlock_t file_lock ____cacheline_aligned_in_smp;
int next_fd;
unsigned long close_on_exec_init[1];
unsigned long open_fds_init[1];
struct file __rcu * fd_array[NR_OPEN_DEFAULT];
};
都是 Clone
创建线程:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
普通 fork:
clone(SIGCHLD, 0);
vfork:
clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
fork ()
子进程拷贝父进程的数据段,代码段
vfork ()
子进程与父进程共享数据段
内核进程的创建
内核线程只运行在内核态。Linux 内核可以看作一个服务进程。在 Linux 中进程和线程没有明显的区分,由于都是运行在内核空间,所以一般叫做内核线程。
相关函数:
- kernel_thread
- kthread_create
- kthread_run
kthread_run () 调用 kthread_create (),kthread_create () 加入链表后,有 kthreadd () 线程读取链表然后再调用 kernel_thread () 创建线程。
kernel_thread () 实在真正的创建线程。
kthread_run ()/kthread_create () : 做特殊的准备工作,之后再调用 kernel_thread () 创建线程。
所有的内核线程都是 kthreadd (pid 2)
的子级而 kthreadd
又是 kernel (pid 0)
在引导过程中矿建的。我们可以通过 ps -el
列出内核线程。
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 Nov17 ? 00:00:02 /sbin/init
root 2 0 0 Nov17 ? 00:00:00 [kthreadd]
root 3 2 0 Nov17 ? 00:00:00 [rcu_gp]
root 4 2 0 Nov17 ? 00:00:00 [rcu_par_gp]
root 6 2 0 Nov17 ? 00:00:00 [kworker/0:0H-events_highpri]
root 8 2 0 Nov17 ? 00:00:00 [mm_percpu_wq]
root 9 2 0 Nov17 ? 00:00:00 [rcu_tasks_rude_]
root 10 2 0 Nov17 ? 00:00:00 [rcu_tasks_trace]
root 11 2 0 Nov17 ? 00:00:13 [ksoftirqd/0]
root 12 2 0 Nov17 ? 00:00:43 [rcu_sched]
root 13 2 0 Nov17 ? 00:00:00 [migration/0]
root 15 2 0 Nov17 ? 00:00:00 [cpuhp/0]
root 17 2 0 Nov17 ? 00:00:00 [kdevtmpfs]
root 18 2 0 Nov17 ? 00:00:00 [netns]
root 19 2 0 Nov17 ? 00:00:00 [kauditd]
root 20 2 0 Nov17 ? 00:00:00 [khungtaskd]
root 21 2 0 Nov17 ? 00:00:00 [oom_reaper]
root 22 2 0 Nov17 ? 00:00:00 [writeback]
root 23 2 0 Nov17 ? 00:00:04 [kcompactd0]
root 24 2 0 Nov17 ? 00:00:00 [ksmd]
root 25 2 0 Nov17 ? 00:00:06 [khugepaged]
root 43 2 0 Nov17 ? 00:00:00 [kintegrityd]
root 44 2 0 Nov17 ? 00:00:00 [kblockd]
root 45 2 0 Nov17 ? 00:00:00 [blkcg_punt_bio]
root 46 2 0 Nov17 ? 00:00:00 [edac-poller]
root 47 2 0 Nov17 ? 00:00:00 [devfreq_wq]
root 48 2 0 Nov17 ? 00:00:03 [kworker/0:1H-kblockd]
root 49 2 0 Nov17 ? 00:00:04 [kswapd0]
root 50 2 0 Nov17 ? 00:00:00 [kthrotld]
root 51 2 0 Nov17 ? 00:00:00 [irq/24-pciehp]
root 52 2 0 Nov17 ? 00:00:00 [irq/25-pciehp]
root 53 2 0 Nov17 ? 00:00:00 [irq/26-pciehp]
root 54 2 0 Nov17 ? 00:00:00 [irq/27-pciehp]
root 55 2 0 Nov17 ? 00:00:00 [irq/28-pciehp]
root 56 2 0 Nov17 ? 00:00:00 [irq/29-pciehp]
root 57 2 0 Nov17 ? 00:00:00 [irq/30-pciehp]
root 58 2 0 Nov17 ? 00:00:00 [irq/31-pciehp]
root 59 2 0 Nov17 ? 00:00:00 [irq/32-pciehp]
root 60 2 0 Nov17 ? 00:00:00 [irq/33-pciehp]
root 61 2 0 Nov17 ? 00:00:00 [irq/34-pciehp]
root 62 2 0 Nov17 ? 00:00:00 [irq/35-pciehp]
root 63 2 0 Nov17 ? 00:00:00 [irq/36-pciehp]
root 64 2 0 Nov17 ? 00:00:00 [irq/37-pciehp]
root 65 2 0 Nov17 ? 00:00:00 [irq/38-pciehp]
root 66 2 0 Nov17 ? 00:00:00 [irq/39-pciehp]
root 67 2 0 Nov17 ? 00:00:00 [irq/40-pciehp]
root 68 2 0 Nov17 ? 00:00:00 [irq/41-pciehp]
root 69 2 0 Nov17 ? 00:00:00 [irq/42-pciehp]
root 70 2 0 Nov17 ? 00:00:00 [irq/43-pciehp]
root 71 2 0 Nov17 ? 00:00:00 [irq/44-pciehp]
root 72 2 0 Nov17 ? 00:00:00 [irq/45-pciehp]
root 73 2 0 Nov17 ? 00:00:00 [irq/46-pciehp]
root 74 2 0 Nov17 ? 00:00:00 [irq/47-pciehp]
root 75 2 0 Nov17 ? 00:00:00 [irq/48-pciehp]
root 76 2 0 Nov17 ? 00:00:00 [irq/49-pciehp]
root 77 2 0 Nov17 ? 00:00:00 [irq/50-pciehp]
root 78 2 0 Nov17 ? 00:00:00 [irq/51-pciehp]
root 79 2 0 Nov17 ? 00:00:00 [irq/52-pciehp]
root 80 2 0 Nov17 ? 00:00:00 [irq/53-pciehp]
root 81 2 0 Nov17 ? 00:00:00 [irq/54-pciehp]
root 82 2 0 Nov17 ? 00:00:00 [irq/55-pciehp]
root 83 2 0 Nov17 ? 00:00:00 [acpi_thermal_pm]
root 84 2 0 Nov17 ? 00:00:00 [ipv6_addrconf]
root 94 2 0 Nov17 ? 00:00:00 [kstrp]
root 97 2 0 Nov17 ? 00:00:00 [zswap-shrink]
root 98 2 0 Nov17 ? 00:00:00 [kworker/u257:0-hci0]
root 137 2 0 Nov17 ? 00:00:00 [mpt_poll_0]
root 139 2 0 Nov17 ? 00:00:00 [mpt/0]
root 142 2 0 Nov17 ? 00:00:00 [ata_sff]
root 143 2 0 Nov17 ? 00:00:00 [scsi_eh_0]
root 144 2 0 Nov17 ? 00:00:00 [scsi_tmf_0]
root 145 2 0 Nov17 ? 00:00:00 [scsi_eh_1]
root 146 2 0 Nov17 ? 00:00:00 [scsi_tmf_1]
root 152 2 0 Nov17 ? 00:00:00 [scsi_eh_2]
root 153 2 0 Nov17 ? 00:00:00 [scsi_tmf_2]
root 158 2 0 Nov17 ? 00:00:00 [kdmflush]
root 160 2 0 Nov17 ? 00:00:00 [kdmflush]
root 197 2 0 Nov17 ? 00:00:16 [jbd2/dm-0-8]
root 198 2 0 Nov17 ? 00:00:00 [ext4-rsv-conver]
root 235 1 0 Nov17 ? 00:00:00 /lib/systemd/systemd-journald
root 267 1 0 Nov17 ? 00:00:00 /lib/systemd/systemd-udevd
root 337 2 0 Nov17 ? 00:00:00 [cryptd]
root 345 2 0 Nov17 ? 00:00:10 [irq/16-vmwgfx]
root 348 2 0 Nov17 ? 00:00:00 [ttm_swap]
root 353 2 0 Nov17 ? 00:00:00 [card0-crtc0]
root 357 2 0 Nov17 ? 00:00:00 [card0-crtc1]
root 360 2 0 Nov17 ? 00:00:00 [card0-crtc2]
root 365 2 0 Nov17 ? 00:00:00 [card0-crtc3]
root 369 2 0 Nov17 ? 00:00:00 [card0-crtc4]
root 372 2 0 Nov17 ? 00:00:00 [card0-crtc5]
root 377 2 0 Nov17 ? 00:00:00 [card0-crtc6]
root 380 2 0 Nov17 ? 00:00:00 [card0-crtc7]
root 488 2 0 Nov17 ? 00:00:00 [kworker/u257:2-hci0]
root 490 2 0 Nov17 ? 00:00:00 [ext4-rsv-conver]
systemd+ 500 1 0 Nov17 ? 00:00:00 /lib/systemd/systemd-timesyncd
root 509 1 0 Nov17 ? 00:00:00 /usr/bin/VGAuthService
root 510 1 0 Nov17 ? 00:02:01 /usr/bin/vmtoolsd
root 525 1 0 Nov17 ? 00:00:00 /usr/sbin/cron -f
message+ 526 1 0 Nov17 ? 00:00:00 /usr/bin/dbus-daemon --system --address=sys
root 535 1 0 Nov17 ? 00:00:00 /usr/sbin/rsyslogd -n -iNONE
root 540 1 0 Nov17 ? 00:00:00 /lib/systemd/systemd-logind
root 589 1 0 Nov17 tty1 00:00:00 /bin/login -p --
root 599 1 0 Nov17 ? 00:00:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-
root 686 1 0 Nov17 ? 00:01:08 /www/server/panel/pyenv/bin/python /www/ser
root 703 1 0 Nov17 ? 00:00:00 /www/server/panel/pyenv/bin/python /www/ser
root 35997 1 0 Nov17 ? 00:00:00 /lib/systemd/systemd --user
root 35998 35997 0 Nov17 ? 00:00:00 (sd-pam)
root 36017 589 0 Nov17 tty1 00:00:00 -bash
root 558961 2 0 12:49 ? 00:00:32 [kworker/0:1-events_power_efficient]
root 576562 2 0 17:42 ? 00:00:00 [kworker/u256:3-ext4-rsv-conversion]
kernel_thread 用于创建内核线程。
static noinline void __ref rest_init(void)
{
int pid;
rcu_scheduler_starting();
/*
* We need to spawn init first so that it obtains pid 1, however
* the init task will end up wanting to create kthreads, which, if
* we schedule it before we create kthreadd, will OOPS.
*/
kernel_thread(kernel_init, NULL, CLONE_FS);
numa_default_policy();
pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
rcu_read_lock();
kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);
/*
* The boot idle thread must execute schedule()
* at least once to get things moving:
*/
init_idle_bootup_task(current);
schedule_preempt_disabled();
/* Call into cpu_idle with preempt disabled */
cpu_startup_entry(CPUHP_ONLINE);
}
上面的代码,通过调用 kernel_thread
,启动 kernel_init
进程和 kthreadd
进程。
kthread
是一个常驻进程,会监听 kthread_create_list
的变动
/*kthreadd routine(kthread.c) */
int kthreadd(void *unused)
{
struct task_struct *tsk = current;
/* Setup a clean context for our children to inherit. */
set_task_comm(tsk, "kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk, cpu_all_mask);
set_mems_allowed(node_states[N_MEMORY]);
current->flags |= PF_NOFREEZE;
for (;;) {
set_current_state(TASK_INTERRUPTIBLE);
if (list_empty(&kthread_create_list))
schedule();
__set_current_state(TASK_RUNNING);
spin_lock(&kthread_create_lock);
while (!list_empty(&kthread_create_list)) {
struct kthread_create_info *create;
create = list_entry(kthread_create_list.next,
struct kthread_create_info, list);
list_del_init(&create->list);
spin_unlock(&kthread_create_lock);
create_kthread(create); /* creates kernel threads with attributes enqueued */
spin_lock(&kthread_create_lock);
}
spin_unlock(&kthread_create_lock);
}
return 0;
}
kthread_create
则调用 kthread_create_on_node ()
,后者会在当前 NUMA 节点创建内核线程。
非统一内存访问架构是一种为多处理器的电脑设计的内存架构,内存访问时间取决于内存相对于处理器的位置。在 NUMA 下,处理器访问它自己的本地内存的速度比非本地内存快一些。 非统一内存访问架构的特点是:被共享的内存物理上是分布式的,所有这些内存的集合就是全局地址空间。
NUMA 节点是 CPU / 内存对。 通常,CPU Socket 和最近的内存库构建了一个 NUMA 节点。 每当一个 CPU 需要访问另一个 NUMA 节点的内存时,它不能直接访问它,而是需要通过拥有该内存的 CPU 来访问它。
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data,
int node,
const char namefmt[], ...);
/**
* kthread_create - create a kthread on the current node
* @threadfn: the function to run in the thread
* @data: data pointer for @threadfn()
* @namefmt: printf-style format string for the thread name
* @...: arguments for @namefmt.
*
* This macro will create a kthread on the current node, leaving it in
* the stopped state. This is just a helper for
* kthread_create_on_node();
* see the documentation there for more details.
*/
#define kthread_create(threadfn, data, namefmt, arg...)
kthread_create_on_node(threadfn, data, NUMA_NO_NODE, namefmt, ##arg)
struct task_struct *kthread_create_on_cpu(int (*threadfn)(void *data),
void *data,
unsigned int cpu,
const char *namefmt);
/**
* kthread_run - create and wake a thread.
* @threadfn: the function to run until signal_pending(current).
* @data: data ptr for @threadfn.
* @namefmt: printf-style name for the thread.
*
* Description: Convenient wrapper for kthread_create() followed by
* wake_up_process(). Returns the kthread or ERR_PTR(-ENOMEM).
*/
#define kthread_run(threadfn, data, namefmt, ...)
({
struct task_struct *__k
= kthread_create(threadfn, data, namefmt, ## __VA_ARGS__);
if (!IS_ERR(__k))
wake_up_process(__k);
__k;
})
kthread_create_on_node () 将要创建的 kthread 的详细信息(作为参数接收)实例化为 kthread_create_info 类型的结构,并将其排在 kthread_create_list 的尾部。 然后它唤醒 kthreadd 并等待线程创建完成:
/* kernel/kthread.c */
static struct task_struct *__kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
va_list args)
{
DECLARE_COMPLETION_ONSTACK(done);
struct task_struct *task;
struct kthread_create_info *create = kmalloc(sizeof(*create),
GFP_KERNEL);
if (!create)
return ERR_PTR(-ENOMEM);
create->threadfn = threadfn;
create->data = data;
create->node = node;
create->done = &done;
spin_lock(&kthread_create_lock);
list_add_tail(&create->list, &kthread_create_list);
spin_unlock(&kthread_create_lock);
wake_up_process(kthreadd_task);
/*
* Wait for completion in killable state, for I might be chosen by
* the OOM killer while kthreadd is trying to allocate memory for
* new kernel thread.
*/
if (unlikely(wait_for_completion_killable(&done))) {
/*
* If I was SIGKILLed before kthreadd (or new kernel thread)
* calls complete(), leave the cleanup of this structure to
* that thread.
*/
if (xchg(&create->done, NULL))
return ERR_PTR(-EINTR);
/*
* kthreadd (or new kernel thread) will call complete()
* shortly.
*/
wait_for_completion(&done); // wakeup on completion of thread creation.
}
...
...
...
}
struct task_struct *kthread_create_on_node(int (*threadfn)(void *data),
void *data, int node,
const char namefmt[],
...)
{
struct task_struct *task;
va_list args;
va_start(args, namefmt);
task = __kthread_create_on_node(threadfn, data, node, namefmt, args);
va_end(args);
return task;
}
kthreadd 调用 create_thread () 例程以根据排队到列表中的数据启动内核线程。 此例程创建线程并发出完成信号:
/* kernel/kthread.c */
static void create_kthread(struct kthread_create_info *create)
{
int pid;
#ifdef CONFIG_NUMA
current->pref_node_fork = create->node;
#endif
/* We want our own signal handler (we take no signals by default). */
pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES |
SIGCHLD);
if (pid < 0) {
/* If user was SIGKILLed, I release the structure. */
struct completion *done = xchg(&create->done, NULL);
if (!done) {
kfree(create);
return;
}
create->result = ERR_PTR(pid);
complete(done); /* signal completion of thread creation */
}
}
最终还是靠 do_work () 完成
即现有 Linux 源码中的
kernel_clone
函数。
Linux 的进程上下文切换
相关代码在 linux/core.c at master・torvalds/linux (github.com) 中。通过 context_switch
函数实现。主要要切换 内存 和 寄存器状态。
分为四种切换:
- kernel -> kernel lazy + transfer active
- user -> kernel lazy + mmgrab() active
- kernel -> user switch + mmdrop() active
- user -> user switch
/*
* context_switch - switch to the new MM and the new thread's register state.
*/
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
struct task_struct *next, struct rq_flags *rf)
{
prepare_task_switch(rq, prev, next);
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
* one hypercall.
*/
arch_start_context_switch(prev);
/*
* kernel -> kernel lazy + transfer active
* user -> kernel lazy + mmgrab() active
*
* kernel -> user switch + mmdrop() active
* user -> user switch
*/
if (!next->mm) { // to kernel
enter_lazy_tlb(prev->active_mm, next);
next->active_mm = prev->active_mm;
if (prev->mm) // from user
mmgrab(prev->active_mm);
else
prev->active_mm = NULL;
} else { // to user
membarrier_switch_mm(rq, prev->active_mm, next->mm);
/*
* sys_membarrier() requires an smp_mb() between setting
* rq->curr / membarrier_switch_mm() and returning to userspace.
*
* The below provides this either through switch_mm(), or in
* case 'prev->active_mm == next->mm' through
* finish_task_switch()'s mmdrop().
*/
switch_mm_irqs_off(prev->active_mm, next->mm, next);
if (!prev->mm) { // from kernel
/* will mmdrop() in finish_task_switch(). */
rq->prev_mm = prev->active_mm;
prev->active_mm = NULL;
}
}
rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
prepare_lock_switch(rq, next, rf);
/* Here we just switch the register state and the stack. */
switch_to(prev, next, prev);
barrier();
return finish_task_switch(prev);
}
mmgrab (prev->active_mm);
将前一个任务的 prev->active_mm
引用计数加 1 防止被释放
switch_mm_irqs_off
负责切换地址空间
内核线程的 tsk->mm
永远为空,利用这个可以判断线程是内核的还是用户的。
注意到 next->active_mm = prev->active_mm;
,即 前一个用户任务的 active_mm 赋值到自己的 active_mm,这是为了避免频繁不必要地切换地址空间。
switch_to 负责将上一个进程的处理器状态切换到下一个进程的处理器状态。包括保存和恢复:
- 栈信息
- 寄存器信息
抢占式切换的实现
如果没有抢占式,那么只能等进程显式调用 schedule ()
。Linux 使用 need_resched
标识指示是否需要强行调度。这个标识位于所有进程的进程描述符 thread_info
结构体中。
XV6 的进程
PCB
struct proc {
uint sz;
pde_t* pgdir; //page table 指针
char *kstack; // 内核栈底
enum procstate state; // 内核当前的状态
volatile int pid;
struct proc *parent;
struct trapframe *tf;
struct context *context;
void *chan;
int killed;
struct file *ofile[NOFILE];
struct inode *cwd; // 当前工作目录
char name[16];
};
ptable
struct {
struct spinlock lock;
struct proc proc[NPROC];
} ptable;
进程表
用户进程创建
xv6-public/proc.c at master · mit-pdos/xv6-public (github.com)
// Create a new process copying p as the parent.
// Sets up stack to return as if from system call.
// Caller must set state of returned proc to RUNNABLE.
int
fork(void)
{
int i, pid;
struct proc *np;
struct proc *curproc = myproc();
// Allocate process.
if((np = allocproc()) == 0){
return -1;
}
// Copy process state from proc.
if((np->pgdir = copyuvm(curproc->pgdir, curproc->sz)) == 0){
kfree(np->kstack);
np->kstack = 0;
np->state = UNUSED;
return -1;
}
np->sz = curproc->sz;
np->parent = curproc;
*np->tf = *curproc->tf;
// Clear %eax so that fork returns 0 in the child.
np->tf->eax = 0;
for(i = 0; i < NOFILE; i++)
if(curproc->ofile[i])
np->ofile[i] = filedup(curproc->ofile[i]);
np->cwd = idup(curproc->cwd);
safestrcpy(np->name, curproc->name, sizeof(curproc->name));
pid = np->pid;
acquire(&ptable.lock);
np->state = RUNNABLE;
release(&ptable.lock);
return pid;
}
可以看到,XV6 的进程创建非常简单,就是 先 allocproc
分配,然后 copyuvm
复制页表。然后会做一些更改引用计数等状态之类的脏活,最后返回一个 PID。
内核进程创建
似乎不存在内核进程,就直接普通 fork
。套了个皮而已。
xv6-public/sysproc.c at master · mit-pdos/xv6-public (github.com)
int
sys_fork(void)
{
return fork();
}
调度切换
//PAGEBREAK: 42
// Per-CPU process scheduler.
// Each CPU calls scheduler() after setting itself up.
// Scheduler never returns. It loops, doing:
// - choose a process to run
// - swtch to start running that process
// - eventually that process transfers control
// via swtch back to the scheduler.
void
scheduler(void)
{
struct proc *p;
struct cpu *c = mycpu();
c->proc = 0;
for(;;){
// Enable interrupts on this processor.
sti();
// Loop over process table looking for process to run.
acquire(&ptable.lock);
for(p = ptable.proc; p < &ptable.proc[NPROC]; p++){
if(p->state != RUNNABLE)
continue;
// Switch to chosen process. It is the process's job
// to release ptable.lock and then reacquire it
// before jumping back to us.
c->proc = p;
switchuvm(p);
p->state = RUNNING;
swtch(&(c->scheduler), p->context);
switchkvm();
// Process is done running for now.
// It should have changed its p->state before coming back.
c->proc = 0;
}
release(&ptable.lock);
}
}
switchuvm ()
用于 切换到选定进程页表。
switchkvm ()
用于切换到内核页表。
swtch
是一个使用汇编编写的函数,在 swtch.S
中编写,用于保存当前进程的上下文,然后返回到要切换的进程的上下文(cpu->scheduler
)中。
uCore 的进程
PCB
uCore 的 PCB 叫做 proc_struct
。
下面是 alloc_proc 函数。内核线程和用户进程都是在这创建。
// alloc_proc - alloc a proc_struct and init all fields of proc_struct
static struct proc_struct *
alloc_proc(void) {
struct proc_struct *proc = kmalloc(sizeof(struct proc_struct));
if (proc != NULL) {
//LAB4:EXERCISE1 YOUR CODE
/*
* below fields in proc_struct need to be initialized
* enum proc_state state; // Process state
* int pid; // Process ID
* int runs; // the running times of Proces
* uintptr_t kstack; // Process kernel stack
* volatile bool need_resched; // bool value: need to be rescheduled to release CPU?
* struct proc_struct *parent; // the parent process
* struct mm_struct *mm; // Process's memory management field
* struct context context; // Switch here to run process
* struct trapframe *tf; // Trap frame for current interrupt
* uintptr_t cr3; // CR3 register: the base addr of Page Directroy Table(PDT)
* uint32_t flags; // Process flag
* char name[PROC_NAME_LEN + 1]; // Process name
*/
//LAB5 YOUR CODE : (update LAB4 steps)
/*
* below fields(add in LAB5) in proc_struct need to be initialized
* uint32_t wait_state; // waiting state
* struct proc_struct *cptr, *yptr, *optr; // relations between processes
*/
//LAB6 YOUR CODE : (update LAB5 steps)
/*
* below fields(add in LAB6) in proc_struct need to be initialized
* struct run_queue *rq; // running queue contains Process
* list_entry_t run_link; // the entry linked in run queue
* int time_slice; // time slice for occupying the CPU
* skew_heap_entry_t lab6_run_pool; // FOR LAB6 ONLY: the entry in the run pool
* uint32_t lab6_stride; // FOR LAB6 ONLY: the current stride of the process
* uint32_t lab6_priority; // FOR LAB6 ONLY: the priority of process, set by lab6_set_priority(uint32_t)
*/
//LAB8:EXERCISE2 YOUR CODE HINT:need add some code to init fs in proc_struct, ...
}
return proc;
}
通用进程创建
核心部分是 do_fork ()
完成,比较像 Linux。
/* do_fork - parent process for a new child process
* @clone_flags: used to guide how to clone the child process
* @stack: the parent's user stack pointer. if stack==0, It means to fork a kernel thread.
* @tf: the trapframe info, which will be copied to child process's proc->tf
*/
int
do_fork(uint32_t clone_flags, uintptr_t stack, struct trapframe *tf) {
int ret = -E_NO_FREE_PROC;
struct proc_struct *proc;
if (nr_process >= MAX_PROCESS) {
goto fork_out;
}
ret = -E_NO_MEM;
//LAB4:EXERCISE2 YOUR CODE
//LAB8:EXERCISE2 YOUR CODE HINT:how to copy the fs in parent's proc_struct?
/*
* Some Useful MACROs, Functions and DEFINEs, you can use them in below implementation.
* MACROs or Functions:
* alloc_proc: create a proc struct and init fields (lab4:exercise1)
* setup_kstack: alloc pages with size KSTACKPAGE as process kernel stack
* copy_mm: process "proc" duplicate OR share process "current"'s mm according clone_flags
* if clone_flags & CLONE_VM, then "share" ; else "duplicate"
* copy_thread: setup the trapframe on the process's kernel stack top and
* setup the kernel entry point and stack of process
* hash_proc: add proc into proc hash_list
* get_pid: alloc a unique pid for process
* wakeup_proc: set proc->state = PROC_RUNNABLE
* VARIABLES:
* proc_list: the process set's list
* nr_process: the number of process set
*/
// 1. call alloc_proc to allocate a proc_struct
// 2. call setup_kstack to allocate a kernel stack for child process
// 3. call copy_mm to dup OR share mm according clone_flag
// 4. call copy_thread to setup tf & context in proc_struct
// 5. insert proc_struct into hash_list && proc_list
// 6. call wakeup_proc to make the new child process RUNNABLE
// 7. set ret vaule using child proc's pid
//LAB5 YOUR CODE : (update LAB4 steps)
/* Some Functions
* set_links: set the relation links of process. ALSO SEE: remove_links: lean the relation links of process
* -------------------
* update step 1: set child proc's parent to current process, make sure current process's wait_state is 0
* update step 5: insert proc_struct into hash_list && proc_list, set the relation links of process
*/
fork_out:
return ret;
bad_fork_cleanup_fs: //for LAB8
put_files(proc);
bad_fork_cleanup_kstack:
put_kstack(proc);
bad_fork_cleanup_proc:
kfree(proc);
goto fork_out;
}
内核进程创建
通过 kernel_thread 实现。
kernel_thread(int (*fn)(void *), void *arg, uint32_t clone_flags)
{
struct trapframe tf;
memset(&tf, 0, sizeof(struct trapframe));
tf.tf_cs = KERNEL_CS;
tf.tf_ds = tf_struct.tf_es = tf_struct.tf_ss = KERNEL_DS;
tf.tf_regs.reg_ebx = (uint32_t)fn;
tf.tf_regs.reg_edx = (uint32_t)arg;
tf.tf_eip = (uint32_t)kernel_thread_entry;
return do_fork(clone_flags | CLONE_VM, 0, &tf);
}
根本上是使用 do_fork 实现。
用户线程创建
int
thread(int (*fn)(void *), void *arg, thread_t *tidp) {
if (fn == NULL || tidp == NULL) {
return -E_INVAL;
}
int ret;
uintptr_t stack = 0;
if ((ret = mmap(&stack, THREAD_STACKSIZE, MMAP_WRITE | MMAP_STACK)) != 0) {
return ret;
}
assert(stack != 0);
if ((ret = clone(CLONE_VM | CLONE_THREAD, stack + THREAD_STACKSIZE, fn, arg)) < 0) {
munmap(stack, THREAD_STACKSIZE);
return ret;
}
tidp->pid = ret;
tidp->stack = (void *)stack;
return 0;
}
调度
也是采用 need_resched
标识。
调度过程:
void
cpu_idle(void) {
while (1) {
if (current->need_resched) {
schedule();
调度基本是 Linux 的简化版。
Haribote OS 的进程
创建进程
Haribote 的进程管理非常简陋,不分用户态和内核态。
30daysOS/mtask.c at develop · kamaboko123/30daysOS (github.com)
struct TASK *task_init(struct MEMMAN *memman){
int i;
struct TASK *task;
struct TASK *idle;
struct SEGMENT_DESCRIPTOR *gdt = (struct SEGMENT_DESCRIPTOR *) ADR_GDT;
taskctl = (struct TASKCTL *) memman_alloc_4k(memman, sizeof(struct TASKCTL));
for(i = 0; i < MAX_TASKS; i++){
taskctl->tasks0[i].flags = 0;
//TSS の ldtr に書き込んで、このタスクがどの LDT を使用するかを CPU にします
taskctl->tasks0[i].tss.ldtr = (TASK_GDT0 + MAX_TASKS + i) * 8;
taskctl->tasks0[i].sel = (TASK_GDT0 + i) * 8;
//TSS 用のディスクリプタ (サイズは 104byte)
set_segmdesc(gdt + TASK_GDT0 + i, 103, (int)&taskctl->tasks0[i].tss, AR_TSS32);
//LDT 用のディスクリプタ (16byte)、セグメントディスクリプタと同じで一つ 8byte、これを 2 つ分
// 各 TASK 構造体の中の ldt に記録しておく
set_segmdesc(gdt + TASK_GDT0 + MAX_TASKS + i, 15, (int)&taskctl->tasks0[i].ldt, AR_LDT);
}
for(i = 0; i < MAX_TASKLEVELS; i++){
taskctl->level[i].running = 0;
taskctl->level[i].now = 0;
}
task = task_alloc();
task->flags = 2; // 動作中
task->priority = 2; //0.02 秒
task->level = 0;
task_add(task);
task_switchsub();
load_tr(task->sel);
task_timer = timer_alloc();
timer_settime(task_timer, task->priority);
// 各タスクは sleep するときには cli したあと、でも起こして上げるために (fifo に書き込むために) 割り込み受付は必要
//alloc で eflags の IF=1 にセットされるので割り込み許可
//idle タスクによって fifo へ書き込みができるので他のタスクを起こせるってこと?
idle = task_alloc();
idle->tss.esp = memman_alloc_4k(memman, 64 * 1024) + 64 * 1024;
idle->tss.eip = (int) &task_idle;
idle->tss.es = 1* 8;
idle->tss.cs = 2* 8;
idle->tss.ss = 1* 8;
idle->tss.ds = 1* 8;
idle->tss.fs = 1* 8;
idle->tss.gs = 1* 8;
task_run(idle, MAX_TASKLEVELS -1, 1);
return task;
}
简单粗暴,最大只能 1000 个进程。
通过 task_alloc
创建进程,通过 task_add
添加。
void task_add(struct TASK *task){
// レベルにタスクを追加する
struct TASKLEVEL *tl = &taskctl->level[task->level];
tl->tasks[tl->running] = task;
tl->running++;
task->flags = 2;
}
原理很简单,就是在对应 tasklevel,追加一个 task。tl->running
是一个计数变量,指向下次创建的位置。
调度
void task_switch(void){
struct TASKLEVEL *tl = &taskctl->level[taskctl->now_lv];
struct TASK *new_task;
struct TASK *now_task = tl->tasks[tl->now];
tl->now++;
if(tl->now == tl->running){
tl->now = 0;
}
// 優先度変更の更新が必要であれば、task_switchsub を読んで更新
if(taskctl->lv_change != 0){
task_switchsub();
tl = &taskctl->level[taskctl->now_lv];
}
new_task = tl->tasks[tl->now];
timer_settime(task_timer, new_task->priority);
if(new_task != now_task){
farjmp(0, new_task->sel);
}
}
void task_switchsub(void){
int i;
// 一番上のレベルを探す
// 查找正在运行的最高级别
for(i = 0; i < MAX_TASKLEVELS; i++){
if(taskctl->level[i].running > 0) break;
}
taskctl->now_lv = i;
taskctl->lv_change = 0;
}
每个进程运行的时间就是它的优先级。
Linux 的系统调用
- 以
sys_
开头 - 通过宏
SYSCAL_DEFINE 参数数量 (名称)
定义 - 每个系统调用都有唯一的系统调用号
sys.c - kernel/sys.c - Linux source code (v3.10) - Bootlin
/**
* sys_getpid - return the thread group id of the current process
*
* Note, despite the name, this returns the tgid not the pid. The tgid and
* the pid are identical unless CLONE_THREAD was specified on clone() in
* which case the tgid is the same in all threads of the same group.
*
* This is SMP safe as current->tgid does not change.
*/
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current);
}
SYSCALL_DEFINE0 (getpid)
展开后变成 asmlinkage long sys_getpid (void)
用户程序如何执行系统调用?
用户程序并不能执行系统调用,只能让内核代为执行。即通知内核,然后内核返回。
通知内核的实现
通知内核通过 int 0x80
软中断,或者 sysenter
指令实现。用户需要在约定的寄存器(比如 x86 是 eax)写入系统调用号。然后 syscall handler 从中取得。并执行相应的系统调用:
call *sys_call_table(,%rax, 8)
系统调用参数传入内核的机制
将参数放在内存,然后用一个单独的寄存器存放指针。在 x86,返回值通过 eax 寄存器实现。
参考
- Linux 内核线程及普通进程总结 - kk Blog —— 通用基础 (abcdxyzk.github.io)
- C Programming: Creating a child process using “clone” - CodeSteps
- [fork 系统调用过程分析・De4dCr0w’s Blog](https://de4dcr0w.github.io/ 分析 fork 系统调用.html)
- Linux 文件系统之 VFS(六) (mudongliang.github.io)
- [初探 Linux kernel 亂數產生器 – random generator – SZ Lin with Cybersecurity & Embedded Linux](https://szlin.me/2016/10/05 / 初探 - linux-kernel - 亂數產生器 - random-generator/)
- Linux 下的进程类别(内核线程、轻量级进程和用户进程)以及其创建方式 –Linux 进程的管理与调度(四) - 明明是悟空 - 博客园 (cnblogs.com)
- Kernel threads | Mastering Linux Kernel Development (packtpub.com)
- 深入理解 Linux 内核之内核线程(下) - 知乎 (zhihu.com)
- [《操作系统》xv6 阅读报告之进程调度 | kinami’s](https://www.kinami.cc/2021/04/14/《操作系统》xv6 阅读报告之进程调度 /)
- 设计关键数据结构 – 进程控制块 | uCore Lab Documents (gitbooks.io)
- 进程切换过程・ucore_os_docs (gitbooks.io)