From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id DE64FC4332F for ; Mon, 12 Dec 2022 01:51:18 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0C3BD8E0003; Sun, 11 Dec 2022 20:51:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 074C18E0002; Sun, 11 Dec 2022 20:51:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E56858E0003; Sun, 11 Dec 2022 20:51:17 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id D707C8E0002 for ; Sun, 11 Dec 2022 20:51:17 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id A8E14A0187 for ; Mon, 12 Dec 2022 01:51:17 +0000 (UTC) X-FDA: 80231976594.27.11F082D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf29.hostedemail.com (Postfix) with ESMTP id B5168120003 for ; Mon, 12 Dec 2022 01:51:15 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XseIVNfI; spf=pass (imf29.hostedemail.com: domain of ikent@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=ikent@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1670809875; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=60f4Xv2PrgnMPYC6/emmrTXGIgNiLNjayDFVJUFJfss=; b=1NqGnNOpQluzvS2qIGiW6MI/Gdfr8c9b1CIP2M1Ghc4maozS0rIzrEXmSjYm1tWrD3V0cj ag3xJbrgEr7Udd83rUQdhJnQU+Mky6j6FN+4RhfVJ0icQhEZOSQbTMgYMeJjCjv3c7U5kL 4k/jIh+1VjWuVM7OMrWubDzBuCg2quI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XseIVNfI; spf=pass (imf29.hostedemail.com: domain of ikent@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=ikent@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1670809875; a=rsa-sha256; cv=none; b=ndnf7xpAxdYv8/16YWc+ErZVXOLSjXFm9xqzowtNiYGoW857TArFgJa4D00E1cImoc9kCd iafCr0dInQg15mjzRBjLGFuGlj9xrtZXw8NX8F86zDvCZkg4O1MCk8VjXovsMe3gSyNmGZ wQodL1aYGpfVXRmHM2B4veRpwhKf4p0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1670809875; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=60f4Xv2PrgnMPYC6/emmrTXGIgNiLNjayDFVJUFJfss=; b=XseIVNfIRk9fbLeBcF283GkIIcwnuhVswu0VohJDXM0hmay8xxm+dlPK65COEH+Ms1EeHG p3Fx0vcTndGDHeq+tOxzM1Ruj6pWiHo4NY5W+rP0xeImw7wRoWi4NOC3DIcwqwjGogUiAL rJBhyDmdLYW/qffhUpl1vy0w/3otrCI= Received: from mail-pj1-f71.google.com (mail-pj1-f71.google.com [209.85.216.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-88-QM0kKvlpNj6HkJJ7E9u6Yw-1; Sun, 11 Dec 2022 20:51:13 -0500 X-MC-Unique: QM0kKvlpNj6HkJJ7E9u6Yw-1 Received: by mail-pj1-f71.google.com with SMTP id il11-20020a17090b164b00b00219a4366109so11002653pjb.0 for ; Sun, 11 Dec 2022 17:51:13 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=60f4Xv2PrgnMPYC6/emmrTXGIgNiLNjayDFVJUFJfss=; b=tOk/uT3n2hU9RgxYbWsf2+jNmmojuQOp+scRcq9HiyKHgh3e0fuRwVQFbtRGQ96paS 9srZ7i3ExZUW0NwZVpK8nj2P7r7LTej3d5m4p/dCRs1VUABHe40R0cYnbeAhxa8me54R 2kkq73RI8x7UxjtrGvyTbzrfMNbs15mjB3M0xveJsd3utizSsGKV+IhlrvpYHvO9CVXV S2nrmJkRIQHDX2a6uG++dHlnKwz0V8f6cmk/WZMue88Eu74XeC5mBwKVrZCRgnHjSLID o3y8/T8g5gmGsDyKGgQgUoZJxdIxwZxfKdwAV51c0aXfJg8ba95sA5HWuUIVBnclaMQk wsuQ== X-Gm-Message-State: ANoB5pkF6ECovNKcdWHFHfYHR6XeAlcVBidyoZmfGN6st058vtrlK/FX Dx00hckwKP/aL1WvT2e+O8Na65aiS0xA9SAjEOU+gSp1vw69IwjN7ifZiOpn1hK6ngvpSju306/ OuyGrfZTgs4M= X-Received: by 2002:a05:6a00:10cd:b0:56c:3e38:bf0e with SMTP id d13-20020a056a0010cd00b0056c3e38bf0emr18440780pfu.11.1670809872446; Sun, 11 Dec 2022 17:51:12 -0800 (PST) X-Google-Smtp-Source: AA0mqf587NRhM6OVheRY/MbptuX/ZhFOPnacgTDgDlB18t7/QTrM6a6ECUroMCCqMDEs2WFOy2CMaQ== X-Received: by 2002:a05:6a00:10cd:b0:56c:3e38:bf0e with SMTP id d13-20020a056a0010cd00b0056c3e38bf0emr18440762pfu.11.1670809872160; Sun, 11 Dec 2022 17:51:12 -0800 (PST) Received: from ?IPV6:2403:580e:4b40:0:7968:2232:4db8:a45e? (2403-580e-4b40--7968-2232-4db8-a45e.ip6.aussiebb.net. [2403:580e:4b40:0:7968:2232:4db8:a45e]) by smtp.gmail.com with ESMTPSA id c7-20020aa79527000000b0057255b82bd1sm4513759pfp.217.2022.12.11.17.51.08 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sun, 11 Dec 2022 17:51:11 -0800 (PST) Message-ID: <89b892c0-7dd9-3a98-99c8-6db83c5a7200@redhat.com> Date: Mon, 12 Dec 2022 09:51:06 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.5.1 Subject: Re: [PATCH v3 4/5] pid: mark pids associated with group leader tasks To: Brian Foster , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Cc: onestero@redhat.com, willy@infradead.org, ebiederm@redhat.com References: <20221202171620.509140-1-bfoster@redhat.com> <20221202171620.509140-5-bfoster@redhat.com> From: Ian Kent In-Reply-To: <20221202171620.509140-5-bfoster@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: b46xdz5k5teq9f7te3nw1w48kp5orpxj X-Rspam-User: X-Rspamd-Queue-Id: B5168120003 X-Rspamd-Server: rspam06 X-HE-Tag: 1670809875-781631 X-HE-Meta: U2FsdGVkX19l9xbTZtrzEzcVxT/TG7XTeG+mUDq3wHsXYb53QvMQnS7gUZnq55IP4XvAtbElf6eWBAD+iX3/R1QGaKmEZx41YHJXsIXC9GW5t+m6ZVorxCXbT1CzRb1D0vX8jI/un7NTYQ98t7MIY9cXil2VtoXQlywZbrxaMG0/sjk1b5O2WpwLCO8dyMdVnyBy24yI+UprDmZogGrJHZnddO2LsW37JnmAXlwTH718jX6+Bf6qwY/FL9jG6voJHb4Q6yfQi8brTk628kT/9Jfmd7fMp4xwrE1z/TlRSNEe9QU6yr0qBq2oBnTsH40KZtHJhPTtdvVt36+PeIexT/UX8t9SrySwdJ+0RBuuPDJJW+bjmg2Fbz2yJjsuc9XW7YBYxAT8XZnQ3Xybmo//M9gdlrw9wXfSvDqKXptJg+AA7gs85EbUOO0Ice/NPESCCd6C1LOrviE5zeD9TLQwl0MRzcQVX9q/BKfWfBHcqCYD4tq7MrOcel3b4zwpjZx4COvpGxRm+48m751l+g11RqJN8AdbOJW6zVpQAKeLBWOLgioOFHXHdM1XZAnPTLxKZJOUcuDV+vB3iCzLuHJWr85UxyCZijwYyc0tGn8eyuO+fMIaTt3OL2lCHjDAqhhcDK3taBlvsza5i19p7wyhdD1VWdiYjqKJsc2cAjUVpS0d2yo/bxKpxnH3Jr0/BRlasFBZhNWvD1CJvgJ0WeCcGS/I7KaNCDVD9vzHnyvrIpRQPxKQm8EwmgdO3A465DzPeL46lKgOhHwbNlhZIjR1kjfVk8z7inLXKEsdW4XQ+xhztpEFDBtoAHWZ4FlDt6RU0eMKFzaIxhSNREBe6zwf5BGbaZkYPoG6KjtqNw6RNB0Pfm4O0CVmLOW6pKTDd53Tb5c9t9RRrXF4HcMA/W2tyNlOWHrqyt5TheZL3p0CamHbYNLNV4l+nFP/l395ieRDqwuZc8g5jzbBgMpjPYV 5HQIPRrM uoccfVloPGfeq9R67OD/D38/iN7nnBqNBWl6C4nJyhFSAi8jcBhlDvB/wm+/EK8hVqz5x/AuwmpsU8ulEh7qjks3DFTK7pJIvcZMs2Cq1nyo5qTA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 3/12/22 01:16, Brian Foster wrote: > Searching the pid_namespace for group leader tasks is a fairly > inefficient operation. Listing the root directory of a procfs mount > performs a linear scan of allocated pids, checking each entry for an > associated PIDTYPE_TGID task to determine whether to populate a > directory entry. This can cause a significant increase in readdir() > syscall latency when run in namespaces that might have one or more > processes with significant thread counts. > > To facilitate improved TGID pid searches, mark the ids of pid > entries that are likely to have an associated PIDTYPE_TGID task. To > keep the code simple and avoid having to maintain synchronization > between mark state and post-fork pid-task association changes, the > mark is applied to all pids allocated for tasks cloned without > CLONE_THREAD. > > This means that it is possible for a pid to remain marked in the > xarray after being disassociated from the group leader task. For > example, a process that does a setsid() followed by fork() and > exit() (to daemonize) will remain associated with the original pid > for the session, but link with the child pid as the group leader. > OTOH, the only place other than fork() where a tgid association > occurs is in the exec() path, which kills all other tasks in the > group and associates the current task with the preexisting leader > pid. Therefore, the semantics of the mark are that false positives > (marked pids without PIDTYPE_TGID tasks) are possible, but false > negatives (unmarked pids without PIDTYPE_TGID tasks) should never > occur. > > This is an effective optimization because false negatives are fairly > uncommon and don't add overhead (i.e. we already have to check > pid_task() for marked entries), but still filters out thread pids > that are guaranteed not to have TGID task association. > > Mark entries in the pid allocation path when the caller specifies > that the pid associates with a new thread group. Since false > negatives are not allowed, warn in the event that a PIDTYPE_TGID > task is ever attached to an unmarked pid. Finally, create a helper > to implement the task search based on the mark semantics defined > above (based on search logic currently implemented by next_tgid() in > procfs). Yes, the tricky bit, but the analysis sounds thorough so it should work for all cases ... > > Signed-off-by: Brian Foster Reviewed-by: Ian Kent > --- > include/linux/pid.h | 3 ++- > kernel/fork.c | 2 +- > kernel/pid.c | 44 +++++++++++++++++++++++++++++++++++++++++++- > 3 files changed, 46 insertions(+), 3 deletions(-) > > diff --git a/include/linux/pid.h b/include/linux/pid.h > index 343abf22092e..64caf21be256 100644 > --- a/include/linux/pid.h > +++ b/include/linux/pid.h > @@ -132,9 +132,10 @@ extern struct pid *find_vpid(int nr); > */ > extern struct pid *find_get_pid(int nr); > extern struct pid *find_ge_pid(int nr, struct pid_namespace *); > +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *); > > extern struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > - size_t set_tid_size); > + size_t set_tid_size, bool group_leader); > extern void free_pid(struct pid *pid); > extern void disable_pid_allocation(struct pid_namespace *ns); > > diff --git a/kernel/fork.c b/kernel/fork.c > index 08969f5aa38d..1cf2644c642e 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -2267,7 +2267,7 @@ static __latent_entropy struct task_struct *copy_process( > > if (pid != &init_struct_pid) { > pid = alloc_pid(p->nsproxy->pid_ns_for_children, args->set_tid, > - args->set_tid_size); > + args->set_tid_size, !(clone_flags & CLONE_THREAD)); > if (IS_ERR(pid)) { > retval = PTR_ERR(pid); > goto bad_fork_cleanup_thread; > diff --git a/kernel/pid.c b/kernel/pid.c > index 53db06f9882d..d65f74c6186c 100644 > --- a/kernel/pid.c > +++ b/kernel/pid.c > @@ -66,6 +66,9 @@ int pid_max = PID_MAX_DEFAULT; > int pid_max_min = RESERVED_PIDS + 1; > int pid_max_max = PID_MAX_LIMIT; > > +/* MARK_0 used by XA_FREE_MARK */ > +#define TGID_MARK XA_MARK_1 > + > struct pid_namespace init_pid_ns = { > .ns.count = REFCOUNT_INIT(2), > .xa = XARRAY_INIT(init_pid_ns.xa, PID_XA_FLAGS), > @@ -137,7 +140,7 @@ void free_pid(struct pid *pid) > } > > struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > - size_t set_tid_size) > + size_t set_tid_size, bool group_leader) > { > struct pid *pid; > enum pid_type type; > @@ -257,6 +260,8 @@ struct pid *alloc_pid(struct pid_namespace *ns, pid_t *set_tid, > > /* Make the PID visible to find_pid_ns. */ > __xa_store(&tmp->xa, upid->nr, pid, 0); > + if (group_leader) > + __xa_set_mark(&tmp->xa, upid->nr, TGID_MARK); > tmp->pid_allocated++; > xa_unlock_irq(&tmp->xa); > } > @@ -314,6 +319,11 @@ static struct pid **task_pid_ptr(struct task_struct *task, enum pid_type type) > void attach_pid(struct task_struct *task, enum pid_type type) > { > struct pid *pid = *task_pid_ptr(task, type); > + struct pid_namespace *pid_ns = ns_of_pid(pid); > + pid_t pid_nr = pid_nr_ns(pid, pid_ns); > + > + WARN_ON(type == PIDTYPE_TGID && > + !xa_get_mark(&pid_ns->xa, pid_nr, TGID_MARK)); > hlist_add_head_rcu(&task->pid_links[type], &pid->tasks[type]); > } > > @@ -506,6 +516,38 @@ struct pid *find_ge_pid(int nr, struct pid_namespace *ns) > } > EXPORT_SYMBOL_GPL(find_ge_pid); > > +/* > + * Used by proc to find the first thread group leader task with an id greater > + * than or equal to *id. > + * > + * Use the xarray mark as a hint to find the next best pid. The mark does not > + * guarantee a linked group leader task exists, so retry until a suitable entry > + * is found. > + */ > +struct task_struct *find_get_tgid_task(int *id, struct pid_namespace *ns) > +{ > + struct pid *pid; > + struct task_struct *t; > + unsigned long nr = *id; > + > + rcu_read_lock(); > + do { > + pid = xa_find(&ns->xa, &nr, ULONG_MAX, TGID_MARK); > + if (!pid) { > + rcu_read_unlock(); > + return NULL; > + } > + t = pid_task(pid, PIDTYPE_TGID); > + nr++; > + } while (!t); > + > + *id = pid_nr_ns(pid, ns); > + get_task_struct(t); > + rcu_read_unlock(); > + > + return t; > +} > + > struct pid *pidfd_get_pid(unsigned int fd, unsigned int *flags) > { > struct fd f;