linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Zhongkun He <hezhongkun.hzk@bytedance.com>
To: Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	corbet@lwn.net, linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [External] Re: [PATCH v2] mm: add new syscall pidfd_set_mempolicy().
Date: Mon, 14 Nov 2022 23:12:00 +0800	[thread overview]
Message-ID: <3a3b4f5b-14d1-27d8-7727-cf23da90988f@bytedance.com> (raw)
In-Reply-To: <Y3IqLzvduM6HqPJV@dhcp22.suse.cz>

Sorry,michal. I dont know if my expression is accurate.
> 
> We shouldn't really rely on mmap_sem for this IMO. 

  Yes, We should rely on mmap_sem for vma->vm_policy,but not for
  process context policy(task->mempolicy).

> There is alloc_lock
> (aka task lock) that makes sure the policy is stable so that caller can
> atomically take a reference and hold on the policy. And we do not do
> that consistently and this should be fixed.

I saw some explanations in the doc("numa_memory_policy.rst") and
comments(mempolcy.h) why not use locks and reference in page
allocation:

In process context there is no locking because only the process accesses
its own state.

During run-time "usage" of the policy, we attempt to minimize atomic
operations on the reference count, as this can lead to cache lines
bouncing between cpus and NUMA nodes.  "Usage" here means one of
the following:
1) querying of the policy, either by the task itself [using
the get_mempolicy() API discussed below] or by another task using
the /proc/<pid>/numa_maps interface.

2) examination of the policy to determine the policy mode and
associated node or node lists, if any, for page allocation.
This is considered a "hot path".  Note that for MPOL_BIND, the
"usage" extends across the entire allocation process, which may
sleep during page reclaimation, because the BIND policy nodemask
is used, by reference, to filter ineligible nodes.

> E.g. just looking at some
> random places like allowed_mems_nr (relying on get_task_policy) is
> completely lockless and some paths (like fadvise) do not use any of the
> explicit (alloc_lock) or implicit (mmap_lock) locking. That means that
> the task_work based approach cannot really work in this case, right?

The task_work based approach (mpol_put_async()) allows mempolicy release
to be transferred from the pidfd_set_mempolicy() context to the
target process context.The old mempolicy droped by pidfd_set_mempolicy()
will be freed by task_work(mpol_free_async) whenever the target task
exit to user mode. At this point task->mempolicy will not be used,
thus avoiding race conditions.

pidfd process context:
void mpol_put_async()
{.....
         init_task_work(&p->w.cb_head, "mpol_free_async");
         if (!task_work_add(task, &p->w.cb_head, TWA_SIGNAL_NO_IPI))
                 return;}

target task:
/* there is no lock and mempolicy may about to be freed by
  * pidfd_set_mempolicy().
  */
pol = get_task_policy()
page = __alloc_pages(..pol)
.....
exit_to_user_mode
	task_work_run()
	/* It's safe to free old mempolicy
  	* dropped by pidfd_set_mempolicy() at this time.*/
		mpol_free_async()

> Playing more tricks will not really help long term. So while your patch
> tries to workaround the current state of the art I do not think we
> really want that. As stated previously, I would much rather see proper
> reference counting instead. I thought we have agreed this would be the
> first approach unless the resulting performance is really bad. Have you
> concluded that to be the case? I do not see any numbers or notes in the
> changelog.

I have tried it, but I found that using task_work to release the
old policy on the target process can solve the problem, but I'm
not sure. My expression may not be very clear, Looking forward to
your reply.

Thanks.




  parent reply	other threads:[~2022-11-14 15:12 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-11  8:40 Zhongkun He
2022-11-11 19:27 ` Andrew Morton
2022-11-13 16:41   ` [External] " Zhongkun He
2022-11-14 11:44     ` Michal Hocko
2022-11-14 11:46       ` Michal Hocko
2022-11-14 17:52         ` Michal Hocko
2022-11-14 15:12       ` Zhongkun He [this message]
2022-11-14 18:12         ` Michal Hocko
2022-11-15  7:39           ` Zhongkun He
2022-11-16 11:28             ` Zhongkun He
2022-11-16 14:57               ` Michal Hocko
2022-11-17  7:19                 ` Zhongkun He
2022-11-21 14:38                   ` Michal Hocko
2022-11-22  8:33                     ` Zhongkun He
2022-11-22  8:40                       ` Michal Hocko
2022-11-14  9:24   ` Zhongkun He
2022-11-12  2:09 ` kernel test robot
2022-11-16  7:04 ` Huang, Ying
2022-11-16  9:38   ` [External] " Zhongkun He
2022-11-16  9:44     ` Michal Hocko
2022-11-17  6:29       ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3a3b4f5b-14d1-27d8-7727-cf23da90988f@bytedance.com \
    --to=hezhongkun.hzk@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox