From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DBD1BC5479D
	for <linux-mm@archiver.kernel.org>; Mon,  9 Jan 2023 21:29:10 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 4009D8E0003; Mon,  9 Jan 2023 16:29:10 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 3B0868E0001; Mon,  9 Jan 2023 16:29:10 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 29FDA8E0003; Mon,  9 Jan 2023 16:29:10 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 1ACE98E0001
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 16:29:10 -0500 (EST)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id C904E40FB1
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 21:29:09 +0000 (UTC)
X-FDA: 80336551218.26.453E0B3
Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48])
	by imf19.hostedemail.com (Postfix) with ESMTP id 42CE71A0008
	for <linux-mm@kvack.org>; Mon,  9 Jan 2023 21:29:08 +0000 (UTC)
Authentication-Results: imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=iCotF84f;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1673299748;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=unEz8Po80qyvgT7TDUJc75ZwwHShIf6/y2xqGnG0MLw=;
	b=O4YaIwg5/83hnid7Jc5afMubqTJiuEZPmmcTx5U4QWnw0v0bdyE4iaMK3R8TaYqQvhH3IT
	T/BHAsoGMM6wT/ELY9p7DaB6B36Ft0e+ej3sNDiQL8VsPnyFVUwgGgaJze2lnpyp3Ngtp/
	UeMTbn6sbhnCk0MfBlJAuJSpcPksq4k=
ARC-Authentication-Results: i=1;
	imf19.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20210112 header.b=iCotF84f;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf19.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1673299748; a=rsa-sha256;
	cv=none;
	b=BAwwYMfsMKcnKAqM5ChTXHZ/EqSZSiG/k6KPgcgXtpXb+StpW1XFDwMU90P/5fqsw+JdLO
	fmVV4U7RPP7D0VvP/erjL+7fJqIDn8jqc/P7Xty1+XQvxobMPm6+APv4pSAbz4A4uFUdn4
	ZN/Y0QlSbzsClL7TwbwiDqMjsUByCFY=
Received: by mail-vs1-f48.google.com with SMTP id 3so10219618vsq.7
        for <linux-mm@kvack.org>; Mon, 09 Jan 2023 13:29:07 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:from:to:cc:subject:date:message-id:reply-to;
        bh=unEz8Po80qyvgT7TDUJc75ZwwHShIf6/y2xqGnG0MLw=;
        b=iCotF84fyU6WecJRaYkXs3vK3d7x9b8Nj+HbTmHBstmHAOMZ6wu+fHC/f2BCes33RA
         5Fex5jkaX4wZyn1EVHLh4oXxBe3d0BIv3ScUHRW8XjHZwT6wW1BIxT9WOy2NCE0uv+xi
         PLlTpj1j71eEzTVt4GNqWdM4k/zlutRaSDhJ+huyga2k/sVU947d+CQcmJ0o2Kn/wWrh
         N3ZH599DUjSeyBwoo+dD7R3dzvW99iMSiiHe5FTO+O6oNTGlVY2PLme/QJwE/M+iFpJZ
         ou86QYiECR/qaRGKBBSprOValMlbWr0mGDwDFYrVmap0Q30KY9DkgX3d+50qdeKqpyZB
         YkBg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=cc:to:subject:message-id:date:from:in-reply-to:references
         :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=unEz8Po80qyvgT7TDUJc75ZwwHShIf6/y2xqGnG0MLw=;
        b=F1Vlp3vxRQrYNIgrK7tmvpRq957jiDCVlE5fOC1BT7hl6zoy23BP3qNpz2jQb/RAfW
         N5EdxzsWKGzcAOGfB3Davbm9mPiKr9pxg7PejmCvy5yzYbuRnuCJcdb1VRXsZJqKDFGw
         pPkD3X7MlE4ANM3IElG8fCp+FLN5AamWDgDabu0YPgF5/Si1seqq5KZ3ztLwZUf8hjdQ
         dYDklF1PLFFng7gVb65xKOu2YSltAOZzHhgLb7sDiWzDztYPoPONltyxa/hhNO57F4Mb
         F7+koxc1SSc10aV/BcIaGdwZcJC+cVkLRFBFnTJA+pnp1B1l9L2pGEeS7ToNOru6Omx8
         Z9xQ==
X-Gm-Message-State: AFqh2kq4sDi0Q64Lcqt+BAPVoHIZzKZzJEwUzjNPqbcsZGPlbtrwHzuU
	QFujR2p45TKr4OCg26DhmfVDEW+HmDt9yYf3lq4=
X-Google-Smtp-Source: AMrXdXsg6Ko/alWOI1aCE8iGkx65RtschN8kfdrjibAvlZRKWqyjgef2aFot9xS0Y4dCZHNtV0kQ5ZBQglrZ6/auzas=
X-Received: by 2002:a67:5e01:0:b0:3ce:d776:30b6 with SMTP id
 s1-20020a675e01000000b003ced77630b6mr3170503vsb.33.1673299747173; Mon, 09 Jan
 2023 13:29:07 -0800 (PST)
MIME-Version: 1.0
References: <20221117082648.47526-1-yangyicong@huawei.com> <20221117082648.47526-3-yangyicong@huawei.com>
 <Y7cToj5mWd1ZbMyQ@arm.com> <CAGsJ_4yC0i6MYwvosRSrdQ1iT7n88ypmK3aOQJkuusqNKtddtg@mail.gmail.com>
 <Y7xMhPTAwcUT4O6b@arm.com>
In-Reply-To: <Y7xMhPTAwcUT4O6b@arm.com>
From: Barry Song <21cnbao@gmail.com>
Date: Tue, 10 Jan 2023 05:28:55 +0800
Message-ID: <CAGsJ_4zrff5vtS0WP4Q8VH8vhBma8bzMqyY5c0mxjQ_qjFbO-Q@mail.gmail.com>
Subject: Re: [PATCH v7 2/2] arm64: support batched/deferred tlb shootdown
 during page reclamation
To: Catalin Marinas <catalin.marinas@arm.com>, Nadav Amit <namit@vmware.com>, 
	Mel Gorman <mgorman@suse.de>
Cc: Yicong Yang <yangyicong@huawei.com>, akpm@linux-foundation.org, linux-mm@kvack.org, 
	linux-arm-kernel@lists.infradead.org, x86@kernel.org, will@kernel.org, 
	anshuman.khandual@arm.com, linux-doc@vger.kernel.org, corbet@lwn.net, 
	peterz@infradead.org, arnd@arndb.de, punit.agrawal@bytedance.com, 
	linux-kernel@vger.kernel.org, darren@os.amperecomputing.com, 
	yangyicong@hisilicon.com, huzhanyuan@oppo.com, lipeifeng@oppo.com, 
	zhangshiming@oppo.com, guojian@oppo.com, realmz6@gmail.com, 
	linux-mips@vger.kernel.org, openrisc@lists.librecores.org, 
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org, 
	linux-s390@vger.kernel.org, wangkefeng.wang@huawei.com, 
	xhao@linux.alibaba.com, prime.zeng@hisilicon.com, 
	Barry Song <v-songbaohua@oppo.com>
Content-Type: text/plain; charset="UTF-8"
X-Rspamd-Queue-Id: 42CE71A0008
X-Rspamd-Server: rspam09
X-Rspam-User: 
X-Stat-Signature: y6h47r7zc4kanaza8mqur3exar41bec5
X-HE-Tag: 1673299748-33777
X-HE-Meta: U2FsdGVkX1+/rI8vgVUltBctVj24i1dOx83CRyCFcOhoJjfipoxv3HHJP9dy5WKvgGPnn9o8vlIoFXBNPPbW/NyQUJl7Q9dEwjg5P6+MWNCdHamQWqiBdqv9iC+XJeGs2KqDjPAuulJSoPwPAcQLknkhvS9e0uB4FHAxz9ZkVN6Gw2OuTmeQPFSVApCRM7FGwDFA1tx0JzFPz7uQ7CK1Tlx49jJfZkKupIdSaVejVWu8X08S/3ysnsedf3HmMwQ7CorqwnmxWOp/qJBBb9SBUrqipz4nh02VJFk9Jl02KWHdBE+SEEjI5u4rBaIuYMS64y31x8WzIKdtJZmK1kPOiykf2SQ9UEJacEdYVbvO8OUlcVdwNYhIl9N9OWJJgee4vRxDxrpixa8qGWe4U53CDnrCNS/OyANpGgtACENmaom32jHCAlqOSEZaBGrrYEezf6jdQ8+g1N0MqxU5U0xVH5/+QF5HWiNK74XtPwTeD0P26s1k8icPObxgqV/Bxh8ayMKJko3chMLHjuih9h/zpv7Uu+47m64xewmPuUnZY7AnTTxk1Ct+LFO0SGGrMLX4i/DapTgtb4fQlhFPU6P364uAFrEoov/tbmKlLXzV6oF0jWqp/vJAqPgyrRTczANaA/LOl4wsi3FFD/i0Lrq4XEpBJF6Jr3vBcUl6FcJOEPcfk5SpNHh3/LbMKAxf3kLYz0s2/9VSalVPP62qyJmCTyCsrrIOlQdQfxTos5kT5vvf5SaL5hTi7tdkg84T24ELIEiGMrWInnRkesStHgyx0TppavyVzGyiBwb5SwLl7F+6RNCyAbReM2uImg6hTVTOHszS9Lzoogni0hbV7p6oZJ3g48/Jf4vg1/vYqBEiCDjzxDcmCW/0UrFeQ7LzgGYki/0u2MSMQT9yWLXfI9t9H6wpLLe1U0SHE65XIS/033sQ3iMbeLOoyd9KXMgEJZBtUbK5SMSoveYxLgz+0ZU
 r5hf56Zx
 sd1HK6gAECjvTlpBwxRmhw1FFQ383cs9OoWZfkM91uTn2Dg6OOpI8unNBfw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Jan 10, 2023 at 1:19 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> On Sun, Jan 08, 2023 at 06:48:41PM +0800, Barry Song wrote:
> > On Fri, Jan 6, 2023 at 2:15 AM Catalin Marinas <catalin.marinas@arm.com> wrote:
> > > On Thu, Nov 17, 2022 at 04:26:48PM +0800, Yicong Yang wrote:
> > > > It is tested on 4,8,128 CPU platforms and shows to be beneficial on
> > > > large systems but may not have improvement on small systems like on
> > > > a 4 CPU platform. So make ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH depends
> > > > on CONFIG_EXPERT for this stage and make this disabled on systems
> > > > with less than 8 CPUs. User can modify this threshold according to
> > > > their own platforms by CONFIG_NR_CPUS_FOR_BATCHED_TLB.
> > >
> > > What's the overhead of such batching on systems with 4 or fewer CPUs? If
> > > it isn't noticeable, I'd rather have it always on than some number
> > > chosen on whichever SoC you tested.
> >
> > On the one hand, tlb flush is cheap on a small system. so batching tlb flush
> > helps very minorly.
>
> Yes, it probably won't help on small systems but I don't like config
> options choosing the threshold, which may be different from system to
> system even if they have the same number of CPUs. A run-time tunable
> would be a better option.
>
> > On the other hand, since we have batched the tlb flush, new PTEs might be
> > invisible to others before the final broadcast is done and Ack-ed.
>
> The new PTEs could indeed be invisible at the TLB level but not at the
> memory (page table) level since this is done under the PTL IIUC.
>
> > thus, there
> > is a risk someone else might do mprotect or similar things  on those deferred
> > pages which will ask for read-modify-write on those deferred PTEs.
>
> And this should be fine, we have things like the PTL in place for the
> actual memory access to the page table.
>
> > in this
> > case, mm will do an explicit flush by flush_tlb_batched_pending which is
> > not required if tlb flush is not deferred.
>
> I don't fully understand why it's needed, or at least why it would be
> needed on arm64. At the end of an mprotect(), we have the final PTEs in
> place and we just need to issue a TLBI for that range.
> change_pte_range() for example has a tlb_flush_pte_range() if the PTE
> was present and that won't be done lazily. If there are other TLBIs
> pending for the same range, they'll be done later though likely
> unnecessarily but still cheaper than issuing a flush_tlb_mm().

Thanks! I'd like to ask for some comments from Nadav and Mel from the x86 side.
Revisiting the code of flush_tlb_batched_pending shows we still have races even
under PTL.

/*
 * Reclaim unmaps pages under the PTL but do not flush the TLB prior to
 * releasing the PTL if TLB flushes are batched. It's possible for a parallel
 * operation such as mprotect or munmap to race between reclaim unmapping
 * the page and flushing the page. If this race occurs, it potentially allows
 * access to data via a stale TLB entry. Tracking all mm's that have TLB
 * batching in flight would be expensive during reclaim so instead track
 * whether TLB batching occurred in the past and if so then do a flush here
 * if required. This will cost one additional flush per reclaim cycle paid
 * by the first operation at risk such as mprotect and mumap.
 *
 * This must be called under the PTL so that an access to tlb_flush_batched
 * that is potentially a "reclaim vs mprotect/munmap/etc" race will synchronise
 * via the PTL.
 */
void flush_tlb_batched_pending(struct mm_struct *mm)
{
}

According to Catalin's comment, it seems over-cautious since we can make sure
people see updated TLB after mprotect and munmap are done as they have tlb
flush.  We can also make sure mprotect see updated "memory" of PTEs from
reclamation though pte is not visible in TLB level.

Hi Mel, Nadav, would you please help clarify the exact sequence of how this race
is going to happen?

>
> > void flush_tlb_batched_pending(struct mm_struct *mm)
> > {
> >        int batch = atomic_read(&mm->tlb_flush_batched);
> >        int pending = batch & TLB_FLUSH_BATCH_PENDING_MASK;
> >        int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
> >
> >        if (pending != flushed) {
> >                flush_tlb_mm(mm);
> >         /*
> >          * If the new TLB flushing is pending during flushing, leave
> >          * mm->tlb_flush_batched as is, to avoid losing flushing.
> >         */
> >       atomic_cmpxchg(&mm->tlb_flush_batched, batch,
> >            pending | (pending << TLB_FLUSH_BATCH_FLUSHED_SHIFT));
> >      }
> > }
>
> I guess this works on x86 better as it avoids the IPIs if this flush
> already happened. But on arm64 we already issued the TLBI, we just
> didn't wait for it to complete via a DSB.
>
> > I believe Anshuman has contributed many points on this in those previous
> > discussions.
>
> Yeah, I should re-read the old threads.
>
> --
> Catalin

Thanks
Barry