From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77CDAC25B47 for ; Wed, 25 Oct 2023 06:27:58 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0ECAD6B031E; Wed, 25 Oct 2023 02:27:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 09E6D6B031F; Wed, 25 Oct 2023 02:27:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA6F96B0320; Wed, 25 Oct 2023 02:27:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D8B7C6B031E for ; Wed, 25 Oct 2023 02:27:57 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AEF4F1CB922 for ; Wed, 25 Oct 2023 06:27:57 +0000 (UTC) X-FDA: 81383003394.15.43F4E93 Received: from mail-vs1-f43.google.com (mail-vs1-f43.google.com [209.85.217.43]) by imf30.hostedemail.com (Postfix) with ESMTP id E8AA180002 for ; Wed, 25 Oct 2023 06:27:55 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nYS8Ivs8; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698215276; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/hvhN95Hgxd5EOg3E4vQFgU+T0leLLlisysltTyUILc=; b=jHcZsv2Kw54J6EOXMJjrlB1Y+sOSOcT+LZ8k5i+Avi5YCMHMoZZ1iVQiVsDesHXt0gHvzA 0X4e1OVv50dJUvKupCcMNooEDSkm8i1QzZtXf34ZcRDl+u/7LpPucaU1vmDJNbhD8p8QA8 Q2HVFb+fUjhKMn9NoOeEoYbFzG7kTa8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698215276; a=rsa-sha256; cv=none; b=TfhT9FqSxQvKZdnzdDUAUg9SVqq/Ep9xctXbHnm+ZcuiLeU6Du9bUto9ZPt/sUAlL/p7Y+ SBVHqFlA0g6otVtRvAX1YAkNLm1ZGMkbbOaQTjQc9KEB95Cdx+IFhQrHVD88jZS5A2H7Lu 7kG+u8oPkpDR1MKZnw7Rz2c8UHvgzco= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=nYS8Ivs8; spf=pass (imf30.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.43 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vs1-f43.google.com with SMTP id ada2fe7eead31-457e5dec94dso2159703137.3 for ; Tue, 24 Oct 2023 23:27:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698215275; x=1698820075; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/hvhN95Hgxd5EOg3E4vQFgU+T0leLLlisysltTyUILc=; b=nYS8Ivs8pXRUV1A0AMUuMt761N945L0WkFKAI5ZPq2VfzpSGBZUdTJEM3cG0OA/Nuv VvthEKoApmrWa5WmqVhsd3iIugFQdOA/CorfcES95/BY3pwTp+l4WbVI+dCiRZyhBVyn +6ibuF3JR85+2LMjuHL9ma09lllFJluIvjeSL7rxlUUlI3S4VdLEq8W1F/HJ1mo2H0nH gWqUKlK7VlJzLj6ZprYEj08MvdeO4mHxXkq04toXvsXAPe52vy5iMHUbb/P4rbTe12AF qRez74Zonv+AIVz+jsr26Ct3ZdMrOCROAjZExsg8zb++yO0veaWqbIznN5GJ48GVOMiB 0dBw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698215275; x=1698820075; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/hvhN95Hgxd5EOg3E4vQFgU+T0leLLlisysltTyUILc=; b=MG/j0wuz+fL3MLJvbqml8XYwq0zKOENY3Zav3oo7cbDZPiErM65kpoamusHxeTznlx +fjgPxrhTxOBLczLR7oDocTaJaH7Zw9wzQa/gLJn/Giz5o2EDINjz7gpoTMuqXxd/t+l LeU2SCIujaXvXV+gUQj3id8d5V+a9V9Ump539L7ojQgofP19ZnT9hyNGOPBA/Hop6D5Q VhFIMM0SaLh9df+jxHrPA3u2cqhuFe1yT6JKCTyashhTFS9Qjc+KMqxmfWm0SB5ILgBA i6A0Qst278YmX8+Rd/Wz3ChT+hxETvZqOduYKKX04R1fmgTW59FHs2jEceOkSVxEVEiH 5zsA== X-Gm-Message-State: AOJu0YxPPDL/bD8a40DSvsqSeB8XHua+RR/cLdfJU4RPiuDF4Qfy9cHJ kEQA9LxJHTvkDZ4+ZnOg+MnXyDptkQcjN/ZCimU= X-Google-Smtp-Source: AGHT+IEQEdH2DS0DmnrsPQe31oNAW06TD9BBbD61iByaZs2s3/SuMAYfnQN5a+0/jOZUcnYewvrmk8OpvLKvA6AGz5M= X-Received: by 2002:a67:e009:0:b0:452:78cb:206b with SMTP id c9-20020a67e009000000b0045278cb206bmr11817794vsl.3.1698215274847; Tue, 24 Oct 2023 23:27:54 -0700 (PDT) MIME-Version: 1.0 References: <87y1frqz2u.fsf@nvdebian.thelocal> <87ttqfqw8f.fsf@nvdebian.thelocal> <87bkcn1j5k.fsf@nvdebian.thelocal> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Wed, 25 Oct 2023 14:27:42 +0800 Message-ID: Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit To: Yu Zhao Cc: Alistair Popple , Baolin Wang , catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, v-songbaohua@oppo.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E8AA180002 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: 63dtm5ouo5818nmizpsbykrq9m8w5izi X-HE-Tag: 1698215275-534583 X-HE-Meta: U2FsdGVkX18gjGgM+vHzOHPdxzDHs+NRt2GM4U2u9INXMef9Owzv/EZsXy/8p3B/n9VzRIO//k07QedKvoCH8Ow5xA1NXAEm5XTNuS0pYkYX3dvWR41DTfawAbyN4d9Ha7J3fwUuf+0ItanOJ/dSnphdcL57F0dUAJ2Pr+6TeaLjfECu0zKPqYxOxYG6g5pH+/4f75NhhYrjKIRe9dCKveeSI/HqkU99HauvGIbTecsUZlMsoNlCvvtlpbxZ/uRvVdVRw0/941eg3rgXdBUAtySDvmSq7hiElaJzCcfHPj3g7LV8pD+XsABgD6EraV/GFFlKbDT1Yh89h+zQTJMgJVqwAVt1COi7HuijBRZw9dR8jqvMylImoKT6DC2zxiY5KMcOrlIoABdMOTWR/bYmrXLZexjb0zzawn8gXkwSKgkO+HsfN9XqlEDB73Bhl0O/OcAaU9qF7h+EnVVPxF5IB9oyiwLR+uo1PevYY4T70txOBMyzXGsgfHLDjZULbB7VA72DWot6kXxf1p4G0xWCHsxu0n60Uz59+uRNVWwiwOqX+1JsUP/BZs/Q9K814k0hDV0NkUrF5tB1t2Ok0u3k6OQnu81vKGHzdY6S06+thdHK/DB9IbGVxDMUnkDzjOIDjFXy3gd//wW7N8TAfUOUgjVK0dxnK+k8/5Dt1grEWLBqz9qeeVseJIekQfKvQD5xDjlmtjqL88KIxutar+l1YqasloaCKrZrp2IBh3UP1Tz/I6NSdtuYA2fH9nmnsMJozbCSA+VIS3QiKm6vmkVHX1uLKU/304LCTHQixQCPo6s4Gwu/ObFK+hlsh+gFrthmPopGLaVuMzH7ZsgVsTG3JQaUoE5Un+HNuOg5cNitWInIq6BoP4NRGgQXJ2kNQUf10UM4/NMXWoF4WewBgXtz/Mvlf2T646XmisdhJEingH+CQLd1dFa9esfCwVCsFhe+/ZNBhC3tyeqRAYz3EZM VV8u0jNh Z2A5CU/hl5akK3ibao+7/iyHtJLECtQCpCeND8ZBtAbFeTZ/AhIPiE74oVXIX5ifyzFn+vFeH+Atim3ehj87LPBR55fXZG+eHo4hR4jfMag23ObDUjOXkftRX7Bs1nIa8noG6BJYE2OYyQQOEwyVhrFuoRPk1hN9sxdnFV3SQpOE2uS/0qdL4XOqERrelogjaUiNWwDhvfRd0etCJLYcfzzkHYLE+jXTIhFgGxXmWpueW06EzM0fBKbnXcD/fDgyOmOupenCTj/RVN09PjGrBh1DMim7PRF04g82jXkyTh2zopKdPCTpPS2DiBNh313pg4IIUSib3Dig+Piw3//8+4J+QsZlnkkH8CLpPoLYgQUHPcuLBv+RXjSOr8Xq9tFbynxWvpkm3riaIghKZI5Qpu0TsiM6yDaA8Q6kO4kt0gTs/PGoWCCjbRgBg2bK9ycnOGiE/OohrW1QB0HqO8ECv/6Hns1SjNDMiN3ydZxsH9c//HLlblHJvMZdOoAFU2xqgaQglAOLAcNATlxJK2rrgdTkCl4z1M3gTLSnrP5xZUS93Yzv+PgqxrkD0vA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 25, 2023 at 2:17=E2=80=AFPM Yu Zhao wrote: > > On Tue, Oct 24, 2023 at 9:21=E2=80=AFPM Alistair Popple wrote: > > > > > > Baolin Wang writes: > > > > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > > >> Barry Song <21cnbao@gmail.com> writes: > > >> > > >>> On Wed, Oct 25, 2023 at 9:18=E2=80=AFAM Alistair Popple wrote: > > >>>> > > >>>> > > >>>> Barry Song <21cnbao@gmail.com> writes: > > >>>> > > >>>>> On Wed, Oct 25, 2023 at 7:16=E2=80=AFAM Barry Song <21cnbao@gmail= .com> wrote: > > >>>>>> > > >>>>>> On Tue, Oct 24, 2023 at 8:57=E2=80=AFPM Baolin Wang > > >>>>>> wrote: > > >> [...] > > >> > > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclaimed= hot > > >>>>>> page, B might > > >>>>>> be a correct choice. > > >>>>> > > >>>>> Plus, I doubt B is really going to happen. as after a page is pro= moted to > > >>>>> the head of lru list or new generation, it needs a long time to s= lide back > > >>>>> to the inactive list tail or to the candidate generation of mglru= . the time > > >>>>> should have been large enough for tlb to be flushed. If the page = is really > > >>>>> hot, the hardware will get second, third, fourth etc opportunity = to set an > > >>>>> access flag in the long time in which the page is re-moved to the= tail > > >>>>> as the page can be accessed multiple times if it is really hot. > > >>>> > > >>>> This might not be true if you have external hardware sharing the p= age > > >>>> tables with software through either HMM or hardware supported ATS > > >>>> though. > > >>>> > > >>>> In those cases I think it's much more likely hardware can still be > > >>>> accessing the page even after a context switch on the CPU say. So = those > > >>>> pages will tend to get reclaimed even though hardware is still act= ively > > >>>> using them which would be quite expensive and I guess could lead t= o > > >>>> thrashing as each page is reclaimed and then immediately faulted b= ack > > >>>> in. > > > > > > That's possible, but the chance should be relatively low. At least on > > > x86, I have not heard of this issue. > > > > Personally I've never seen any x86 system that shares page tables with > > external devices, other than with HMM. More on that below. > > > > >>> i am not quite sure i got your point. has the external hardware sha= ring cpu's > > >>> pagetable the ability to set access flag in page table entries by > > >>> itself? if yes, > > >>> I don't see how our approach will hurt as folio_referenced can noti= fy the > > >>> hardware driver and the driver can flush its own tlb. If no, i don'= t see > > >>> either as the external hardware can't set access flags, that means = we > > >>> have ignored its reference and only knows cpu's access even in the = current > > >>> mainline code. so we are not getting worse. > > >>> > > >>> so the external hardware can also see cpu's TLB? or cpu's tlb flush= can > > >>> also broadcast to external hardware, then external hardware sees th= e > > >>> cleared access flag, thus, it can set access flag in page table whe= n the > > >>> hardware access it? If this is the case, I feel what you said is t= rue. > > >> Perhaps it would help if I gave a concrete example. Take for example > > >> the > > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in one = of > > >> two ways depending on the specific HW implementation. > > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > > >> TLB > > >> invalidations. If BTM is not supported it relies on SW to explicitly > > >> forward TLB invalidations via MMU notifiers. > > > > > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > > > > Lucky you :-) > > > > ARM64 SMMU architecture specification supports the possibilty of both, > > as does the driver. Not that I think whether or not BTM is supported ha= s > > much relevance to this issue. > > > > >> Now consider the case where some external device is accessing mappin= gs > > >> via the SMMU. The access flag will be cached in the SMMU TLB. If we > > >> clear the access flag without a TLB invalidate the access flag in th= e > > >> CPU page table will not get updated because it's already set in the = SMMU > > >> TLB. > > >> As an aside access flag updates happen in one of two ways. If the > > >> SMMU > > >> HW supports hardware translation table updates (HTTU) then hardware = will > > >> manage updating access/dirty flags as required. If this is not suppo= rted > > >> then SW is relied on to update these flags which in practice means > > >> taking a minor fault. But I don't think that is relevant here - in > > >> either case without a TLB invalidate neither of those things will > > >> happen. > > >> I suppose drivers could implement the clear_flush_young() MMU > > >> notifier > > >> callback (none do at the moment AFAICT) but then won't that just lea= d to > > >> the opposite problem - that every page ever used by an external devi= ce > > >> remains active and unavailable for reclaim because the access flag n= ever > > >> gets cleared? I suppose they could do the flush then which would ens= ure > > > > > > Yes, I think so too. The reason there is currently no problem, perhap= s > > > I think, there are no actual use cases at the moment? At least on our > > > Alibaba's fleet, SMMU and MMU do not share page tables now. > > > > We have systems that do. > > Just curious: do those systems run the Linux kernel? If so, are pages > shared with SMMU pinned? If not, then how are IO PFs handled after > pages are reclaimed? it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in I/O PF, so finally it runs the same codes to get page back just like CPU's PF. years ago, we recommended a pin solution, but obviously there were lots of push backs: https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wangzhou= 1@hisilicon.com/ Thanks Barry