From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 267ABC25B6B for ; Thu, 26 Oct 2023 23:48:23 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8AE4E8D0035; Thu, 26 Oct 2023 19:48:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 84DA28D0034; Thu, 26 Oct 2023 19:48:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6B0978D0035; Thu, 26 Oct 2023 19:48:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 54C078D0034 for ; Thu, 26 Oct 2023 19:48:22 -0400 (EDT) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 2CB1EA0DF2 for ; Thu, 26 Oct 2023 23:48:22 +0000 (UTC) X-FDA: 81389254044.03.3219F23 Received: from mail-vs1-f51.google.com (mail-vs1-f51.google.com [209.85.217.51]) by imf29.hostedemail.com (Postfix) with ESMTP id 74AD4120019 for ; Thu, 26 Oct 2023 23:48:20 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gYUvOdkG; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698364100; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=; b=Lm+i7XXXxIIrGIR6TUxjJlcRg8BSm2ab4XVapwfyrTeeKNJwqss+Kw/Tkwy/+RpYgkF3Jh yxKoP7zlVYkSgJtW2ihjgvMite8KSUtk4tyw3XOncGOzNSEqmUw3y+kSoUxBdX2L0JCtdx oN/uqpNKiKWpSpD9cqvZ41YKWvq34Ms= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698364100; a=rsa-sha256; cv=none; b=0BsAOhbD30QJlGus7i59XZokRA+qABPnlsZr1oMDCvKduaNiovBuBC55QCGB4WzpPd/XKZ ImcG9WsBgO/7FR7+2On176/KL53MXlfjRMn8B01tocei5Nb5Q1WnfaNZ9Q1+W5XdSRgcFU jexcl5VcZBWCS+VP5wP5tLw6vlAAqxk= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=gYUvOdkG; spf=pass (imf29.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.51 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-vs1-f51.google.com with SMTP id ada2fe7eead31-45853ab5556so667550137.2 for ; Thu, 26 Oct 2023 16:48:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1698364099; x=1698968899; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=; b=gYUvOdkGPAbHsdBFcFaF6VT4zWDYo66BTx/ZWc88viUoi8NjHz+jVDxkf1cYPlMG9v XG3E45a1AhtVMFYN0dEljA7/AKTKqluA/Qwl+syMGN3iBKq3yWUNWkS5JZDWZt+BcaAk T6h0gQ8YOMXQL5t0+TNMQJC9qVoj9sk8cg2oRaoB2MiX/Zacg0Og7FAKxFcKYdCXekU/ 8t0HoFIPS0k5epAWoHBxe+PdOIiHCJpkxm7G4TJG2Opl0Qk+PstZgd/mgkUx8sx5/8Iz VcmldQd/7Xr6Vt9vxY2qQ/R25r8A6pwGlz+F3BRrV2xzhmQYQp+tV116r010Ca8tUptd KyxQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698364099; x=1698968899; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=+In5cIMB3HJpQTMHOEwIOuwXk/mnIKncDUTw+oPVd94=; b=JNnTqClonNc/hn2nAS3VDOMdAvX9B7oRCdlvBWn1th6+tsTWhC050PPqElDDYwd9A4 d0ISWV6FK/gA1GoGGArX+upV8Uw8+VP8Vaj084s5a9/DGyWlozfPtvhrfm6zkcy9smiO epcUw4tbzhstG4xEMs2YhPnjNfEYBmwLkBekiJ8Y6XeYfIiBloeOCxuTD4SH5bMHVOkr AEBTvaLjFVpUyp5/SZzQNihhHO5WtH3jfQWNH8atHtIYExDMLZlCQm6rWDYOHq2zD59G 6UDwARWBUaITKtphMCuMBPw1OnqC1BuD8Udf3OYFNdkeSRpHXw2m5DdEMrmBrppfx7Sk V6Xw== X-Gm-Message-State: AOJu0YzjxnX3swzjgMCjJCqhqKmleCKwf33O+WiGaQdMkQf/R7Ql4LgQ xjGEHYMGgtoA7nylIR3JouiHhQOJknoBTMMSTAw= X-Google-Smtp-Source: AGHT+IGHK9Xl19bFVAuBbDA6w+Zx4r5YYHEYxhcFPaOUYOyPXe2dhYoI51npOeOFy8MPncPy91scXXqyEycM+oIwE1M= X-Received: by 2002:a67:b80c:0:b0:458:1c00:c32f with SMTP id i12-20020a67b80c000000b004581c00c32fmr1422072vsf.34.1698364099116; Thu, 26 Oct 2023 16:48:19 -0700 (PDT) MIME-Version: 1.0 References: <87y1frqz2u.fsf@nvdebian.thelocal> <87ttqfqw8f.fsf@nvdebian.thelocal> <87bkcn1j5k.fsf@nvdebian.thelocal> <877cnb0zyk.fsf@nvdebian.thelocal> In-Reply-To: <877cnb0zyk.fsf@nvdebian.thelocal> From: Barry Song <21cnbao@gmail.com> Date: Fri, 27 Oct 2023 07:48:07 +0800 Message-ID: Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit To: Alistair Popple Cc: Yu Zhao , Baolin Wang , catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, v-songbaohua@oppo.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 835ongj14c6eayu89176rubys7kmghhe X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 74AD4120019 X-Rspam-User: X-HE-Tag: 1698364100-264296 X-HE-Meta: U2FsdGVkX184UlsCfYmr+UCNoBd6ANnpcW9CuDStlj3Gb2eWW1/p3ZuvJT2JpfpYYCCAuxm+4s0wYkwGYu4W62gkAu2JcjCxhjj2O8LPsb86aCsEVhZxN7Dqe3PL67fOXx++iXoRw36971Yg7uGzjhS57/Z3xxmybI1bvZZ0zq8WyX7an7OgninA6ySs0PiagJlVZ7XQhzi+1nIMeJ1D7nOCLBRhQDEKay/06Eh1u22BaY9LeRzeSXt60YsmxRVwoM4apHerWRuAADnVU4dyZpF+X+dGEhobIMQ2IyQlvBMj7gpJC0m7ZFdzKqqXb9ZM2i1fc36fibOa53M/KfdKKVxkr/6WwacDJNRUzGkY+88yZZuFveBTP7Yrshpz9eU30vraWePy2+AVecq/H2rW2RVKoHuBtiTqBSjNOPX8vweChZ5plTaxojKlGw40evdeIWG4/BTYfy+6T6y/CzkqnObjZuQYY6bEZXvYhbfocrIfHK/BOAHOkRlgNUTPrQvN6RuCbWUHyyDdY5NVAMrG4tSLdGsNGi6UbClCvuWWfdPoMSZsEi4O55cx7UGajhOKxtDDMN9qklTJAp1nVLgXnQrvsJLqUH/fqWh5K5YKnwLbrXiwlNKVA7/T1NGG6wHpVOoI+7VHeJjN05FKeigy3TL559ODL851JFeFaK5uY3oMRYsfgj4Z6U4KSlyGf2wDCbox5xl+rmmKGuKKuwS9t15GZohiWd4PWu16n3UUEtYSqNJbRp3h1iDK+xMVNw57vUEm+9GmNZuEHCYUK8OCafHc8jwmVrLydPnkAqy7UbHnyf13FqUID1ilMQTKf/AR7hksUOy0dJfb/h+5D5HYRI0fQJ9OeUc1UC7/4qWLXDXmVXV97wov+2i+FkvelA4uEkWVwxbT6Niz2LDlnkogXQput/mcnhAmPWawTPcPJqOKahWcIw1p6JOVPXuAGlC/uistkp6ZvoCKOZGAIbY Qoh8TNdx 82QIJps1t/uR21vgHGAELNcF4wNMf/JwWe2wEbzKSS1qtg9/Sgn5guHHjUEknkvETyaxM+GQkQkdxmUeD+BthG7SDQHwljfbxjyCBwMMvl+L4Uk011v/8JCV3xlkVxzf3ABRb9bSJXhzsJ6+4qQFN9jp2PFSIWLF+b5fAqFhS2SiJzaKdwXoMXZdTUnDGAi4dnDP1E1MrVyTALr+MILRXPodTGrSENPOX1S7jQlvTe9G4jflq0yxIrwyIbUDAk757YgfDcM1w+8SFnwh6e95s0dvYiJXrkRWNg3GP63zaVz+vrdP5GaEcSbtpunr+UXc3f0IGqPsp+KrrSKIyKuJf8rVrieum8opPDSiOEnQJfQeFGvbSeVzBEP1TbKfCFeA6KIo4lJMQLW0HOzjcXe/CbHaXI92auKbSEoOSLaCbw+iT8pNAm/aDs2HVNjCdqDxg3I9rnavCIn6YQvO0yj0P3ap9xmrbEZLIwAEwrMBFBKF3k5stRf27swQrq7gJwiFrZecBL42vT4TtJFzljGDuk4tIKXXZQC15zLZMG0OKQMy8u0to8ktHz3J6FA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 25, 2023 at 6:16=E2=80=AFPM Alistair Popple wrote: > > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 25, 2023 at 2:17=E2=80=AFPM Yu Zhao wro= te: > >> > >> On Tue, Oct 24, 2023 at 9:21=E2=80=AFPM Alistair Popple wrote: > >> > > >> > > >> > Baolin Wang writes: > >> > > >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > >> > >>> On Wed, Oct 25, 2023 at 9:18=E2=80=AFAM Alistair Popple wrote: > >> > >>>> > >> > >>>> > >> > >>>> Barry Song <21cnbao@gmail.com> writes: > >> > >>>> > >> > >>>>> On Wed, Oct 25, 2023 at 7:16=E2=80=AFAM Barry Song <21cnbao@gm= ail.com> wrote: > >> > >>>>>> > >> > >>>>>> On Tue, Oct 24, 2023 at 8:57=E2=80=AFPM Baolin Wang > >> > >>>>>> wrote: > >> > >> [...] > >> > >> > >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclai= med hot > >> > >>>>>> page, B might > >> > >>>>>> be a correct choice. > >> > >>>>> > >> > >>>>> Plus, I doubt B is really going to happen. as after a page is = promoted to > >> > >>>>> the head of lru list or new generation, it needs a long time t= o slide back > >> > >>>>> to the inactive list tail or to the candidate generation of mg= lru. the time > >> > >>>>> should have been large enough for tlb to be flushed. If the pa= ge is really > >> > >>>>> hot, the hardware will get second, third, fourth etc opportuni= ty to set an > >> > >>>>> access flag in the long time in which the page is re-moved to = the tail > >> > >>>>> as the page can be accessed multiple times if it is really hot= . > >> > >>>> > >> > >>>> This might not be true if you have external hardware sharing th= e page > >> > >>>> tables with software through either HMM or hardware supported A= TS > >> > >>>> though. > >> > >>>> > >> > >>>> In those cases I think it's much more likely hardware can still= be > >> > >>>> accessing the page even after a context switch on the CPU say. = So those > >> > >>>> pages will tend to get reclaimed even though hardware is still = actively > >> > >>>> using them which would be quite expensive and I guess could lea= d to > >> > >>>> thrashing as each page is reclaimed and then immediately faulte= d back > >> > >>>> in. > >> > > > >> > > That's possible, but the chance should be relatively low. At least= on > >> > > x86, I have not heard of this issue. > >> > > >> > Personally I've never seen any x86 system that shares page tables wi= th > >> > external devices, other than with HMM. More on that below. > >> > > >> > >>> i am not quite sure i got your point. has the external hardware = sharing cpu's > >> > >>> pagetable the ability to set access flag in page table entries b= y > >> > >>> itself? if yes, > >> > >>> I don't see how our approach will hurt as folio_referenced can n= otify the > >> > >>> hardware driver and the driver can flush its own tlb. If no, i d= on't see > >> > >>> either as the external hardware can't set access flags, that mea= ns we > >> > >>> have ignored its reference and only knows cpu's access even in t= he current > >> > >>> mainline code. so we are not getting worse. > >> > >>> > >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb fl= ush can > >> > >>> also broadcast to external hardware, then external hardware sees= the > >> > >>> cleared access flag, thus, it can set access flag in page table = when the > >> > >>> hardware access it? If this is the case, I feel what you said i= s true. > >> > >> Perhaps it would help if I gave a concrete example. Take for exam= ple > >> > >> the > >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in o= ne of > >> > >> two ways depending on the specific HW implementation. > >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > >> > >> TLB > >> > >> invalidations. If BTM is not supported it relies on SW to explici= tly > >> > >> forward TLB invalidations via MMU notifiers. > >> > > > >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > >> > > >> > Lucky you :-) > >> > > >> > ARM64 SMMU architecture specification supports the possibilty of bot= h, > >> > as does the driver. Not that I think whether or not BTM is supported= has > >> > much relevance to this issue. > >> > > >> > >> Now consider the case where some external device is accessing map= pings > >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If = we > >> > >> clear the access flag without a TLB invalidate the access flag in= the > >> > >> CPU page table will not get updated because it's already set in t= he SMMU > >> > >> TLB. > >> > >> As an aside access flag updates happen in one of two ways. If the > >> > >> SMMU > >> > >> HW supports hardware translation table updates (HTTU) then hardwa= re will > >> > >> manage updating access/dirty flags as required. If this is not su= pported > >> > >> then SW is relied on to update these flags which in practice mean= s > >> > >> taking a minor fault. But I don't think that is relevant here - i= n > >> > >> either case without a TLB invalidate neither of those things will > >> > >> happen. > >> > >> I suppose drivers could implement the clear_flush_young() MMU > >> > >> notifier > >> > >> callback (none do at the moment AFAICT) but then won't that just = lead to > >> > >> the opposite problem - that every page ever used by an external d= evice > >> > >> remains active and unavailable for reclaim because the access fla= g never > >> > >> gets cleared? I suppose they could do the flush then which would = ensure > >> > > > >> > > Yes, I think so too. The reason there is currently no problem, per= haps > >> > > I think, there are no actual use cases at the moment? At least on = our > >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. > >> > > >> > We have systems that do. > >> > >> Just curious: do those systems run the Linux kernel? If so, are pages > >> shared with SMMU pinned? If not, then how are IO PFs handled after > >> pages are reclaimed? > > Yes, these systems all run Linux. Pages shared with SMMU aren't pinned > and fault handling works as Barry notes below - a driver is notified of > a fault and calls handle_mm_fault() in response. > > > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in > > I/O PF, so finally > > it runs the same codes to get page back just like CPU's PF. > > > > years ago, we recommended a pin solution, but obviously there were lots= of > > push backs: > > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wang= zhou1@hisilicon.com/ > > Right. Having to pin pages defeats the whole point of having hardware > that can handle page faults. Right. But Sometimes I feel that purpose and the technology for the purpose are contradictory. People want user space dma to have high bandwidth and performance by droppi= ng u/k copying and complicated syscalls managing those buffers. but IO PF some= how defeats all the efforts to improve performance. AFAIK, IO PF is much much slower than PF as it depends on interrupts and ev= ent queues in hardware. Even a minor page fault due to page migration or demand paging can significantly decrease the speed of userspace DMA. So finally people ha= ve to depend on some mlock-like and pre-paging ways to achieve the purpose if performance is the key concern. Thanks Barry