From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 333C5C0032E for ; Wed, 25 Oct 2023 18:23:20 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C03A68D000C; Wed, 25 Oct 2023 14:23:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BB45D8D0001; Wed, 25 Oct 2023 14:23:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7C048D000C; Wed, 25 Oct 2023 14:23:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 998838D0001 for ; Wed, 25 Oct 2023 14:23:19 -0400 (EDT) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 6DEA9B577B for ; Wed, 25 Oct 2023 18:23:19 +0000 (UTC) X-FDA: 81384806118.22.58829E9 Received: from mail-qt1-f175.google.com (mail-qt1-f175.google.com [209.85.160.175]) by imf28.hostedemail.com (Postfix) with ESMTP id F2F4EC0004 for ; Wed, 25 Oct 2023 18:23:15 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=oQwuPZC5; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1698258196; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=; b=mTz/RChnw2L78UUTUKng8MhhxA1v+LUB2wj3Z7vV6QnGP8dUE2pnnKrY+huvxq0c6BVLLv YO3k83KwboPkU4c1Jytc/qoBsLsxcGafm48Mj2uJ2U3RVeMFctnXdqERRU5FKqRi3TZKIK 0JRYoIsvIPDiaeq24FeCbYqD7Q5eh+g= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=oQwuPZC5; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.175 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1698258196; a=rsa-sha256; cv=none; b=mleTbdcOcwCWiJDpp5j9NfLitCQp0hLMFyeNoTZYJrni9wSsTSHzYcT8MxtK6iPU+ffPCE z87gXsMWdD1PW5X1HeuSNRGkwV5JVujghtXRG3rvp3Oza14DxDX6BQ6qJI2ZHI/WMfJXb+ HHWuNJqGrRhCRvmuKeX+SautlePhVpQ= Received: by mail-qt1-f175.google.com with SMTP id d75a77b69052e-41cdc669c5eso386811cf.1 for ; Wed, 25 Oct 2023 11:23:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1698258196; x=1698862996; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=; b=oQwuPZC5BU2jVXxxsheSpnkA6osmyJxe6o1y6lD5sCgdQ4iDMEk6qauC64avJnOKnn DgTAwAwkv7LCp7TNra2CJDarXtJCHciB0MFa8cnHsn7Be5ulMQV6eFezocvxueEF9jsa 8u+/fqQUCwegRcL8gnB5NMPkqyL6NvGYmfuhxEFQLl8lIEJAjKnqxbQm+uEL5Mf1gCHM lFKxE90KRCg6HUbfFXJTDepbXcS9LucsZiXz5UQDo0RXuPjI6ZKCLwGQuYgNpgqNKL1Z eDJghf2QkHZhEvX6Ico+/kM1Xsofyl7rDAyRnA8/huTNF2drQh4mLLt5YNpqNI4DKVbO qc2g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1698258196; x=1698862996; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=YOngol+UNt2BS7ROUL819iuWlTNz6vkC6Ill4epbCm4=; b=boqHddhC2+1ihfVtzxnP5HdTuCVB4cDwgOWZvwYZqt0RAt4+ajjI/HTA/sUAKUeUBW fnontdw8ZzlQol+C5HtnsYafY49wIsyy2Hqb2FxrZnzi/WC4gZUEfvYk70nuQWOK0czs DHCNvAEPYDxYotHkl99kt5Nm0iNGG01K+EM+68ibTi/lRJmRNTmLOEdC/pKni5K4iSNz YVlIC3uX5VkTszHyUoBaK6FZSA66zCoI9wL4QxTZSjpNXa4B65uUeaUPUfANGAjeqZiS VjrLauGduBcDEV18d7gC6gJ1GGeV1DeHo9cCnNS7e66+RZp+zfJq1eRlTTXyk4HERBjs EwDg== X-Gm-Message-State: AOJu0Yy/P/KCRIAGgqfziyBg6U8rhjnUtl/hlAaiC+w8bEkn0MHPPnsd bJCqCv//KtQxZDvLNmKfpbJXAkVYpMwBxZpeZQomBA== X-Google-Smtp-Source: AGHT+IEUBTKIA5WOFudfFWTufeAV1wkNsfXYI8/ec5yQNnrsqrBY/obT/m/kMgdVpKFvpflGNbKialGaTG9YnDoC3Aw= X-Received: by 2002:a05:622a:6dcd:b0:41c:da9e:efc6 with SMTP id ir13-20020a05622a6dcd00b0041cda9eefc6mr319724qtb.28.1698258195584; Wed, 25 Oct 2023 11:23:15 -0700 (PDT) MIME-Version: 1.0 References: <87y1frqz2u.fsf@nvdebian.thelocal> <87ttqfqw8f.fsf@nvdebian.thelocal> <87bkcn1j5k.fsf@nvdebian.thelocal> <877cnb0zyk.fsf@nvdebian.thelocal> In-Reply-To: <877cnb0zyk.fsf@nvdebian.thelocal> From: Yu Zhao Date: Wed, 25 Oct 2023 12:22:37 -0600 Message-ID: Subject: Re: [PATCH] arm64: mm: drop tlb flush operation when clearing the access bit To: Alistair Popple Cc: Barry Song <21cnbao@gmail.com>, Baolin Wang , catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, v-songbaohua@oppo.com, linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: F2F4EC0004 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: 96om73wjugyyrqii7tajkw85bza4imri X-HE-Tag: 1698258195-112128 X-HE-Meta: U2FsdGVkX180Wmh4ygeQD/N01/FpwA0i0wIv253R16UpmDRvdaOzM+TUXSEfyQySB3ag8S3Ru/7I+3y3seVpw3i3nHoyrA2HiQ0aht8Xyxcnoul7ccgiACCJL3V8OFyNCOaYA7ztigi5LuIYBIqeeQ27G/wToaMe6iHaKdtfqq46dcCbXCMAX3IQO4ROk9A4sZbybyJ6qJ46xhriqTWpwqetPVVZgjWo9/9tc8U3KvIjskjM+T+XWxgCU7VosstnXkdLFIrj9YzgQH4kBDk6SHfK4bOs67W7QrLInI32A+Gk1AZh9VA2WH+3gmYAtcOngmQk5X8dzzr84Vwz2IuBBF9G+ENKNkrwmpUgc55OzkuJC6g5IKzQtduoPwMGNPaDApCJcsWvUdYn0KvG+HSOh8Hq8j0C3Ihz5+SRCnHb4uYAtqSd3JLEwwuhCR/Yi57T/O0ySijYaQEpDVWDpq4q2eSfmi9Q4oOdAlrpU64YwBGJX4fu+e0Dd1r/TVE2lCEL7y1ojZL9mSm7NoNg0AQ2MZusGEzRaCHSbFeiEQIjxGI3uQxAVVLVoA4jGSxcbYIwpkwCuwqe3DAZ7vVPdZXejMvirm7SXlGiKgG3Q2BqRxYbJ7d5+6wJrhGp8LO4P3gQZWoiXtX0K7Hp03MR+gPaglKUGnjGK26ZNJz7bttzq/+Gzxn3obl28tWKUCgR6+Jz8D+W1wVk8fz0y/cO5+4li/E+JV9PvfX+OBA/IS90TXZl/PqlYXp0x6jkcfkVqD0qjCbxJ/QmQh9dzL30a8zHBX+Rk2KxoEUfKxH/L874L3X+K3ugbd7t2DCeWTYlsH2bqQ5+nHIOtlV1sVg2CtJ3aUrztgJtLjJqZAnjOKIznlLu1mplAo3vj/fr1PUO8AJfP8n6U+JBZEOuiWiTfOKZtWcqqCEEXh6u8Dp6EpyXIAbsPYtE7dFvVFPwx011sFt+lB76rw6SQxY2h5SQfI8 w/eccSDb ntrTpzhpvAKRWAabQsWmgG86MfTpu8I3Mm9SEBNhW99ITwDmcaBb+i3fFxEjB6fl3aZ358wFoVUJzGIiCKYWGDEjzitwiHCDNuI8abbeJhhyVS0YcLQOlWX7DBmHRlS4FasIq+Cw6YxnF5tyYcW5r/24MzuzEesFp3Q0VuzW8Wv8+J2sFtk2wDzBmKRB48srcjIm2m6N5PiEsbqOd22cMW/DzxvXhqwZ4ZOgU36iwOy4+3sEy0iiO4f0OBStaYX9emB2OT+NvKeHMLDryfNfcccuhf7F6EyH8/Su5HGbcT+UIdjaz5WGf5uCvTuvXbXI9ObPP1+FkeYs13eSCUY7sNd5Kwz7hrtVcMTrUbzE6QEDlF2rxxJFTd+eueu4IU332Xn3JSHvNmpksSOxkH6KdTUO3Shsx/Wvp2P7hw0Wuwd6aYWsHgi6fRrmnOQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Oct 25, 2023 at 4:16=E2=80=AFAM Alistair Popple wrote: > > > Barry Song <21cnbao@gmail.com> writes: > > > On Wed, Oct 25, 2023 at 2:17=E2=80=AFPM Yu Zhao wro= te: > >> > >> On Tue, Oct 24, 2023 at 9:21=E2=80=AFPM Alistair Popple wrote: > >> > > >> > > >> > Baolin Wang writes: > >> > > >> > > On 10/25/2023 9:58 AM, Alistair Popple wrote: > >> > >> Barry Song <21cnbao@gmail.com> writes: > >> > >> > >> > >>> On Wed, Oct 25, 2023 at 9:18=E2=80=AFAM Alistair Popple wrote: > >> > >>>> > >> > >>>> > >> > >>>> Barry Song <21cnbao@gmail.com> writes: > >> > >>>> > >> > >>>>> On Wed, Oct 25, 2023 at 7:16=E2=80=AFAM Barry Song <21cnbao@gm= ail.com> wrote: > >> > >>>>>> > >> > >>>>>> On Tue, Oct 24, 2023 at 8:57=E2=80=AFPM Baolin Wang > >> > >>>>>> wrote: > >> > >> [...] > >> > >> > >> > >>>>>> (A). Constant flush cost vs. (B). very very occasional reclai= med hot > >> > >>>>>> page, B might > >> > >>>>>> be a correct choice. > >> > >>>>> > >> > >>>>> Plus, I doubt B is really going to happen. as after a page is = promoted to > >> > >>>>> the head of lru list or new generation, it needs a long time t= o slide back > >> > >>>>> to the inactive list tail or to the candidate generation of mg= lru. the time > >> > >>>>> should have been large enough for tlb to be flushed. If the pa= ge is really > >> > >>>>> hot, the hardware will get second, third, fourth etc opportuni= ty to set an > >> > >>>>> access flag in the long time in which the page is re-moved to = the tail > >> > >>>>> as the page can be accessed multiple times if it is really hot= . > >> > >>>> > >> > >>>> This might not be true if you have external hardware sharing th= e page > >> > >>>> tables with software through either HMM or hardware supported A= TS > >> > >>>> though. > >> > >>>> > >> > >>>> In those cases I think it's much more likely hardware can still= be > >> > >>>> accessing the page even after a context switch on the CPU say. = So those > >> > >>>> pages will tend to get reclaimed even though hardware is still = actively > >> > >>>> using them which would be quite expensive and I guess could lea= d to > >> > >>>> thrashing as each page is reclaimed and then immediately faulte= d back > >> > >>>> in. > >> > > > >> > > That's possible, but the chance should be relatively low. At least= on > >> > > x86, I have not heard of this issue. > >> > > >> > Personally I've never seen any x86 system that shares page tables wi= th > >> > external devices, other than with HMM. More on that below. > >> > > >> > >>> i am not quite sure i got your point. has the external hardware = sharing cpu's > >> > >>> pagetable the ability to set access flag in page table entries b= y > >> > >>> itself? if yes, > >> > >>> I don't see how our approach will hurt as folio_referenced can n= otify the > >> > >>> hardware driver and the driver can flush its own tlb. If no, i d= on't see > >> > >>> either as the external hardware can't set access flags, that mea= ns we > >> > >>> have ignored its reference and only knows cpu's access even in t= he current > >> > >>> mainline code. so we are not getting worse. > >> > >>> > >> > >>> so the external hardware can also see cpu's TLB? or cpu's tlb fl= ush can > >> > >>> also broadcast to external hardware, then external hardware sees= the > >> > >>> cleared access flag, thus, it can set access flag in page table = when the > >> > >>> hardware access it? If this is the case, I feel what you said i= s true. > >> > >> Perhaps it would help if I gave a concrete example. Take for exam= ple > >> > >> the > >> > >> ARM SMMU. It has it's own TLB. Invalidating this TLB is done in o= ne of > >> > >> two ways depending on the specific HW implementation. > >> > >> If broadcast TLB maintenance (BTM) is supported it will snoop CPU > >> > >> TLB > >> > >> invalidations. If BTM is not supported it relies on SW to explici= tly > >> > >> forward TLB invalidations via MMU notifiers. > >> > > > >> > > On our ARM64 hardware, we rely on BTM to maintain TLB coherency. > >> > > >> > Lucky you :-) > >> > > >> > ARM64 SMMU architecture specification supports the possibilty of bot= h, > >> > as does the driver. Not that I think whether or not BTM is supported= has > >> > much relevance to this issue. > >> > > >> > >> Now consider the case where some external device is accessing map= pings > >> > >> via the SMMU. The access flag will be cached in the SMMU TLB. If = we > >> > >> clear the access flag without a TLB invalidate the access flag in= the > >> > >> CPU page table will not get updated because it's already set in t= he SMMU > >> > >> TLB. > >> > >> As an aside access flag updates happen in one of two ways. If the > >> > >> SMMU > >> > >> HW supports hardware translation table updates (HTTU) then hardwa= re will > >> > >> manage updating access/dirty flags as required. If this is not su= pported > >> > >> then SW is relied on to update these flags which in practice mean= s > >> > >> taking a minor fault. But I don't think that is relevant here - i= n > >> > >> either case without a TLB invalidate neither of those things will > >> > >> happen. > >> > >> I suppose drivers could implement the clear_flush_young() MMU > >> > >> notifier > >> > >> callback (none do at the moment AFAICT) but then won't that just = lead to > >> > >> the opposite problem - that every page ever used by an external d= evice > >> > >> remains active and unavailable for reclaim because the access fla= g never > >> > >> gets cleared? I suppose they could do the flush then which would = ensure > >> > > > >> > > Yes, I think so too. The reason there is currently no problem, per= haps > >> > > I think, there are no actual use cases at the moment? At least on = our > >> > > Alibaba's fleet, SMMU and MMU do not share page tables now. > >> > > >> > We have systems that do. > >> > >> Just curious: do those systems run the Linux kernel? If so, are pages > >> shared with SMMU pinned? If not, then how are IO PFs handled after > >> pages are reclaimed? > > Yes, these systems all run Linux. Pages shared with SMMU aren't pinned > and fault handling works as Barry notes below - a driver is notified of > a fault and calls handle_mm_fault() in response. > > > it will call handle_mm_fault(vma, prm->addr, fault_flags, NULL); in > > I/O PF, so finally > > it runs the same codes to get page back just like CPU's PF. > > > > years ago, we recommended a pin solution, but obviously there were lots= of > > push backs: > > https://lore.kernel.org/linux-mm/1612685884-19514-1-git-send-email-wang= zhou1@hisilicon.com/ > > Right. Having to pin pages defeats the whole point of having hardware > that can handle page faults. Thanks. How would a DMA transaction be retried after the kernel resolves an IO PF? I.e., does the h/w (PCIe spec, etc.) support auto retries or is the s/w responsible for doing so? At least when I worked on the PCI subsystem, I didn't know any device that was capable of doing auto retries. (Pasha and I will have a talk on IOMMU at the coming LPC, so this might be an interesting intersection between IOMMU and MM to discuss.)