From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A87AC25B75 for ; Mon, 3 Jun 2024 09:01:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E96536B009B; Mon, 3 Jun 2024 05:01:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E1F9B6B009C; Mon, 3 Jun 2024 05:01:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CC0FC6B009E; Mon, 3 Jun 2024 05:01:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id A53FD6B009B for ; Mon, 3 Jun 2024 05:01:38 -0400 (EDT) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 56AFCA2577 for ; Mon, 3 Jun 2024 09:01:38 +0000 (UTC) X-FDA: 82188984276.27.E49F7A0 Received: from mail-vk1-f175.google.com (mail-vk1-f175.google.com [209.85.221.175]) by imf05.hostedemail.com (Postfix) with ESMTP id 88BDF100019 for ; Mon, 3 Jun 2024 09:01:36 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ioi3jcXV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717405296; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=tPFZKlBIY+L1Skh2mqsY6hChJ2s59haycgaaejzLJUE=; b=GhNzVQWsDik1HsGzCvMoQ4tXIC7CP8c0NLWoUsB8+OJxIBjepcsRAf6PGpPDLPyU/wxsve sGYzxb15folDBZ+X47Kwefx6Iv4cojM4eE2/+AW/AHAg8djwd3WtG6pe3ZYEZq6jKImqR4 6WlUeJEMsYrtoXd56AjkT7SiB5Syz0s= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=ioi3jcXV; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.175 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717405296; a=rsa-sha256; cv=none; b=5Qxmw1ARPSRNykWy/bhr6LL8XgEixUC8vK5P505kg5r9wrigWS1/1lRsgaf7khrILWffsH VcCZbqhdNr98gRyp5IYVxw85flGw8RoVtLeUU3m89Sg2lSjwhtKzIFLQv89kxkbOjbEl65 3LMDZlRkpeY0yLk9zZbr2R+BNTvkdiQ= Received: by mail-vk1-f175.google.com with SMTP id 71dfb90a1353d-4eb18095222so423971e0c.2 for ; Mon, 03 Jun 2024 02:01:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717405295; x=1718010095; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=tPFZKlBIY+L1Skh2mqsY6hChJ2s59haycgaaejzLJUE=; b=ioi3jcXVbrtTsDiQFiQkxjNJcEaLgVDwFjMtuTRjsDPgnp1KvMd5NsnQaZAoURVvvd qXxndPzTfODZvl3DrUWoLXqbWiA/KS5DvR9XMUHGgcVMaoCddoSJh1Tcd+TnBAmQDoEM cU/ZHFt6lDZQ12rKF+BpmvLQCiCimWf8Fv4Iv0PY/CN9iCk6e8N7syWFNH9AHN4Wuww6 f8VZAyIv7h5TnkDdDkTdz4xCYF0BLndmMJvTMG8sDgirWvuV7znlLiK5VzK043k3o0HH ZpP/auTY3swWeox/SRDh59kSonmpCteFmz3HU85Kr1ARynvC+fkT+Wl5toEfcSKmFsL9 +ZXg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717405295; x=1718010095; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tPFZKlBIY+L1Skh2mqsY6hChJ2s59haycgaaejzLJUE=; b=TyoymsJWWlY6UQr5TgbmPQ+v9AtC7FkZ4vFxtODwV4PlRPN+5gE2wLWzNSW7wRJ1+1 FVgplLO4Zfjt2jlSRDh6zPbYsKH3Cbd6WNGfYZLaIQC2ywaeJvNpfVP6q6kKhf2nBnxN L0FXn5EzMXsWBAwe1sgt9rQj27DhaI0f34NeKo4pmf17LKnxQkQcN4VojzkfEIVRIaYz MA+R2UMcRtqipDplpxgpBgrQoS2bB4pv6KZOuxFYc/+Gwj6+tZ0C8j1lO7q0CkP5gIm6 zvdv8+gcAyw8YqVwCZmeu2h6qhS4F5KglaEILj09/zdutGeFfyqb7ObdA/R83HktsCjM S/Mw== X-Forwarded-Encrypted: i=1; AJvYcCVsrmFceRqWqyVWmFeF9NF1gvLf2/anpx0ChaIBkCj/DdY4s3gh1O2C7aTLIox3sOKxn4m6Iii5rSGuB8jHFtfN1ro= X-Gm-Message-State: AOJu0Yz2sCv6scOg/mtyhPdESoMZ8PeLZ6KFto1LURCjSWIGHfhyrYFC 9bvb8WtZShbv2SJyxDOtV3poEQLdjPF2GOJOYqZQQssEwRg0AsCda5blfZBbHESBjfoYxn8il3j imgYJPYhRVaMOAXVJrXjOSO29eNY= X-Google-Smtp-Source: AGHT+IHYGXU8wDQ/HC0H0lRWd8kGHs1Z7OHR6aT/4YuXZBaLaDF8nPL8a2Hixk9N2C88EbcxWE+fxpIugGkKr5aw/Ds= X-Received: by 2002:a05:6122:316:b0:4e4:e9db:6b10 with SMTP id 71dfb90a1353d-4eb02d76968mr7015626e0c.2.1717405295157; Mon, 03 Jun 2024 02:01:35 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 3 Jun 2024 21:01:24 +1200 Message-ID: Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio To: Baolin Wang Cc: akpm@linux-foundation.org, hughd@google.com, willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, ying.huang@intel.com, ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com, ioworker0@gmail.com, da.gomez@samsung.com, p.raghav@samsung.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 88BDF100019 X-Stat-Signature: yq1w4wpw1i5augdn7cfxu3pckno8h46j X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1717405296-103663 X-HE-Meta: U2FsdGVkX1/SuUPFEsoaaJBIJIG24BgZZZWc6knuNjCywngwfBLlyhVrYwhHFGeWd/rhFFM+U29Edha2kw0AI4bWKfnhYGNFO425cukB+KTEA/8m2u2vEZQqDQetXkbfrVaslXYrfwz3Aeet8sXCvSOA5clksmuHOPllkLdzclMbJX5AaaJmNdMNqPLFTCiWgImwNn68Jza/VrTBo+93L3EAXSfP6RaLTpkbZv52zN6OTu/eMlIdlQGRBXgF556DNhFUgecwfHLWlSdFSuAJHTv1ZaGkn8RdfRX4NERr2JugOBtMTDLBMu5uy9RhSy8cvpGJC8ncioKQztyEFlXAyMuJ3OWYP1xlIVgjSDvPBNs9e6EEaPrhlbE4oNiirgtgPBqzvIOcbdtQBhtltCr0tv6R90z/+X4KtvYgUbu5c5Jfpr/yaxOkXOLHfE/EIoa1XhVjvXeBkEeBLLlZXxerU93p0X5Qh3OYOAMJr5mhTiwCyYHthS7W/BOOF5gZhwT53DROaSauvnSwYw1znHJAUsUETPLPhZ76VtawLUhHeLAtf9Y/1s0iThh0ZASq5vhNBfZnVbo4YEwMR7myIcpyzfQYQYYpFRNADpzwuqibbeqmafclzfsKa68irf9ReTOhg2ezoy9av1fAhxbMjXuXB08XHgutIOGvT6he7gOSzwDRgqOuf8BdZPBVnOsUf61qegpqJWKhxYyEKvxKyTwQjb+yoxOVryTR5xGSXkPvS8yUb00IxhSWI/dmdScCvF1HIu4z78UzuncEhByp8rNO3DOL4CIwHA8DaU7H7xnDYZuFzcpfdoiG9OgqNFFrMf0DoRnx7//psCxurSTeAO673GXykF079Jyqwq4Fo/dv0GI5BsZaLgHVywZYk22WPdcyXnpBRsb0xTs1uowM44JONLiNTrbwPg44dwqubMAM/FcxvbXtAzGU3JenrHFTjLGhS6PnnHY7R6kUx40BJOS VrsTjw/l 1fgWEPDk5i9WzoETBHg79vrDY/cBUZb4BykbCWGcZlPMHHyZkJOG6fEV0VWHiamyk1azfcjokT2L6KAZcYDf8FAZjd0ogQAt8K9blK1sdNjjkhmchycZ/qFsktQog34AZ2O0cFETABhZyC6zD75X+VGayPjAlPBOoDX1R5nwc9rvJgKzclA/ikrxEvA/6EwM5B+0uExY8eKwlsev4LtG6OTX8IS1ljOlhZPITkbIVPqFJ7dUYnF42hd04OfpZlCd2dQu3lSJTz2GBOUoVtpBRIel74E9ZDG2axmI/as3DK3dvNMqKKQJAs5RNWWhQhz1oA1zM0DEF3d+39H6SQW5XXf9JgkhBB8QykWz0JyVqlcQbKQhiEH3bPlqTaW33tuoxW5g/Dg+MBsWJQt+4Ubk5gmNNWq4+sI0Rr1tTJpdoY2l7z+ld2Dc7+XVq1wq7WyD8i0Y1 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 3, 2024 at 8:58=E2=80=AFPM Barry Song <21cnbao@gmail.com> wrote= : > > On Mon, Jun 3, 2024 at 8:29=E2=80=AFPM Baolin Wang > wrote: > > > > > > > > On 2024/6/3 13:28, Barry Song wrote: > > > On Thu, May 30, 2024 at 2:04=E2=80=AFPM Baolin Wang > > > wrote: > > >> > > >> Add large folio mapping establishment support for finish_fault() as = a preparation, > > >> to support multi-size THP allocation of anonymous shmem pages in the= following > > >> patches. > > >> > > >> Signed-off-by: Baolin Wang > > >> --- > > >> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-----= ---- > > >> 1 file changed, 48 insertions(+), 10 deletions(-) > > >> > > >> diff --git a/mm/memory.c b/mm/memory.c > > >> index eef4e482c0c2..435187ff7ea4 100644 > > >> --- a/mm/memory.c > > >> +++ b/mm/memory.c > > >> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > > >> { > > >> struct vm_area_struct *vma =3D vmf->vma; > > >> struct page *page; > > >> + struct folio *folio; > > >> vm_fault_t ret; > > >> bool is_cow =3D (vmf->flags & FAULT_FLAG_WRITE) && > > >> !(vma->vm_flags & VM_SHARED); > > >> + int type, nr_pages, i; > > >> + unsigned long addr =3D vmf->address; > > >> > > >> /* Did we COW the page? */ > > >> if (is_cow) > > >> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf= ) > > >> return VM_FAULT_OOM; > > >> } > > >> > > >> + folio =3D page_folio(page); > > >> + nr_pages =3D folio_nr_pages(folio); > > >> + > > >> + /* > > >> + * Using per-page fault to maintain the uffd semantics, and = same > > >> + * approach also applies to non-anonymous-shmem faults to av= oid > > >> + * inflating the RSS of the process. > > > > > > I don't feel the comment explains the root cause. > > > For non-shmem, anyway we have allocated the memory? Avoiding inflatin= g > > > RSS seems not so useful as we have occupied the memory. the memory fo= otprint > > > > This is also to keep the same behavior as before for non-anon-shmem, an= d > > will be discussed in the future. > > OK. > > > > > > is what we really care about. so we want to rely on read-ahead hints = of subpage > > > to determine read-ahead size? that is why we don't map nr_pages for n= on-shmem > > > files though we can potentially reduce nr_pages - 1 page faults? > > > > IMHO, there is 2 cases for non-anon-shmem: > > (1) read mmap() faults: we can rely on the 'fault_around_bytes' > > interface to determin what size of mapping to build. > > (2) writable mmap() faults: I want to keep the same behavior as before > > (per-page fault), but we can talk about this when I send new patches to > > use mTHP to control large folio allocation for writable mmap(). > > OK. > > > > > >> + */ > > >> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vm= a))) { > > >> + nr_pages =3D 1; > > >> + } else if (nr_pages > 1) { > > >> + pgoff_t idx =3D folio_page_idx(folio, page); > > >> + /* The page offset of vmf->address within the VMA. *= / > > >> + pgoff_t vma_off =3D vmf->pgoff - vmf->vma->vm_pgoff; > > >> + > > >> + /* > > >> + * Fallback to per-page fault in case the folio size= in page > > >> + * cache beyond the VMA limits. > > >> + */ > > >> + if (unlikely(vma_off < idx || > > >> + vma_off + (nr_pages - idx) > vma_pages(= vma))) { > > >> + nr_pages =3D 1; > > >> + } else { > > >> + /* Now we can set mappings for the whole lar= ge folio. */ > > >> + addr =3D vmf->address - idx * PAGE_SIZE; > > >> + page =3D &folio->page; > > >> + } > > >> + } > > >> + > > >> vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, > > >> - vmf->address, &vmf->ptl); > > >> + addr, &vmf->ptl); > > >> if (!vmf->pte) > > >> return VM_FAULT_NOPAGE; > > >> > > >> /* Re-check under ptl */ > > >> - if (likely(!vmf_pte_changed(vmf))) { > > >> - struct folio *folio =3D page_folio(page); > > >> - int type =3D is_cow ? MM_ANONPAGES : mm_counter_file= (folio); > > >> - > > >> - set_pte_range(vmf, folio, page, 1, vmf->address); > > >> - add_mm_counter(vma->vm_mm, type, 1); > > >> - ret =3D 0; > > >> - } else { > > >> - update_mmu_tlb(vma, vmf->address, vmf->pte); > > >> + if (nr_pages =3D=3D 1 && unlikely(vmf_pte_changed(vmf))) { > > >> + update_mmu_tlb(vma, addr, vmf->pte); > > >> ret =3D VM_FAULT_NOPAGE; > > >> + goto unlock; > > >> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_page= s)) { > > > > > > In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages = =3D=3D 1 > > > then unify the code for nr_pages=3D=3D1 and nr_pages > 1? > > > > > > It seems this has been discussed before, but I forget the reason. > > > > IIUC, this is for uffd case, which is not a none pte entry. > > Is it possible to have a COW case for shmem? For example, if someone > maps a shmem > file as read-only and then writes to it, would that prevent the use of > pte_range_none? sorry, i mean PRIVATE but not READ-ONLY. > > Furthermore, if we encounter a large folio in shmem while reading, > does it necessarily > mean we can map the entire folio? Is it possible for some processes to > only map part > of large folios? For instance, if process A allocates large folios and > process B maps > only part of this shmem file or partially unmaps a large folio, how > would that be handled? > > Apologies for not debugging this thoroughly, but these two corner > cases seem worth > considering. If these scenarios have already been addressed, please disre= gard my > comments. > > Thanks > Barry