From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F31F6C25B76 for ; Mon, 3 Jun 2024 08:58:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 879E86B0093; Mon, 3 Jun 2024 04:58:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 82C616B0095; Mon, 3 Jun 2024 04:58:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6F2026B0098; Mon, 3 Jun 2024 04:58:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 52FDB6B0093 for ; Mon, 3 Jun 2024 04:58:43 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id C7D9C40A15 for ; Mon, 3 Jun 2024 08:58:42 +0000 (UTC) X-FDA: 82188976884.30.300FB29 Received: from mail-ua1-f42.google.com (mail-ua1-f42.google.com [209.85.222.42]) by imf15.hostedemail.com (Postfix) with ESMTP id E9A30A0006 for ; Mon, 3 Jun 2024 08:58:40 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hYy5SuqU; spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717405121; a=rsa-sha256; cv=none; b=eRnOK+7/6ICmbQUyyqjQJmsl9t0Z/Er3NRsfya7Pfzhma95KbV/Fi84LRVHHBhP7I6MlPL 6SuNdhW4xAjq32MPirktetT5/m+UpcZoX7xv/yfe+uiEZBisBf+EU8QNDezez+ZhXeYRZ3 hcYRSJgzkrRXO+XbwQBMBabxNi/ziPM= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=hYy5SuqU; spf=pass (imf15.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.222.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717405121; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=44UQ4/UY0KeMGbyf3xNuLfXOKsdvcbp7J6UBZb4MxnI=; b=6sL60/uD/ozD/RmhZ2bMMydQgchZcJCXyhVd+S0TYvTE0wG5jKSgc6OF2y/O14T6Lzv/bx dZagvkdPDU4h7fXksTl3kT55WBVX1pWhrZSJO64vN9O/chOXJZCqgIJKB7JxBW7pfcUNzz DdBeQgz90KzlRFasQuxIh6f5Hrf/Zss= Received: by mail-ua1-f42.google.com with SMTP id a1e0cc1a2514c-80aca73c536so1150517241.2 for ; Mon, 03 Jun 2024 01:58:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1717405120; x=1718009920; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=44UQ4/UY0KeMGbyf3xNuLfXOKsdvcbp7J6UBZb4MxnI=; b=hYy5SuqU5DCftFy8fnJTAeQTfQEmLwqs5Yzlo3Qt6xclRtpwf1prF9gYDkd7j2E9vw Dp7xfyx7nmontEaiSkMPMz4nNN1Gkwel4khmQGRMEXiVKTrNkXwFYzKGDvCMvhIqppRJ sH+9JY/Z5NdiJolFohqDL7F9Sg89rrA8pKQbZpokTNRGv1oyMC3SB5g7W5kIhXKFrHPA A/AGEkNVHfcRArYEzvvDdB8/rX+lAfo0FordMkJWFuw8XPUh1yx4gXjmcpsMomDh/Om4 uIk3eSn781qu54KILSRu3FPa9LfKDEQfzs0VU8cDIIROsqug4Pxcw3U4gTT1pN4D2vok iChA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717405120; x=1718009920; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=44UQ4/UY0KeMGbyf3xNuLfXOKsdvcbp7J6UBZb4MxnI=; b=JKK2Y9YY8cLFLQhLK8wSt+vUakoxB7MRnM9geuDxe3Umr+n/eqHCvAWm6cBya8mQMa ERmXgrQCB9NfL05EztoM6Tuiwlu8gGXMBduZBX/l3sSV8ee7auV25deTGM1PG297Bk+F 5SiKQUNhkEC8KFn7es4sXZoGauAt9mKy2lsqiVAoJTOUgE7sCwug6UgtDT/Z259yh1Ts EkzpY3jzVB9qoHH+oVPKV0uc6xZyKXCU4eysIpBHz00csI9i1t1xCX2iTqZimvJ4goqE 1xu+Zj1Pm+N19cCgwyFtUC33LaCFS5NmmrBsJOJhjF1o1QPUt3wCbUBdsglSdFQnIhbl Sq2w== X-Forwarded-Encrypted: i=1; AJvYcCVE4CSRhAnAvb7YJak5qjiRPbuPXeaZdDHjyianoYeH6++eaczogBj23gATnCAAf84c58ZP0Z8lSYULQXvgNhSRZbs= X-Gm-Message-State: AOJu0Yxm1zeze8aigoF3aRokPV86Nlxj11Q0A22dKFkEBY0wZZKaSqCO VOYAuTaanoFTWhjLVnJBHTqzf9kKpehFA1pSYPXrY6VA6UoBRdGlr/377LbGL4lSLC/iYeeSj2m BooQjjQEng4GGOUSkzeEZPtaHRwQ= X-Google-Smtp-Source: AGHT+IFd55xH84VmPhuot7l4LvkhDfemuCLIXaAj3pqLKDG8dxzYAHjDXy63ivF9tlTWUFKUKHCjL2g4b0E+veTEV9c= X-Received: by 2002:a05:6102:354a:b0:47c:14bd:7ccd with SMTP id ada2fe7eead31-48bc21942b6mr6367775137.15.1717405118377; Mon, 03 Jun 2024 01:58:38 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Mon, 3 Jun 2024 20:58:26 +1200 Message-ID: Subject: Re: [PATCH v3 1/6] mm: memory: extend finish_fault() to support large folio To: Baolin Wang Cc: akpm@linux-foundation.org, hughd@google.com, willy@infradead.org, david@redhat.com, wangkefeng.wang@huawei.com, ying.huang@intel.com, ryan.roberts@arm.com, shy828301@gmail.com, ziy@nvidia.com, ioworker0@gmail.com, da.gomez@samsung.com, p.raghav@samsung.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E9A30A0006 X-Rspam-User: X-Rspamd-Server: rspam12 X-Stat-Signature: zn7siukcu5o8e5ejmd96d7wnrz1fsccy X-HE-Tag: 1717405120-585088 X-HE-Meta: U2FsdGVkX1/uAJpu3aFvMNjEZ2DEzJrnw5F13k3LWi3OHsX8Pqnf4YVyKHSmQl41H4wW7vmH3XrDkjwqegLo2/h4iXRje4f8Jy5jjD7e031NTdPD0OwQGTsRDV33Qm8zbRZeG6X7ffNlmhOSlJA71HTiLWCjFSG3QWTDKX7sDPDaprXaoi4D2Os9KoNzm8zs9MehSiGkzIRQXPtgtl+SgVIrIr+xNkFc9m3Pgq76IJTplphjJ+/idtvOd3y6xTwjS+Emk2IckCtI/0wZPpGLnETnf9JfyLVpjx04WwTHBW/Xqk9ANeB2jC0EFDTsGePQ8BX4+xkNsxElm9snZUbEKOfSCykXOP1b+cpfDbj0/6lIbAFn5rbOQw3yXVxf4SAK/ZHAC8JoPgYCRdO5A6k3OaQDAFHxx0yUNRMKtG16VJuzpiwu9xP/7QarDKOmX8EKU5wN6d5j3W7fGflHV4CwQq4La2KQ/EW6Zq8nXYdJFnlASWC8Xzf1pW3H0Mm3puoejR4vCuPAcVXaVXRuMt75/dURr4h31o7h/wXzT6qQrFOfJYSCFRoRKAEP/GS4sH3/KrHaz3vUAHpvgSIAWL9iFcuqfmVbQSWY6vMYVSci0nY/ezpitXp507/+klwfXV8uxC4lmok6I/6Q3V6xXY51jOOZ1BCtMJ4NDZMCTwX2ubLODJFeIutC7sEVq1sLTov71jClYYPxPWtV3w1QJt0TzjUL/+ga2CGnHwnrJzH6TaPlBUYRXBtU0h7h3H45ArVFSeg3/5+1keTu99qRfhOEBhJNXs8g4EKIJQ+4G+nM8HLStPjjXsZ6DcGSNxgSKkwRXgct4ttQLB0y67THcv2eLQZgPlio8eyNjOtUUiGgAjaEN5ifi7FtJrj9hTTkdjT7esr87wJcUhhWQ4XDSmPXU74nkyXt4DIpNEc01IsqrPyv9ZfAcP2rwmP/uTapRwUitDkItXvbAX9avLwtguk /EgFFhaM wM07a019LAhjBSdc7v+QNi7FvuZqb/j3szycBNgac/b886hw0i22R8vGlxcdiy7kpNvzwFBvbHnolopPN5ommduE1k1iNGh7nht7VeMqP6pQzwye/X9gDvss2sjyroR421qAj/GBF1P1PmJKqda2B/fj2k9R3SMUlYnraRabfXxbvUv+O6rJsLbwHblkBwofZWZiCZuazHITYQypNVxdYk+PeqjLhegHkvE1+s4wg2uYBk7yWNKzatTqCotqwL50VbIBn/VnOLzd1zAq2OmiLL6r8hpgS4G17y1GhDsvp25nps6Sn3QcuKfjxaawzhq5ax7qn6gr+TXymcCLeLNzpAs/uLtTc4JsIA/5wMZIZSq+4x5gceackDxvDVj1CS/zCypsBaT9dDJkn9ijz6fKd4R2QZgaky5xzfY47tzh3j5NxH1s3HTDq7MRdut3trsVM2REnIZcotfCGB/3Xur5KgLBamg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Mon, Jun 3, 2024 at 8:29=E2=80=AFPM Baolin Wang wrote: > > > > On 2024/6/3 13:28, Barry Song wrote: > > On Thu, May 30, 2024 at 2:04=E2=80=AFPM Baolin Wang > > wrote: > >> > >> Add large folio mapping establishment support for finish_fault() as a = preparation, > >> to support multi-size THP allocation of anonymous shmem pages in the f= ollowing > >> patches. > >> > >> Signed-off-by: Baolin Wang > >> --- > >> mm/memory.c | 58 ++++++++++++++++++++++++++++++++++++++++++++-------= -- > >> 1 file changed, 48 insertions(+), 10 deletions(-) > >> > >> diff --git a/mm/memory.c b/mm/memory.c > >> index eef4e482c0c2..435187ff7ea4 100644 > >> --- a/mm/memory.c > >> +++ b/mm/memory.c > >> @@ -4831,9 +4831,12 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > >> { > >> struct vm_area_struct *vma =3D vmf->vma; > >> struct page *page; > >> + struct folio *folio; > >> vm_fault_t ret; > >> bool is_cow =3D (vmf->flags & FAULT_FLAG_WRITE) && > >> !(vma->vm_flags & VM_SHARED); > >> + int type, nr_pages, i; > >> + unsigned long addr =3D vmf->address; > >> > >> /* Did we COW the page? */ > >> if (is_cow) > >> @@ -4864,24 +4867,59 @@ vm_fault_t finish_fault(struct vm_fault *vmf) > >> return VM_FAULT_OOM; > >> } > >> > >> + folio =3D page_folio(page); > >> + nr_pages =3D folio_nr_pages(folio); > >> + > >> + /* > >> + * Using per-page fault to maintain the uffd semantics, and sa= me > >> + * approach also applies to non-anonymous-shmem faults to avoi= d > >> + * inflating the RSS of the process. > > > > I don't feel the comment explains the root cause. > > For non-shmem, anyway we have allocated the memory? Avoiding inflating > > RSS seems not so useful as we have occupied the memory. the memory foot= print > > This is also to keep the same behavior as before for non-anon-shmem, and > will be discussed in the future. OK. > > > is what we really care about. so we want to rely on read-ahead hints of= subpage > > to determine read-ahead size? that is why we don't map nr_pages for non= -shmem > > files though we can potentially reduce nr_pages - 1 page faults? > > IMHO, there is 2 cases for non-anon-shmem: > (1) read mmap() faults: we can rely on the 'fault_around_bytes' > interface to determin what size of mapping to build. > (2) writable mmap() faults: I want to keep the same behavior as before > (per-page fault), but we can talk about this when I send new patches to > use mTHP to control large folio allocation for writable mmap(). OK. > > >> + */ > >> + if (!vma_is_anon_shmem(vma) || unlikely(userfaultfd_armed(vma)= )) { > >> + nr_pages =3D 1; > >> + } else if (nr_pages > 1) { > >> + pgoff_t idx =3D folio_page_idx(folio, page); > >> + /* The page offset of vmf->address within the VMA. */ > >> + pgoff_t vma_off =3D vmf->pgoff - vmf->vma->vm_pgoff; > >> + > >> + /* > >> + * Fallback to per-page fault in case the folio size i= n page > >> + * cache beyond the VMA limits. > >> + */ > >> + if (unlikely(vma_off < idx || > >> + vma_off + (nr_pages - idx) > vma_pages(vm= a))) { > >> + nr_pages =3D 1; > >> + } else { > >> + /* Now we can set mappings for the whole large= folio. */ > >> + addr =3D vmf->address - idx * PAGE_SIZE; > >> + page =3D &folio->page; > >> + } > >> + } > >> + > >> vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, > >> - vmf->address, &vmf->ptl); > >> + addr, &vmf->ptl); > >> if (!vmf->pte) > >> return VM_FAULT_NOPAGE; > >> > >> /* Re-check under ptl */ > >> - if (likely(!vmf_pte_changed(vmf))) { > >> - struct folio *folio =3D page_folio(page); > >> - int type =3D is_cow ? MM_ANONPAGES : mm_counter_file(f= olio); > >> - > >> - set_pte_range(vmf, folio, page, 1, vmf->address); > >> - add_mm_counter(vma->vm_mm, type, 1); > >> - ret =3D 0; > >> - } else { > >> - update_mmu_tlb(vma, vmf->address, vmf->pte); > >> + if (nr_pages =3D=3D 1 && unlikely(vmf_pte_changed(vmf))) { > >> + update_mmu_tlb(vma, addr, vmf->pte); > >> ret =3D VM_FAULT_NOPAGE; > >> + goto unlock; > >> + } else if (nr_pages > 1 && !pte_range_none(vmf->pte, nr_pages)= ) { > > > > In what case we can't use !pte_range_none(vmf->pte, 1) for nr_pages =3D= =3D 1 > > then unify the code for nr_pages=3D=3D1 and nr_pages > 1? > > > > It seems this has been discussed before, but I forget the reason. > > IIUC, this is for uffd case, which is not a none pte entry. Is it possible to have a COW case for shmem? For example, if someone maps a shmem file as read-only and then writes to it, would that prevent the use of pte_range_none? Furthermore, if we encounter a large folio in shmem while reading, does it necessarily mean we can map the entire folio? Is it possible for some processes to only map part of large folios? For instance, if process A allocates large folios and process B maps only part of this shmem file or partially unmaps a large folio, how would that be handled? Apologies for not debugging this thoroughly, but these two corner cases seem worth considering. If these scenarios have already been addressed, please disrega= rd my comments. Thanks Barry