From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 343BAEB64D7 for ; Thu, 29 Jun 2023 02:13:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9BCFC8D0002; Wed, 28 Jun 2023 22:13:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 96D118D0001; Wed, 28 Jun 2023 22:13:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 80D148D0002; Wed, 28 Jun 2023 22:13:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 70B7E8D0001 for ; Wed, 28 Jun 2023 22:13:20 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 2622D1A0644 for ; Thu, 29 Jun 2023 02:13:20 +0000 (UTC) X-FDA: 80954163360.13.710083C Received: from mail-pg1-f171.google.com (mail-pg1-f171.google.com [209.85.215.171]) by imf08.hostedemail.com (Postfix) with ESMTP id 465E4160002 for ; Thu, 29 Jun 2023 02:13:18 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=LtAm9OJE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688004798; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eK+GZTvaI9GpF+th2wx0qAS291ZJaIQUkAYN/lJSnjk=; b=iQ/3jqsOxQ1K1XLcbvT0OJEge000szT6fmiCl22GKbPwuncModr8NpMhkTnzi8WwICKpzY xgofS6QtdEsKsfKiP7sdZ5A59psNwhN+3+6vwZhsOwHcnOs68GU5O7O4Sl2D/1Jt+6RsOa CvBrngXXctcfBSMDH/Kc8vfREZlZNZI= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=pass header.d=gmail.com header.s=20221208 header.b=LtAm9OJE; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf08.hostedemail.com: domain of shy828301@gmail.com designates 209.85.215.171 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688004798; a=rsa-sha256; cv=none; b=CgYxE3uZhtwEiFaHHoarRCvNhKW0qbyDhiGCsolDMBSEMGxXG8qyOGTMzTk/bAQKlwqez9 KDkgB6zINWrWY5dzPcVBR4QU3q8Mc7AjhTzxo07piv0zoxeSUPvLVWuEDAsKSJeaaL5jfs WCxP4IlyTtxNtaLI4Sp5tiQHGVzWKVc= Received: by mail-pg1-f171.google.com with SMTP id 41be03b00d2f7-54f87d5f1abso151534a12.0 for ; Wed, 28 Jun 2023 19:13:17 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1688004797; x=1690596797; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=eK+GZTvaI9GpF+th2wx0qAS291ZJaIQUkAYN/lJSnjk=; b=LtAm9OJENmFL/CFUxR0GtVCgkyBqDLDbHwNCmdUe4dzkenU0TuMr2ghkJhYLxw9rEU dFL6JDHZTObecMqWehZFBU9vxW46742veERYUyPfemzgxdJCMK1cAMi47wbMRsnHKbfl OcQCaeHRnnvpR22oLYQD/YtL5KnEy0eYqVcyYta+q2H33aXUtc+7OGKyQU+iasEm0EEe Rc+Uti77Airh8wlvU5hSOuEoWh59aMW6RI2qz0sbg1WYFkI/nVvhc3PSY4RFbCm7klLn wg8lbeBS0OCm2po1Q07W5oAXAuSzzr1LHHc+1RdVSszXXSdZp78PSUKFyyrGYUBqu4Cd dxvg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688004797; x=1690596797; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eK+GZTvaI9GpF+th2wx0qAS291ZJaIQUkAYN/lJSnjk=; b=C5OgqeynaCYWeCeyXhsOYkZiv1eDLML/vk6D3Ut8HXnnS3EYr5FFw5CnaVyNO2T40V AwZUJWsFWn3Tzd3BIDJiLouTB7g+LG6kXazUDbWu1uPOE8g4TvmCkDbXhxsVCnx7i1Yn MCZwHfCJDS1HmSMYSOhWSCr0fIsXWXpxXDDhZc4CUdeutZjOotjHNZytZpSHpRTIzFa2 +hWD8eQ0nch7cnOmP8CVEIH+9Z7uwlIR4rCaf6yiDTB4u4nMuZRXkf4W/MQ7jYLPz7Q0 9wE0CxLgrdNj9aV9Ux1uU7FXFtKDiiBeb/HCC4k0lLrKi0b3A0IWC1grD/21O4l5cWQr U5sA== X-Gm-Message-State: AC+VfDwIYh+824Oj6Yx0axFqadlXIj4rH9HEIAQ8ZK/YLqPs2mFbXUWG k0MSR36pqZGAyW11/l8BRxbJWyvK0vXRN5mrWFg= X-Google-Smtp-Source: ACHHUZ7XN+3LeDvOmFnc3BOChJihvOylEkKOtArWWP3fQVUXkBx23yqNUrbBUPNK4DV+EZeYdINPVsUg8Vd2O3ZLJ1Q= X-Received: by 2002:a17:90a:c301:b0:262:c8dc:5ab0 with SMTP id g1-20020a17090ac30100b00262c8dc5ab0mr9049662pjt.39.1688004796832; Wed, 28 Jun 2023 19:13:16 -0700 (PDT) MIME-Version: 1.0 References: <20230626171430.3167004-1-ryan.roberts@arm.com> <20230626171430.3167004-11-ryan.roberts@arm.com> In-Reply-To: <20230626171430.3167004-11-ryan.roberts@arm.com> From: Yang Shi Date: Wed, 28 Jun 2023 19:13:05 -0700 Message-ID: Subject: Re: [PATCH v1 10/10] mm: Allocate large folios for anonymous memory To: Ryan Roberts Cc: Andrew Morton , "Matthew Wilcox (Oracle)" , "Kirill A. Shutemov" , Yin Fengwei , David Hildenbrand , Yu Zhao , Catalin Marinas , Will Deacon , Geert Uytterhoeven , Christian Borntraeger , Sven Schnelle , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-alpha@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-ia64@vger.kernel.org, linux-m68k@lists.linux-m68k.org, linux-s390@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: nex7kfkit6p38dfg7mhndh1ab1jge9zt X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 465E4160002 X-HE-Tag: 1688004798-294166 X-HE-Meta: U2FsdGVkX19HKeuxnuKIdtvR+Ajsi3SIiQ4zX9EoTZWQVR/BDKqtu6TTu9BQUz0GLS08Zz+D8MocC1PUB16tTpZV1rXv0Gc4uUe7k9BK7Kh/Gzll6KHU38ce2OeDIAA3gQP/4F/fA/YyAsobt5bXq2+wnh9YCFMfn/MsL+gjQ9tbvdOFfx3JC46RFR64libF8ba7Fnbjbhw2Ngw4yPkw5MopAo9Zt3CRab6YSGZFRSuLYhdNX01KARjL80pocP5T8u3Vpw4kf5zzaRgKq9GnJbazD6VTXWDLzjldZPrZAAf2tSLYW4peZcyPg7jFDmG2JU88VQHA/tFTtiix22RO7ALaERLPeKzTmFY2+oiAj6yebNKzSNeqDRXl0Ws5N+Xip4Nje5THUAhUs2zu7WkDTZUROkJVGNQlDZqVqZaq9ZZHd8K+bWAQ/UO8cA9nEpHDaut8l5JQIqNaFrmOnCIVKgLtVT4XyL2azlcI5wU0tR+Q6rQeWHDjAy5Nx+4mW3gK1wvruUpqe+txVkH+fhCUk0cCkFPvsFPov/E9e797SqBzofJwabtN98wtQheHwdIZbIBitxbVf0qntGpZnCEN9T69T+Kj61uCQMd4vq8m0LHrywomIVolPj0jYc9tGhB4RIkF9bsvX9kUVwutCKoQ7SCB+7i5OBtBufDJQOvZwn+cnWYQdZpcWoMSm6QZqvIf5QkDmKrfln1x8MuwcZk1SxiU93zbtE3ovc/xHpAk6bbnidM5MHgMBQnQGHyzhPn6VAkq3AlqXwyHwi4GYLY/PSrzqeClR7AWvMgoDLE+5gikDCKbLcm2utXdQcxpUFsj4fwBv2jvGdlk+DW1zTJvZKuQLqL5JNN7oA8V4Dt39xXF2fFd6Al6GH2fhwfmqoWU4gqodnc/PrxfKWYblCxv7NisuCXgTzTTk2ZfOgfLrUBs5tezXpQzj7GV1JWWUBC84pDmPdHAvgUlDDUjmo0 RJFw5nmx ME2EERhTa7crMyl45IJILqKP9pPvSOUJmjlATdfGp+kKJMN96wXGF7qoPfLprxf6xYDuCy0aib9dQh5BMdZxrDrPW1FevWH3PP0QRM4yfUs25gwGoMn9iY1Qynt+oHAet97VDT+n/Du0RjlAT+pGeB+70RVG6NaKDpbukFKDbRYRsSIMYA/rBs2IqvkM5wg8UTYebdqnm7GdWRinlDlarAQ8eSzZyW2oj03o0xK3Pk46T+8DOs9l7CvNnNJhUqOUVKnjfsdStqjbE1sezx8vRQ8flxqdro+yR9iVVPxM/lJrac2ZFt8E7wLHpsiGktxgbfKCZJzb1trhGg26UEGgAiVkQW7tpX1BJsxIBDsZXjwLPHN93Bj/lWNd6Yu3ooEXiVuYz8uh3yhLH3Wb9oKCW3CXN36tuep1d6gwN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jun 26, 2023 at 10:15=E2=80=AFAM Ryan Roberts wrote: > > With all of the enabler patches in place, modify the anonymous memory > write allocation path so that it opportunistically attempts to allocate > a large folio up to `max_anon_folio_order()` size (This value is > ultimately configured by the architecture). This reduces the number of > page faults, reduces the size of (e.g. LRU) lists, and generally > improves performance by batching what were per-page operations into > per-(large)-folio operations. > > If CONFIG_LARGE_ANON_FOLIO is not enabled (the default) then > `max_anon_folio_order()` always returns 0, meaning we get the existing > allocation behaviour. > > Signed-off-by: Ryan Roberts > --- > mm/memory.c | 159 +++++++++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 144 insertions(+), 15 deletions(-) > > diff --git a/mm/memory.c b/mm/memory.c > index a8f7e2b28d7a..d23c44cc5092 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -3161,6 +3161,90 @@ static inline int max_anon_folio_order(struct vm_a= rea_struct *vma) > return CONFIG_LARGE_ANON_FOLIO_NOTHP_ORDER_MAX; > } > > +/* > + * Returns index of first pte that is not none, or nr if all are none. > + */ > +static inline int check_ptes_none(pte_t *pte, int nr) > +{ > + int i; > + > + for (i =3D 0; i < nr; i++) { > + if (!pte_none(ptep_get(pte++))) > + return i; > + } > + > + return nr; > +} > + > +static int calc_anon_folio_order_alloc(struct vm_fault *vmf, int order) > +{ > + /* > + * The aim here is to determine what size of folio we should allo= cate > + * for this fault. Factors include: > + * - Order must not be higher than `order` upon entry > + * - Folio must be naturally aligned within VA space > + * - Folio must not breach boundaries of vma > + * - Folio must be fully contained inside one pmd entry > + * - Folio must not overlap any non-none ptes > + * > + * Additionally, we do not allow order-1 since this breaks assump= tions > + * elsewhere in the mm; THP pages must be at least order-2 (since= they > + * store state up to the 3rd struct page subpage), and these page= s must > + * be THP in order to correctly use pre-existing THP infrastructu= re such > + * as folio_split(). > + * > + * As a consequence of relying on the THP infrastructure, if the = system > + * does not support THP, we always fallback to order-0. > + * > + * Note that the caller may or may not choose to lock the pte. If > + * unlocked, the calculation should be considered an estimate tha= t will > + * need to be validated under the lock. > + */ > + > + struct vm_area_struct *vma =3D vmf->vma; > + int nr; > + unsigned long addr; > + pte_t *pte; > + pte_t *first_set =3D NULL; > + int ret; > + > + if (has_transparent_hugepage()) { > + order =3D min(order, PMD_SHIFT - PAGE_SHIFT); > + > + for (; order > 1; order--) { > + nr =3D 1 << order; > + addr =3D ALIGN_DOWN(vmf->address, nr << PAGE_SHIF= T); > + pte =3D vmf->pte - ((vmf->address - addr) >> PAGE= _SHIFT); > + > + /* Check vma bounds. */ > + if (addr < vma->vm_start || > + addr + (nr << PAGE_SHIFT) > vma->vm_end) > + continue; > + > + /* Ptes covered by order already known to be none= . */ > + if (pte + nr <=3D first_set) > + break; > + > + /* Already found set pte in range covered by orde= r. */ > + if (pte <=3D first_set) > + continue; > + > + /* Need to check if all the ptes are none. */ > + ret =3D check_ptes_none(pte, nr); > + if (ret =3D=3D nr) > + break; > + > + first_set =3D pte + ret; > + } > + > + if (order =3D=3D 1) > + order =3D 0; > + } else > + order =3D 0; > + > + return order; > +} > + > /* > * Handle write page faults for pages that can be reused in the current = vma > * > @@ -4201,6 +4285,9 @@ static vm_fault_t do_anonymous_page(struct vm_fault= *vmf) > struct folio *folio; > vm_fault_t ret =3D 0; > pte_t entry; > + unsigned long addr; > + int order =3D uffd_wp ? 0 : max_anon_folio_order(vma); > + int pgcount =3D BIT(order); > > /* File mapping without ->vm_ops ? */ > if (vma->vm_flags & VM_SHARED) > @@ -4242,24 +4329,44 @@ static vm_fault_t do_anonymous_page(struct vm_fau= lt *vmf) > pte_unmap_unlock(vmf->pte, vmf->ptl); > return handle_userfault(vmf, VM_UFFD_MISSING); > } > - goto setpte; > + if (uffd_wp) > + entry =3D pte_mkuffd_wp(entry); > + set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + > + /* No need to invalidate - it was non-present before */ > + update_mmu_cache(vma, vmf->address, vmf->pte); > + goto unlock; > } > > - /* Allocate our own private page. */ > +retry: > + /* > + * Estimate the folio order to allocate. We are not under the ptl= here > + * so this estiamte needs to be re-checked later once we have the= lock. > + */ > + vmf->pte =3D pte_offset_map(vmf->pmd, vmf->address); > + order =3D calc_anon_folio_order_alloc(vmf, order); > + pte_unmap(vmf->pte); > + > + /* Allocate our own private folio. */ > if (unlikely(anon_vma_prepare(vma))) > goto oom; > - folio =3D vma_alloc_zeroed_movable_folio(vma, vmf->address, 0, 0)= ; > + folio =3D try_vma_alloc_movable_folio(vma, vmf->address, order, t= rue); > if (!folio) > goto oom; > > + /* We may have been granted less than we asked for. */ > + order =3D folio_order(folio); > + pgcount =3D BIT(order); > + addr =3D ALIGN_DOWN(vmf->address, pgcount << PAGE_SHIFT); > + > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > folio_throttle_swaprate(folio, GFP_KERNEL); > > /* > * The memory barrier inside __folio_mark_uptodate makes sure tha= t > - * preceding stores to the page contents become visible before > - * the set_pte_at() write. > + * preceding stores to the folio contents become visible before > + * the set_ptes() write. > */ > __folio_mark_uptodate(folio); > > @@ -4268,11 +4375,31 @@ static vm_fault_t do_anonymous_page(struct vm_fau= lt *vmf) > if (vma->vm_flags & VM_WRITE) > entry =3D pte_mkwrite(pte_mkdirty(entry)); > > - vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->addre= ss, > - &vmf->ptl); > - if (vmf_pte_changed(vmf)) { > - update_mmu_tlb(vma, vmf->address, vmf->pte); > - goto release; > + vmf->pte =3D pte_offset_map_lock(vma->vm_mm, vmf->pmd, addr, &vmf= ->ptl); > + > + /* > + * Ensure our estimate above is still correct; we could have race= d with > + * another thread to service a fault in the region. > + */ > + if (order =3D=3D 0) { > + if (vmf_pte_changed(vmf)) { > + update_mmu_tlb(vma, vmf->address, vmf->pte); > + goto release; > + } > + } else if (check_ptes_none(vmf->pte, pgcount) !=3D pgcount) { > + pte_t *pte =3D vmf->pte + ((vmf->address - addr) >> PAGE_= SHIFT); > + > + /* If faulting pte was allocated by another, exit early. = */ > + if (!pte_none(ptep_get(pte))) { > + update_mmu_tlb(vma, vmf->address, pte); > + goto release; > + } > + > + /* Else try again, with a lower order. */ > + pte_unmap_unlock(vmf->pte, vmf->ptl); > + folio_put(folio); > + order--; > + goto retry; I'm not sure whether this extra fallback logic is worth it or not. Do you have any benchmark data or is it just an arbitrary design choice? If it is just an arbitrary design choice, I'd like to go with the simplest way by just exiting page fault handler, just like the order-0, IMHO. > } > > ret =3D check_stable_address_space(vma->vm_mm); > @@ -4286,16 +4413,18 @@ static vm_fault_t do_anonymous_page(struct vm_fau= lt *vmf) > return handle_userfault(vmf, VM_UFFD_MISSING); > } > > - inc_mm_counter(vma->vm_mm, MM_ANONPAGES); > - folio_add_new_anon_rmap(folio, vma, vmf->address); > + folio_ref_add(folio, pgcount - 1); > + > + add_mm_counter(vma->vm_mm, MM_ANONPAGES, pgcount); > + folio_add_new_anon_rmap_range(folio, &folio->page, pgcount, vma, = addr); > folio_add_lru_vma(folio, vma); > -setpte: > + > if (uffd_wp) > entry =3D pte_mkuffd_wp(entry); > - set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); > + set_ptes(vma->vm_mm, addr, vmf->pte, entry, pgcount); > > /* No need to invalidate - it was non-present before */ > - update_mmu_cache(vma, vmf->address, vmf->pte); > + update_mmu_cache_range(vma, addr, vmf->pte, pgcount); > unlock: > pte_unmap_unlock(vmf->pte, vmf->ptl); > return ret; > -- > 2.25.1 > >