From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 66401FA3741 for ; Tue, 1 Nov 2022 00:54:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E4B476B0074; Mon, 31 Oct 2022 20:54:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DFB2D6B0078; Mon, 31 Oct 2022 20:54:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CEA6280007; Mon, 31 Oct 2022 20:54:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BBA3F6B0074 for ; Mon, 31 Oct 2022 20:54:05 -0400 (EDT) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 8B8EB160666 for ; Tue, 1 Nov 2022 00:54:05 +0000 (UTC) X-FDA: 80083051650.09.B99D02B Received: from mail-pg1-f180.google.com (mail-pg1-f180.google.com [209.85.215.180]) by imf16.hostedemail.com (Postfix) with ESMTP id 333BA180004 for ; Tue, 1 Nov 2022 00:54:05 +0000 (UTC) Received: by mail-pg1-f180.google.com with SMTP id b62so1750449pgc.0 for ; Mon, 31 Oct 2022 17:54:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Fq59yWSLafAiwk54zqJ+kxPVtmYjp0IhMh7ZZqtSKsc=; b=Jn71qW/dsPMLnTWj9wU/Ls18emsiyQ1X+GccfVRr/xSW0gRq46YTl3ULVWDjIMdCP1 CIk5RxhNNCOpSGzyan2Y/nxQIz5XtRVYqIc0AuiNxZaXrcqZ+V8Sp/cI9vLwqwjv2uvu yUX3ipqNmjK8EMCNDGENmJXEseW4mu3TpD5ZoYysP0blrkJnFXEdyS0x+OicQhHu61cA RDllDlQf4qZBf2hRFtAXLWngwf4Xwn+qmDofXgsWdRWBf7zrWTlabDroQ3wRDD7AXzJm 7DOFxQ3t32omDbfMvmjtjBAfYcpkBM/oiRGeK/f+a5E2DyVhW6aCdP0IZ1kNNGT6gi1b q/DA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Fq59yWSLafAiwk54zqJ+kxPVtmYjp0IhMh7ZZqtSKsc=; b=T192sBU/QxUiKNRWVRU5gyB4QyvqAAMqFCDbGyHd8nrm66+xLIuDW3ksokCWPOyrv4 wUFu0xj6EGPaKUCSEQw+k14qJ93SaR9WeuVrFjPWWc831vCGCh2y7ECiMWfzFPpSZz8d iI/f5CM96EBT15jREuWokXs7YEqmRV5zQzMp5pkBVdXoxK5al/9tvLHEER+EqW2JhHBP swPMgV9NRXilnw+xQhznjUFTCIFfUoYhkL+rSAlDRZrM9SLJeqw6UlKF+CnTOJlhvfNR k1f25zQw7/kU1mCwQrjjXIutmjx3HKeVW4WYKXw9bIX1gmR+tsd3qEuCUHhsXZFj3PNQ +WUQ== X-Gm-Message-State: ACrzQf2X01O7M4SNHyNdfHJMQzyRf/MXUpqhxyZBV8WAdOzFER+6HJ51 XUT0x/ATZumGHixUDqT6JOGPI6ay3txQYpMkk6I= X-Google-Smtp-Source: AMsMyM68v+SPLmdGRHXnn8XITjdXmScdbfcMceoqHO21dZYPGTJifMObb5NEhCgwPfvq90nPbsftgQ== X-Received: by 2002:a63:5d12:0:b0:46e:cd38:3f76 with SMTP id r18-20020a635d12000000b0046ecd383f76mr14639852pgb.64.1667264043742; Mon, 31 Oct 2022 17:54:03 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id 17-20020a17090a19d100b002036006d65bsm4795469pjj.39.2022.10.31.17.54.02 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 31 Oct 2022 17:54:03 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Subject: Re: [PATCH v6] hugetlb: don't delete vma_lock in hugetlb MADV_DONTNEED processing From: Nadav Amit In-Reply-To: <20221031223440.285187-1-mike.kravetz@oracle.com> Date: Mon, 31 Oct 2022 17:54:01 -0700 Cc: Linux-MM , kernel list , Naoya Horiguchi , David Hildenbrand , Axel Rasmussen , Mina Almasry , Peter Xu , Rik van Riel , Vlastimil Babka , Matthew Wilcox , Andrew Morton , Wei Chen , "# 5 . 10+" Content-Transfer-Encoding: quoted-printable Message-Id: <8D2D2F0F-9A53-42B5-8A9D-936E06E4A4E9@gmail.com> References: <20221031223440.285187-1-mike.kravetz@oracle.com> To: Mike Kravetz X-Mailer: Apple Mail (2.3696.120.41.1.1) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1667264045; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Fq59yWSLafAiwk54zqJ+kxPVtmYjp0IhMh7ZZqtSKsc=; b=xZM6S4r43G/8SJZEy+RzMCo2Y7lsSHsAD8Fz33AoZdvi1PjAyq1W7tWV91FUQ/kbRK517R gmd1bR/hgwzhiqDE2pv18sahaCmHsdNbJaPqR94ZYjQpmimWDqkeMv6Rt/f4GgOu1zqxX6 xyxynkmgt+Qr+vwRcfJjF6mwHsbqv+U= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="Jn71qW/d"; spf=pass (imf16.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1667264045; a=rsa-sha256; cv=none; b=eBFBCZT3MUUhWIEcBeRVVx8EOa864hliIl8wPdbft91ANbR9FTvQQejpornoNT1r85Wsst /B9D85QZcZpv/JSDtMnI+w7feZ3C6EI8sqGUe4rHGUAfY2I0gAONSOGlYi+msnsYhpXACY ChMMyrXtTNTMQX61ICC4pfhR8Y2euWg= X-Stat-Signature: au68phgog99wfpshjwt8gbdp176mifqj X-Rspamd-Queue-Id: 333BA180004 Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b="Jn71qW/d"; spf=pass (imf16.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.215.180 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1667264045-116531 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Oct 31, 2022, at 3:34 PM, Mike Kravetz = wrote: > madvise(MADV_DONTNEED) ends up calling zap_page_range() to clear the = page > tables associated with the address range. For hugetlb vmas, > zap_page_range will call __unmap_hugepage_range_final. However, > __unmap_hugepage_range_final assumes the passed vma is about to be = removed > and deletes the vma_lock to prevent pmd sharing as the vma is on the = way > out. In the case of madvise(MADV_DONTNEED) the vma remains, but the > missing vma_lock prevents pmd sharing and could potentially lead to = issues > with truncation/fault races. >=20 [snip] > index 978c17df053e..517c8cc8ccb9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -3464,4 +3464,7 @@ madvise_set_anon_name(struct mm_struct *mm, = unsigned long start, > */ > #define ZAP_FLAG_DROP_MARKER ((__force zap_flags_t) BIT(0)) >=20 > +/* Set in unmap_vmas() to indicate an unmap call. Only used by = hugetlb */ > +#define ZAP_FLAG_UNMAP ((__force zap_flags_t) BIT(1)) PeterZ wants to add ZAP_FLAG_FORCE_FLUSH that would be set on zap_pte_range(). Not sure you would want to combine them both together, = but at least be aware of potential conflict. = https://lore.kernel.org/all/Y1f7YvKuwOl1XEwU@hirez.programming.kicks-ass.n= et/ [snip] > +#ifdef CONFIG_ADVISE_SYSCALLS > +/* > + * Similar setup as in zap_page_range(). madvise(MADV_DONTNEED) can = not call > + * zap_page_range for hugetlb vmas as __unmap_hugepage_range_final = will delete > + * the associated vma_lock. > + */ > +void clear_hugetlb_page_range(struct vm_area_struct *vma, unsigned = long start, > + unsigned long end) > +{ > + struct mmu_notifier_range range; > + struct mmu_gather tlb; > + > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, = vma->vm_mm, > + start, end); > + adjust_range_if_pmd_sharing_possible(vma, &range.start, = &range.end); > + tlb_gather_mmu(&tlb, vma->vm_mm); > + update_hiwater_rss(vma->vm_mm); > + mmu_notifier_invalidate_range_start(&range); > + > + __unmap_hugepage_range_locking(&tlb, vma, start, end, NULL, 0); > + > + mmu_notifier_invalidate_range_end(&range); > + tlb_finish_mmu(&tlb); > } > +#endif I hate ifdef=E2=80=99s. And the second definition of = clear_hugetlb_page_range() is confusing since it does not have an ifdef at all. . How about moving the ifdef=E2=80=99s into the function like being done in io_madvise_prep()? = I think it is less confusing. [ snip ] >=20 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -1671,7 +1671,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct = maple_tree *mt, > { > struct mmu_notifier_range range; > struct zap_details details =3D { > - .zap_flags =3D ZAP_FLAG_DROP_MARKER, > + .zap_flags =3D ZAP_FLAG_DROP_MARKER | ZAP_FLAG_UNMAP, > /* Careful - we need to zap private pages too! */ > .even_cows =3D true, > }; > @@ -1704,15 +1704,21 @@ void zap_page_range(struct vm_area_struct = *vma, unsigned long start, > MA_STATE(mas, mt, vma->vm_end, vma->vm_end); >=20 > lru_add_drain(); > - mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, = vma->vm_mm, > - start, start + size); > tlb_gather_mmu(&tlb, vma->vm_mm); > update_hiwater_rss(vma->vm_mm); > - mmu_notifier_invalidate_range_start(&range); > do { > - unmap_single_vma(&tlb, vma, start, range.end, NULL); > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, = vma, > + vma->vm_mm, > + max(start, vma->vm_start), > + min(start + size, vma->vm_end)); > + if (is_vm_hugetlb_page(vma)) > + adjust_range_if_pmd_sharing_possible(vma, > + &range.start, > + &range.end); > + mmu_notifier_invalidate_range_start(&range); > + unmap_single_vma(&tlb, vma, start, start + size, NULL); Is there a reason that you wouldn=E2=80=99t use range.start and = range.end here? At least for consistency.