From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3ADAFD3C543 for ; Sat, 9 Nov 2024 03:09:24 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E5D76B00A3; Fri, 8 Nov 2024 22:09:23 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 695C76B00A4; Fri, 8 Nov 2024 22:09:23 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55D2F6B00A5; Fri, 8 Nov 2024 22:09:23 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 385646B00A3 for ; Fri, 8 Nov 2024 22:09:23 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 89C2A161065 for ; Sat, 9 Nov 2024 03:09:22 +0000 (UTC) X-FDA: 82765073706.09.0015E12 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf29.hostedemail.com (Postfix) with ESMTP id 169CF120012 for ; Sat, 9 Nov 2024 03:08:28 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=NX1wuveH; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf29.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1731121588; a=rsa-sha256; cv=none; b=sLmzD7XVt4w5XIuPv6dHIE1sugWwh4ykmNmgbRKw9XW2/RZESbDFeK3uTZBrpGNcs/93tB iATTiTdtg/0xkyWrUztucn68iOISXlv2ZxLuR5Su2n9cipXGmp1BirzT3btDSHRxl7VBct 9RVvZk8o5YdPUcpUpO7Cde+EJv8IyVI= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=NX1wuveH; dmarc=pass (policy=quarantine) header.from=bytedance.com; spf=pass (imf29.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1731121588; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Gox0zaors/Usl9NjGMbl7yQXcq9+RVKH7P6QUdPGdOk=; b=e4irlCpZrDuSd244TDm7KXJAeLJOpyS1Lzl+7obcnb8bt7sPCpedO06AMKw5fwKioQJk4V BLhdVZjeAYDK+72ymFUWtA8/AjJo8RXg2xJlCgn3e7Ld2LfPZIJfOMytz8Zi9jH5+/Jsow h2+iE/Rm1akGCBlbbBJvrs4IL0qkt0w= Received: by mail-pf1-f171.google.com with SMTP id d2e1a72fcca58-71e681bc315so2114509b3a.0 for ; Fri, 08 Nov 2024 19:09:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1731121758; x=1731726558; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=Gox0zaors/Usl9NjGMbl7yQXcq9+RVKH7P6QUdPGdOk=; b=NX1wuveHLhBNowa4Re5+UlmIx6y/GOH7/Gfc6G4uofaJj+gOw0kkRPXNqqbUIXXd8a +h6ddbGf2RE2//Iygo0hWHWReF5XahkYkj8ZfVgxRzF/ba9i2xBKs4zrZb0xS/aaEIf6 6Yr8nLOJO9HypGHCAHlEV8q4Om3uNbVT7m7EPtowmszJyasNa+pRJA4RzrnwtksbPK+q N2BlF4abLkZmL0HM2/qFZjXlqQCQpysDQQC8m6sRmKYGZLiYrqX7mpSNw/ad004cCrY+ bZHvF3SLH2HBu8tj6IeVud8Gl8KVFK0jMlSFB4VMZ80o4HnZ+/yYuN4hPH9EPJ2b/yi8 xm5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1731121758; x=1731726558; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=Gox0zaors/Usl9NjGMbl7yQXcq9+RVKH7P6QUdPGdOk=; b=lt+H1ESxtH4iDVE7olLiA77gCs2gC+ifyUlFyhb5pUNF2Ols3GX0X7G5SN6n1aYRrD s0lAqRZ0MCyo5uM41cYfs6CTdRjDd2SLA0z8/mDyFV12qXm7RlN1Fu+61ARDj59dcaWu jeHZrKhEhUZIL2/VGqJxxxkA8M+ONKafXTDfBRb6WwApMfJuX0LUVh8MShyGP0cWahdP O+MnixVRhdkLx7g4fWPzy6D/zJMkn9KDEGobdA1IuxTCt/CehrRZKi5pY3ViTSSXJWxD BCjytg0JWQmTCwLD04kivX6xuARKS/2DmDXydnKtdJto1OqhMHEnULRxiA+clIIGrVOB OqwQ== X-Forwarded-Encrypted: i=1; AJvYcCUjrviBE2z5L5oKpVhnODKHkOnPMjz55hB0D+N4gZ5pq95u4vbFD7W0vtE2jotP2jOKccfFv29zpw==@kvack.org X-Gm-Message-State: AOJu0YzU5jE5azqrI2hJp8BD0A5Z65EwRSSVl39Fg1qQGIXEfrOdH60+ hhsfyH9HpqtUv6lQF4wESoSP3O4uYwzppJdT77bc2pZq0IPHHPMkWGWChz31pzE= X-Google-Smtp-Source: AGHT+IHdWXqVIIaaZW1uhG5g6QJs/KSB3EfnuHdAfZjDvz3rUX095jDt/0SkT/bqCMEnbRcuDuZNsg== X-Received: by 2002:a05:6a00:3a18:b0:71e:5573:8dcd with SMTP id d2e1a72fcca58-72413f4c4eamr7905372b3a.2.1731121758351; Fri, 08 Nov 2024 19:09:18 -0800 (PST) Received: from [10.84.149.95] ([63.216.146.178]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-7f41f5b3f8fsm4256894a12.19.2024.11.08.19.09.13 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 08 Nov 2024 19:09:17 -0800 (PST) Message-ID: Date: Sat, 9 Nov 2024 11:07:56 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v2 5/7] mm: pgtable: try to reclaim empty PTE page in madvise(MADV_DONTNEED) Content-Language: en-US To: Jann Horn Cc: david@redhat.com, hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, vbabka@kernel.org, akpm@linux-foundation.org, zokeefe@google.com, rientjes@google.com, peterx@redhat.com, catalin.marinas@arm.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, x86@kernel.org References: <5d371247-853d-49ca-a28c-689a709905d4@bytedance.com> From: Qi Zheng In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Stat-Signature: nqbgsea33ssywpduqb47jpzb5yue5hfj X-Rspamd-Queue-Id: 169CF120012 X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1731121708-471209 X-HE-Meta: U2FsdGVkX19KaRsFTqEzXDrjP7OpvZV2h9FXM6fEK55y739IvReXP7yyxOAXEXrhLQFWQhdD2Nv5Kx5fJfzXuIV6ihSrd0mTLL5lQ5+aVkbF9II+g8INAhuCytB+d6FgXTQkQg2u5LN3Q7bc+C+0Nz1Mrmpy1F+voTgk/SOc7Ng2HVJZT5hc4hxs8Wz6BnhQBVzhQ4Bno5FxtjToJYTsgYPG4c1eTJGFQSn48ekVQsjEN7gPZzEoD1r82+ezB34KW3wqBBgEFrLrXkTMwvoNAAoecjamROLocd97kRZjaOwe1vfI+vOYiMChR5egkf+gSXQPLmpZmQRLA1otwd2Q0wMj/oZMkeuuOUOLDwntO/8o2Wg8FHdawVA2Bn3bLHslM2bKY3w+wMArxtd/EyLmKC2G7VGYhAoWdXZsfPmATAplMgdSTC+ap0fsrJNDczFJPV8QWBDklVKr6DfLG7aIZPNG+eKJeNCDwxZbTL3uy4uQe/LK+4et45zPLcsH1bTK29R1EWV9CQzZcHhcOw6MVKx/N0y4YXxi/BzaU3dZ31A2IBNalcpCjqxHSwxRdBK13N9mQk3vDd2ZRCX4A3e0hieqHuiqFcyVK/iYZArSxL7fxmYqBF8+AeqIWhZ+Q2h+h+V4MKUpV/fuW7ZZXO4cEOjYFryhNRJt12tcFXw1BgA8iT5EH6FSozkMSrKiq0Da+fZ6WNKXE0+GFpOmn3mhBUZM74mAtqRG1OE0F8BySHXIvXgTriOkBe9AT896mxdufAYahsFKBamsid1lpb2luORxgnhtTmYCoR6YL7gq03iYejswvcLdsrYQBIYAjq6Lqh6l/ViFJgxUr5OownaNrEa0JFfTs40yxw13AJrbtXjLKgDqxBBHy0ih2zmJW2mmTqBlMgk9n2jh53PaYhT8FOsbVR+5WdmpgmOTL6M1vWmGSZ/tXdgX+rRgGYHS3BGOWdzUKdOrfMrU1jNgYj0 4c+2FWqX IROStLm4wlcbuzH7sgfBRFT13oDHrfUrJTjo9MJxjY3V5QOP2x5X2w2UH98Bb7rkp9kWa28Iee/+8QseOx2HxrDx1ej1qjyP43wM8vjC2Bx/rfqkiG/LKr7xBgusO7eU2jSmNwQFvbVsFyumdhPpMRio60plqtNx4CCIIwY0WZ0NAPEWyC/kS63NQP9m6OwdlNdyI5fjX01tcc7U+l9TJ94qOs5tYncl3JdRZyMB2JVMCkeyxUFomSpdEPmIrjm0+7D8mwdL6xtRl8gT8NqGvKxdLJ5o2wKENxnTBV8ieZRHH7P8J9C9OFmPyy4lKXnhYWqv9vP+0gs7Ho04WF9m3Ii513vyzjj0SFUUn3fZKCxwqy23ADeaDuHfAgE83TqlE8JF2wXYVLZlKtRg8g3bEc5BDAIBFmR/FJb3d+PRRwFJ3zm4aVGbGIaXfrQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/9 02:04, Jann Horn wrote: > On Fri, Nov 8, 2024 at 8:13 AM Qi Zheng wrote: >> On 2024/11/8 07:35, Jann Horn wrote: >>> On Thu, Oct 31, 2024 at 9:14 AM Qi Zheng wrote: >>>> As a first step, this commit aims to synchronously free the empty PTE >>>> pages in madvise(MADV_DONTNEED) case. We will detect and free empty PTE >>>> pages in zap_pte_range(), and will add zap_details.reclaim_pt to exclude >>>> cases other than madvise(MADV_DONTNEED). >>>> >>>> Once an empty PTE is detected, we first try to hold the pmd lock within >>>> the pte lock. If successful, we clear the pmd entry directly (fast path). >>>> Otherwise, we wait until the pte lock is released, then re-hold the pmd >>>> and pte locks and loop PTRS_PER_PTE times to check pte_none() to re-detect >>>> whether the PTE page is empty and free it (slow path). >>> >>> How does this interact with move_pages_pte()? I am looking at your >>> series applied on top of next-20241106, and it looks to me like >>> move_pages_pte() uses pte_offset_map_rw_nolock() and assumes that the >>> PMD entry can't change. You can clearly see this assumption at the >>> WARN_ON_ONCE(pmd_none(*dst_pmd)). And if we race the wrong way, I >> >> In move_pages_pte(), the following conditions may indeed be triggered: >> >> /* Sanity checks before the operation */ >> if (WARN_ON_ONCE(pmd_none(*dst_pmd)) || WARN_ON_ONCE(pmd_none(*src_pmd)) || >> WARN_ON_ONCE(pmd_trans_huge(*dst_pmd)) || >> WARN_ON_ONCE(pmd_trans_huge(*src_pmd))) { >> err = -EINVAL; >> goto out; >> } >> >> But maybe we can just remove the WARN_ON_ONCE(), because... >> >>> think for example move_present_pte() can end up moving a present PTE >>> into a page table that has already been scheduled for RCU freeing. >> >> ...this situation is impossible to happen. Before performing move >> operation, the pte_same() check will be performed after holding the >> pte lock, which can ensure that the PTE page is stable: >> >> CPU 0 CPU 1 >> >> zap_pte_range >> >> orig_src_pte = ptep_get(src_pte); >> >> pmd_lock >> pte_lock >> check if all PTEs are pte_none() >> --> clear pmd entry >> unlock pte >> unlock pmd >> >> src_pte_lock >> pte_same(orig_src_pte, ptep_get(src_pte)) >> --> return false and will skip the move op > > Yes, that works for the source PTE. But what about the destination? > > Operations on the destination PTE in move_pages_pte() are, when moving > a normal present source PTE pointing to an order-0 page, and assuming > that the optimistic folio_trylock(src_folio) and > anon_vma_trylock_write(src_anon_vma) succeed: > > dst_pte = pte_offset_map_rw_nolock(mm, dst_pmd, dst_addr, > &dummy_pmdval, &dst_ptl) > [check that dst_pte is non-NULL] > some racy WARN_ON_ONCE() checks > spin_lock(dst_ptl); > orig_dst_pte = ptep_get(dst_pte); > spin_unlock(dst_ptl); > [bail if orig_dst_pte isn't none] > double_pt_lock(dst_ptl, src_ptl) > [check pte_same(ptep_get(dst_pte), orig_dst_pte)] > > and then we're past the point of no return. Note that there is a > pte_same() check against orig_dst_pte, but pte_none(orig_dst_pte) is > intentionally pte_none(), so the pte_same() check does not guarantee > that the destination page table is still linked in. OK, now I got what you mean. This is indeed a problem. In this case, it is still necessary to recheck pmd_same() to ensure the stability of dst_pte page. Will fix it. > >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 002aa4f454fa0..c4a8c18fbcfd7 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -1436,7 +1436,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) >>>> static inline bool should_zap_cows(struct zap_details *details) >>>> { >>>> /* By default, zap all pages */ >>>> - if (!details) >>>> + if (!details || details->reclaim_pt) >>>> return true; >>>> >>>> /* Or, we zap COWed pages only if the caller wants to */ >>> >>> This looks hacky - when we have a "details" object, its ->even_cows >>> member is supposed to indicate whether COW pages should be zapped. So >>> please instead set .even_cows=true in madvise_dontneed_single_vma(). >> >> But the details->reclaim_pt should continue to be set, right? Because >> we need to use .reclaim_pt to indicate whether the empty PTE page >> should be reclaimed. > > Yeah, you should set both. OK. Thanks!