From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C0299EB64DA for ; Fri, 14 Jul 2023 09:25:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A0BA6B0071; Fri, 14 Jul 2023 05:25:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 229296B0072; Fri, 14 Jul 2023 05:25:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 07DEF6B0074; Fri, 14 Jul 2023 05:25:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id E70B36B0071 for ; Fri, 14 Jul 2023 05:25:14 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id B4F431601A8 for ; Fri, 14 Jul 2023 09:25:14 +0000 (UTC) X-FDA: 81009683748.29.F7FF35E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 5B68F1C0019 for ; Fri, 14 Jul 2023 09:25:12 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X7rlSipZ; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1689326712; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=JZ9Td58ANzPUzB7LnanvAxN3EjUOTHEhOzZSnP5c4MQ=; b=AbNW0Y3XtT2hxbiMBNq5DC3YX2+44n4mJHkJnLi7KEnKtB+t62piO99hf923hTWqU8VKfr Ahsh9Nc99FlkU7X5UJw8C2rSLCS6hXJv5jCOrUYMm5c1iW++5Dyy5C0ielexGHN31o1gGO 1QoX/e3nbd7u6HJ+cVKEC9rPTqJL+8o= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1689326712; a=rsa-sha256; cv=none; b=P4CLI07sgLFbzLGbkmCQIZbt2IKTbxHDN+XKB99/GWxR9rd4MN8j00kbrW0qXOC5siOg3G QhdtuFpT85m6lIgAVDtniptehHRGcEyJjYdQvzERP0ZqQ6R0zHvcGO2rqcw1v8dL84OcZF 0ZlWbNXNeZGZ1mjCWjRkdFfTs9XoLuk= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X7rlSipZ; spf=pass (imf21.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1689326711; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=JZ9Td58ANzPUzB7LnanvAxN3EjUOTHEhOzZSnP5c4MQ=; b=X7rlSipZzMSOhoS/0rQUov3OzDIEqp2DfAr4n5NolXP11y+MMIi5VV1lWBUqOxuyk1cDh/ BplaU1z+/+PjDpuSXyr+aFFF1rzuzolj6hb399c71qs21wKJk9oTLEcS22L7b4/wKjKCKG FvWY6n9J72cuse/uornhOp+OlcYiWEI= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-83-NoJHLALpNGGrVKpijUMm3A-1; Fri, 14 Jul 2023 05:25:08 -0400 X-MC-Unique: NoJHLALpNGGrVKpijUMm3A-1 Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-3fbe4cecd66so9087285e9.1 for ; Fri, 14 Jul 2023 02:25:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1689326707; x=1691918707; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=JZ9Td58ANzPUzB7LnanvAxN3EjUOTHEhOzZSnP5c4MQ=; b=GSn8tNuy5yOnT39Az7C16C+cBhe3JKRz/Xanly+y1SbvdIcdz+o12g7wh+Td9VCXaQ 368zLyLjmx1CE1NbAAQwQCZg6O0UDaNAaFUS3WjjldGk0CMcC3R5O4MDvWBouL95d+4W ibQ0a0/Lsr8Iv89NEpDv4y7H+BHeCPEdOrIzzMGEYxiMPu5aevZWZNdsrCMs0/iSy3Nv /hhWvXuFlr/WLlZwmY0exgk54reRdlfllilKs7jla6rczZCsWAPVC4Grc/zG0Ma37WWr erOXeqUC7Ni/UkbbLg6CZRP6smMaUlNydthiq5YycFR0o9XQlk1mD/dcM1N0hV8+ObLj qVsg== X-Gm-Message-State: ABy/qLZnvgkCMHAthXCamOWdvbNGW1uH13H+Ps7cW9AkPaYZtzqdOpAJ 0rJFne0VJC10SBCw+mKq+oQXnKtoJL8TwQeh6kOQoo9fIXiBBce/ydkRn3TF68nNssvKOa61HM4 9i9yjB5cncS8= X-Received: by 2002:a7b:c383:0:b0:3fc:9e:eead with SMTP id s3-20020a7bc383000000b003fc009eeeadmr3499805wmj.20.1689326707103; Fri, 14 Jul 2023 02:25:07 -0700 (PDT) X-Google-Smtp-Source: APBJJlGtlCl6szFVfmowSABK3dKXMxCz9U7XEFSxW0fZjzmrBhkLAOByIkiQK8b153lVXwCbVm3sdw== X-Received: by 2002:a7b:c383:0:b0:3fc:9e:eead with SMTP id s3-20020a7bc383000000b003fc009eeeadmr3499789wmj.20.1689326706677; Fri, 14 Jul 2023 02:25:06 -0700 (PDT) Received: from ?IPV6:2003:cb:c70a:4500:8a9e:a24a:133d:86bb? (p200300cbc70a45008a9ea24a133d86bb.dip0.t-ipconnect.de. [2003:cb:c70a:4500:8a9e:a24a:133d:86bb]) by smtp.gmail.com with ESMTPSA id o6-20020a1c7506000000b003fc00212c1esm969657wmc.28.2023.07.14.02.25.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 14 Jul 2023 02:25:06 -0700 (PDT) Message-ID: <36cfe140-5685-bea7-d267-4a61f21aeb79@redhat.com> Date: Fri, 14 Jul 2023 11:25:04 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: "Yin, Fengwei" , Yu Zhao , Zi Yan Cc: Minchan Kim , linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org, willy@infradead.org, ryan.roberts@arm.com, shy828301@gmail.com, "Vishal Moola (Oracle)" References: <20230713150558.200545-1-fengwei.yin@intel.com> <8547495c-9051-faab-a47d-1962f2e0b1da@intel.com> <2cbf457e-389e-cd45-1262-879513a4cf41@intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH] madvise: make madvise_cold_or_pageout_pte_range() support large folio In-Reply-To: <2cbf457e-389e-cd45-1262-879513a4cf41@intel.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 5B68F1C0019 X-Rspam-User: X-Stat-Signature: xrqnje417q3w1rmucpnkks6mtathadak X-Rspamd-Server: rspam03 X-HE-Tag: 1689326712-520641 X-HE-Meta: U2FsdGVkX187WSGSGPz6dnK5ULG+2UxM2ola6hUwKX2BKMKoEFlPLuy+gyJMAn3hDtocE3LmyQuF/uT7gi/H9szpBJfn1xFoOP7DB6RgM/7R7xkz3yS09oKZNHHY86VLsOxqc3/1LSf+5zHlQrMR50xIAh75dSUp9CGAmhudfQSZ/vXJpFtg+B3c8kVj6Ksfqkepj87csnWCjMg453a6u8mdNLal30FpbdP7275l+lnd2gD5FYik5WgqrMq+RdPUUxcG1D1MI6mFlGwH7vczEvp7AaJF4UM7tfuzqa2zY7HM548zeTBRgwwXHxE+3/1+DIP/ErCjYl0Sut+YIRFRHHwNYmxGC6r5bOY5uTUue6CRb2a60L3wHLQF7t59z7DW0Lfu+9JMwJInr0BAhvOcDWNY7mHNKaSkxx5h8gxLBhzC2+oLxfqXcF7V023sZlpI02CrFdPE+yr4LIaNN7SgKzRwQVJ+sgpgqBxGCkrJG2nRKr2MMjTai88v+1b9eFB9jbvkMzpeOaBSoj+lrWd1Ke0PKWdAZPlBU+ImcrJZAFnGJ8aSmKdxGbYlfouv8kJXUUpMc3M9q4UZAdREMYBx8ODnDprqT0SFsqkwdOl9C5ycbhWjGgA6mCGdGQlAcm/YZBicjzjp5EC2AysRipmn29xXZzNMyHw3900BZ+zh52zcIfwa0HTwNZK4rgCa3LvoU06w/1FR+uhHwvH0YupxDJfi/nOEQjGpKz8qCXzo56HNsa+eqVEX5i5scsqys5n93b8q9eJd1TZNttmZYGVr7iomIUUR1SfB6l9NrVa+itSz62vLq4pfMryFa2nr5dV9iDa4HWpX55+RWEw6C2tZUwkPV/UwBSjqipOBcLnZSsD0zLw2LQdab79mc4YulryY6ZMaChbzh9oruCSD1R6VVQOHKm7zXzh8yPa4T7/IfAwsLfGRoA0X13NFu9WQNMpLFTMbjGRPw9ijqa1G9MW eAgTqb7g RxLdJq5dfScCX7mxs1JJlMKG8wR1ulf67o976vlaE8/BP9IkLbaFkcZUv2HzbQhvNaa9HevS5CegvnIQKSXvuYGxY27E8rekJ8ioMZwgXawskcd6Rlb8gGLdSZD7UcgutwqxqlnA8VVzca8GUutoBbI6L5tjiX0QOw4NPSxAQ+RZrZGZm5f8fm1XrPV2FoJhpx4Q7Eyo4qkXKKCTqzpppOhC0Br7IfQcJOApcCXZZCzixo/ctyZHEoyBquXQdW5+pc5iSJKf2NdvEEFqqwSoAy7ZddsPqhKbcAVcDZgZCoK4Ro8URZa3T9ZdFeXjyorf53t9ejG2hiVjb8LajgCgFhPG9M6Yt5flGin6CNoftQ6OjicDNciFpPUYvrDf8wepFpggDaAiu1JEt41ybyMNy49mbRodvjf/ioAyQRiwBX8QIzqCsKDB6PMr51kOpyAtMge3CEyW9AxZtgoiuUp2tBCOV9/qMuJLVeZ33VDwH3/eggW2NKDHQSCAZ/ZpY7aqaGsbuNutbptT4c8HnYF1zML6RTA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 14.07.23 10:34, Yin, Fengwei wrote: [...] >>> Sounds good? >> >> Adding to the discussion, currently the COW selftest always skips a PTE-mapped THP. > You always have very good summary of the situation. Thanks a lot for > adding the following information. > > Add Zi Yan as this is still about mapcount of the folio. > Thanks, I thought he would have already been CCed! >> >> >> For example: >> >> # [INFO] Anonymous memory tests in private mappings >> # [RUN] Basic COW after fork() ... with base page >> ok 1 No leak from parent into child >> # [RUN] Basic COW after fork() ... with swapped out base page >> ok 2 No leak from parent into child >> # [RUN] Basic COW after fork() ... with THP >> ok 3 No leak from parent into child >> # [RUN] Basic COW after fork() ... with swapped-out THP >> ok 4 No leak from parent into child >> # [RUN] Basic COW after fork() ... with PTE-mapped THP >> ok 5 No leak from parent into child >> # [RUN] Basic COW after fork() ... with swapped-out, PTE-mapped THP >> ok 6 # SKIP MADV_PAGEOUT did not work, is swap enabled? >> ... >> >> >> The commit that introduced that change is: >> >> commit 07e8c82b5eff8ef34b74210eacb8d9c4a2886b82 >> Author: Vishal Moola (Oracle) >> Date:   Wed Dec 21 10:08:46 2022 -0800 >> >>     madvise: convert madvise_cold_or_pageout_pte_range() to use folios >> >>     This change removes a number of calls to compound_head(), and saves >>     1729 bytes of kernel text. >> >> >> >> folio_mapcount(folio) is wrong, because that never works on a PTE-mapped THP (well, unless only a single subpage is still mapped ...). >> >> page_mapcount(folio) was wrong, because it ignored all other subpages, but at least it worked in some cases. >> >> folio_estimated_sharers(folio) is similarly wrong like page_mapcount(), as it's essentially a page_mapcount() of the first subpage. >> >> (ignoring that a lockless mapcount-based check is always kind-of unreliable, but that's msotly acceptable for these kind of things) >> >> >> So, unfortunately, page_mapcount() / folio_estimated_sharers() is best we can do for now, but they miss to detect some cases of sharing of the folio -- false negatives to detect sharing. >> >> >> Ideally we want something like folio_maybe_mapped_shared(), and get rid of folio_estimated_sharers(), we better to guess the exact number, simply works towards an answer that tells us "yep, this may be mapped by multiple sharers" vs. "no, this is definitely not mapped by multiple sharers". >> > So you want to accurate number. My understanding is that it's required for COW case. For COW we have to take a look at the mapcount vs. the refcount: With an order-0 page that's straight forward: (a) Has that page never been shared (PageAnonExclusive?), then I am the exclusive owner and can reuse. (a) I am mapping the page and it cannot get unmapped concurrently due to the PT lock (which implies mapcount > 0, refcount > 0). Is this reference I am holding is in fact the only reference to the page (refcount == 1, implying mapcount == 1)? Then I am the exclusive owner and can reuse. Note that we don't have to perform any mapcount checks, because it's implied by pur page table mapping and the refcount. What I want to achieve is the same for PTE-mapped THP, without scanning page tables to detect if I am holding all references to the folio. That is: (1) total_mapcount() == refcount AND (2) I am responsible for all these mappings AND (3) Subpages cannot get unmapped / shared concurrently To make that work reliable, we might need some synchronization, especially when multiple page tables are involved. I previously raised tracking the "creator" of the anon page. I think we can do better, but still have to prototype it. [...] >> >> While it's better than what we have right now: >> >> (a) It's racy. Well, it's always been racy with concurrent (un)mapping >>     and splitting. But maybe we can do better. >> >> (b) folio_total_mapcount() is currently expensive. >> >> (c) there are still false negatives even without races. >> >> >> For anon pages, we could scan all subpages and test if they are PageAnonExclusive, but it's also not really what we want to do here. > I was wondering whether we could identify the cases as: > - bold estimated mapcount is enough. In this case, we can use > current folio_estimated_sharers() for now. For long term, I > am with Zi Yan's proposal: maintain total_mapcount and just use > total_mapcount > folio_nr_pages() as estimated number. > > The madvise/migration cases are identified as this type. > > - Need some level accurate. Use estimated mapcount to filter out obvious > shared case first as estimated mapcount is correct for shared case. > Then use some heavy operations (check anon folio if pages are > PageAnonExclusive or use pvmw) to get more accurate number. > > Cow is identified as this type. I want to tackle both (at least for anon pages) using the same mechanism. Especially, to cover the case "total_mapcount <= folio_nr_pages()". For total_mapcount > folio_nr_pages(), it's easy. In any case, we want an atomic total mapcount I think. > >> >> >> I have some idea to handle anon pages better to avoid any page table walk or subpage scan, improving (a), (b) and (c). It might work for pagecache pages with some more work, but it's a bit more complicated with the scheme I have in mind). >> > Great. > >> >> First step would be replacing folio->_nr_pages_mapped by folio->_total_mapcount. While we could eventually have folio->_total_mapcount in addition to folio->_nr_pages_mapped, I'm, not sure if we want to go down that path >> > I saw Zi Yan shared same proposal. > >> >> That would make folio_total_mapcount() extremely fast (I'm working on a prototype). The downsides are that >> >> (a) We have to make NR_ANON_MAPPED/NR_FILE_MAPPED accounting less precise. Easiest way to handle it: as soon as a single subpage is mapped, account the whole folio as mapped. After all, it's consuming memory, so who cares? >> >> (b) We have to find a different way to decide when to put an anonymous folio on the deferred split queue in page_remove_rmap(). Some cases are nasty to handle: PTE-mapped THP that are shared between a parent and a child. >> > It's nasty because partial mapped to parent and partial mapped to child? Thanks. I thought about this a lot already, but let me dedicate some time here to write it down. There are two scenarios to consider: do we still want to use the subpage mapcount or not. When still considering the subpage mapcount, it gets easier. (1) We're unmapping a single subpage, the compound_mapcount == 0 and the total_mapcount > 0. If the subpage mapcount is now 0, add it to the deferred split queue. (2) We're unmapping a complete folio (PMD mapping / compound), the compound_mapcount is 0 and the total_mapcount > 0. (a) If total mapcount < folio_nr_pages, add it to the deferred split queue. (b) If total mapcount >= folio_nr_pages , we have to scan all subpage mapcounts. If any subpage mapcount == 0, add it to the deferred split queue. (b) is a bit nasty. It would happen when we fork() with a PMD-mapped THP, the parent splits the THP due to COW, and then our child unmaps or splits the PMD-mapped THP (unmap easily happening during exec()). Fortunately, we'd only scan once when unmapping the PMD. Getting rid of the subpage mapcount usage in (1) would mean that we have to do exactly what we do in (2). But then we'd need to ha handle (2) (B) differently as well. So, for 2 (b) we would either need some other heuristic, or we add it to the deferred split queue more frequently and let that one detect using an rmap walk "well, every subpage is still mapped, let's abort the split". -- Cheers, David / dhildenb