From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C9C6FC77B70 for ; Mon, 17 Apr 2023 14:05:32 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6974A6B0071; Mon, 17 Apr 2023 10:05:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 621E86B0072; Mon, 17 Apr 2023 10:05:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4C14F8E0001; Mon, 17 Apr 2023 10:05:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 3977E6B0071 for ; Mon, 17 Apr 2023 10:05:32 -0400 (EDT) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 112991A051E for ; Mon, 17 Apr 2023 14:05:32 +0000 (UTC) X-FDA: 80691055704.17.F638F86 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 9C64D1A002C for ; Mon, 17 Apr 2023 14:05:29 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FHSzUtrV; spf=pass (imf19.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1681740329; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BFUcsP3DP036sGfZtDUXKQ3wNaL2wIKVlyoKsP6xiis=; b=5vVVScBYC2qt9w06ZM1IfDTN98dfwyPdAhvWGpmeaLV5LhleIn8VYhPXGmrE4xh8939bQN KFaqPEv69IdP+Rs0Os7mu6pxJKcHURUt3lEVhxykhofYgAmjP04nM1otXMUd13u4OPJ0Zx Pn8XFtHNNWrm+qw81henOYfUBmKXWyU= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FHSzUtrV; spf=pass (imf19.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1681740329; a=rsa-sha256; cv=none; b=u3WD6uSWVf7wQcBiRmS3KGTlbYUoQOTm5ZSGrMC3+B3VhtyBpoXiiGd0RkAiOGH3L+CFFv gVQSv03PaaWtcgYuuePuf9L9DY20VXtI0lGScvdoJBTqT9yE3hifUkqTrCbr0ds4iGlti7 CCtfmSIvW19J5016KwQZB3lS2AEdUQk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1681740328; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BFUcsP3DP036sGfZtDUXKQ3wNaL2wIKVlyoKsP6xiis=; b=FHSzUtrVBMEZiJG71adgcyBYxpWzspM8ApCrsA8TzenScflDb6brZaICd+5jtwOd7xEJ3g ih0N/VzHNdkXBNLFpla0Kx/H2U+E95t7fh6PDFF1wwIv38iAsQxmpBJODp0gDQKr9zkSds x+ZFIqZev5KzOxuiIi0Zl1JpDtLxgZk= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-638-1GH6sPSoOkWoNlS4FF44rw-1; Mon, 17 Apr 2023 10:05:27 -0400 X-MC-Unique: 1GH6sPSoOkWoNlS4FF44rw-1 Received: by mail-wm1-f70.google.com with SMTP id n19-20020a05600c501300b003f16fda66d9so635870wmr.1 for ; Mon, 17 Apr 2023 07:05:27 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1681740326; x=1684332326; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=BFUcsP3DP036sGfZtDUXKQ3wNaL2wIKVlyoKsP6xiis=; b=IYxdIxwh2M4ApFn0RD4p6TwMAR3hJmnNiM/c6lj8QncYZ9dL7P3I0FwgRenckZ/dHK AMW7abg4EwV4tYqx1XcDonPAcBqWO8tdhSgiJ6TEmBJm8bxSiu0Kkgs95ExCrmEJVIkt 6jtE8BzWHDcl+PwuWdAxM16n2dN6RhGguIB3/jn9GHoBGf4iI0nEp/YyxAhE05LVn5JT wxMBxjpugfdC2j+8TR75yYyRgtNv5xayIUpDuOSTtgxl52UGs9TWNyzVkklwVtlHSCM1 vM+1adkvO5qGmgm9DjULrKUngixPZpyPsLVnjJrgymNouwDU29bsi94VUwyjk5PiTnTC XUJg== X-Gm-Message-State: AAQBX9dBvQ4+b5LUOWuHpcf1znRxdPE95yloSpj+19qLny0Mc96GBdxt zdnkjyzGwwTvD9D3JzwLy/SsogJTSsi/LVEqMPJWmOsENZbXp4TarVoYeokAWeJrbWJgPU3wp9y XwGqrXPrOCQA= X-Received: by 2002:a05:600c:231a:b0:3f0:4e04:b8f8 with SMTP id 26-20020a05600c231a00b003f04e04b8f8mr10144972wmo.39.1681740326240; Mon, 17 Apr 2023 07:05:26 -0700 (PDT) X-Google-Smtp-Source: AKy350bBHtcBGlOlsZSeAmTFxxs+CBcC+T5mDQPmhXEyuz12dgtLvxJVzrcz3ZRQkVesA/HeuYk4MA== X-Received: by 2002:a05:600c:231a:b0:3f0:4e04:b8f8 with SMTP id 26-20020a05600c231a00b003f04e04b8f8mr10144949wmo.39.1681740325750; Mon, 17 Apr 2023 07:05:25 -0700 (PDT) Received: from ?IPV6:2003:cb:c700:fc00:db07:68a9:6af5:ecdf? (p200300cbc700fc00db0768a96af5ecdf.dip0.t-ipconnect.de. [2003:cb:c700:fc00:db07:68a9:6af5:ecdf]) by smtp.gmail.com with ESMTPSA id f24-20020a1cc918000000b003ee63fe5203sm12006460wmb.36.2023.04.17.07.05.24 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 17 Apr 2023 07:05:25 -0700 (PDT) Message-ID: <568b5b73-f0e9-c385-f628-93e45825fb7b@redhat.com> Date: Mon, 17 Apr 2023 16:05:23 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 To: Ryan Roberts , Andrew Morton , "Matthew Wilcox (Oracle)" , Yu Zhao , "Yin, Fengwei" Cc: linux-mm@kvack.org, linux-arm-kernel@lists.infradead.org References: <20230414130303.2345383-1-ryan.roberts@arm.com> <13969045-4e47-ae5d-73f4-dad40fe631be@arm.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC v2 PATCH 00/17] variable-order, large folios for anonymous memory In-Reply-To: <13969045-4e47-ae5d-73f4-dad40fe631be@arm.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 9C64D1A002C X-Stat-Signature: qwy11omacfkinyqunkfri9e1t7zhmxsn X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1681740329-153321 X-HE-Meta: U2FsdGVkX1+OkzzbpXX6Unpk1nLPD1FoJlnivyRpGUfApR/QWSSSNUFG6LmQH1br/doyIrLRotgVUblGeEpQK0ttjuJe/HTz1Ct170/gzhAXnV7CIo06xg2iF6yFE9nexhcqBJR5FrXKUj23Ix9Xn+lyijBwwUy74mhBrGmeXgmGxSQ/lPK1PYY4qjRVH9VlY5pksuoWHOfbNpEypTO+2PLUDrRRVNxdVsx7vcK5YbLcG1BXSzb/wBpKXsvPnjMdYPWFKsCwoLLiMtod3kcd9p0pYV9bfKNthyauvYgbDeB8H1LF6IPxCEbyaauDjLp6WQqrDYCyLCYZdvQyLsaXhkUDeSNjETNXeAnsK5V+n2Mo9ZFyCZzUUhfhpBKS05/nBX+QEHB9XLvQjpNnrRHr6/5gvOrITojLzXd495wG8e8Gp+rwaSB4+i0+MYUUm6n9UQaGvtoLAs6BaEZDmhEsvaqOIXcCSbqArdhJWy5Vo1Rnk/gpAUombYNJsEa+ZnyaliGuvjr9LCyrc4MDi8g21mCxEfSXA2iTPRfqUf9DZ9RM9h2LAA/FAbXzumxUG+mS7s/ATXc58HomeTASZd6NysGjRL+lOIRrD4JkXxmjemMlsMUY2F0SThStctfzxJ/rHtyf1hVNk/nq6DsL6ENsoAUJRCBQssj0rV7F5kUxxzH1AvqTbkAWsPHA2HrbR0LxrcPn1v4BSFNyorXHTo7g+OGAyu3X2GTBhiehBTgU8v7MbeT6Fbc32d3suUq/gM/WLrHTmBq8SEIb+Gj7078h2a6HovkR+2yKooRgggYgphfdyxkdmaxG1Y6KaByHQNUght9YpePHffpbv+V5+QwjMbeNOih31p1VzvOLBkXkeIQKquOv3/nYBsIUPGc2mfCcCmxkEgf3TjykEOIzzXNF7z2gJvZuoszfZ2o8yLH7GcEg85JxVHatZVKxQ8mlPUdi9kTxXW/WdzbBJMpgTvq W56PExau 3nql64MuwXRqXb53TC1QH9sw1E8isxaot189yZinEK6qOGQOlYLv5s+jdEVf9SRbdBNnGhWUxmL6XPK1NmAJFxhlVAzG+I7xdktH8hHnAmhO7T4GVrIE2xk2Zqvf1OI1PxL0z3L8dWQwbXPqTY7NxUBIYr/KPhUdXhyYTa74uOLxbtVGvAWuhfQgCyswYeBqUbda4n390IO3vsRxECzRveF0NKhoWNXRaNQ5eAqlWSLQUY74pBP3FZB47jkHDB/0HQk2h3bg1pW/GYvuusy9dRJl3wO2h1T2KVVuhrmZiOLVooZQtbtXXmynkUj1ceGHSdUKAs4hwE1BysNdPHBVNQjoe2uaNxIKO4I65GfhF+ANQMMUs3sbVSEPAZfcQH/sMKljJ6hQyFwrlx8s0TWrcOwgZLg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [...] >> Just a note (that you maybe already know) that we have to be a bit careful in >> the wp_copy path with replacing sub-pages that are marked exclusive. > > Ahh, no I wasn't aware of this - thanks for taking the time to explain it. I > think I have a bug. > > (I'm guessing the GUP fast path assumes that if it sees an exclusive page then > that page can't go away? And if I then wp_copy it, I'm breaking the assumption? > But surely user space can munmap it at any time and the consequences are > similar? It's probably clear that I don't know much about the GUP implementation > details...) If GUP finds a read-only PTE pointing at an exclusive subpage, it assumes that this page cannot randomly be replaced by core MM due to COW. See gup_must_unshare(). So it can go ahead and pin the page. As long as user space doesn't do something stupid with the mapping (MADV_DONTNEED, munmap()) the pinned page must correspond to the mapped page. If GUP finds a writeable PTE, it assumes that this page cannot randomly be replaced by core MM due to COW -- because writable implies exclusive. See, for example the VM_BUG_ON_PAGE() in follow_page_pte(). So, similarly, GUP can simply go ahead and pin the page. GUP-fast runs lockless, not even taking the PT locks. It syncs against concurrent fork() using a special seqlock, and essentially unpins whatever it temporarily pinned when it detects that fork() was running concurrently. But it might result in some pages temporarily being flagged as "maybe pinned". In other cases (!fork()), GUP-fast synchronizes against concurrent sharing (KSM) or unmapping (migration, swapout) that implies clearing of the PG_anon_flag of the subpage by first unmapping the PTE and conditionally remapping it. See mm/ksm.c:write_protect_page() as an example for the sharing side (especially: if page_try_share_anon_rmap() fails because the page may be pinned). Long story short: replacing a r-o "maybe shared" (!exclusive) PTE is easy. Replacing an exclusive PTE (including writable PTEs) requires some work to sync with GUP-fast and goes rather in the "maybe just don't bother" terriroty. > > My current patch always prefers reuse over copy, and expands the size of the > reuse to the biggest set of pages that are all exclusive (determined either by > the presence of the anon_exclusive flag or from the refcount), and covered by > the same folio (and a few other bounds constraints - see > calc_anon_folio_range_reuse()). > > If I determine I must copy (because the "anchor" page is not exclusive), then I > determine the size of the copy region based on a few constraints (see > calc_anon_folio_order_copy()). But I think you are saying that no pages in that > region are allowed to have the anon_exclusive flag set? In which case, this > could be fixed by adding that check in the function. Yes, changing a PTE that points at an anonymous subpage that has the "exclusive" flag set requires more care. > >> >> Currently, we always only replace a single shared anon (sub)page by a fresh >> exclusive base-page during a write-fault/unsharing. As the sub-page is already >> marked "maybe shared", it cannot get pinned concurrently and everybody is happy. > > When you say, "maybe shared" is that determined by the absence of the > "exclusive" flag? Yes. Semantics of PG_anon_exclusive are "exclusive" vs. "maybe shared". Once "maybe shared", we must only go back to "exclusive (set the flag) if we are sure that there are no other references to the page. > >> >> If you now decide to replace more subpages, you have to be careful that none of >> them are still exclusive -- because they could get pinned concurrently and >> replacing them would result in memory corruptions. >> >> There are scenarios (most prominently: MADV_WIPEONFORK), but also failed partial >> fork() that could result in something like that. > > Are there any test cases that stress the kernel in this area that I could use to > validate my fix? tools/testing/selftests/mm/cow.c does excessive tests (including some MADV_DONTFORK -- that's what I actually meant -- and partial mremap tests), but mostly focuses on ordinary base pages (order-0), THP, and hugetlb. We don't have any "GUP-fast racing with fork()" tests or similar yet (tests that rely on races are not a good candidate for selftests). We might want to extend tools/testing/selftests/mm/cow.c to test for some of the cases you extend. We may also change the detection of THP (I think, by luck, it would currently also test your patches to some degree set the way it tests for THP) if (!pagemap_is_populated(pagemap_fd, mem + pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } Would have to be, for example, if (!pagemap_is_populated(pagemap_fd, mem + thpsize - pagesize)) { ksft_test_result_skip("Did not get a THP populated\n"); goto munmap; } Because we touch the first PTE in a PMD and want to test if core-mm gave us a full THP (last PTE also populated). Extending the tests to cover other anon THP sizes could work by aligning a VMA to THP/2 size (so we're sure we don't get a full THP), and then testing if we get more PTEs populated -> your code active. > >> >> Further, we have to be a bit careful regarding replacing ranges that are backed >> by different anon pages (for example, due to fork() deciding to copy some >> sub-pages of a PTE-mapped folio instead of sharing all sub-pages). > > I don't understand this statement; do you mean "different anon _folios_"? I am > scanning the page table to expand the region that I reuse/copy and as part of > that scan, make sure that I only cover a single folio. So I think I conform here > - the scan would give up once it gets to the hole. During fork(), what could happen (temporary detection of pinned page resulting in a copy) is something weird like: PTE 0: subpage0 of anon page #1 (maybe shared) PTE 1: subpage1 of anon page #1 (maybe shared PTE 2: anon page #2 (exclusive) PTE 3: subpage2 of anon page #1 (maybe shared Of course, any combination of above. Further, with mremap() we might get completely crazy layouts, randomly mapping sub-pages of anon pages, mixed with other sub-pages or base-page folios. Maybe it's all handled already by your code, just pointing out which kind of mess we might get :) > >> >> >> So what should be safe is replacing all sub-pages of a folio that are marked >> "maybe shared" by a new folio under PT lock. However, I wonder if it's really >> worth the complexity. For THP we were happy so far to *not* optimize this, >> implying that maybe we shouldn't worry about optimizing the fork() case for now >> that heavily. > > I don't have the exact numbers to hand, but I'm pretty sure I remember enabling > large copies was contributing a measurable amount to the performance > improvement. (Certainly, the zero-page copy case, is definitely a big > contributer). I don't have access to the HW at the moment but can rerun later > with and without to double check. In which test exactly? Some micro-benchmark? > >> >> >> One optimization once could think of instead (that I raised previously in other >> context) is the detection of exclusivity after fork()+exit in the child (IOW, >> only the parent continues to exist). Once PG_anon_exclusive was cleared for all >> sub-pages of the THP-mapped folio during fork(), we'd always decide to copy >> instead of reuse (because page_count() > 1, as the folio is PTE mapped). >> Scanning the surrounding page table if it makes sense (e.g., page_count() <= >> folio_nr_pages()), to test if all page references are from the current process >> would allow for reusing the folio (setting PG_anon_exclusive) for the sub-pages. >> The smaller the folio order, the cheaper this "scan surrounding PTEs" scan is. >> For THP, which are usually PMD-mapped even after fork()+exit, we didn't add this >> optimization. > > Yes, I have already implemented this in my series; see patch 10. Oh, good! That's the most important part. -- Thanks, David / dhildenb