From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 84B13C0218F for ; Fri, 31 Jan 2025 17:13:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F25226B0088; Fri, 31 Jan 2025 12:13:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EAE28280001; Fri, 31 Jan 2025 12:13:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D01516B008A; Fri, 31 Jan 2025 12:13:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id A87756B0088 for ; Fri, 31 Jan 2025 12:13:26 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 15172B0ED7 for ; Fri, 31 Jan 2025 17:13:26 +0000 (UTC) X-FDA: 83068393212.07.BB98175 Received: from mail-wr1-f49.google.com (mail-wr1-f49.google.com [209.85.221.49]) by imf05.hostedemail.com (Postfix) with ESMTP id 14A7F100011 for ; Fri, 31 Jan 2025 17:13:23 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=R2ZOOuzB; dmarc=none; spf=none (imf05.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.221.49) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1738343604; a=rsa-sha256; cv=none; b=qv/g/rjJ3JuvdtemSt7e7UXXeugSaxjIJuVLi4mKlgfxjHPoMzT6FFDSyHdD2eUidX7CY9 yAAaS+BoH/UhNKdrcdqKI1h8RO+AFmGTvTFEmFjymQmwdVIsM5jts5LXGLedfsa/tpvhe/ lFg5dYs+a9TsyeWnnH8peiYXOlc6iUE= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b=R2ZOOuzB; dmarc=none; spf=none (imf05.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.221.49) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1738343604; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=6jSWMajugHE39GppqlmQIz8pPmyPm52DOpdfP99qKyo=; b=8OL3KbiLPw++oOlX6v43x3cGfil84jgig7Og052uFPRvYw51ERbh8Vdf+n7uDxsRtYHAJl olG2FdUe2YAYvnP6/JvO5Swawop02b/zsUSyAuqkNRpTjSJy1CzZK8R9dskY2Zf08R2+jA aw+G3L56wyER5QEH0mO6a9q1cc9O5Dk= Received: by mail-wr1-f49.google.com with SMTP id ffacd0b85a97d-3863703258fso2151826f8f.1 for ; Fri, 31 Jan 2025 09:13:23 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1738343602; x=1738948402; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date:from:to:cc :subject:date:message-id:reply-to; bh=6jSWMajugHE39GppqlmQIz8pPmyPm52DOpdfP99qKyo=; b=R2ZOOuzBstuNigcuhFah72kG6GSnhT8lQWZCAkU6kR1IciW3uec2vauMXNunlM2w5W BefhW7KltKxmBVGfvK4C0+MyNHAVTNXY6Pt6W/a1HuU7kyFk5JmmDwB5VVqXwjMlf1P5 dud4hUHJSm9BuLgl7ydm6+ofbxk9zUARV2dWg= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738343602; x=1738948402; h=in-reply-to:content-disposition:mime-version:references :mail-followup-to:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=6jSWMajugHE39GppqlmQIz8pPmyPm52DOpdfP99qKyo=; b=J7Ns/sfdJGqSkxgh4EaNGxhkFiKJ3xn85SEqfGa2fBu/3f4IJhqFHjvPWz9ltd9XQz iwnoynkLhRT1tHDdbK08bG+fYqjtpPXBse8hUAmuJZvKQYxOo3bhuul3mQDvvfbXBdL7 Y4So1g0Ff4qFg/hsAbKyEpwEfDpNPLfeK9TDT7X21/NpXNSsh3tF+kWAwGnEZVb6/7l0 D0XXyM54PwO8B4WeDMMoHvL5XnnoemEavTn0c9jtNZlnv8ZkwbznjHOr0BWP/W7XvO2D 0asyQUXZZ6orfs4ZBQ98OZp+WKX7e45irvOjhasH0LxGjQBjlRkxWGPGjKeW0u6GnAvr bntg== X-Forwarded-Encrypted: i=1; AJvYcCUQNERLSgGiVWKoXeU6Rnds/uWg84jDypul/2kZdAxIgXw6NMzW7afRdrZJWER8wt1H7Vp5H/6HrA==@kvack.org X-Gm-Message-State: AOJu0YyvSDMDr0OkzSKWbcF97XczuQJ30wXPURMXhArbPyoLWcGptATS usWgl5OkIjAD5BdYKQltHaH3KzqY94ufBlPzqUbe5LB7RX9/kOvdCQ2App/Ui1Y= X-Gm-Gg: ASbGncuWws4zUiMt4jMoP1qVCMtcZSRhfajhRQAygGHbhUerUDUC36wbrV1jR6sOCrj b9s+gNTuBVEuCndJqdnWK4y7pUkbyIGLoqgrEGyo1d7dvI6M1GnBznQ5/FUASfboeYqwiJ3OsTL 1qrFH6AvnznrWRHf3cttFSizz4bFPQm+lIvyuO9xFZoeHT+8ZpOXk/BpG0aBMePE683a7b64zvO OHD1YbCJNWwVW3fAx/cG+Xoz7vf3NB7NJw02VZtPR72moJACpIC870PWoVOOhlRHY7/tWYpUIO+ 43SVaYNdvhNAw5TyuE09ZM/yoDc= X-Google-Smtp-Source: AGHT+IGBiDEPfJPSLRmnOOkpJfCiAcSXF7PRNDxRISEjglfPUZ3JlbXVG6tP7dp5Gn/UOWRu2XqpDg== X-Received: by 2002:a05:6000:1563:b0:38a:5d7d:4bd6 with SMTP id ffacd0b85a97d-38c5a9bee80mr7346375f8f.25.1738343602545; Fri, 31 Jan 2025 09:13:22 -0800 (PST) Received: from phenom.ffwll.local ([2a02:168:57f4:0:5485:d4b2:c087:b497]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-38c5c1b57ccsm5178729f8f.79.2025.01.31.09.13.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 31 Jan 2025 09:13:22 -0800 (PST) Date: Fri, 31 Jan 2025 18:13:19 +0100 From: Simona Vetter To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, nouveau@lists.freedesktop.org, Andrew Morton , =?iso-8859-1?B?Suly9G1l?= Glisse , Jonathan Corbet , Alex Shi , Yanteng Si , Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn , Pasha Tatashin , Peter Xu , Alistair Popple , Jason Gunthorpe Subject: Re: [PATCH v1 12/12] mm/rmap: keep mapcount untouched for device-exclusive entries Message-ID: Mail-Followup-To: David Hildenbrand , linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, nouveau@lists.freedesktop.org, Andrew Morton , =?iso-8859-1?B?Suly9G1l?= Glisse , Jonathan Corbet , Alex Shi , Yanteng Si , Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , "Liam R. Howlett" , Lorenzo Stoakes , Vlastimil Babka , Jann Horn , Pasha Tatashin , Peter Xu , Alistair Popple , Jason Gunthorpe References: <20250129115411.2077152-1-david@redhat.com> <20250129115411.2077152-13-david@redhat.com> <887df26d-b8bb-48df-af2f-21b220ef22e6@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Operating-System: Linux phenom 6.12.11-amd64 X-Rspam-User: X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 14A7F100011 X-Stat-Signature: 1k7abetb3j9jj3ymo5kyrc59jfj6y8f6 X-HE-Tag: 1738343603-784333 X-HE-Meta: U2FsdGVkX1+UCR7IbhNf9OcI4p1risTSk+7IJEOs268bouBUJW8QANeQzjyMcdEXbsg95QAGxWV7u72/HcMk/jk7+B5oUFoZVsX+AeFifVOv6Oeh8YHWrTsnMKZPHMxFLh9dSQM4B+6U0vHPDGoITormBLzt8L84lFTnxdQ7RWsNRVYj9/c65nxfqKaoNSqcejo/EC24QNmHCPH4iFreYLFON4anriz4CpniUk/mm8v0d2+mnl8zaqmXHtgQAq7XaEGxVxK4GubiJGVHAjbKJ85A+gOl0hWr4iuheSx1UU5SyEOMAOYQfMGYRbNhlxwFWwCjc8NbRxbRQ/t0rKSv8C37GllClMaWNc9CnwRH6NNTAXbIhxCTfMrfprMIXICHajwzhscu/gGIcXfB91RNr3xDgb8TMrOn8jrsHNjYacB9Nh6szh/p1ndYPL8AhLXkxHE6CaTswi66NoQU11mMqfjXzCbAK4jHBDts/Cp+PU60IBkZthwx3BMFxf9nkhAtYBCWWDCLjBA2mRA/eSFghRZGdKDxn6d+qihYR/qEVOAyGx0JaeNyCzuo1w1r8SzeE/YKoWWXfYdGdMZ99lAe88WTWNBD7hh2HjqzaWSl8JJ6DAHlWx0G8RmeZtUO7zFoilLiEtEJLFuz6XKQ2NtUC71f3n785iqMzl2ju4Yp6NsI+k09EikBRCtWXbdvVV0cXI5qGwI/JVBf/nG7pANOwhE9lqJ69rS0z+/QS9xf3j9gUbsWZLX4THA4X6z3zkTlRPjB6ONWMAf9xGTHXyQWev1Q825+ZBQxXTF1FZmHEX0IWsEwRKgCyex5wSl++X186VLpWkjdDZxg0CQuMU8X2hRgTXwv535vXGuWVwOODeoOibG4fcPwPKOUBWoOvFrt1WcE4a0WyycpQ+B+rHb4qWxl/3CLP3weLKSfX0wp3KcnpROLQEgeFGcaon+oY5mL4vnBWnXuPZVam731xeH AIqozwKy AfB1Rjy18h9VrG94rDg8L+b5cYcAWfsju5lt15T+BdqU2NCQMXapSUN3aNvodVuZHQfccEtWUFC8WnS2Re2jKcd/A5i20uZPLiOn38G26IQ59lidrZJYr6F3A2TJlCRcSHAX60R4wsImdZTAJ9xZg12crGLarBjgQQQGWMHRZlyOQMA2ctAOL9gV2kodfOJMgNJKC2wStk8ahmMlBsjD+h+zxc/0b1aKerRfYwEwfsaydTZ3NlfI+ZbHFfF+n9YcEM61GRT4APhNVGW/7bOsjF5ZjHGue1oUfQp13SiTXUunwvsQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000067, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 30, 2025 at 04:43:08PM +0100, David Hildenbrand wrote: > > > Assume you have a THP (or any mTHP today). You can easily trigger the > > > scenario that folio_mapcount() != 0 with active device-exclusive entries, > > > and you start doing rmap walks and stumble over these device-exclusive > > > entries and *not* handle them properly. Note that more and more systems are > > > configured to just give you THP unless you explicitly opted-out using > > > MADV_NOHUGEPAGE early. > > > > > > Note that b756a3b5e7ea added that hunk that still walks these > > > device-exclusive entries in rmap code, but didn't actually update the rmap > > > walkers: > > > > > > @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) > > > > > > /* Handle un-addressable ZONE_DEVICE memory */ > > > entry = pte_to_swp_entry(*pvmw->pte); > > > - if (!is_device_private_entry(entry)) > > > + if (!is_device_private_entry(entry) && > > > + !is_device_exclusive_entry(entry)) > > > return false; > > > > > > pfn = swp_offset(entry); > > > > > > That was the right thing to do, because they resemble PROT_NONE entries and > > > not migration entries or anything else that doesn't hold a folio reference). > > > > Yeah I got that part. What I meant is that doubling down on this needs a > > full audit and cannot rely on "we already have device private entries > > going through these paths for much longer", which was the impression I > > got. I guess it worked, thanks for doing that below :-) > > I know I know, I shouldn't have touched it ... :) > > So yeah, I'll spend some extra work on sorting out the other cases. Thanks :-) > > And at least from my very rough understanding of mm, at least around all > > this gpu stuff, tracking device exclusive mappings like real cpu mappings > > makes sense, they do indeed act like PROT_NONE with some magic to restore > > access on fault. > > > > I do wonder a bit though what else is all not properly tracked because > > they should be like prot_none except arent. I guess we'll find those as we > > hit them :-/ > > Likely a lot of stuff. But more in a "entry gets ignored -- functionality > not implemented, move along" way, because all page table walkers have to > care about !pte_present() already; it's just RMAP code that so far never > required it. I think it'd be good to include a tersion summary of this in the commit messages - I'd expect this is code other gpu folks will need to crawl through in the future again, and I had no idea where I should even start looking to figure this out. > > [...] > > > > > > If thp constantly reassembles a pmd entry because hey all the > > > > memory is contig and userspace allocated a chunk of memory to place > > > > atomics that alternate between cpu and gpu nicely separated by 4k pages, > > > > then we'll thrash around invalidating ptes to no end. So might be more > > > > fallout here. > > > > > > khugepaged will back off once it sees an exclusive entry, so collapsing > > > could only happen once everything is non-exclusive. See > > > __collapse_huge_page_isolate() as an example. > > > > Ah ok. I think might be good to add that to the commit message, so that > > people who don't understand mm deeply (like me) aren't worried when they > > stumble over this change in the future again when digging around. > > Will do, thanks for raising that concern! > > > > > > It's really only page_vma_mapped_walk() callers that are affected by this > > > change, not any other page table walkers. > > > > I guess my mm understanding is just not up to that, but I couldn't figure > > out why just looking at page_vma_mapped_walk() only is good enough? > > See above: these never had to handle !page_present() before -- in contrast > to the other page table walkers. > > So nothing bad happens when these page table walkers traverse these PTEs, > it's just that the functionality will usually be implemented. > > Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() will > simply skip "!pte_present()", because it wouldn't know what to do in that > case. > > Of course, there could be page table walkers that check all cases and bail > out if they find something unexpected: do_swap_page() cannot make forward > progress and will inject a VM_FAULT_SIGBUS if it doesn't recognize the > entry. But these are rather rare. Yeah this all makes sense to me now. Thanks a lot for your explanation, I'll try to pay it back by trying to review the next version of the series a bit. > We could enlighten selected page table walkers to handle device-exclusive > where it really makes sense later. I think rmap for eviction/migration is really the big one that obviously should be fixed. All the other cases I could think of I think just end up in handle_mm_fault() to sort out the situation and then retry. Cheers, Sima -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch