From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5572CC77B7A
	for <linux-mm@archiver.kernel.org>; Mon, 29 May 2023 09:25:48 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6E67C900003; Mon, 29 May 2023 05:25:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 66F24900002; Mon, 29 May 2023 05:25:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 50FC4900003; Mon, 29 May 2023 05:25:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 3A7B8900002
	for <linux-mm@kvack.org>; Mon, 29 May 2023 05:25:47 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id EE570401BD
	for <linux-mm@kvack.org>; Mon, 29 May 2023 09:25:46 +0000 (UTC)
X-FDA: 80842760292.28.E2D424D
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf05.hostedemail.com (Postfix) with ESMTP id 3ECA9100012
	for <linux-mm@kvack.org>; Mon, 29 May 2023 09:25:44 +0000 (UTC)
Authentication-Results: imf05.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aDfooA5K;
	spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1685352344;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=Sj2qXiEWFeMturc9uGJf6ztAEEy4dq2p+Uw4dRrFm4Q=;
	b=EKRabJCaBllo1URQXVOD4iCT/SIkgxvW0PG6gSSkh4caw06CQZ/5G1bW4qT+dw//KWe48X
	GJDRYvn47BVUiKnZi+k2CbHRr/oguLQW7FenZPzz53g2pC8xEz+iE15gGmfT2BKAqamE8e
	1R0DNorre8q39MzTXIK4d+gpEUhqJ5s=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685352344; a=rsa-sha256;
	cv=none;
	b=BHAHxTahlaakaG0zS+ZK/xXDUiNtIH9Fyql2Oxiz5WsPzQUkBLFPf5WnYJ2fMkYfmlpJ6f
	tT9gpCQwNHEUQh83WwG/mflO1YdASFOZFcSw/8xCdLFzZeJ2XAU7EBzK8/+YA+ir8hRQpL
	4Xa32XXbdnahI63aoMPGGphWVimKWCU=
ARC-Authentication-Results: i=1;
	imf05.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aDfooA5K;
	spf=pass (imf05.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1685352343;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=Sj2qXiEWFeMturc9uGJf6ztAEEy4dq2p+Uw4dRrFm4Q=;
	b=aDfooA5KPYd41knswQfrQMEHXpfWtOevXEJxDpnw/xGCaaEkyvCxRQcukkDXuDYq+rIuqT
	4Ipjwc8DP79Q0Qo+1WjctQ37zHEWJKENXTeOvnFKdy8dzdQ2CXeue3dGXnDfEhsYcoV4ly
	7bhxysC/GvG29f8evn2XD0n/iiwyZTM=
Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com
 [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-564-DO_6lPovOZ2n9Cp6ikKAxw-1; Mon, 29 May 2023 05:25:42 -0400
X-MC-Unique: DO_6lPovOZ2n9Cp6ikKAxw-1
Received: by mail-wm1-f70.google.com with SMTP id 5b1f17b1804b1-3f6f58e269eso12196705e9.1
        for <linux-mm@kvack.org>; Mon, 29 May 2023 02:25:42 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1685352341; x=1687944341;
        h=content-transfer-encoding:in-reply-to:subject:organization:from
         :content-language:references:cc:to:user-agent:mime-version:date
         :message-id:x-gm-message-state:from:to:cc:subject:date:message-id
         :reply-to;
        bh=Sj2qXiEWFeMturc9uGJf6ztAEEy4dq2p+Uw4dRrFm4Q=;
        b=Pj1R8SkDvaqD9IvAJvlXCylJwCABONGB5U6jK2GTRF2u+D0HA5Txg6h+YNhPWaJ7LL
         sH3ZaGjqYmJdQN+tSfR9rvJ92aH9CS4L5iO9EEPeYA8MgPT1Ftm3sKQEikMJQhgyrbCp
         FlqIXPozlTyTnqiKxs+oC5Pyrpnj9oqV+0Hz+RbDj+OxmW872/BgR85VDkcdNB1L4fKJ
         g3VmcFCIE1W6o+W4kkvs9yV9Q2GKeMI/m9I+/tKikqaA/8leFf5Za6HVmQ2jElK4W/lK
         ymdsVfzGyHJPGQxAhKCbZ+JWMw21h9TRRtibKaUhIjIBTM16RvxipYlY1z1HU4TH3rgB
         IemQ==
X-Gm-Message-State: AC+VfDyylkC7Zh5ehIKLkPvOJ4WFoDiZ42mho2UQ8nPGJyHu1HqnbNMq
	ET4jDpAVce+Au/dBvLini5qex50tb6+Z7/D/CttyChupnU56dIbBtmvq3uPJ83ji/1f6m4NXBrD
	BBBpk3DB/cgw=
X-Received: by 2002:a1c:7211:0:b0:3f6:cfc7:8bd0 with SMTP id n17-20020a1c7211000000b003f6cfc78bd0mr9676802wmc.36.1685352341030;
        Mon, 29 May 2023 02:25:41 -0700 (PDT)
X-Google-Smtp-Source: ACHHUZ5yPlygRj+fmkALsPX+gGK9yhyCvUZvlYvF7gEiRn8SuazY42oBGYjGdb39uc+kffq2bG3MBw==
X-Received: by 2002:a1c:7211:0:b0:3f6:cfc7:8bd0 with SMTP id n17-20020a1c7211000000b003f6cfc78bd0mr9676786wmc.36.1685352340652;
        Mon, 29 May 2023 02:25:40 -0700 (PDT)
Received: from ?IPV6:2003:d8:2f2e:ae00:f2e3:50e0:73f7:451? (p200300d82f2eae00f2e350e073f70451.dip0.t-ipconnect.de. [2003:d8:2f2e:ae00:f2e3:50e0:73f7:451])
        by smtp.gmail.com with ESMTPSA id q1-20020a1ce901000000b003f423dfc686sm13416038wmc.45.2023.05.29.02.25.39
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Mon, 29 May 2023 02:25:40 -0700 (PDT)
Message-ID: <4d035744-271d-1ca3-a440-f8b1573eec96@redhat.com>
Date: Mon, 29 May 2023 11:25:39 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.10.0
To: Matthew Wilcox <willy@infradead.org>, John Hubbard <jhubbard@nvidia.com>
Cc: Khalid Aziz <khalid.aziz@oracle.com>,
 Steven Sistare <steven.sistare@oracle.com>, akpm@linux-foundation.org,
 ying.huang@intel.com, mgorman@techsingularity.net,
 baolin.wang@linux.alibaba.com, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, Khalid Aziz <khalid@kernel.org>
References: <ZG/I7tYY4uV/32hP@casper.infradead.org>
 <ZG/To8Z3StoVoenU@casper.infradead.org>
 <60367660-f4a3-06dc-4d17-4dbdc733ef74@oracle.com>
 <ZHDh4Jeb/vKY+nGU@casper.infradead.org>
 <f5a37f8d-d888-9085-2f2b-1e350a267396@redhat.com>
 <ZHD+eOMpZpWXNAig@casper.infradead.org>
 <e31cd404-56ce-4cad-fcc3-3a6695f750fa@redhat.com>
 <846b770c-9f63-90a2-0435-ec82484e3f74@nvidia.com>
 <ZHF2byaRlaX3W6Md@casper.infradead.org>
 <9821bd9c-7c30-8f0c-68e4-6b1d312bc032@nvidia.com>
 <ZHPydXSAfRq8sh0u@casper.infradead.org>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH v4] mm, compaction: Skip all non-migratable pages during
 scan
In-Reply-To: <ZHPydXSAfRq8sh0u@casper.infradead.org>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Language: en-US
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
X-Rspamd-Queue-Id: 3ECA9100012
X-Rspam-User: 
X-Stat-Signature: dp6wbjxbpahodzw79c6sdew78xzxhtqi
X-Rspamd-Server: rspam03
X-HE-Tag: 1685352344-478887
X-HE-Meta: U2FsdGVkX19uoUUaHTsWj/SN4rHLzlw0LSIAszqY+IGsMPL6k31NYKET93+VAbuHgH+tGf0C2jsa9yXaUYHV4MZG36FQS4uafbIZPZbc8cIFMiFXa0GyMhpXt3yIt0bKnSGfg98bDNW95r+VjBP55ky7Us/mTZVdugH+UULYlWGnw/kD/2kF5hIeHu1TrupGSJWDvDQu6c5PqdYrPpvsVGxC7WjCR+XCYJjVXnyRgtn8754HmsVR8ydvNuBOZkMyoCThf53OgXjrruql58857clPHrrcrIK4f5ed7GIV/LzfiZg0KCwFqK3HxC3U4tYfCWxM+gdsmjuuTTBwv4EEh3c2jWXroaWy3aqnKwQHWQ6YGVuJnfSJdOXy+i0XYGbH4NWmulEveYY9TUCHvBhRy1ZvoWvKZB0flz5j5E00uJmFg1kpqxT0dGSXIwk2E/2F02xWzS+HDcUt+RI7UIkqSHRBFOA7sdy0zxU6Mtd3yU4v6LIcDRP653qmmPUhaP+3qviU0LG4RxlKOThagNX6us37i0OrO5E3zDZmY9IkaWFRa+mm6zoVeT7OJCOnfhmrIAAWJKlOlbJTBz/BN1APVUQ4HYRi8NwS37csN6VtNLzzxYCzgv4AGOS3MXTx/8FI0NrSePVmgH4NbF3ltKbUbDQ2huGty3+K37+QsUFdwuMwa3PUWbScyhsf/JFsDlBmB4PCDpqfoILQ0rUtLAIbE/s+oeP4bB5i7KgMjR5dJ9/MNe2PBo2w09ecCDfyTy+5eUAkdcQG/vixLy+2sPmb578/yxUOz8jpjZy8eJovg4+Gn9GUHRKrE7oSd9vGUZDJ0bjPtyta6fkGjw2iUo1mJoobkSHuI+inmG3Efk/YiKAu2miKvzu0+EdLCd8ZB0wuVs/ou00NAn1ix2U53BHH6lpOfuAorbQKIjgw6DUDTjL7lpuboa92jxWo8xaRj1mpnLwJoGI5x/Chz/NXMcF
 DoTDftFY
 ISr5Dja6Vz9w4H2pm+mo7+sM0Wgz/FAhIWhJj6aWdYdWTrhSNfj5VCcX/ykM66ObQ7CzeEhCr6RERcMyrc9CsZDP73nASSoV233iKxNI4iYAGWj3D/JeFPoyKJczalKE6l8ZwSAi/fvTTadxnSc70OLFo7lWF2uTLd4JhQ2rKVPDSpEikdGdqRKqWJrac5CAySGW/+8IHC1RWc5a0m0dsz9qouZ3sXA8nft1OLx4Rs97E2g7Det6sP0StxoFbeZCzn1pO6yW4l8fWBGc98yDyRCTI8J2A0izhy4ftAq/McyFlrEUlteEn019tnaMbKDi4Pr0D0brNVj2lY1SZSYpEuIkWVfHBHxP+laeEO/Z9xGbKhuSLVozjw4DsjIjM+7jx3P5tb+e+OsL7UgaxoiR70DcS9AlQrChuKjk3
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 29.05.23 02:31, Matthew Wilcox wrote:
> On Sun, May 28, 2023 at 04:49:52PM -0700, John Hubbard wrote:
>> On 5/26/23 20:18, Matthew Wilcox wrote:
>>> On Fri, May 26, 2023 at 07:11:05PM -0700, John Hubbard wrote:
>>>>> So any user with 1024 processes can fragment physical memory? :/
>>>>>
>>>>> Sorry, I'd like to minimize the usage of folio_maybe_dma_pinned().
>>>>
>>>> I was actually thinking that we should minimize any more cases of
>>>> fragile mapcount and refcount comparison, which then leads to
>>>> Matthew's approach here!
>>>
>>> I was wondering if we shouldn't make folio_maybe_dma_pinned() a little
>>> more accurate.  eg:
>>>
>>>           if (folio_test_large(folio))
>>>                   return atomic_read(&folio->_pincount) > 0;
>>> 	return (unsigned)(folio_ref_count(folio) - folio_mapcount(folio)) >=
>>> 			GUP_PIN_COUNTING_BIAS;
>>
>> I'm trying to figure out what might be wrong with that, but it seems
>> OK. We must have talked about this earlier, but I recall vaguely that
>> there was not a lot of concern about the case of a page being mapped
>>> 1024 times. Because pinned or not, it's likely to be effectively
>> locked into memory due to LRU effects. As mentioned here, too.
> 
> That was my point of view, but David convinced me that a hostile process
> can effectively lock its own memory into place.
> 

1) My opinion on this optimization

Before I start going into detail, let me first phrase my opinion so we 
are on the same page:

"a tiny fraction of Linux installations makes heavy use of long-term 
pinning -- the *single* mechanism that completely *destroys* the whole 
purpose of memory compaction -- and the users complain about memory 
compaction overhead. So we are talking about optimizing for that by 
eventually harming *everybody else*."

Ehm ... I'm all for reasonable optimization, but not at any price.

We don't care about a handful of long-term pinned pages in the system, 
this is really about vfio long-term pinning a significant amount of 
system RAM, and we only care about shmem here.


*maybe* there is an issue with page migration when we have many page 
mappings, but (a) it's a separate issue and to be dealt with separately, 
not buried into such checks (b) it's unclear how many page mappings are 
too many, the magic number 1024 is just a random number (c) it needs 
much finer control (hostile processes).


2) mapcount vs. pagecount

Now, doing these mapcount vs. pagecount checks is perfectly reasonable 
(see mm/ksm.c) as long as know what we're doing. For example, we have to 
make sure that a possible compound page cannot get split concurrently 
(e.g., hold a reference). It's been used forever, I consider it stable.

I completely agree that we should be careful with such mapcount vs. 
pagecount checks, and if we can use something better, let's use 
something *better*.


3) page_maybe_dma_pinned()

Now, why do I dislike bringing up page_maybe_dma_pinned() [IOW, why is 
it not better]? Besides it ignoring FOLL_GET for now, that might be 
fixed at some point.

I think we agree that order-0 pages are the problem, because we get 
guaranteed false positives with many mappings (not just on speculative 
page pinnings). For these order-0 pages, it's perfectly reasonable to 
check page_maybe_dma_pinned() *as long as* we know the number of 
mappings is very small.

I don't consider anon pages the issue here, we barely get 1024 mappings 
(not even with KSM), and it's much harder to exploit because you cannot 
simply get new mappings via mmap(), only via fork().

In fact, we could optimize easily for order-0 anon pages if we'd need 
to: I have a patch lying around, it just wasn't really worth it for now, 
because there is only a single relevant page_maybe_dma_pinned() call in 
vmscan that could benefit:

https://github.com/davidhildenbrand/linux/commit/0575860d064694d4e2f307b2c20a880a6a7b59ab

We cannot do the same for pagecache pages, so we would possibly 
introduce harm by carelessly checking page_maybe_dma_pinned() on pages
with many mappings.


4) folio_maybe_dma_longterm_pinned() ?

I thought yesterday if we'd want something like 
folio_maybe_dma_longterm_pinned() here. Essentially using what we 
learned about long-term pinning of fs pages:

(1) ZONE_MOVABLE, MIGRATE_CMA -> "return false;"
(2) If !anon, !hugetlb, !shmem -> "return false;"
(3) "return folio_maybe_dma_pinned()"

Yes, above would easily create false-positives for short-term pinned 
pages (anon/hugetlb/shmem), but would never create false-positives for 
any other page (shared library ...).


We would use it in the following way:

bool skip_folio_in_isolation()
{
	/*
          * Avoid skipping pages that are short-term pinned, the pin
	 * might go away any moment and we'll succeed to migrate.
          *
          * We get false positives for short-term pinned anon, shmem and
          * hugetl pages for now, but such short-term pins are transient.
          */
	if (!folio_maybe_dma_longterm_pinned())
		return false;
         /*
          * order-0 pages with many mappings can easily be confused
          * for pinned pages and this could be exploited by
          * malicious user-space to cause fragmentation. This is only
          * an optimization, so if a page (especially shmem) is mapped
          * many times, we'll rather try migrating it instead of
          * accidentally skipping it all the time.
          */
	return folio_order(folio) != 0 || && total_mappings <= 32)
}

Someone long-term pins an shmem page with many mappings? Too bad, we 
don't optimize for that and still try migrating it.


BUT, I am still confused if we want to check here for "any additional 
references", which is what mapcount vs. refcount is, or 
folio_maybe_dma_longterm_pinned().

Of course, we could similarly write a variant of skip_folio_in_isolation:

bool skip_folio_in_isolation()
{
	/*
          * If a page is not pinned, try migrating it. Note that this
          * does not consider any FOLL_GET used for DMA yet.
          */
	if (!folio_maybe_dma_pinned())
		return false;
         /*
          * order-0 pages with many mappings can easily be confused
          * for pinned pages and this could be exploited by
          * malicious user-space to cause fragmentation. This is only
          * an optimization, so if a page is mapped
          * many times, we'll rather try migrating it instead of
          * accidentally skipping it all the time.
          */
	return folio_order(folio) != 0 || && total_mappings <= 32)
}


As long as FOLL_GET is still used for DMA, the mapcount vs. pagecount 
checks might be better ... but it depends on if we care about short-term 
or long-term pinned pages here.

>> Anyway, sure.
>>
>> A detail:
>>
>> The unsigned cast, I'm not sure that helps or solves anything, right?
>> That is, other than bugs, is it possible to get refcount < mapcount?

BUG IMHO.

>>
>> And if it's only due to bugs, then the casting, again, isn't likely to
>> going to mitigate the fallout from whatever mess the bug caused.
> 
> I wasn't thinking too hard about the cast.  If the caller has the folio
> lock, I don't think it's possible for refcount < mapcount.  This caller
> has a refcount, but doesn't hold the lock, so it is possible for them
> to read mapcount first, then have both mapcount and refcount decremented
> and see refcount < mapcount.
> 
> I don't think it matters too much.  We don't hold the folio lock, so
> it might transition from pinned to unpinned as much as a refcount might
> be decremented or a mapcount incremented.  What's important is that a
> hostile process can't prevent memory from being moved indefinitely.
> 
> David, have I missed something else?


What I learned from staring at the code in mm/ksm.c:write_protect_page() 
for too long a while ago is that:

(1) Mapping a page first increments the refcount, then the mapcount
(2) Unmapping a page first decrements the mapcount, then the refcount

So the mapcount is supposed to be always larger than the refcount. 
Especially, if you take a snapshot of both (read first the mapcount, 
then the mapcount).

A hostile process wouldn't be able to block compaction here forever, 
even if we accidentally would make the wrong call once when deciding 
whether to isolate a page. It would work on the next attempt.

That's the difference to page_maybe_dma_pinned(), which can be made to 
consistently block compaction.


[sorry for the lengthy mail]

-- 
Thanks,

David / dhildenb