From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=QC80=OS=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id C6BAFC433F5
	for <linux-mm@archiver.kernel.org>; Tue, 28 Sep 2021 08:53:12 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 4C3386113E
	for <linux-mm@archiver.kernel.org>; Tue, 28 Sep 2021 08:53:12 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4C3386113E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id AD5D6900003; Tue, 28 Sep 2021 04:53:11 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A84CF900002; Tue, 28 Sep 2021 04:53:11 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 925AE900003; Tue, 28 Sep 2021 04:53:11 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0106.hostedemail.com [216.40.44.106])
	by kanga.kvack.org (Postfix) with ESMTP id 7EA93900002
	for <linux-mm@kvack.org>; Tue, 28 Sep 2021 04:53:11 -0400 (EDT)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 0CC072BA60
	for <linux-mm@kvack.org>; Tue, 28 Sep 2021 08:53:11 +0000 (UTC)
X-FDA: 78636367782.01.2E95A09
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124])
	by imf23.hostedemail.com (Postfix) with ESMTP id 8002F90000A0
	for <linux-mm@kvack.org>; Tue, 28 Sep 2021 08:53:10 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1632819190;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=wPVHVUzJLA0xC63do/dT2ftZ20sbk+pplCA5Sb9RB6w=;
	b=YhDmT5U6WaF18qV2E9Gf2trYHlktb4r+hwjdAJQLPkY4hYIJYrrfcRXPkj6B1qQnUTpFYY
	QUF790xmnKBRAiCgFjp4pMU3Ih2l1408RfMmiZ6XLhcK0GEc45t0+n/aBGFj8oZkmnVCbC
	6KpvQQXNghDcGBpyvH4K/CRjhb9YnJ0=
Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com
 [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id
 us-mta-515-z0kI4HN2O3ugttlTwPwegg-1; Tue, 28 Sep 2021 04:53:08 -0400
X-MC-Unique: z0kI4HN2O3ugttlTwPwegg-1
Received: by mail-wr1-f69.google.com with SMTP id r15-20020adfce8f000000b0015df1098ccbso14904565wrn.4
        for <linux-mm@kvack.org>; Tue, 28 Sep 2021 01:53:08 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20210112;
        h=x-gm-message-state:to:cc:references:from:organization:subject
         :message-id:date:user-agent:mime-version:in-reply-to
         :content-language:content-transfer-encoding;
        bh=wPVHVUzJLA0xC63do/dT2ftZ20sbk+pplCA5Sb9RB6w=;
        b=SY6F48DuBASiVgwDqPz9OERSPIMKLNvGQV5LjWN05ky/AJMPL4wwMHP2sW3gXx+De6
         3gOhlVf3ayNkopPwXauXyEjq0FZQ7BfyON8OoPignNOcRIr4GaLYMqMBd6EtY4FQ9Dsi
         Ne9fPO+qltzsanO+g20+694xbr+g6gVF05Oj45s/5jM0unHPie3jbJtLAYgPV0M3uMOj
         7TXSoi3TRYRVvfMfj1qYtjQn2GJO0XmLLSbyypr0MQ07XcqVzQjBWWg+jwCjlVQHZM4T
         AcvAzXbIg1/pOSpiPz8VUNpvTbLL5QRqsfv7amwXKLKiyWCTsGF7gk6Y2fscpUsfsp7N
         xzLA==
X-Gm-Message-State: AOAM532/sfYoSnRVYOeonIxVmLDjdsokZwaXFIQOd87x2XJcpOzYQEGB
	GFINIzNa2m6yNofj0GvXiWRMVRC1UVeIcO9LUCzrkY1gkAdHBHmaSKuvZPX93Jt72LpHJnRDXYj
	x9eAKvbqu/cs=
X-Received: by 2002:a05:600c:35d2:: with SMTP id r18mr3443937wmq.97.1632819187478;
        Tue, 28 Sep 2021 01:53:07 -0700 (PDT)
X-Google-Smtp-Source: ABdhPJxCT0rAZ4AUZznug4gSIQ/PFmsFmPvDMgEYCRI3IQxCnfiSdWN7tLcC1Z3Oj/kaulHUkRBu9w==
X-Received: by 2002:a05:600c:35d2:: with SMTP id r18mr3443918wmq.97.1632819187189;
        Tue, 28 Sep 2021 01:53:07 -0700 (PDT)
Received: from [192.168.3.132] (p4ff23aaf.dip0.t-ipconnect.de. [79.242.58.175])
        by smtp.gmail.com with ESMTPSA id n186sm2008424wme.31.2021.09.28.01.53.06
        (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
        Tue, 28 Sep 2021 01:53:06 -0700 (PDT)
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Andrew Morton <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>,
 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
 Peter Xu <peterx@redhat.com>, Andrea Arcangeli <aarcange@redhat.com>,
 Minchan Kim <minchan@kernel.org>, Colin Cross <ccross@google.com>,
 Suren Baghdasarya <surenb@google.com>,
 Mike Rapoport <rppt@linux.vnet.ibm.com>
References: <20210926161259.238054-1-namit@vmware.com>
 <7ce823c8-cfbf-cc59-9fc7-9aa3a79740c3@redhat.com>
 <6E8A03DD-175F-4A21-BCD7-383D61344521@gmail.com>
 <2753a311-4d5f-8bc5-ce6f-10063e3c6167@redhat.com>
 <AE756194-07D4-4467-92CA-9E986140D85D@gmail.com>
 <f47970f5-faa7-9d5f-f07a-9399e4626eda@redhat.com>
 <9DE833C8-515F-4427-9867-E5BF9AD380FB@gmail.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [RFC PATCH 0/8] mm/madvise: support
 process_madvise(MADV_DONTNEED)
Message-ID: <9b53a85c-83f4-4548-c3b5-c65bd8737670@redhat.com>
Date: Tue, 28 Sep 2021 10:53:05 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101
 Thunderbird/78.11.0
MIME-Version: 1.0
In-Reply-To: <9DE833C8-515F-4427-9867-E5BF9AD380FB@gmail.com>
X-Mimecast-Spam-Score: 0
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 8002F90000A0
X-Stat-Signature: 74f7pmhafo6en3juor9yhheh7zf9supn
Authentication-Results: imf23.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YhDmT5U6;
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=none (imf23.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com
X-HE-Tag: 1632819190-137551
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

>>
>> Again, thanks for the details. I guess this should basically work, alt=
hough it involves a lot of complexity (read: all flavors of uffd on other=
 processes). And I am no so sure about performance aspects. "Performance =
is not as bad as you think" doesn't sound like the words you would want t=
o hear from a car dealer ;) So there has to be another big benefit to do =
such user space swapping.
>=20
> There is some complexity, indeed. Worse, there are some quirks of UFFD
> that make life hard for no reason and some uffd and iouring bugs.
>=20
> As for my sales pitch - I agree that I am not the best car dealer=E2=80=
=A6 :(

:)

> When I say performance is not bad, I mean that the core operations of
> page-fault handling, prefetch and reclaim do not induce high overhead
> *after* the improvements I sent or mentioned.
>=20
> The benefit of doing so from userspace is that you have full control
> over the reclaim/prefetch policies, so you may be able to make better
> decisions.
>=20
> Some workloads have predictable access patterns (see for instance "MAGE=
:
> Nearly Zero-Cost Virtual Memory for Secure Computation=E2=80=9D, OSDI=E2=
=80=9921). You may
> be handle such access patterns without requiring intrusive changes to t=
he
> workload.

Thanks for the pointer.

And my question would be if something like DAMON would actually be what=20
you want.

>=20
>=20
>>
>>> I am aware that there are some caveats, as zapping the memory does no=
t
>>> guarantee that the memory would be freed since it might be pinned for=
 a
>>> variety of reasons. That's the reason I mentioned the processes have =
"some
>>> level of cooperation" with the manager. It is not intended to deal wi=
th
>>> adversaries or uncommon corner cases (e.g., processes that use UFFD f=
or
>>> their own reasons).
>>
>> It's not only long-term pinnings. Pages could have been de-duplicated =
(COW after fork, KSM, shared zeropage). Further, you'll most probably los=
e any kind of "aging" ("accessed") information on pages, or how would you=
 track that?
>=20
> I know it=E2=80=99s not just long-term pinnings. That=E2=80=99s what =E2=
=80=9Cvariety of reasons=E2=80=9D
> stood for. ;-)
>=20
> Aging is a tool for certain types of reclamation policies. Some do not
> require it (e.g., random). You can also have compiler/application-guide=
d
> reclamation policies. If you are really into =E2=80=9Caging=E2=80=9D, y=
ou may be able
> to use PEBS or other CPU facilities to track it.
>=20
> Anyhow, the access-bit by itself not such a great solution to track
> aging. Setting it can induce overheads of >500 cycles from my (and
> others) experience.

Well, I'm certainly no expert on that; I would assume it's relevant in=20
corner cases only: if you're application accesses all it's memory=20
permanently a swap setup is already "broken". If you have plenty of old=20
memory (VMs, databases, ...) it should work reasonably well. But yeah,=20
detecting the working set size is a problematic problem, and "access"
bits can be sub-optimal.

After all, that's what the Linux kernel has been relying on for a long=20
time ... and IIRC it might be extended by multiple "aging" queues soon.

>=20
>>
>> Although I can see that this might work, I do wonder if it's a use cas=
e worth supporting. As Michal correctly raised, we already have other inf=
rastructure in place to trigger swapin/swapout. I recall that also damon =
wants to let you write advanced policies for that by monitoring actual ac=
cess characteristics.
>=20
> Hints, as those that Michal mentioned, prevent the efficient use of
> userfaultfd. Using MADV_PAGEOUT will not trigger another uffd event
> when the page is brought back from swap. So using
> MADV_PAGEOUT/MADV_WILLNEED does not allow you to have a custom
> prefetch policy, for instance. It would also require you to live
> with the kernel reclamation/IO stack for better and worse.

Would more uffd (or similar) events help?

>=20
> As for DAMON, I am not very familiar with it, but from what I remember
> it seemed to look on a similar direction. IMHO it is more intrusive
> and less configurable (although it can have the advantage of better
> integration with various kernel mechanism). I was wondering for a
> second why you give me such a hard time for a pretty straight-forward
> extension for process_madvise(), but then I remembered that DAMON got
> into the kernel after >30 versions, so I=E2=80=99ll shut up about that.=
 ;-)

It took ... quite a long time, indeed :)

>=20
>>
>>> Putting aside my use-case (which I am sure people would be glad to cr=
iticize),
>>> I can imagine debuggers or emulators may also find use for similar sc=
hemes
>>> (although I do not have concrete use-cases for them).
>>
>> I'd be curious about use cases for debuggers/emulators. Especially for=
 emulators I'd guess it makes more sense to just do it within the process=
. And for debuggers, I'm having a hard time why it would make sense to th=
row away a page instead of just overwriting it with $PATTERN (e.g., 0). B=
ut I'm sure people can be creative :)
>=20
> I have some more vague ideas, but I am afraid that you will keep
> saying that it makes more sense to handle such events from within
> a process. I am not sure that this is true. Even for the emulators
> that we discuss, the emulated program might run in a different
> address space (for sandboxing). You may be able to avoid the need
> for remote-UFFD and get away with the current non-cooperative
> UFFD, but zapping the memory (for atomic updates) would still
> require process_madvise(MADV_DONTNEED) [putting aside various
> ptrace solutions].
>=20
> Anyhow, David, I really appreciate your feedback. And you make
> strong points about issues I encounter. Yet, eventually, I think
> that the main question in this discussion is whether enabling
> process_madvise(MADV_DONTNEED) is any different - from security
> point of view - than process_vm_writev(), not to mention ptrace.
> If not, then the same security guards should suffice, I would
> argue.
>=20

You raise a very excellent point (and it should have been part of your=20
initial sales pitch): how does it differ to process_vm_writev().

I can say that it differs in a way that you can break applications in=20
more extreme ways. Let me give you two examples:

1. longterm pinnings: you raised this yourself; this can break an=20
application silently and there is barely a safe way your tooling could=20
handle it.

2. pagemap: applications can depend on the populated(present |swap)=20
information in the pagemap for correctness. For example, there was=20
recently a discussion to use pagemap information to speed up live=20
migration of VMs, by skipping migration of !populated pages. There is=20
currently no way your tooling can fake that. In comparison, ordinary=20
swapping in the kernel can handle it.

Is it easy to break an application with process_vm_writev()? Yes. When=20
talking about dynamic debugging, it's expected that you break the target=20
already -- or the target is already broken. Is it easier to break an=20
application with process_madvise(MADV_DONTNEED)? I'd say yes, especially=20
when implementing something way beyond debugging as you describe.


I'm giving you "a hard time" for the reason Michal raised: we discussed=20
this in the past already at least two times IIRC and "it is a free=20
ticket to all sorts of hard to debug problem" in our opinion; especially=20
when we mess around in other process address spaces besides for debugging=
.

I'm not the person to ack/nack this, I'm just asking the questions :)

--=20
Thanks,

David / dhildenb