From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB726C433F5 for ; Wed, 13 Oct 2021 15:47:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 60D84610A2 for ; Wed, 13 Oct 2021 15:47:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 60D84610A2 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id DB3676B0073; Wed, 13 Oct 2021 11:47:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D60E66B0078; Wed, 13 Oct 2021 11:47:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C28BD900002; Wed, 13 Oct 2021 11:47:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0062.hostedemail.com [216.40.44.62]) by kanga.kvack.org (Postfix) with ESMTP id B41EE6B0073 for ; Wed, 13 Oct 2021 11:47:16 -0400 (EDT) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 6FEF022C12 for ; Wed, 13 Oct 2021 15:47:16 +0000 (UTC) X-FDA: 78691843272.28.DE26B7E Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) by imf12.hostedemail.com (Postfix) with ESMTP id 03E2810000A0 for ; Wed, 13 Oct 2021 15:47:15 +0000 (UTC) Received: by mail-pj1-f41.google.com with SMTP id k23so2575826pji.0 for ; Wed, 13 Oct 2021 08:47:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=7tl0UrRgQJeUgfAjO37z3FAcAtBQoJqRFjVMGzCTvA8=; b=DWcjK29rZL8qAwcVkgX4a3/MCud1JTap4CQQARYAYNmlL+x4UPK3v8TuG4hlbnlT7u IVG7b9BRqAbMTX3u3JPDan7JbUgKHF2iNp2b1BR3UgUTDw3bwVj5qe930IRh6CSlOSuY 4OERN4rn+ZwBHzQbx4+yAWOj6ZXeSzxL24EP7p/y31qCPDpVRaJszBK9NE0lKbTMi3VH vwKNxiLTCZJlzvjnTztAzuARXhpCchRAFhJhALtrUJ/N11LtrdVc8cx9qvrlZL9YVd23 aSVS+X87C5n+avtxcKaZb8IavjRITBaMO63uZ6U49qh8MPivlNXPcpPnwcgiH9wk5WM5 bSiQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=7tl0UrRgQJeUgfAjO37z3FAcAtBQoJqRFjVMGzCTvA8=; b=TBNJV9t9BaOAq+oCWNHq93N36fLS5YgW/CNrukV/ZK6oa/LULaIAMa8cFpiLaAAZ/a q4XlIfpTmKK7tPCmmIs7njoocNWL0ehHACRa/zQ+WB7O3C19bN7EuSLanLM0TNQE9s0V hPzZNV19uQTf0SlLbEhxdj6i/7RpJY/MUcEPc/yXEEZHLlGRwY7U3sli7+JnOXkaQGLA XFQHRP5I6hdeVDZLb230vwpE6aITZ8nxcbyvArMG4vaoiNQOJa0h+RGYL9fzGzhWVQ7J H0d6f4m06Fe5SiMD+qJbNfkBxlFdiYdgijv5o3KseSLTFB/vSwTsbUWwIts9amYU6Abx +jig== X-Gm-Message-State: AOAM533ozPvJLZA0nX+OPrI9L8Y7U3FXqQsa0tPWxrhK8bkD3u4Uwhc8 4pylohNx05uHYiycxNVqUP4= X-Google-Smtp-Source: ABdhPJwJ++Ty0dgnmWxqRt9ZvtCdWXTVUybNoygKj7+VjckXQk1A+ag8TPHabAN0ZVWcBwEz1iV9bQ== X-Received: by 2002:a17:902:9a91:b0:138:efd5:7302 with SMTP id w17-20020a1709029a9100b00138efd57302mr36507391plp.35.1634140034453; Wed, 13 Oct 2021 08:47:14 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id o26sm10645342pfp.177.2021.10.13.08.47.12 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Wed, 13 Oct 2021 08:47:13 -0700 (PDT) From: Nadav Amit Message-Id: <595A6581-86CF-4372-98AF-532DF65186C6@gmail.com> Content-Type: multipart/signed; boundary="Apple-Mail=_F23DBACE-EAA8-4740-8FEA-C23F179D4E87"; protocol="application/pgp-signature"; micalg=pgp-sha256 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) Date: Wed, 13 Oct 2021 08:47:11 -0700 In-Reply-To: Cc: Michal Hocko , David Hildenbrand , Andrew Morton , Linux-MM , Linux Kernel Mailing List , Andrea Arcangeli , Minchan Kim , Colin Cross , Suren Baghdasarya , Mike Rapoport To: Peter Xu References: <20210926161259.238054-1-namit@vmware.com> <7ce823c8-cfbf-cc59-9fc7-9aa3a79740c3@redhat.com> <6E8A03DD-175F-4A21-BCD7-383D61344521@gmail.com> <2753a311-4d5f-8bc5-ce6f-10063e3c6167@redhat.com> <0FC3F99A-9F77-484A-899B-EDCBEFBFAC5D@gmail.com> X-Mailer: Apple Mail (2.3654.120.0.1.13) X-Rspamd-Queue-Id: 03E2810000A0 X-Stat-Signature: gktdzo691nd9exy7s3gbg3cxfjsge6eb Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=DWcjK29r; spf=pass (imf12.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.216.41 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspamd-Server: rspam06 X-HE-Tag: 1634140035-715763 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: --Apple-Mail=_F23DBACE-EAA8-4740-8FEA-C23F179D4E87 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 > On Oct 12, 2021, at 4:14 PM, Peter Xu wrote: >=20 > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: >>=20 >>=20 >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko wrote: >>>=20 >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote: >>>>=20 >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko wrote: >>>>>=20 >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: >>>>> [...] >>>>>> The manager is notified on memory regions that it should monitor >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these = regions >>>>>> using the remote-userfaultfd that you saw on the second thread. = When it wants >>>>>> to reclaim (anonymous) memory, it: >>>>>>=20 >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got = a vectored >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet). >>>>>> 2. Calls process_vm_readv() to read that memory of that process. >>>>>> 3. Write it back to =E2=80=9Cswap=E2=80=9D. >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. >>>>>=20 >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? >>>>=20 >>>> Providing hints to the kernel takes you so far to a certain extent. >>>> The kernel does not want to (for a good reason) to be completely >>>> configurable when it comes to reclaim and prefetch policies. Doing >>>> so from userspace allows you to be fully configurable. >>>=20 >>> I am sorry but I do not follow. Your scenario is describing a user >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been >>> designed for. What are you missing in the existing functionality? >>=20 >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control >> many aspects of paging out memory: >>=20 >> 1. Writeback: writeback ahead of time, dynamic clustering, etc. >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job >> on non-contiguous memory). >> 3. No guarantee the page is actually reclaimed (e.g., writeback) >> and the time it takes place. >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE >> as non-performant as it is cannot be used for swap AFAIK). >> 5. Other operations (e.g., locking, working set tracking) that >> might not be necessary or interfere. >>=20 >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use >> of userfaultfd to trap page-faults and react accordingly, so you >> are also prevented from: >>=20 >> 6. Having your own custom prefetching policy in response to #PF. >>=20 >> There are additional use-cases I can try to formalize in which >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference >> is pretty clear, I think: one is a hint that only applied to >> page reclamation. The other enables the direct control of >> userspace over (almost) all aspects of paging. >>=20 >> As I suggested before, if it is preferred, this can be a UFFD >> IOCTL instead of process_madvise() behavior, thereby lowering >> the risk of a misuse. >=20 > (Sorry to join so late..) >=20 > Yeah I'm wondering whether that could add one extra layer of security. = But as > you mentioned, we've already have process_vm_writev(), then it's = indeed not > strong reason to reject process_madvise(DONTNEED) too, it seems. >=20 > Not sure whether you're aware of the umap project from LLNL: >=20 > https://github.com/LLNL/umap >=20 > =46rom what I can tell, that's really doing very similar thing as what = you > proposed here, but it's just a local version of things. IOW in umap = the > DONTNEED can be done locally with madvise() already in the umap = maintained > threads. That close the need to introduce the new process_madvise() = interface > and it's definitely safer as it's per-mm and per-task. >=20 > I think you mentioned above that the tracee program will need to = cooperate in > this case, I'm wondering whether some solution like umap would be fine = too as > that also requires cooperation of the tracee program, it's just that = the > cooperation may be slightly more than your solution but frankly I = think that's > still trivial and before I understand the details of your solution I = can't > really tell.. >=20 > E.g. for a program to use umap, I think it needs to replace mmap() to = umap() > where we want the buffers to be managed by umap library rather than = the kernel, > then link against the umap library should work. If the remote = solution you're > proposing requires similar (or even more complicated) cooperation, = then it'll > be controversial whether that can be done per-mm just like how umap = designed > and used. So IMHO it'll be great to share more details on those parts = if umap > cannot satisfy the current need - IMHO it satisfies all the features = you > described on fully customized pageout and page faulting in, it's just = done in a > single mm. Thanks for you feedback, Peter. I am familiar with umap, perhaps not enough, but I am aware. =46rom my experience, the existing interfaces are not sufficient if you = look for high performance (low overhead) solution for multiple processes. The level of cooperation that I mentioned is something that I mentioned preemptively to avoid unnecessary discussion, but I believe they can be resolved (I have just deferred handling them). Specifically for performance, several new kernel features are needed, = for instance, support for iouring with async operations, a vectored UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a vectored madvise(). Even if we talk on the context of a single mm, I cannot see umap being performant for low latency devices without those facilities. Anyhow, I take your feedback and I will resend the patch for enabling MADV_DONTNEED with other patches once I am done. As for the TLB batching itself, I think it has an independent value - but I am not going to argue about it now if there is a pushback against it. --Apple-Mail=_F23DBACE-EAA8-4740-8FEA-C23F179D4E87 Content-Transfer-Encoding: 7bit Content-Disposition: attachment; filename=signature.asc Content-Type: application/pgp-signature; name=signature.asc Content-Description: Message signed with OpenPGP -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEESJL3osl5Ymx/w9I1HaAqSabaD1oFAmFm/38ACgkQHaAqSaba D1qiNw//Xbc4NezeBInOmRhP3eZALcE9fm09Xstn5YxQSevZ/tgfHqTfnm0DK4vp 4prwPEWMeOjHK1yAJrrrVWGFYAgQQyLkFyAFqNyRRb3zzDTubSzNXtAu1p1ztHV4 rK5TL0bV2HNciLaqZDfAXYFl7fIe5jAHrZRiSLLIlbLT0fY4UFU8f3A4q43OoKan um0L+Yv3e8KPhrgCeWOsTNJ6QexGgq7RvpG0lHWO3Al7tK8Fm6oqvGw9Zz840ZQl WDMjOQXGbVBKQktYc6IOiPBmhlOd0qsPxNYtcpg2OeWP4O6TZIrIFk8he1OaFZni wvK7vTmfkdQBtCapVjwk8JyTcHlpgV7pq5iz3rTDl3J91rVMgZzVBCVzWqg6minN tVGz8pHqKdpEiyU/72skRZ3P17XKvuFxCuPfhmOQn5iHKkiB6BNvop+IAvxGiGnD osh5Lea1fT2ZO9SCnw1nTWp0v/mpCfnPRpEhAEYEaa3jTQxJRjmockE8yrtAdNOZ p614N4eb6TT80VN7qV8KtWW0fJebbEgvzGiVwluhZM62uQedGYysjV7w43Ulk3bX 06e7R2yF1PgaFUSVRFdp7oZW+4otCmuL5N/dqIM5fY8/xH9x7hgIoWMyymjmQx8K aCKxnX6wGWBMdYC5Nu5jh/c4bVPPyhNlKpvskctWf++vf254Tks= =PQA3 -----END PGP SIGNATURE----- --Apple-Mail=_F23DBACE-EAA8-4740-8FEA-C23F179D4E87--