From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC9D4C433FE for ; Wed, 13 Oct 2021 23:10:10 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1D505611B0 for ; Wed, 13 Oct 2021 23:10:10 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 1D505611B0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 3636F6B006C; Wed, 13 Oct 2021 19:10:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 312356B0071; Wed, 13 Oct 2021 19:10:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 18BDA900002; Wed, 13 Oct 2021 19:10:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0848E6B006C for ; Wed, 13 Oct 2021 19:10:09 -0400 (EDT) Received: from smtpin15.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 900C53262E for ; Wed, 13 Oct 2021 23:10:08 +0000 (UTC) X-FDA: 78692959296.15.408A1B2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf21.hostedemail.com (Postfix) with ESMTP id 7A6D8D03FAB9 for ; Wed, 13 Oct 2021 23:10:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634166607; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=U3Zz1NdgJbuZO9PMLIHdfhxYur45mnDl/uZ4E3qmW6c=; b=GTGC357WA8O4XlbNRFM2kRdur30xEaRhk26X57mJM9kTPYUW40JfWd9fSg7cdOe1A3ca3x NWunL7g6/3/Ak27cb5z/K+73W5+hkHBYOMOCUL0sEdHBeyLnDnsRq0h9bUeEC1g4Z0fmMz c88EV/BGTCULi2DSRTfCQvPc8TKQtf8= Received: from mail-pj1-f72.google.com (mail-pj1-f72.google.com [209.85.216.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-102-p2LKFzVAO-CFPFFS086eig-1; Wed, 13 Oct 2021 19:10:04 -0400 X-MC-Unique: p2LKFzVAO-CFPFFS086eig-1 Received: by mail-pj1-f72.google.com with SMTP id bt5-20020a17090af00500b001a070233029so2395174pjb.4 for ; Wed, 13 Oct 2021 16:10:04 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=U3Zz1NdgJbuZO9PMLIHdfhxYur45mnDl/uZ4E3qmW6c=; b=LZzPVnVv2m1LJg/2L6sSS6OQ874Jr2vpRj3qFVhpNKf4Ufp3CoFr7deI5v64mHI5Oy MYqNJd9ofM2ZYJ3KK2RwtXuP781tLGbKO2jzaO0e0wNLOFAeNLXnQjQ/ZOMPxauJq4MF R1ANjRMl3kYAQHKtJYetA1UfxV+8BeOfrsKLTZJ7SmR7utnKZQlVk8y8M/nVe53aj9KW CJ3iQvSeG1OZxIRSIzzGSP806yQ53mIwlmJaQaksVsZ6tLLHhibdaZ4HMJgR1svzBYN2 3bGOGzkeyd4gy7jticf9Yja1+lw8A4/fTfrL4kNSrYg1e805sJwb3PR5bHwOVpC6Gaz0 96+g== X-Gm-Message-State: AOAM533xm9frUQbGU1cq7Rd671lVRMEECWZbNnYQvu6cTUQAigLK8Xfa f64KMCV0D/7wlxwXtzyL71I/P1DfaauCna9pt6uhCE3WbQTc0q8j9ejsVsCZCgQrc6ns6yG8X4G pzsBUy/EKoZY= X-Received: by 2002:a63:3548:: with SMTP id c69mr1598459pga.111.1634166603645; Wed, 13 Oct 2021 16:10:03 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwQwv2/BAw9goLcSL27J9xHo4b1sI1wMi4h7Z9mOjN0qKfNPkjPSjFHu2enqHTgE0pRpUDipA== X-Received: by 2002:a63:3548:: with SMTP id c69mr1598436pga.111.1634166603237; Wed, 13 Oct 2021 16:10:03 -0700 (PDT) Received: from t490s ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id nu16sm501375pjb.56.2021.10.13.16.09.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 13 Oct 2021 16:10:02 -0700 (PDT) Date: Thu, 14 Oct 2021 07:09:54 +0800 From: Peter Xu To: Nadav Amit Cc: Michal Hocko , David Hildenbrand , Andrew Morton , Linux-MM , Linux Kernel Mailing List , Andrea Arcangeli , Minchan Kim , Colin Cross , Suren Baghdasarya , Mike Rapoport Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) Message-ID: References: <7ce823c8-cfbf-cc59-9fc7-9aa3a79740c3@redhat.com> <6E8A03DD-175F-4A21-BCD7-383D61344521@gmail.com> <2753a311-4d5f-8bc5-ce6f-10063e3c6167@redhat.com> <0FC3F99A-9F77-484A-899B-EDCBEFBFAC5D@gmail.com> <595A6581-86CF-4372-98AF-532DF65186C6@gmail.com> MIME-Version: 1.0 In-Reply-To: <595A6581-86CF-4372-98AF-532DF65186C6@gmail.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7A6D8D03FAB9 X-Stat-Signature: esjo57z5658r5dmp4mwequgfy49xbxo5 Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GTGC357W; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf21.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=peterx@redhat.com X-HE-Tag: 1634166607-320150 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Oct 13, 2021 at 08:47:11AM -0700, Nadav Amit wrote: >=20 >=20 > > On Oct 12, 2021, at 4:14 PM, Peter Xu wrote: > >=20 > > On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: > >>=20 > >>=20 > >>> On Sep 29, 2021, at 12:52 AM, Michal Hocko wrote: > >>>=20 > >>> On Mon 27-09-21 12:12:46, Nadav Amit wrote: > >>>>=20 > >>>>> On Sep 27, 2021, at 5:16 AM, Michal Hocko wrote= : > >>>>>=20 > >>>>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: > >>>>> [...] > >>>>>> The manager is notified on memory regions that it should monitor > >>>>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these= regions > >>>>>> using the remote-userfaultfd that you saw on the second thread. = When it wants > >>>>>> to reclaim (anonymous) memory, it: > >>>>>>=20 > >>>>>> 1. Uses UFFD-WP to protect that memory (and for this matter I go= t a vectored > >>>>>> UFFD-WP to do so efficiently, a patch which I did not send yet). > >>>>>> 2. Calls process_vm_readv() to read that memory of that process. > >>>>>> 3. Write it back to =E2=80=9Cswap=E2=80=9D. > >>>>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > >>>>>=20 > >>>>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? > >>>>=20 > >>>> Providing hints to the kernel takes you so far to a certain extent= . > >>>> The kernel does not want to (for a good reason) to be completely > >>>> configurable when it comes to reclaim and prefetch policies. Doing > >>>> so from userspace allows you to be fully configurable. > >>>=20 > >>> I am sorry but I do not follow. Your scenario is describing a user > >>> space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been > >>> designed for. What are you missing in the existing functionality? > >>=20 > >> Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control > >> many aspects of paging out memory: > >>=20 > >> 1. Writeback: writeback ahead of time, dynamic clustering, etc. > >> 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job > >> on non-contiguous memory). > >> 3. No guarantee the page is actually reclaimed (e.g., writeback) > >> and the time it takes place. > >> 4. I/O stack for swapping - you must use kernel I/O stack (FUSE > >> as non-performant as it is cannot be used for swap AFAIK). > >> 5. Other operations (e.g., locking, working set tracking) that > >> might not be necessary or interfere. > >>=20 > >> In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use > >> of userfaultfd to trap page-faults and react accordingly, so you > >> are also prevented from: > >>=20 > >> 6. Having your own custom prefetching policy in response to #PF. > >>=20 > >> There are additional use-cases I can try to formalize in which > >> MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference > >> is pretty clear, I think: one is a hint that only applied to > >> page reclamation. The other enables the direct control of > >> userspace over (almost) all aspects of paging. > >>=20 > >> As I suggested before, if it is preferred, this can be a UFFD > >> IOCTL instead of process_madvise() behavior, thereby lowering > >> the risk of a misuse. > >=20 > > (Sorry to join so late..) > >=20 > > Yeah I'm wondering whether that could add one extra layer of security= . But as > > you mentioned, we've already have process_vm_writev(), then it's inde= ed not > > strong reason to reject process_madvise(DONTNEED) too, it seems. > >=20 > > Not sure whether you're aware of the umap project from LLNL: > >=20 > > https://github.com/LLNL/umap > >=20 > > From what I can tell, that's really doing very similar thing as what = you > > proposed here, but it's just a local version of things. IOW in umap = the > > DONTNEED can be done locally with madvise() already in the umap maint= ained > > threads. That close the need to introduce the new process_madvise() = interface > > and it's definitely safer as it's per-mm and per-task. > >=20 > > I think you mentioned above that the tracee program will need to coop= erate in > > this case, I'm wondering whether some solution like umap would be fin= e too as > > that also requires cooperation of the tracee program, it's just that = the > > cooperation may be slightly more than your solution but frankly I thi= nk that's > > still trivial and before I understand the details of your solution I = can't > > really tell.. > >=20 > > E.g. for a program to use umap, I think it needs to replace mmap() to= umap() > > where we want the buffers to be managed by umap library rather than t= he kernel, > > then link against the umap library should work. If the remote soluti= on you're > > proposing requires similar (or even more complicated) cooperation, th= en it'll > > be controversial whether that can be done per-mm just like how umap d= esigned > > and used. So IMHO it'll be great to share more details on those part= s if umap > > cannot satisfy the current need - IMHO it satisfies all the features = you > > described on fully customized pageout and page faulting in, it's just= done in a > > single mm. >=20 > Thanks for you feedback, Peter. >=20 > I am familiar with umap, perhaps not enough, but I am aware. >=20 > From my experience, the existing interfaces are not sufficient if you l= ook > for high performance (low overhead) solution for multiple processes. Th= e > level of cooperation that I mentioned is something that I mentioned > preemptively to avoid unnecessary discussion, but I believe they can be > resolved (I have just deferred handling them). >=20 > Specifically for performance, several new kernel features are needed, f= or > instance, support for iouring with async operations, a vectored > UFFDIO_WRITEPROTECT(V) which batches TLB flushes across VMAs and a > vectored madvise(). Even if we talk on the context of a single mm, I > cannot see umap being performant for low latency devices without those > facilities. >=20 > Anyhow, I take your feedback and I will resend the patch for enabling > MADV_DONTNEED with other patches once I am done. As for the TLB batchin= g > itself, I think it has an independent value - but I am not going to > argue about it now if there is a pushback against it. Fair enough. Yes my comment was mostly about whether a remote interface is needed or c= an we still do it locally (frankly I always wanted to have some remote interfac= e to manipulate uffd, but still anything like that should require some justifications for sure). I totally agree your rest works on either optimizing tlb (especially on t= he two points Andrea mentioned for either reducing tlb for huge pmd change prote= ction, or pte promotions) and vectored interfaces sound reasonable, and they're definitely separate issues comparing to this one. Thanks, --=20 Peter Xu