From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E7065C433FE for ; Tue, 12 Oct 2021 23:14:41 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 77A7A610A0 for ; Tue, 12 Oct 2021 23:14:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 77A7A610A0 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id BFA1B6B006C; Tue, 12 Oct 2021 19:14:40 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BA9D36B0071; Tue, 12 Oct 2021 19:14:40 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A711E6B0072; Tue, 12 Oct 2021 19:14:40 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0097.hostedemail.com [216.40.44.97]) by kanga.kvack.org (Postfix) with ESMTP id 956476B006C for ; Tue, 12 Oct 2021 19:14:40 -0400 (EDT) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 4948E181AEF10 for ; Tue, 12 Oct 2021 23:14:40 +0000 (UTC) X-FDA: 78689341920.24.4A3F468 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf17.hostedemail.com (Postfix) with ESMTP id CCEA8F000090 for ; Tue, 12 Oct 2021 23:14:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1634080479; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lMX1RCgjLsc80uuaghFh+qKGN+Vea72Az+9OruRUp/Y=; b=Cexn/rcQrNSl5B1VM+35h1d9wJwyw3mTokQ85iGC3jysKIpOkHqcwOV3YX1RbPDALSIVX8 sW1rRAGRH2bIbS7tMtlo2r3g+f7eI2U63U+maQ13q3EY81XbZOsLyofIWmOY1ZizUtViL7 XV2M+gI4gFx4L7olBo621MjOmvlRy7Y= Received: from mail-pg1-f199.google.com (mail-pg1-f199.google.com [209.85.215.199]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-147-1o9eeCW3NcmBwZ9r06hQwQ-1; Tue, 12 Oct 2021 19:14:37 -0400 X-MC-Unique: 1o9eeCW3NcmBwZ9r06hQwQ-1 Received: by mail-pg1-f199.google.com with SMTP id d6-20020a63d646000000b00268d368ead8so334048pgj.6 for ; Tue, 12 Oct 2021 16:14:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=lMX1RCgjLsc80uuaghFh+qKGN+Vea72Az+9OruRUp/Y=; b=GBGmTtHZq0uU/7TPvKskShZfW4UbDq3yrotNCGdn+/Lbi3nfoKuCOZs9snTJkBOIs2 7TVxPm9AVXcmJiBCF8KX3LU0pTZwWxouvnp1rA/m47POC5WWMz6GNcQLQmwAvaq1Iz5r FIttFWYymLUnV7lhQdixvrEdI1AFVUVbFKv/8uKEWTpoBiZ2npWVd7wCGJ8Zl3g97ZjM 2SvQ4YonblmnxBoSR1I2B/aR5V5aMtKUpmd3ygaVLk8wRkvre7oxZbVZqgNPvUgZZySu OGruA9AOlzlfpHX9t/pEWfRdOVzr284lnI25FoOQ4fFtn8HUwYRV1KW3vxczZ8c6unTk CWfA== X-Gm-Message-State: AOAM532xf0Yxu9QNtAfSCaAaOxylggDtDXcBnrPJlt5dUXE2qzdArmSl oWwHIP/V415wIFeJLGS2PwLQceJ5sXtDgU1Mv2wkYqR0mq9KLEH6pwGZiNfPa5yojx3nI+J72FD I0FXNMGOkrjA= X-Received: by 2002:a17:90a:de0b:: with SMTP id m11mr9304245pjv.39.1634080475674; Tue, 12 Oct 2021 16:14:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyD84tRalPfxJ0XICA7oxDcEM8MgCjTy+nG79fuR+rkEH0egiiU/ayCQez8eEcAziRW2/6wNQ== X-Received: by 2002:a17:90a:de0b:: with SMTP id m11mr9304200pjv.39.1634080475260; Tue, 12 Oct 2021 16:14:35 -0700 (PDT) Received: from t490s ([209.132.188.80]) by smtp.gmail.com with ESMTPSA id o189sm12408129pfd.203.2021.10.12.16.14.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 12 Oct 2021 16:14:34 -0700 (PDT) Date: Wed, 13 Oct 2021 07:14:17 +0800 From: Peter Xu To: Nadav Amit Cc: Michal Hocko , David Hildenbrand , Andrew Morton , Linux-MM , Linux Kernel Mailing List , Andrea Arcangeli , Minchan Kim , Colin Cross , Suren Baghdasarya , Mike Rapoport Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) Message-ID: References: <20210926161259.238054-1-namit@vmware.com> <7ce823c8-cfbf-cc59-9fc7-9aa3a79740c3@redhat.com> <6E8A03DD-175F-4A21-BCD7-383D61344521@gmail.com> <2753a311-4d5f-8bc5-ce6f-10063e3c6167@redhat.com> <0FC3F99A-9F77-484A-899B-EDCBEFBFAC5D@gmail.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: CCEA8F000090 X-Stat-Signature: fhbrbig7zbpep7pmrdz9t3qthkf99pty Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="Cexn/rcQ"; spf=none (imf17.hostedemail.com: domain of peterx@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=peterx@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam06 X-HE-Tag: 1634080479-514793 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 29, 2021 at 11:31:25AM -0700, Nadav Amit wrote: >=20 >=20 > > On Sep 29, 2021, at 12:52 AM, Michal Hocko wrote: > >=20 > > On Mon 27-09-21 12:12:46, Nadav Amit wrote: > >>=20 > >>> On Sep 27, 2021, at 5:16 AM, Michal Hocko wrote: > >>>=20 > >>> On Mon 27-09-21 05:00:11, Nadav Amit wrote: > >>> [...] > >>>> The manager is notified on memory regions that it should monitor > >>>> (through PTRACE/LD_PRELOAD/explicit-API). It then monitors these r= egions > >>>> using the remote-userfaultfd that you saw on the second thread. Wh= en it wants > >>>> to reclaim (anonymous) memory, it: > >>>>=20 > >>>> 1. Uses UFFD-WP to protect that memory (and for this matter I got = a vectored > >>>> UFFD-WP to do so efficiently, a patch which I did not send yet). > >>>> 2. Calls process_vm_readv() to read that memory of that process. > >>>> 3. Write it back to =E2=80=9Cswap=E2=80=9D. > >>>> 4. Calls process_madvise(MADV_DONTNEED) to zap it. > >>>=20 > >>> Why cannot you use MADV_PAGEOUT/MADV_COLD for this usecase? > >>=20 > >> Providing hints to the kernel takes you so far to a certain extent. > >> The kernel does not want to (for a good reason) to be completely > >> configurable when it comes to reclaim and prefetch policies. Doing > >> so from userspace allows you to be fully configurable. > >=20 > > I am sorry but I do not follow. Your scenario is describing a user > > space driven reclaim. Something that MADV_{COLD,PAGEOUT} have been > > designed for. What are you missing in the existing functionality? >=20 > Using MADV_COLD/MADV_PAGEOUT does not allow userspace to control > many aspects of paging out memory: >=20 > 1. Writeback: writeback ahead of time, dynamic clustering, etc. > 2. Batching (regardless, MADV_PAGEOUT does pretty bad batching job > on non-contiguous memory). > 3. No guarantee the page is actually reclaimed (e.g., writeback) > and the time it takes place. > 4. I/O stack for swapping - you must use kernel I/O stack (FUSE > as non-performant as it is cannot be used for swap AFAIK). > 5. Other operations (e.g., locking, working set tracking) that > might not be necessary or interfere. >=20 > In addition, the use of MADV_COLD/MADV_PAGEOUT prevents the use > of userfaultfd to trap page-faults and react accordingly, so you > are also prevented from: >=20 > 6. Having your own custom prefetching policy in response to #PF. >=20 > There are additional use-cases I can try to formalize in which > MADV_COLD/MADV_PAGEOUT is insufficient. But the main difference > is pretty clear, I think: one is a hint that only applied to > page reclamation. The other enables the direct control of > userspace over (almost) all aspects of paging. >=20 > As I suggested before, if it is preferred, this can be a UFFD > IOCTL instead of process_madvise() behavior, thereby lowering > the risk of a misuse. (Sorry to join so late..) Yeah I'm wondering whether that could add one extra layer of security. B= ut as you mentioned, we've already have process_vm_writev(), then it's indeed n= ot strong reason to reject process_madvise(DONTNEED) too, it seems. Not sure whether you're aware of the umap project from LLNL: https://github.com/LLNL/umap >From what I can tell, that's really doing very similar thing as what you proposed here, but it's just a local version of things. IOW in umap the DONTNEED can be done locally with madvise() already in the umap maintaine= d threads. That close the need to introduce the new process_madvise() inte= rface and it's definitely safer as it's per-mm and per-task. I think you mentioned above that the tracee program will need to cooperat= e in this case, I'm wondering whether some solution like umap would be fine to= o as that also requires cooperation of the tracee program, it's just that the cooperation may be slightly more than your solution but frankly I think t= hat's still trivial and before I understand the details of your solution I can'= t really tell.. E.g. for a program to use umap, I think it needs to replace mmap() to uma= p() where we want the buffers to be managed by umap library rather than the k= ernel, then link against the umap library should work. If the remote solution y= ou're proposing requires similar (or even more complicated) cooperation, then i= t'll be controversial whether that can be done per-mm just like how umap desig= ned and used. So IMHO it'll be great to share more details on those parts if= umap cannot satisfy the current need - IMHO it satisfies all the features you described on fully customized pageout and page faulting in, it's just don= e in a single mm. Thanks, --=20 Peter Xu