From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 2572BCD68E3
	for <linux-mm@archiver.kernel.org>; Tue, 10 Oct 2023 01:49:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id AB0BF80039; Mon,  9 Oct 2023 21:49:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id A39CB80027; Mon,  9 Oct 2023 21:49:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8DA4B80039; Mon,  9 Oct 2023 21:49:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 77D9B80027
	for <linux-mm@kvack.org>; Mon,  9 Oct 2023 21:49:50 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 4B6721404AE
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 01:49:50 +0000 (UTC)
X-FDA: 81327870540.06.7F3D7EA
Received: from mail-yw1-f177.google.com (mail-yw1-f177.google.com [209.85.128.177])
	by imf14.hostedemail.com (Postfix) with ESMTP id 7BD4F100033
	for <linux-mm@kvack.org>; Tue, 10 Oct 2023 01:49:48 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Sffclp+M;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1696902588;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=tq3aYBssSknCkELoDWL5PONaJce6offpoVO2OC5YbY0=;
	b=T9yAgbdlWsgm3mgDbCqGyjK5LRRzDvdxnCvR1lWX6b6pQ5FcwR9IaF+3DKryLlUPsi7II5
	ZyYnRiob2eSTURbtsIvLoEAyKhXAhlmkOaUrV9Z4Bh6EwYGsJ1w/nq1Bx9xn0BHiax2t+v
	CvCaCa2qk4G/IR/P0IleJRuhsCmexbo=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=Sffclp+M;
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.128.177 as permitted sender) smtp.mailfrom=surenb@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696902588; a=rsa-sha256;
	cv=none;
	b=OANfmVheFSO4mgHy6g+4dDuujw/YTWoJ2KG6QH8gR/o2plWlY+Me72uBAYZmlEbyboEb/d
	+0ixlPvHxhYyH2m75Ash+QoNDffdq9Dk5EUe5oMD0tBs2AzoB3YIwoWS27CX8Ntkr8Zh0V
	CCiGhU47Zm7wKbY26ljnUaEj0N1+3g4=
Received: by mail-yw1-f177.google.com with SMTP id 00721157ae682-579de633419so63433497b3.3
        for <linux-mm@kvack.org>; Mon, 09 Oct 2023 18:49:48 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1696902587; x=1697507387; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=tq3aYBssSknCkELoDWL5PONaJce6offpoVO2OC5YbY0=;
        b=Sffclp+MoGMQW7Leq47aIyISYegsD4I91ivWwkGmYr7LIkCvOjWVnD0yqlGqIDK+S2
         lHl5dkEIw5E9ShYb1Wh3rPrwIFbpVJ29r5KtL6JI7Ke5Hayyc6hMoVWvi8zwPFoZrwik
         dQ31x0A476rQIv6jQ2AxCoVC+qldNPekeRxpIwbr+YyRgwQToAd+Sv7B1sTFEbrilnIf
         +TKXqrErEHZIaAbv4gjs5JxK9LMSUNUA0cE0oIw4ABMIRx/9yuXMznRpDdnEG0kRA8yQ
         Jr2zBCvh6Jbf8XIhC4itzxQQ1K87YY01Qz/jUN90t7bfDHqatEIe3ZRHtSznFyEaY9H2
         i05w==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1696902587; x=1697507387;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=tq3aYBssSknCkELoDWL5PONaJce6offpoVO2OC5YbY0=;
        b=itir6VczPQp/634+89ULPcQel26p778gkY6yRgIFxjBzWc5sMRTnb68SLQT7X70Q9W
         rSG3dsx+EfPsnkRng1t6uEjU9Mo+Dx3wk6Bq1EG+z0yjxygl/J39mIScFLQUGR3FvJPj
         ux6jUSp/pPzDGVupOm3s9cWMBz5dtxVAUobw0PdXBgNtxDPv8W6+adzieA9sRY9fFE12
         u7n/gGNb+gv1PstvIEieaEWdpVkDl/pwUXNci0As53yW3+h++BItK45a/4d9TOI6f6hA
         ogdxieJ0V+3Wi07fxNZorIQX0DUZPBcb+c1K6zN7M25lJqrXwWHMXI4bUMTq5a2ORlEh
         z52w==
X-Gm-Message-State: AOJu0YyMzgP6HmFu/uMXscMxPHY4Ki2lTJusbZo+ECc3fvnatHtVfuHQ
	livlh33zxzweTqJx6EB0087MXkf5elR7V3hWic41zA==
X-Google-Smtp-Source: AGHT+IF1wzf0z81f9XNeq7rMtn3vscHxfdIyiAWjbj9uunHLA7KUVtOoWE/ISwjSWxumTn33b/64kTrbRsvxcQHvgQw=
X-Received: by 2002:a81:9182:0:b0:59c:6ef:cd55 with SMTP id
 i124-20020a819182000000b0059c06efcd55mr18343288ywg.8.1696902587307; Mon, 09
 Oct 2023 18:49:47 -0700 (PDT)
MIME-Version: 1.0
References: <20231009064230.2952396-1-surenb@google.com> <20231009064230.2952396-3-surenb@google.com>
 <214b78ed-3842-5ba1-fa9c-9fa719fca129@redhat.com> <CAJuCfpHzSm+z9b6uxyYFeqr5b5=6LehE9O0g192DZdJnZqmQEw@mail.gmail.com>
 <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> <CA+EESO5nvzka0KzFGzdGgiCWPLg7XD-8jA9=NTUOKFy-56orUg@mail.gmail.com>
 <CA+EESO47LqwMwGgkHQdx1cBdcn_+FWqda8OPcBU-skk9yML_qA@mail.gmail.com>
In-Reply-To: <CA+EESO47LqwMwGgkHQdx1cBdcn_+FWqda8OPcBU-skk9yML_qA@mail.gmail.com>
From: Suren Baghdasaryan <surenb@google.com>
Date: Tue, 10 Oct 2023 01:49:36 +0000
Message-ID: <CAJuCfpH9hBRnUM1S8NL=QDwfn227uyz4ZYPxRYngG=WNKkCk2g@mail.gmail.com>
Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI
To: Lokesh Gidra <lokeshgidra@google.com>
Cc: David Hildenbrand <david@redhat.com>, akpm@linux-foundation.org, viro@zeniv.linux.org.uk, 
	brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, peterx@redhat.com, 
	hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, 
	willy@infradead.org, Liam.Howlett@oracle.com, jannh@google.com, 
	zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, 
	ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, 
	linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, 
	linux-kselftest@vger.kernel.org, kernel-team@android.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam09
X-Rspamd-Queue-Id: 7BD4F100033
X-Stat-Signature: cfger8rxmoxctqdqc7z5rz7nefdcwi8n
X-Rspam-User: 
X-HE-Tag: 1696902588-140080
X-HE-Meta: U2FsdGVkX1/mh4/RL9pJb5MdJEhOp6BdBuyXbGq3R9Tw/ZTo4x3YXiDpUUorGrAvPi9m2xjWeYkWGBFEIteIkxrA3qPrtUlZzD8/z9wtl72deqcvldAzl9X+J9MFMPuiwFXIjyWZfszqmKwbVs1bheWWXblUt9sHPs4PZwTYNNv8lLTuOoWXC5SoxItiv/Oq6AsTINrB2TLU/x0Gh/fn57UdkWID8AO5Wn/U+0nnIvbKFQQdEBrIdFxqy8cw587W9WAgO6JVcTPp8xDrrPRGqZU+Ul0pV93Ihr5OaSLjQcqjlIkdx5/NrjpAaBZK683pQ5lbXmZoa+efiE/jNp4Ci6uTEfu2hU22b4PEqPvfVEEILPzPMahOehCx+5BAeLCOmP06icFmQe6nu+dYQChDrcYkxHBD8chB5rXCkPt4lGe9qCAqj6OfFHtpWluiU4Ci7pswpYUP+dJYiM4JzFfJytDfh+enAw7QQLQFCdpzvLfZ92AUlzOWBZ1Z3U2AsQd/1XvuUuyAKoO2FYodpcy3xewXm5uN6Swh9ov7zillo7tougeCY18zuRMmXJ+9nvN2yKSA20S55o44nSMXf/9I9HYbUsPQdQSF6igUsBt22VCCxKHaXa+/diZu+lSJ/CipPivhHsji/Oj4rs3BYYVf8Q6nM6aZMQ33VCoBg5s/bdoDkox0yuOv/TxwXuHYbesBpO7N+Ct8mIzMMNB2ITvPNPyYYcdVuxu2fR4qtd+WaPe85wnMpvd6v4yLRjC0bwhlyMDIx0TiP58gqTKvhAcZwZQhcnq0zTCxdLll89SirSwdXVAhF5d5b1jPSkcO4BxSPLQ0PXzHPMocQ63DM3cyeGAKbVJNS68NxryiaPLVdwmpyT9SBjEms4L+9W7mfioM9jtgd4TtbuYx7ZJ8reL4ai5qOK5/uzd84MCK9arvYMZhP/FAo4sS6lhHHeHczNI0OhwwcnL7pjSCbXMAmpg
 hut3awJG
 +9jrscKMkZHMzrqIUaqyR44f9YjtJKmWetu+6AkO9/7qZemMQPhSAI5VLDAYOK5qHvvv5DJLqWvLVoMi7uRCNKHFjoS8XfjQ4rlQfYsP1ohN7qUOGwiRQA1cenUvOOr2seZRRBUtZwRs8KEaa5pwMoY1wiP62b01AIJWhfsrn3yDJ9u5ncIDPeb28tpChzWjF9hapDmQ+YiXyh+cIpsf3foRAtduJI53AS/KiI8gRGrpkuYXQTPOBoWEtvXmdnBiJRUVAtLG0Rb2ECB3gri2l7T2Ny9kpcLaGwAYU9BnpbYu5l+pSfDHB7RZiow==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Oct 9, 2023 at 5:57=E2=80=AFPM Lokesh Gidra <lokeshgidra@google.com=
> wrote:
>
> On Mon, Oct 9, 2023 at 9:29=E2=80=AFAM Lokesh Gidra <lokeshgidra@google.c=
om> wrote:
> >
> > On Mon, Oct 9, 2023 at 5:24=E2=80=AFPM David Hildenbrand <david@redhat.=
com> wrote:
> > >
> > > On 09.10.23 18:21, Suren Baghdasaryan wrote:
> > > > On Mon, Oct 9, 2023 at 7:38=E2=80=AFAM David Hildenbrand <david@red=
hat.com> wrote:
> > > >>
> > > >> On 09.10.23 08:42, Suren Baghdasaryan wrote:
> > > >>> From: Andrea Arcangeli <aarcange@redhat.com>
> > > >>>
> > > >>> Implement the uABI of UFFDIO_MOVE ioctl.
> > > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the applic=
ation
> > > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pa=
ges are
> > > >>> available (in userspace) for recycling, as is usually the case in=
 heap
> > > >>> compaction algorithms, then we can avoid the page allocation and =
memcpy
> > > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the
> > > >>> userspace, we avoid the need to release (via madvise) the pages b=
ack to
> > > >>> the kernel [2].
> > > >>> We see over 40% reduction (on a Google pixel 6 device) in the com=
pacting
> > > >>> thread=E2=80=99s completion time by using UFFDIO_MOVE vs. UFFDIO_=
COPY. This was
> > > >>> measured using a benchmark that emulates a heap compaction implem=
entation
> > > >>> using userfaultfd (to allow concurrent accesses by application th=
reads).
> > > >>> More details of the usecase are explained in [2].
> > > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without
> > > >>> touching them within the same vma. Today, it can only be done by =
mremap,
> > > >>> however it forces splitting the vma.
> > > >>>
> > > >>> [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-=
aarcange@redhat.com/
> > > >>> [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaU=
Q5nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/
> > > >>>
> > > >>> Update for the ioctl_userfaultfd(2)  manpage:
> > > >>>
> > > >>>      UFFDIO_MOVE
> > > >>>          (Since Linux xxx)  Move a continuous memory chunk into t=
he
> > > >>>          userfault registered range and optionally wake up the bl=
ocked
> > > >>>          thread. The source and destination addresses and the num=
ber of
> > > >>>          bytes to move are specified by the src, dst, and len fie=
lds of
> > > >>>          the uffdio_move structure pointed to by argp:
> > > >>>
> > > >>>              struct uffdio_move {
> > > >>>                  __u64 dst;    /* Destination of move */
> > > >>>                  __u64 src;    /* Source of move */
> > > >>>                  __u64 len;    /* Number of bytes to move */
> > > >>>                  __u64 mode;   /* Flags controlling behavior of m=
ove */
> > > >>>                  __s64 move;   /* Number of bytes moved, or negat=
ed error */
> > > >>>              };
> > > >>>
> > > >>>          The following value may be bitwise ORed in mode to chang=
e the
> > > >>>          behavior of the UFFDIO_MOVE operation:
> > > >>>
> > > >>>          UFFDIO_MOVE_MODE_DONTWAKE
> > > >>>                 Do not wake up the thread that waits for page-fau=
lt
> > > >>>                 resolution
> > > >>>
> > > >>>          UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES
> > > >>>                 Allow holes in the source virtual range that is b=
eing moved.
> > > >>>                 When not specified, the holes will result in ENOE=
NT error.
> > > >>>                 When specified, the holes will be accounted as su=
ccessfully
> > > >>>                 moved memory. This is mostly useful to move hugep=
age aligned
> > > >>>                 virtual regions without knowing if there are tran=
sparent
> > > >>>                 hugepages in the regions or not, but preventing t=
he risk of
> > > >>>                 having to split the hugepage during the operation=
.
> > > >>>
> > > >>>          The move field is used by the kernel to return the numbe=
r of
> > > >>>          bytes that was actually moved, or an error (a negated er=
rno-
> > > >>>          style value).  If the value returned in move doesn't mat=
ch the
> > > >>>          value that was specified in len, the operation fails wit=
h the
> > > >>>          error EAGAIN.  The move field is output-only; it is not =
read by
> > > >>>          the UFFDIO_MOVE operation.
> > > >>>
> > > >>>          The operation may fail for various reasons. Usually, rem=
apping of
> > > >>>          pages that are not exclusive to the given process fail; =
once KSM
> > > >>>          might deduplicate pages or fork() COW-shares pages durin=
g fork()
> > > >>>          with child processes, they are no longer exclusive. Furt=
her, the
> > > >>>          kernel might only perform lightweight checks for detecti=
ng whether
> > > >>>          the pages are exclusive, and return -EBUSY in case that =
check fails.
> > > >>>          To make the operation more likely to succeed, KSM should=
 be
> > > >>>          disabled, fork() should be avoided or MADV_DONTFORK shou=
ld be
> > > >>>          configured for the source VMA before fork().
> > > >>>
> > > >>>          This ioctl(2) operation returns 0 on success.  In this c=
ase, the
> > > >>>          entire area was moved.  On error, -1 is returned and err=
no is
> > > >>>          set to indicate the error.  Possible errors include:
> > > >>>
> > > >>>          EAGAIN The number of bytes moved (i.e., the value return=
ed in
> > > >>>                 the move field) does not equal the value that was
> > > >>>                 specified in the len field.
> > > >>>
> > > >>>          EINVAL Either dst or len was not a multiple of the syste=
m page
> > > >>>                 size, or the range specified by src and len or ds=
t and len
> > > >>>                 was invalid.
> > > >>>
> > > >>>          EINVAL An invalid bit was specified in the mode field.
> > > >>>
> > > >>>          ENOENT
> > > >>>                 The source virtual memory range has unmapped hole=
s and
> > > >>>                 UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set.
> > > >>>
> > > >>>          EEXIST
> > > >>>                 The destination virtual memory range is fully or =
partially
> > > >>>                 mapped.
> > > >>>
> > > >>>          EBUSY
> > > >>>                 The pages in the source virtual memory range are =
not
> > > >>>                 exclusive to the process. The kernel might only p=
erform
> > > >>>                 lightweight checks for detecting whether the page=
s are
> > > >>>                 exclusive. To make the operation more likely to s=
ucceed,
> > > >>>                 KSM should be disabled, fork() should be avoided =
or
> > > >>>                 MADV_DONTFORK should be configured for the source=
 virtual
> > > >>>                 memory area before fork().
> > > >>>
> > > >>>          ENOMEM Allocating memory needed for the operation failed=
.
> > > >>>
> > > >>>          ESRCH
> > > >>>                 The faulting process has exited at the time of a
> > > >>>                 UFFDIO_MOVE operation.
> > > >>>
> > > >>
> > > >> A general comment simply because I realized that just now: does an=
ything
> > > >> speak against limiting the operations now to a single MM?
> > > >>
> > > >> The use cases I heard so far don't need it. If ever required, we c=
ould
> > > >> consider extending it.
> > > >>
> > > >> Let's reduce complexity and KIS unless really required.
> > > >
> > > > Let me check if there are use cases that require moves between MMs.
> > > > Andrea seems to have put considerable effort to make it work betwee=
n
> > > > MMs and it would be a pity to lose that. I can send a follow-up pat=
ch
> > > > to recover that functionality and even if it does not get merged, i=
t
> > > > can be used in the future as a reference. But first let me check if=
 we
> > > > can drop it.
> >
> > For the compaction use case that we have it's fine to limit it to
> > single MM. However, for general use I think Peter will have a better
> > idea.
> > >
> > > Yes, that sounds reasonable. Unless the big important use cases requi=
res
> > > moving pages between processes, let's leave that as future work for n=
ow.
> > >
> > > --
> > > Cheers,
> > >
> > > David / dhildenb
> > >
>
> While going through mremap's move_page_tables code, which is pretty
> similar to what we do here, I noticed that cache is flushed as well,
> whereas we are not doing that here. Is that OK? I'm not a MM expert by
> any means, so it's a question rather than a comment :)

Good question. I'll have to look closer into it. Unfortunately I'll be
travelling starting tomorrow and be back next week. Will try my best
to answer questions in a timely manner but depends on my connection
and availability.
Thanks!