From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CCF5DCD6115 for ; Mon, 9 Oct 2023 17:57:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 663B86B017E; Mon, 9 Oct 2023 13:57:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5EC1D6B017F; Mon, 9 Oct 2023 13:57:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 466CF6B0181; Mon, 9 Oct 2023 13:57:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 30AF76B017E for ; Mon, 9 Oct 2023 13:57:06 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id F0C87140366 for ; Mon, 9 Oct 2023 17:57:05 +0000 (UTC) X-FDA: 81326679210.24.049AD2D Received: from mail-wm1-f53.google.com (mail-wm1-f53.google.com [209.85.128.53]) by imf14.hostedemail.com (Postfix) with ESMTP id 20922100020 for ; Mon, 9 Oct 2023 17:57:03 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="nT/ZuHd/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696874224; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=HR7i0+7/SsYbq7Xpu0I3F4tMxRg2aNZclcuGtCASF8c=; b=8hLRjOdOSnk0DAeS7Yc7y0MqiEnuGoihCzo5oWP70Qf1ee/j6XoQF9cTML3bAEdPU9biCv jTl2XKjVmi6bvtnlHx96nHl/2pb8xQzJ9ez7fZATL+//ucKT7grrPMljUoAwVeTu1bjf0Q ErMNjRLmeKaTgcBvboZt2k0ouopfZts= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b="nT/ZuHd/"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.128.53 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696874224; a=rsa-sha256; cv=none; b=qorb3VvYb2HhH4LT5wVdAZ58oPcTAJloJKk6pPzTdbcLjg/7QbjWJkxtSAVpI0Bu0w1MWG CqkxCX2IIt5o8q1v5SvwL1PeAzNid7WwrwiChXVG+GMW86Vv25q0f8oytjE0XJv+BsxKQ7 QYzcMKtwJAjJqGsKLS2v9kxzsO0plr8= Received: by mail-wm1-f53.google.com with SMTP id 5b1f17b1804b1-406618d0992so47165765e9.0 for ; Mon, 09 Oct 2023 10:57:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696874222; x=1697479022; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=HR7i0+7/SsYbq7Xpu0I3F4tMxRg2aNZclcuGtCASF8c=; b=nT/ZuHd/e+B/qQr+pbOtvAVuIb5qTMy0KlvSazI3Epuxcf9HJe2PYoJTS+pz8/CopR JO3BZBiQPjeVCF3odVK7TgndIdi4dXnfnhh8SdIJLeeTv1xHTn3ThNccLFrDA8o/nPRK SUTIv7ZrylflO61KcDzQC5wtoLF0DhzKXUAFRIvq/ElN4VqtIbq1LseZeWBgsSdS7WlH MFEST2YY1Q+40XODjXsOr1eh6IceHEZPb6awZFqKmK8sNGBQLhQXnIGEMrvnamoHHTOD e4slLj1TU9pwBFp/j/lmrqE1BzPbSUNrhh5pL0PU8bkaPL0YdJpW+B6WF1sVhXASHTk7 A76A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696874222; x=1697479022; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=HR7i0+7/SsYbq7Xpu0I3F4tMxRg2aNZclcuGtCASF8c=; b=L+w5kaJRlazs5/nj3ilgsL+gCunTCyPOTxCH43PxNRDi2JkM25Gtj7Ao+ijuML6969 Nfvq0ItJP/vKmmxn/etCa6rSUgvSOw7zVKv4XE4ZYweCEsTAHb/46YxBp1j/YQJS4xbI bN01h9lJieA8pakoBhJL4OMYbpfSJ7kj6q8axBP7dMyIETFiEjnVc3UuH3Sjj8JZfcMD qzwPim1TLFgCi0h8M6zDJaUUL8p4APIiipdMvcdp0w+WK4LkFAGTeuqB+whTfc0oAXIM Rw93EQq3xlA1yQuMC8kzuN555EMKidUqfd4YaMz6OBDCc+l9AJud+A4C8ymD/TOMJSxZ xAPw== X-Gm-Message-State: AOJu0YzcE9S98DHwERy/D5VI7Nox1mJU2oBhFaUacYiqBDIKVqZ7G1O9 BacDA/nYB45Wd6LLGILAcGChqKoaopJ8QMPUsv8RtQ== X-Google-Smtp-Source: AGHT+IFdugjmpCG+0rz5vOSpRJwP3Q+EaYZkFCxXfXiHgTVi3kGGdp/ap2cjhuKlC0Nv/dcgveGTyVI98G4zOxrgfMg= X-Received: by 2002:a05:6000:1046:b0:31f:f9fe:e739 with SMTP id c6-20020a056000104600b0031ff9fee739mr14684966wrx.59.1696874222386; Mon, 09 Oct 2023 10:57:02 -0700 (PDT) MIME-Version: 1.0 References: <20231009064230.2952396-1-surenb@google.com> <20231009064230.2952396-3-surenb@google.com> <214b78ed-3842-5ba1-fa9c-9fa719fca129@redhat.com> <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> In-Reply-To: From: Lokesh Gidra Date: Mon, 9 Oct 2023 10:56:50 -0700 Message-ID: Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI To: David Hildenbrand Cc: Suren Baghdasaryan , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, peterx@redhat.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 20922100020 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: fieywgia9b7uw81grdjijw3ipy6ri6yh X-HE-Tag: 1696874223-212853 X-HE-Meta: U2FsdGVkX182mxzWoQKsu9loQVl7H2cWPNSr7oDP5h0ChB6ixFpDtmt1fH/yU1adLjcWNnT1o8/pjs92v8pXQciTsnh/0XYfeNxV5Nx7iLHhF+9XRUMG6iEpeuf2kuzOx/8Sugr0yfwwLNfPqo32+l9w/TMJ68kDz8M/ZLBnC7fqBQJYzFPpsyZ6wNIUhVj40XA6iBCz9zCRX8baDDlaxZYsSHMkBPYMFm4fCxZqagc9GzvfPKUPHYbirnjT0qEqJkei5VnUJDQZHd+7W3vA7mum/M9FclDTQw3Nm7UZzxJs2xtY9l1/lYSoDrpIjT56dtS4JwNEDpzCBe6t930agHwt+rKVT47lmeR+95V9I9e27PrDBISsDISFyBuJafXI6O/lNKf3ZxkBCEIJBchpRDb9xrT537zXCb2/s30JbTHgYJtg7m26GhAA23euPHhTbwxPwQP/sipbfWTANJnISrHv9I45G0w1NZiSaQwt9gqCQe00zk6Cb9yZ4snVzsnx4/TSobaQ3IWWfTfguAmmCRA9aQWOCC1XxGKfc+SdJOliFVlqeM8Ycl0cHSx+syCMhfkWu4rh8BED0afOZ3bRpQnrOXo4N+6dvFvjuxSt7X6pdg5rExdGUmKlY8v9lEN5HYU/Z5pgtJKL8BYDP+U/O3d7JaLlrVnRspXrgBkeDKU87YPdMjgEde3JNM/mTTX8p49oSvh1jI8B72CL/gHQPViomcLKD/hliORfe9A3R7OAQrKW/MswGGGrXVlc4ffk+pJQs/kZyfVUbs9pr9frMZY/MnrPcNHFTQxhjYqYM+LRPVdE4TRAyZhhaWNE6SDYB1eE5G1fPxXLEDGzRO+trLHPosoj5lYf/ya9NmZ/4neNMVsv8rGeyA6ozQmaOtFLx/5qG79pcBmdufGnOUAq2nUXdt21GY0snMpP6qSPHvAMi9rqD98sMmo9ogVFvxixPNNBA1UKQjNHCKtGrZg pzQklJ3L y9JKHCVc8/eKaK8U94EfDemriHujrt95c8Gcxkn+52FM/1qvGk9iDWjZ7LiZ6iiMpMk356svGdaQNkjjatm7fuAl/2YkW1JQqN0A/+mG4wOmXFb7Lcdl37zw44/xMztjMPtElUwXucPJTlnX10HWPV4WEUANR/RGNx6vWTXindba4s5kh4MRQ66YO3CGvg2ny4gQiVQXMq+EKeIGG44GppJjD2mX2yfsSRzFpa+T/vvW8ZZ4k83npjLMyn+yjrxjwsE8H6lU/YBKxGdpACRZ8PLhrb3zkcinF3mHZHyD5ZDyHYg8YyusIJCymB7WFzqqLQU3u5BztiA7SBVQzfCiWfPYsQr/qORfEN89eSzRt0ErN0KrO2ssrST82jIzLRmAsqnglHBGFGw6XvLprtGwpdyr5sQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 9, 2023 at 9:29=E2=80=AFAM Lokesh Gidra wrote: > > On Mon, Oct 9, 2023 at 5:24=E2=80=AFPM David Hildenbrand wrote: > > > > On 09.10.23 18:21, Suren Baghdasaryan wrote: > > > On Mon, Oct 9, 2023 at 7:38=E2=80=AFAM David Hildenbrand wrote: > > >> > > >> On 09.10.23 08:42, Suren Baghdasaryan wrote: > > >>> From: Andrea Arcangeli > > >>> > > >>> Implement the uABI of UFFDIO_MOVE ioctl. > > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the applicat= ion > > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if page= s are > > >>> available (in userspace) for recycling, as is usually the case in h= eap > > >>> compaction algorithms, then we can avoid the page allocation and me= mcpy > > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the > > >>> userspace, we avoid the need to release (via madvise) the pages bac= k to > > >>> the kernel [2]. > > >>> We see over 40% reduction (on a Google pixel 6 device) in the compa= cting > > >>> thread=E2=80=99s completion time by using UFFDIO_MOVE vs. UFFDIO_CO= PY. This was > > >>> measured using a benchmark that emulates a heap compaction implemen= tation > > >>> using userfaultfd (to allow concurrent accesses by application thre= ads). > > >>> More details of the usecase are explained in [2]. > > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without > > >>> touching them within the same vma. Today, it can only be done by mr= emap, > > >>> however it forces splitting the vma. > > >>> > > >>> [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aa= rcange@redhat.com/ > > >>> [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5= nZKNKXqxRCyjniNVjp0Aw@mail.gmail.com/ > > >>> > > >>> Update for the ioctl_userfaultfd(2) manpage: > > >>> > > >>> UFFDIO_MOVE > > >>> (Since Linux xxx) Move a continuous memory chunk into the > > >>> userfault registered range and optionally wake up the bloc= ked > > >>> thread. The source and destination addresses and the numbe= r of > > >>> bytes to move are specified by the src, dst, and len field= s of > > >>> the uffdio_move structure pointed to by argp: > > >>> > > >>> struct uffdio_move { > > >>> __u64 dst; /* Destination of move */ > > >>> __u64 src; /* Source of move */ > > >>> __u64 len; /* Number of bytes to move */ > > >>> __u64 mode; /* Flags controlling behavior of mov= e */ > > >>> __s64 move; /* Number of bytes moved, or negated= error */ > > >>> }; > > >>> > > >>> The following value may be bitwise ORed in mode to change = the > > >>> behavior of the UFFDIO_MOVE operation: > > >>> > > >>> UFFDIO_MOVE_MODE_DONTWAKE > > >>> Do not wake up the thread that waits for page-fault > > >>> resolution > > >>> > > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES > > >>> Allow holes in the source virtual range that is bei= ng moved. > > >>> When not specified, the holes will result in ENOENT= error. > > >>> When specified, the holes will be accounted as succ= essfully > > >>> moved memory. This is mostly useful to move hugepag= e aligned > > >>> virtual regions without knowing if there are transp= arent > > >>> hugepages in the regions or not, but preventing the= risk of > > >>> having to split the hugepage during the operation. > > >>> > > >>> The move field is used by the kernel to return the number = of > > >>> bytes that was actually moved, or an error (a negated errn= o- > > >>> style value). If the value returned in move doesn't match= the > > >>> value that was specified in len, the operation fails with = the > > >>> error EAGAIN. The move field is output-only; it is not re= ad by > > >>> the UFFDIO_MOVE operation. > > >>> > > >>> The operation may fail for various reasons. Usually, remap= ping of > > >>> pages that are not exclusive to the given process fail; on= ce KSM > > >>> might deduplicate pages or fork() COW-shares pages during = fork() > > >>> with child processes, they are no longer exclusive. Furthe= r, the > > >>> kernel might only perform lightweight checks for detecting= whether > > >>> the pages are exclusive, and return -EBUSY in case that ch= eck fails. > > >>> To make the operation more likely to succeed, KSM should b= e > > >>> disabled, fork() should be avoided or MADV_DONTFORK should= be > > >>> configured for the source VMA before fork(). > > >>> > > >>> This ioctl(2) operation returns 0 on success. In this cas= e, the > > >>> entire area was moved. On error, -1 is returned and errno= is > > >>> set to indicate the error. Possible errors include: > > >>> > > >>> EAGAIN The number of bytes moved (i.e., the value returned= in > > >>> the move field) does not equal the value that was > > >>> specified in the len field. > > >>> > > >>> EINVAL Either dst or len was not a multiple of the system = page > > >>> size, or the range specified by src and len or dst = and len > > >>> was invalid. > > >>> > > >>> EINVAL An invalid bit was specified in the mode field. > > >>> > > >>> ENOENT > > >>> The source virtual memory range has unmapped holes = and > > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. > > >>> > > >>> EEXIST > > >>> The destination virtual memory range is fully or pa= rtially > > >>> mapped. > > >>> > > >>> EBUSY > > >>> The pages in the source virtual memory range are no= t > > >>> exclusive to the process. The kernel might only per= form > > >>> lightweight checks for detecting whether the pages = are > > >>> exclusive. To make the operation more likely to suc= ceed, > > >>> KSM should be disabled, fork() should be avoided or > > >>> MADV_DONTFORK should be configured for the source v= irtual > > >>> memory area before fork(). > > >>> > > >>> ENOMEM Allocating memory needed for the operation failed. > > >>> > > >>> ESRCH > > >>> The faulting process has exited at the time of a > > >>> UFFDIO_MOVE operation. > > >>> > > >> > > >> A general comment simply because I realized that just now: does anyt= hing > > >> speak against limiting the operations now to a single MM? > > >> > > >> The use cases I heard so far don't need it. If ever required, we cou= ld > > >> consider extending it. > > >> > > >> Let's reduce complexity and KIS unless really required. > > > > > > Let me check if there are use cases that require moves between MMs. > > > Andrea seems to have put considerable effort to make it work between > > > MMs and it would be a pity to lose that. I can send a follow-up patch > > > to recover that functionality and even if it does not get merged, it > > > can be used in the future as a reference. But first let me check if w= e > > > can drop it. > > For the compaction use case that we have it's fine to limit it to > single MM. However, for general use I think Peter will have a better > idea. > > > > Yes, that sounds reasonable. Unless the big important use cases require= s > > moving pages between processes, let's leave that as future work for now= . > > > > -- > > Cheers, > > > > David / dhildenb > > While going through mremap's move_page_tables code, which is pretty similar to what we do here, I noticed that cache is flushed as well, whereas we are not doing that here. Is that OK? I'm not a MM expert by any means, so it's a question rather than a comment :)