From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D1233CD6106 for ; Mon, 9 Oct 2023 16:29:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5FACA8D002C; Mon, 9 Oct 2023 12:29:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5AB2780027; Mon, 9 Oct 2023 12:29:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 473B98D0086; Mon, 9 Oct 2023 12:29:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 398118D002C for ; Mon, 9 Oct 2023 12:29:25 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 0CA93120340 for ; Mon, 9 Oct 2023 16:29:25 +0000 (UTC) X-FDA: 81326458290.05.FFB8A11 Received: from mail-wr1-f54.google.com (mail-wr1-f54.google.com [209.85.221.54]) by imf10.hostedemail.com (Postfix) with ESMTP id 243DFC0025 for ; Mon, 9 Oct 2023 16:29:22 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X6T51Y9l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1696868963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; b=OlZA4BSGcd0THME1kbR92ft1FEmOM7z7GGbIfn65Uc+yuGvF+0fNA6bHEj8WMcpp6wkn7s yISBkBptYDsvlDg65W4bEwjpTzCFkKhfuejl8wk5i+wxCDQscWIrbtsWWSEJcIoBe1qKd2 83qheDx6gv5YDM0kg01QXTwBVbI5eBA= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=X6T51Y9l; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf10.hostedemail.com: domain of lokeshgidra@google.com designates 209.85.221.54 as permitted sender) smtp.mailfrom=lokeshgidra@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1696868963; a=rsa-sha256; cv=none; b=xGKg+UekpoQwZJqAE2Ko+Ro7nK5+D5+XWKqkvE2irrMQ7wa2Y2bc4PgFpqktUWXNRUHCdR sfNOOj9BQlw9EX/HeQfzt1NOmmY2t2LmUMZd61jGLn1/cNLxYO/KuR9wkooM0MU5X6bMEv s4UuM8hKSg5tXNlUc0Yjr5rQn2VFl2o= Received: by mail-wr1-f54.google.com with SMTP id ffacd0b85a97d-32003aae100so3525659f8f.0 for ; Mon, 09 Oct 2023 09:29:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1696868961; x=1697473761; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; b=X6T51Y9lsek4SZbsURDyKkRe0cI9ja6paDNZdUIf8b6LHTLVTHre4CSafdgX225kQ9 DlfjVpHsWo089YWpNGJkp0ti8u+foJqDUChKAdgmvrzfBK5hs/c7SW0OriyBS1fevTle T48VtZ2IoHfYZG0YH2OGX87rcqXkiHOW5tMdfMgXbk+rFPJOXXwFVs3GTtZSK81bl0yZ j418soUAZ/VhaiFmQseqgNlNI6/pwLSNhCMEdtt4KG8XB+4RHwwn8yBpULtwKaeTLQoA 40Jo5kZanlkNEc2cIUSMe45qOjUTGTVlcoPky3RSDPlU3JMCegNZuTIVSjSGNDYKl6ZR aOTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1696868961; x=1697473761; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=n9uxnDjj5RsOcAN+RF0ynO6xdWSr/KBmzmP+ARysPwQ=; b=L4QQlnYAsX2LM4obBzxRiPDcL4OUszfrMJPEmkdFl5JuWlPOFEkBhPx5bWPYbGFV5H zn7Y53SYdnuO027ROVk/kwtqd4B4mNEO7A8fVqPdRrHDCZt2v36SbYnfy/hZwkDkd5qi 2iGftpgHVYc04Z0XgaePgT8qwXXx+yqmuzyl+QREsSwMzi2cBfwdvbwj2uuTfaEGVeXE TSCzZLq0U6DlovU5rWo0n4zGmOxKKbSIqkZ2WgAMZST7lLtn0DZgzmXo4HRQarJT0G9L QwUqqDgeR1r4qmpEpGUQ88v6d1XEdJouTusXm/8bx+YQX5fzeMh7UjxwY0FZof+gk7rq 2sQQ== X-Gm-Message-State: AOJu0YzqYcbltsXEVvAG1a4SFFrhQagki4l7ofwC4qFkw0n7NIcN1mb/ Vl1ReZm+vm/HtbIQ5cBB72RlLGKkxhO28wdxq6v6qg== X-Google-Smtp-Source: AGHT+IH6PLI2drC54PV9B3H88bFGBQmIanwb3EneliHOPBj95HFQUfNM0nOikUEz5pJHMri6gAV4G1Mzu6DEnQh8dhI= X-Received: by 2002:adf:ff88:0:b0:319:5234:5c92 with SMTP id j8-20020adfff88000000b0031952345c92mr9481043wrr.35.1696868961196; Mon, 09 Oct 2023 09:29:21 -0700 (PDT) MIME-Version: 1.0 References: <20231009064230.2952396-1-surenb@google.com> <20231009064230.2952396-3-surenb@google.com> <214b78ed-3842-5ba1-fa9c-9fa719fca129@redhat.com> <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> In-Reply-To: <478697aa-f55c-375a-6888-3abb343c6d9d@redhat.com> From: Lokesh Gidra Date: Mon, 9 Oct 2023 17:29:08 +0100 Message-ID: Subject: Re: [PATCH v3 2/3] userfaultfd: UFFDIO_MOVE uABI To: David Hildenbrand Cc: Suren Baghdasaryan , akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, peterx@redhat.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 243DFC0025 X-Stat-Signature: bdpmwa31rx7ts45i7nizxx56mocbxgx8 X-Rspam-User: X-HE-Tag: 1696868962-48295 X-HE-Meta: U2FsdGVkX1+dYiJaN2f+d0z/7xjOj/hD3hqiyEBAD7t3baf/Lrv2i814xGei//OYg5HJrLBO9ISzBkohfR6etJU+EBAGu6SvZx2z/cf5OXpBXAmVWcOgaO6/T5lzW/a3HXdniTQTEk7PqvaxAWismGfD7/sRf385P3qo6x6uKKzjjgTpYlLkprognpbvvPRjJmfcv+fYYPI7WueSL8DPdq3Pqo7iNxsxbTRdWiAaUI55AUu8I6W6X+eNdDL2NrD1yhTsZbm1hOZwfT0mGeCDlITThzjdPUgWoK6OAf5/q/bXj5k1D08pKj2RGZ5VZme367t9fvW/jSvcWILW5oskVl6QRwdT/e0PZT9/dbIa7Xu9BiL2BDx0spbI6Tw2b+1CRbXVPUamNha5skqIicTIIVYmD/9+F01Q3HN6gCY3zOIkr1tu5LSHNRwxywMEbkd7Sq7FE0O4HjQHBoALrDGvkFHFraCOENDGc2lwpYMDRBf2dM+iKQHvhk2h5RUnSSjHxYT4GVxFj3IssH9aGXt4HI8YgOCjVk/2LVCOpwE7Ih5VV1q4tYBcQ2cc/DcmY3fvzviioi/P8MBu80j1c1slb26ALGKEEYwmUU/yNtnLJ11h3u2Ogp87ujxuC4n95y2osJXyXycPDc9e4Yo5q1Pzs18y5vElKl0qsNGSraFKQ2JrWNARAQH+R4HsRULHmq0mvXDZMoAmf3gwhwJO1u3twsb84dyGhD1JfUJiXULdDY2PsQZgoYnm6Bvd33tO0wzM2J/kYF3rHPikDyrBxtq/X4jwkr6HITFpIKqjl8Dca5tWvRCEWQUjHQs2jbnMZPCAls7uzus1j/Ce5b9qWFeoQ09+huCGoT+GkSQ7r+68DKwBVWJDHO4fONwpnE6fW4U9JAOcxDsy+vFuzXbRQTRyTtzZN5Agn2LVxYH2J6vBWU/SC0TSgew5Rx3dv0itgP9gNjX/E1Rd7QiG7posPcP wtM0Cp1I T9mnWAOsSKQZnCSEZ/IDIc16WrorGWs4WU6HkbeBiD4gfGy+PttV3G8W2F0DQJVYhWvZFi3MsbkCc1iqO6AyrSAc1LQSZpHzuokvFIrA2uooSPSOujVYth1hxlXVVnuJG56waN0e9cU3nIjwLhilc836yvpFP5FSvma5fv2cu8b3ko7Zjl9RVsgucuSyfXuwxW2XuI63o4ZhgnprQFAbqmkbSiOKhzbtubWl/74AmIsJc6a2hkTn+M24rq12Sn1nfK70B0aKHYQOcTbT/jU4TXg9c9TdiRVuChB8gigW5F772AaFhfaHtdzc3aRisMvQrEHXT/aeHHnB+APe+hfl4ZaRe5h6rVvtsRwPnNtVKOouRNnSZ8uB0a2aWtc9lZWVC84vD X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Oct 9, 2023 at 5:24=E2=80=AFPM David Hildenbrand = wrote: > > On 09.10.23 18:21, Suren Baghdasaryan wrote: > > On Mon, Oct 9, 2023 at 7:38=E2=80=AFAM David Hildenbrand wrote: > >> > >> On 09.10.23 08:42, Suren Baghdasaryan wrote: > >>> From: Andrea Arcangeli > >>> > >>> Implement the uABI of UFFDIO_MOVE ioctl. > >>> UFFDIO_COPY performs ~20% better than UFFDIO_MOVE when the applicatio= n > >>> needs pages to be allocated [1]. However, with UFFDIO_MOVE, if pages = are > >>> available (in userspace) for recycling, as is usually the case in hea= p > >>> compaction algorithms, then we can avoid the page allocation and memc= py > >>> (done by UFFDIO_COPY). Also, since the pages are recycled in the > >>> userspace, we avoid the need to release (via madvise) the pages back = to > >>> the kernel [2]. > >>> We see over 40% reduction (on a Google pixel 6 device) in the compact= ing > >>> thread=E2=80=99s completion time by using UFFDIO_MOVE vs. UFFDIO_COPY= . This was > >>> measured using a benchmark that emulates a heap compaction implementa= tion > >>> using userfaultfd (to allow concurrent accesses by application thread= s). > >>> More details of the usecase are explained in [2]. > >>> Furthermore, UFFDIO_MOVE enables moving swapped-out pages without > >>> touching them within the same vma. Today, it can only be done by mrem= ap, > >>> however it forces splitting the vma. > >>> > >>> [1] https://lore.kernel.org/all/1425575884-2574-1-git-send-email-aarc= ange@redhat.com/ > >>> [2] https://lore.kernel.org/linux-mm/CA+EESO4uO84SSnBhArH4HvLNhaUQ5nZ= KNKXqxRCyjniNVjp0Aw@mail.gmail.com/ > >>> > >>> Update for the ioctl_userfaultfd(2) manpage: > >>> > >>> UFFDIO_MOVE > >>> (Since Linux xxx) Move a continuous memory chunk into the > >>> userfault registered range and optionally wake up the blocke= d > >>> thread. The source and destination addresses and the number = of > >>> bytes to move are specified by the src, dst, and len fields = of > >>> the uffdio_move structure pointed to by argp: > >>> > >>> struct uffdio_move { > >>> __u64 dst; /* Destination of move */ > >>> __u64 src; /* Source of move */ > >>> __u64 len; /* Number of bytes to move */ > >>> __u64 mode; /* Flags controlling behavior of move = */ > >>> __s64 move; /* Number of bytes moved, or negated e= rror */ > >>> }; > >>> > >>> The following value may be bitwise ORed in mode to change th= e > >>> behavior of the UFFDIO_MOVE operation: > >>> > >>> UFFDIO_MOVE_MODE_DONTWAKE > >>> Do not wake up the thread that waits for page-fault > >>> resolution > >>> > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES > >>> Allow holes in the source virtual range that is being= moved. > >>> When not specified, the holes will result in ENOENT e= rror. > >>> When specified, the holes will be accounted as succes= sfully > >>> moved memory. This is mostly useful to move hugepage = aligned > >>> virtual regions without knowing if there are transpar= ent > >>> hugepages in the regions or not, but preventing the r= isk of > >>> having to split the hugepage during the operation. > >>> > >>> The move field is used by the kernel to return the number of > >>> bytes that was actually moved, or an error (a negated errno- > >>> style value). If the value returned in move doesn't match t= he > >>> value that was specified in len, the operation fails with th= e > >>> error EAGAIN. The move field is output-only; it is not read= by > >>> the UFFDIO_MOVE operation. > >>> > >>> The operation may fail for various reasons. Usually, remappi= ng of > >>> pages that are not exclusive to the given process fail; once= KSM > >>> might deduplicate pages or fork() COW-shares pages during fo= rk() > >>> with child processes, they are no longer exclusive. Further,= the > >>> kernel might only perform lightweight checks for detecting w= hether > >>> the pages are exclusive, and return -EBUSY in case that chec= k fails. > >>> To make the operation more likely to succeed, KSM should be > >>> disabled, fork() should be avoided or MADV_DONTFORK should b= e > >>> configured for the source VMA before fork(). > >>> > >>> This ioctl(2) operation returns 0 on success. In this case,= the > >>> entire area was moved. On error, -1 is returned and errno i= s > >>> set to indicate the error. Possible errors include: > >>> > >>> EAGAIN The number of bytes moved (i.e., the value returned i= n > >>> the move field) does not equal the value that was > >>> specified in the len field. > >>> > >>> EINVAL Either dst or len was not a multiple of the system pa= ge > >>> size, or the range specified by src and len or dst an= d len > >>> was invalid. > >>> > >>> EINVAL An invalid bit was specified in the mode field. > >>> > >>> ENOENT > >>> The source virtual memory range has unmapped holes an= d > >>> UFFDIO_MOVE_MODE_ALLOW_SRC_HOLES is not set. > >>> > >>> EEXIST > >>> The destination virtual memory range is fully or part= ially > >>> mapped. > >>> > >>> EBUSY > >>> The pages in the source virtual memory range are not > >>> exclusive to the process. The kernel might only perfo= rm > >>> lightweight checks for detecting whether the pages ar= e > >>> exclusive. To make the operation more likely to succe= ed, > >>> KSM should be disabled, fork() should be avoided or > >>> MADV_DONTFORK should be configured for the source vir= tual > >>> memory area before fork(). > >>> > >>> ENOMEM Allocating memory needed for the operation failed. > >>> > >>> ESRCH > >>> The faulting process has exited at the time of a > >>> UFFDIO_MOVE operation. > >>> > >> > >> A general comment simply because I realized that just now: does anythi= ng > >> speak against limiting the operations now to a single MM? > >> > >> The use cases I heard so far don't need it. If ever required, we could > >> consider extending it. > >> > >> Let's reduce complexity and KIS unless really required. > > > > Let me check if there are use cases that require moves between MMs. > > Andrea seems to have put considerable effort to make it work between > > MMs and it would be a pity to lose that. I can send a follow-up patch > > to recover that functionality and even if it does not get merged, it > > can be used in the future as a reference. But first let me check if we > > can drop it. For the compaction use case that we have it's fine to limit it to single MM. However, for general use I think Peter will have a better idea. > > Yes, that sounds reasonable. Unless the big important use cases requires > moving pages between processes, let's leave that as future work for now. > > -- > Cheers, > > David / dhildenb >