From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1CB5C433EF for ; Tue, 24 May 2022 00:19:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C07C76B0072; Mon, 23 May 2022 20:19:11 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BBB826B0073; Mon, 23 May 2022 20:19:11 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ACF236B0074; Mon, 23 May 2022 20:19:11 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 9BF336B0072 for ; Mon, 23 May 2022 20:19:11 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 7874521322 for ; Tue, 24 May 2022 00:19:11 +0000 (UTC) X-FDA: 79498726902.11.E9D98A6 Received: from mail-lj1-f181.google.com (mail-lj1-f181.google.com [209.85.208.181]) by imf23.hostedemail.com (Postfix) with ESMTP id 7C86314001D for ; Tue, 24 May 2022 00:18:50 +0000 (UTC) Received: by mail-lj1-f181.google.com with SMTP id a23so19052056ljd.9 for ; Mon, 23 May 2022 17:19:10 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:from:date:message-id:subject:to:cc; bh=s/NEfmvA5DiRayv7SafAxTri4PCKbq1cVO409w0X4xU=; b=p8CW38ZU5QIs8kUl3vUrz07TFXi2xMo6lWkN0gYODGFBJ5NPEjULHS6xyB7s+/tYE6 t7ekD+CRlahrBJcmu83nzcmY0KLn9IKnvhL6M1qldq7QFYfeotWyKSosSZu10d2kUxt6 QkJYToS008BawenSz9r9bytlKJuIrHte/dcjBggt0dtziwofhAD8QlM3DJiFnufi2Ng9 cqMvS1zjqt2T5+Ex2pjChOB1P8rUrBg6oGAqJnxvWnLXc7qBHFbYtoodH2gbMBFIOoFo l++5PFNVUux7j+9BLWG5croxiDmUd4Qc7kP2GXVhWkrrybEyZzogYv82w14VX3BflV0C Ni0g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=s/NEfmvA5DiRayv7SafAxTri4PCKbq1cVO409w0X4xU=; b=qac6/yRJL5Pyty9arbM+dzemVB+teO5kg+lklDQ0YaBdDbYyEYmuwFN31fiPF1IaXq NGW3Vbo+gBvwXLKchHWXqdB2ZcB/JSWOCmWAbh5O/QZyTl10D482/TEBTY/kiG/2dTIA +/j3B30AznmWC6kEfu9zZxpQ4z3Ct9T2mR3wOsMwF/V9p4X1s+U81W2jdDnF+wANPUF7 JCwC4VwcQvyc70tkwSwQCXQ0ixMbDp9aGytqaOZ9+wedAMzxQNjtSOwy2bALqwO30KS7 sn+XfE68DIeEJkJXnQKR/oJRmHgjs8v9kpvmoqQzFU/fGQBX0lkWQJZmngYfhU+WDe/g s57g== X-Gm-Message-State: AOAM530d7q2A4YxTw59J2bz9ck/QphoqovD0tXkjA1oO6DU96SIx2SZT pPQd7OQ6unv8fcbzoNFiXLfFcTusSLoBTeL1c8Q9RQ== X-Google-Smtp-Source: ABdhPJyWlcGNFNUGMsr9sjaIDRhw9Z15ZbwZMeSHFvR7POyBVrbywvKX3RTnkDTxX9Erp4EjxiMD8Ox22s6F3ePNv8w= X-Received: by 2002:a2e:954f:0:b0:24f:4457:950d with SMTP id t15-20020a2e954f000000b0024f4457950dmr14488626ljh.35.1653351549187; Mon, 23 May 2022 17:19:09 -0700 (PDT) MIME-Version: 1.0 From: "Zach O'Keefe" Date: Mon, 23 May 2022 17:18:32 -0700 Message-ID: Subject: [RFC] mm: MADV_COLLAPSE semantics To: Alex Shi , David Hildenbrand , David Rientjes , Matthew Wilcox , Michal Hocko , Peter Xu , Song Liu , Yang Shi , linux-mm@kvack.org, rongwei.wang@linux.alibaba.com Cc: Andrea Arcangeli , Axel Rasmussen , Hugh Dickins , "Kirill A. Shutemov" , Minchan Kim , SeongJae Park , Pasha Tatashin Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 7C86314001D X-Stat-Signature: 7b3by168bxxnbszxomz1ibrm4hax13oc Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=p8CW38ZU; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf23.hostedemail.com: domain of zokeefe@google.com designates 209.85.208.181 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspam-User: X-Rspamd-Server: rspam06 X-HE-Tag: 1653351530-933860 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey All, I'm sending this out before the v6 of "mm: userspace hugepage collapse" for the purposes of aligning on and finalizing the semantics of the proposed MADV_COLLAPSE madvise(2) mode. Background: So far, thanks to everyone's input, we've aligned on: - MADV_COLLAPSE specifies its own hugepage allocation semantics (it allows direct reclaim/compaction). - MADV_COLLAPSE ignores khugepaged heuristics (/sys/kernel/mm/transparent_hugepage/khugepaged/max_pte_* and young/referenced page requirements). In terms of THP _eligibility_, in v5 it was proposed that MADV_COLLAPSE follow existing THP eligibility semantics (/sys/kernel/mm/transparent_hugepage/enabled + the VMA flags of the VMA being collapsed)[1]. However, Rongwei Wang kindly pointed out that the useability of process_madvise(MADV_COLLAPSE) on a system in "madvise" THP mode was limited. I agreed to include process_madvise(2) support for MADV_[NO]HUGEPAGE in v6, but following a discussion with David H., I think that was a mistake. Namely, as David kindly pointed out, there exist programs that don't work with THP and have good reason to disable it. The example provided was postcopy life migration in QEMU, which explicitly disables THP right before faulting in any pages. Idea: MADV_COLLAPSE should respect VM_NOHUGEPAGE and "never" THP mode, but otherwise would attempt to collapse. Why? If someone(*), somewhere told us not to use THPs, then don't override that decision. Otherwise, this is an explicit, safe(**) request made on behalf of ourselves, or by a CAP_SYS_ADMIN process, and shouldn't be blocked by interfaces meant to guide the "transparent" part of THPs. Other options considered: I considered variations of setting VM_HUGEPAGE only if calling on behalf of self or if VM_NOHUGEPAGE is not set. However, I didn't like this because there isn't a way to undo the operation: If we supported process_madvise(MADV_NOHUGEPAGE), we would have to let the application unclear VM_NOHUGEPAGE because outside processes can't/shouldn't. It would have to require some *new* madvise mode like MADV_CLEARHUGEPAGE (that would fail if calling on behalf of another process and VM_NOHUGEPAGE set) to clear VM_[NO]HUGEPAGE. A possible downside to the proposed approach is that, if in "madvise" THP mode and collapsing a VMA not marked VM_HUGEPAGE, it's now the caller's responsibility to monitor and recollapse this memory back into THPs. However, in practice this likely means an explicit MADV_DONTNEED (please let me know if there are other important cases here), and presumably it's the caller's job to do the monitoring anyway. Thanks again for taking the time to read / provide input here. I think this is the last point to clear up before releasing a v6 that should hopefully have all the functionality we need. Best, Zach --- (*) If we could verify that "never" THP mode was used _only_ for debugging, then I'd actually opt to ignore "never" in MADV_COLLAPSE. It's the last dependency MADV_COLLAPSE has on sysfs THP interface and would provide a convenient way to test/debug MADV_COLLAPSE with khugepaged / at-fault disabled. (**) I suppose there could exist applications that see THP "madvise" mode, never call MADV_HUGEPAGE, and so assume THPs will never be found. [1] https://lore.kernel.org/linux-mm/20220504214437.2850685-1-zokeefe@google.com/ [2] https://lore.kernel.org/linux-mm/502a3ced-f3c6-7117-3b24-d80d204d66ee@linux.alibaba.com/