From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9042CC433F5 for ; Mon, 21 Mar 2022 15:47:15 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 247CC8D0001; Mon, 21 Mar 2022 11:47:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 1D0C26B0074; Mon, 21 Mar 2022 11:47:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 04A368D0001; Mon, 21 Mar 2022 11:47:14 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0213.hostedemail.com [216.40.44.213]) by kanga.kvack.org (Postfix) with ESMTP id E78426B0073 for ; Mon, 21 Mar 2022 11:47:14 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id A06931828A7F9 for ; Mon, 21 Mar 2022 15:47:14 +0000 (UTC) X-FDA: 79268822388.22.EA1C77C Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf26.hostedemail.com (Postfix) with ESMTP id 17A13140038 for ; Mon, 21 Mar 2022 15:47:13 +0000 (UTC) Received: by mail-lf1-f51.google.com with SMTP id d5so6077160lfj.9 for ; Mon, 21 Mar 2022 08:47:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=JRvIXISAKq/qkeeE367vyY3JZgNW4DqSPOoQJ+LHunk=; b=GcBlcMPaqw89hf9cXP+rpHRlTvtQlfeJFpxyOAKMsADzkSFGxMvLJ8y4OKEtFnR1W4 uVzCvjH+i9MVUcNzqyejfJbkLQYtCwgf80iZ/U7LniEqfXdtJ8rHZNsb7BVgW0W98z6o F/iyOSjFk5ey690of99haQV4l52KT4095xOfDAOukAOHnNEmmZn9rcmlxVSOGLlGlBBK fWNeyBqphmmfSxWcsIOMJeEa4zN/+GxPeW33zSpLWNb9LMlLuH2AIXKPJrYk+XA2ml7P H/zG55rLPfuDH4fGQf8F6UotHv2fJrMDA2zCkHyxha8yBvZDthpB+Tb6akrR0IyziSIY r+Ng== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=JRvIXISAKq/qkeeE367vyY3JZgNW4DqSPOoQJ+LHunk=; b=Iu7A9Dv72wKc6A1GCNbBwkIfgxyeOB6Zos4lM4MpCgLE/xoafzZ6Ujvn+Y43M7f6je 0JH+VV4PsQRA/rvi0lmQo8CUNy6F+npA+STh0hl+KH1IaQ9qK9sc3sy3EyOri+nZN0bw XQn74uFQTzEgIchhnKXSbqOnGB+m3AEtt+tmlBCcR547afR8xVefUKJ3AmdWMaFpE6R8 CfiVGus80BOxabX+UZjChzHdrb9IFixIq6Y5SN6sPST0B1Hf61chVwZ+8Ax9Gn7o+12q 9Y0/T48BXOg+/TzPva2eIrx6KPQbs5HZfhE0KaueStenxzvDoNX5ZfvzQmExqvpf0dJF CK7A== X-Gm-Message-State: AOAM532xHetoqHeo30Aef3H6WD7vefNJlb/rAKIFATYxP5iZysreX5ZK nIHBZBdweawztPp7eGqPdhdgKzEUmfrZxTbVG8UmNw== X-Google-Smtp-Source: ABdhPJwFbzk31AmkV+lLjhTrryN9fPHHjfjOG/+xZPzdWs6V9FQyMjoZ7gfHc+6eP4joSNKdeubapXBP0JhxprO3xew= X-Received: by 2002:a05:6512:33c2:b0:44a:25e2:25d4 with SMTP id d2-20020a05651233c200b0044a25e225d4mr5296869lfg.359.1647877632075; Mon, 21 Mar 2022 08:47:12 -0700 (PDT) MIME-Version: 1.0 References: <20220308213417.1407042-1-zokeefe@google.com> In-Reply-To: From: "Zach O'Keefe" Date: Mon, 21 Mar 2022 08:46:35 -0700 Message-ID: Subject: Re: [RFC PATCH 00/14] mm: userspace hugepage collapse To: Michal Hocko Cc: Alex Shi , David Hildenbrand , David Rientjes , Pasha Tatashin , SeongJae Park , Song Liu , Vlastimil Babka , Zi Yan , linux-mm@kvack.org, Andrea Arcangeli , Andrew Morton , Arnd Bergmann , Axel Rasmussen , Chris Kennelly , Chris Zankel , Helge Deller , Hugh Dickins , Ivan Kokshaysky , "James E.J. Bottomley" , Jens Axboe , "Kirill A. Shutemov" , Matthew Wilcox , Matt Turner , Max Filippov , Miaohe Lin , Minchan Kim , Patrick Xia , Pavel Begunkov , Peter Xu , Thomas Bogendoerfer , Yang Shi Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 17A13140038 X-Stat-Signature: xypsai77jgrdnhfbszbodbazhtkgogph X-Rspam-User: Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=google.com header.s=20210112 header.b=GcBlcMPa; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf26.hostedemail.com: domain of zokeefe@google.com designates 209.85.167.51 as permitted sender) smtp.mailfrom=zokeefe@google.com X-Rspamd-Server: rspam02 X-HE-Tag: 1647877633-681622 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hey Michal, thanks for taking the time to review / comment. On Mon, Mar 21, 2022 at 7:38 AM Michal Hocko wrote: > > [ Removed Richard Henderson from the CC list as the delivery fails for > his address] Thank you :) > On Tue 08-03-22 13:34:03, Zach O'Keefe wrote: > > Introduction > > -------------------------------- > > > > This series provides a mechanism for userspace to induce a collapse of > > eligible ranges of memory into transparent hugepages in process context= , > > thus permitting users to more tightly control their own hugepage > > utilization policy at their own expense. > > > > This idea was previously introduced by David Rientjes, and thanks to > > everyone for your patience while I prepared these patches resulting fro= m > > that discussion[1]. > > > > [1] https://lore.kernel.org/all/C8C89F13-3F04-456B-BA76-DE2C378D30BF@nv= idia.com/ > > > > Interface > > -------------------------------- > > > > The proposed interface adds a new madvise(2) mode, MADV_COLLAPSE, and > > leverages the new process_madvise(2) call. > > > > (*) process_madvise(2) > > > > Performs a synchronous collapse of the native pages mapped by > > the list of iovecs into transparent hugepages. The default gfp > > flags used will be the same as those used at-fault for the VMA > > region(s) covered. > > Could you expand on reasoning here? The default allocation mode for #PF > is rather light. Madvised will try harder. The reasoning is that we want > to make stalls due to #PF as small as possible and only try harder for > madvised areas (also a subject of configuration). Wouldn't it make more > sense to try harder for an explicit calls like madvise? > The reasoning is that the user has presumably configured system/vmas to tell the kernel how badly they want thps, and so this call aligns with current expectations. I.e. a user who goes about the trouble of trying to fault-in a thp at a given memory address likely wants a thp "as bad" as the same user MADV_COLLAPSE'ing the same memory to get a thp. If this is not the case, then the MADV_F_COLLAPSE_DEFRAG flag could be used to explicitly request the kernel to try harder, as you mention. > > When multiple VMA regions are spanned, if > > faulting-in memory from any VMA would permit synchronous > > compaction and reclaim, then all hugepage allocations required > > to satisfy the request may enter compaction and reclaim. > > I am not sure I follow here. Let's have a memory range spanning two > vmas, one with MADV_HUGEPAGE. I think you are rightly confused here, since the code doesn't currently match this description - thanks for pointing it out. The idea* was that, in the case you provided, the gfp flags used for all thp allocations would match those used for a MADV_HUGEPAGE vma, under current system settings. IOW, we treat the semantics of the collapse for the entire range uniformly (aside from MADV_NOHUGEPAGE, as per earlier discussions). So, for example, if transparent_hugepage/enabled was set to "always" and transparent_hugepage/defrag was set to "madvise", then all allocations could enter direct reclaim. The reasoning for this is, #1 the user has already told us that entering direct reclaim is tolerable for this syscall, and they can wait. #2 is that MADV_COLLAPSE might yield confusing results otherwise; some ranges might get backed by thps, while others may not. Also, a single MADV_HUGEPAGE vma early in the range might permit enough reclaim/compaction that allows successive non-MADV_HUGEPAGE allocations to succeed where they otherwise may not have. However, the code and this description disagree, since madvise decomposes the call over multiple vmas into iterative madvise_vma_behavior() over a single vma, with no state shared between calls. If the motivation above is sufficient, then this could be added. > > > Diverging from the at-fault semantics, VM_NOHUGEPAGE is ignored > > by default, as the user is explicitly requesting this action. > > Define two flags to control collapse semantics, passed through > > process_madvise(2)=E2=80=99s optional flags parameter: > > This part is discussed later in the thread. > > > > > MADV_F_COLLAPSE_LIMITS > > > > If supplied, collapse respects pte collapse limits set via > > sysfs: > > /transparent_hugepage/khugepaged/max_ptes_[none|swap|shared]. > > Required if calling on behalf of another process and not > > CAP_SYS_ADMIN. > > > > MADV_F_COLLAPSE_DEFRAG > > > > If supplied, permit synchronous compaction and reclaim, > > regardless of VMA flags. > > Why do we need this? Do you mean MADV_F_COLLAPSE_DEFRAG specifically, or both? * MADV_F_COLLAPSE_LIMITS is included because we'd like some form of inter-process protection for collapsing memory in another process' address space (which a malevolent program could exploit to cause oom conditions in another memcg hierarchy, for example), but we want privileged (CAP_SYS_ADMIN) users to otherwise be able to optimize thp utilization as they wish. * MADV_F_COLLAPSE_DEFRAG is useful as mentioned above, where we want to explicitly tell the kernel to try harder to back this by thps, regardless of the current system/vma configuration. Note that when used together, these flags can be used to implement the exact behavior of khugepaged, through MADV_COLLAPSE. > -- > Michal Hocko > SUSE Labs