From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A2233C47422 for ; Fri, 19 Jan 2024 02:37:29 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0140C6B0075; Thu, 18 Jan 2024 21:37:29 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EDF996B0078; Thu, 18 Jan 2024 21:37:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D591D6B007D; Thu, 18 Jan 2024 21:37:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id C0DF06B0075 for ; Thu, 18 Jan 2024 21:37:28 -0500 (EST) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 7526A120C7C for ; Fri, 19 Jan 2024 02:37:28 +0000 (UTC) X-FDA: 81694499376.06.5CCC395 Received: from mail-yw1-f178.google.com (mail-yw1-f178.google.com [209.85.128.178]) by imf05.hostedemail.com (Postfix) with ESMTP id B5A3410000D for ; Fri, 19 Jan 2024 02:37:26 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XAiX9qeI; spf=pass (imf05.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705631846; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=aICAH4hRXeLeMw1z8zxzaW7lbyGRMfDIu0CC/thDglg=; b=U9CH2+q0ygIHAR2xrKj38FOd8HO4EdXtXof2YYuKxUftkVhjCDiQN+r8yd3GmDFwmLapUp UCGJfQyL+eghLwalFiNy9s2wBAuXF9hyn/wHcImfXjzk4yMi8ehuSSvTPCG4bs4OZMLjzE BXD4J9L1XuUXDjctz0hT6JI0vmYxfCw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705631846; a=rsa-sha256; cv=none; b=XG7kU2vSapai4PDLubCXMSeDIpKhnv1pdTQ17o32cygcJt7O/3auOvTWoA34tawt92JJHQ zyDRpV2TRSrWRGGzHzE8MtXw+Nr4VaTgffDWeDT6PDKC4kq/bzl7Dt9ymYKzGw8z2HxVr7 NQgEXx1TejPWHN1+2zoXBsXRfhTGNXc= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=XAiX9qeI; spf=pass (imf05.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=ioworker0@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-yw1-f178.google.com with SMTP id 00721157ae682-5ff7a8b5e61so2506747b3.2 for ; Thu, 18 Jan 2024 18:37:26 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705631846; x=1706236646; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=aICAH4hRXeLeMw1z8zxzaW7lbyGRMfDIu0CC/thDglg=; b=XAiX9qeI8tkW+rKtLFLIpoKvIoOq4oci3sBbuIeIQfFddxbSWurbLLVy4b9iOWajrP HnXww5dke+ZzhQwbbLbiyRhVYSR3IpAbBOaBcDZseFr4StjXSy16o3/q+6fsSVNR5RZf R9f0APk2nBGmkvx981l5PYzouRg9RAxxtZYElAqrnnPn6Iv6QkzKUin048OcspVyslzh 2WnnOD9Dg++OLLK0ZuTD+z9MfwmWcpoVXT2QQqZvWm8DGX+jgt6sXqjHGlogip4luC4i vv7ws3gD92cWBobAxHe6aCC3kET5LOUjhsXmdPoMwcxkc1DbIxiirPfPZ+o63BEZV5py KfKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705631846; x=1706236646; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=aICAH4hRXeLeMw1z8zxzaW7lbyGRMfDIu0CC/thDglg=; b=MzGOCrBzif9lpmBQZoMWzKlQyG0oPMr5ZRgQLtVskX1Vi5LdY5HvhgKhXhmAsj1+tu P0zGl8rNH0fJTjXOmAfE/u+eNJW4B7/9jpyMsWdiNTBgRGO9BYguxp2hD8HJ7Y0YILX6 qQES0xgH9kBTiwG+RP2v5qohN2nsf4cCOnf8T6rLbpp6enk/zfP3DuE8WojTOJtFpE5M X6PgFJtLXqpI9PXV94LjioOSwQE0tyESnRAzTCT0nCeP7Sx9TUX6Qx2zH0X94gfkNnSr vgZdQ1FpNb5wWWLlGPynIrwJWZrhsq0+go74dx+JN9INMm4vs4S6a3iDbTibLandtxB8 MIfw== X-Gm-Message-State: AOJu0Yz1wl+SfE0xuuQINuIOzCs/oickuNLOZsyaXowiXgkahaDOC49T I+yxxKl3bOolP8isTE7wRJIpiUVTxi/k6onPh8NjFoOj7shqHHJjr+Jbj0iiXTViJIkXzbO3XfY 2lgvS4Zdgb1jfh/OFQk4Dl8emxHM= X-Google-Smtp-Source: AGHT+IFh9LHQHtffb4CbOihY48EMjl6zFT/VeDNE2gD3ybmauOUjWzJrCzUrHXjoEE8tIzwDekj0YCoL7L2xxgPhr8g= X-Received: by 2002:a5b:c07:0:b0:dc2:371d:53b with SMTP id f7-20020a5b0c07000000b00dc2371d053bmr1488950ybq.9.1705631845763; Thu, 18 Jan 2024 18:37:25 -0800 (PST) MIME-Version: 1.0 References: <20240118120347.61817-1-ioworker0@gmail.com> In-Reply-To: From: Lance Yang Date: Fri, 19 Jan 2024 10:37:14 +0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() To: Yang Shi Cc: "Zach O'Keefe" , Michal Hocko , akpm@linux-foundation.org, david@redhat.com, songmuchun@bytedance.com, peterx@redhat.com, mknyszek@google.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: ayy59mckz1f6k14ebhucwyuuxqopkmym X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: B5A3410000D X-Rspam-User: X-HE-Tag: 1705631846-33212 X-HE-Meta: U2FsdGVkX19PCwsmcaJKZ7jTIXZXtlRSy3CrsyFUE9P+ZbW96aFt5h4K+ikSYQN2oh6TR3SEm+a2nWCrmUW6Tp3MOsfBbUu2gJNdYY3urWaQbmVzMft285pia7DjM3zG/HzMxUcna8qwWAMKoVnBJcmMGR9zG8DoOs3ax8Cg3AiMQPPHkdlDSAJIJRfsfveCFxGFhUa82Zesa4zRodUar1PrWEVSqLoHTe4Zh6uMn/1hOOUqNeKdnajfuV8hToAEEZdY/dSWOGAsFPydKV3MPUHqmjSuGQ0vUqnH9z1o9TLu91xTkkYE0qMqlqBPkX9uO27sFH/vTIh2rzyOs6T0zT9tz8g2toL5jDOcnWhBk5enD0bea1Bvn30blMblTQyVLm2FphKn1oEBoSi79MOC3wASs7vpnFkwZ279CwM2bLA+wIUaKsAruwNh1dUBzlh2GmevKD01+p1vT/9DHI+B0ayYeH2AdHEfdpCg+0XehSEdzY1KJrqvUIzLLX0CUTub8+G9Ose9bZAKJ1aox+9SLeuKS8HLfIAnC0yUqw2fOpnR1aJXhk6t4idG4p3Q9Ax91033G5H3apASZLxJmZl20IUwuDV4rWJPBn8APhDjQlRl2hz4G4jWhOIDv4hsKfM7Fa9EgvPJ3F7AG3XV1YCSVcVNkl+cnCvpQqZLfGIpMNOLfHqAcN24ceUYVQIzforLv3mO9/VjRqLjW67BDfCR2UhhGPbfxOEelOw1DaNCnbxfBLOYEddZ6jlvItCYIXH+7TFLlNq+RFeImLegUxw4fRteMB0r73KWuuxryKIZFIPonJ8qbAh6bPGneQrJDSN4yWOgD71xY6diQ9mk7lwKGECxH8ajSiBDP+7NH4BchgLg2Lr20P2vImKe2cP5P/Ze+ZLAEKjR8C+j8c1jJhp3g0gZ1IwvHqgm4uEKXTdNrILxQ2/7Jj5SH9T86q2xcReAigzjiJIrxsqspdn0mDi jj7qa5R3 8aD8LA9qoJe1UAk4y0cW4y6MtB3Esi6Tj9Uwyz1VlJgvXtVAB0d13c7h2hVQSehpBqCm+JmYvPxxnoASs15IDMUowni3kiJT9gbwEh3kJneCiT2nJK7ouX8UGx2OPSvPvBF1EX+cQSotoNLP56rKmYbFTL0g1G9OLNKqJPCvmPfKBrTKG3Ifjy2GF+pOlARSBbdyE+DwOd+N9tO4BvdNqdKU4hobasoK+reCyRYkr/yBQiEPGyUF0pSLxUpwi5XEd8ZSeB8ctEoDhUg2O/lmlFkhoJVu5wK0iZ/3NowRDG224wSUZzAxB4H584UGbLK2vxC33roB4eyiDGWDsF2A3th7HGMJWnpXhkx9XgWKmKgH/CDKgla6spGu+LvJ0FguthKdKLEH9qWvJh4Y3VsSdKseGMyVa9SyPTGzJthyrSxqsEBDpBdxUBpVYaegC3iPoneTZoEzW7BdlGmaBZD57KOUSThPsiJn9QdYAtvjYad7P+egYwoHMrH+ld8v6t55Aj+P9zBaZdMOtirLChgoqQKDZ+ZpabK1KZ8YZ/Tk9CdWvqcn/fXa1xPBGeimDB7iqvyNfWo/aK6WneYVG4s+Nmaf4Gg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hey Yang, Thanks for taking the time to review! On Fri, Jan 19, 2024 at 3:00=E2=80=AFAM Yang Shi wrot= e: > > On Thu, Jan 18, 2024 at 6:59=E2=80=AFAM Zach O'Keefe = wrote: > > > > On Thu, Jan 18, 2024 at 5:43=E2=80=AFAM Michal Hocko = wrote: > > > > > > Dang, forgot to cc linux-api... > > > > > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > > [...] > > > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > > > Use Cases > > > > > > > > > > An immediate user of this new functionality is the Go runtime hea= p allocator > > > > > that manages memory in hugepage-sized chunks. In the past, whethe= r it was a > > > > > newly allocated chunk through mmap() or a reused chunk released b= y > > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back m= emory with > > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLA= PSE)[3] > > > > > respectively. However, both approaches resulted in performance is= sues; for > > > > > both scenarios, there could be entries into direct reclaim and/or= compaction, > > > > > leading to unpredictable stalls[4]. Now, the allocator can confid= ently use > > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation = of huge pages. > > > > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be > > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) > > > > > > IIUC the primary reason is the cost of the huge page allocation whi= ch > > > > can be really high if the memory is heavily fragmented and it is ca= lled > > > > synchronously from the process directly, correct? Can that be worke= d > > > > around by process_madvise and performing the operation from a diffe= rent > > > > context? Are there any other reasons to have a different mode? > > > > > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE = - > > > > e.g. non blocking one to make sure that the caller doesn't really b= lock > > > > on resource contention (be it locks or memory availability) because= that > > > > matches our non-blocking interface in other areas but having a LIGH= T > > > > operation sounds really vague and the exact semantic would be > > > > implementation specific and might change over time. Non-blocking ha= s a > > > > clear semantic but it is not really clear whether that is what you > > > > really need/want. > > > > IIUC, usecase from Go is unbounded latency due to sync compaction in a > > context where the latency is unacceptable. Working w/ them to > > understand how things can be improved -- it's possible the changes can > > occur entirely on their side, w/o any additional kernel support. > > > > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and > > khugepaged; esp when common case is that the allocation can probably > > be satisfied in fast path. > > > > The suggestion for something like "LIGHT" was intentionally vague > > because it could allow for other optimizations / changes down the > > line, as you point out. I think that might be a win, vs tying to a > > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I > > could be alone on that front, given the design of > > /sys/kernel/mm/transparent_hugepage. > > Per the description Go marks the address spaces with MADV_HUGEPAGE. It > means the application really wants to have huge page back the address > space so kernel will try as hard as possible to get huge page. This is > the default behavior of MADV_HUGEPAGE. If they don't want to enter > direct reclaim, they can configure the defrag mode to "defer", which > means no direct reclaim and wakeup kswapd and kcompactd, and rely on > khugepaged to install huge page later on. But this mode is not > supported by khugepaged defrag, so MADV_COLLAPSE may not support it > (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not > call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does > so (specifying a flag to indicate use huge page or not by the users). Thank you for providing insights into the Go use cases with MADV_HUGEPAGE and the configuration options for defrag mode. Considering the limitations with the "defer" mode, it becomes apparent that there is a gap in addressing scenarios where an application desires a lighter-wei= ght alternative to MADV_HUGEPAGE. MADV_F_COLLAPSE_LIGHT aims to fill this gap by providing a more flexible an= d opportunistic approach, catering to applications in latency-sensitive environments that seek performance improvements with huge pages but prefer to avoid dire= ct reclaim and compaction. This option can serve as a valuable addition for us= ers who want more control over the behavior without the constraints of existing configurations. In the era of cloud-native computing, it's challenging for users to be aware of the THP configurations on all nodes in a cluster, let alone have fine-grained control over them. Simply disabling the use of huge pages due to concerns about potential direct reclamation and compaction may be regrettable, as users are deprived= of the opportunity to experiment with large page allocations. However, relying solely on MADV_HUGEPAGE introduces the risk of unpredictable stalls, making it a trad= e-off that users must carefully consider. By introducing MADV_F_COLLAPSE_LIGHT, we offer users a more flexible and controllable solution in cloud-native environments, allowing them to better balance performance requirements and resource management. This selectively lightwei= ght alternative is designed to provide users with more choices to better meet the diverse needs of different scenarios. Thanks again for your review and your suggestion! Lance > > > > > But circling back, I agree w/ you that the first order of business is t= o > > iron out a real usecase. As of right now, it's not clear something > > like this is required or helpful. > > > > Thanks, > > Zach > > > > > > > > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec= 56404faca29a82689c77 > > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa7333740= 996181268b60a3a > > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9e= ca3be4b3a9bd2af > > > > > [4] https://github.com/golang/go/issues/63334 > > > > > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker= 0@gmail.com/ > > > > -- > > > > Michal Hocko > > > > SUSE Labs > > > > > > -- > > > Michal Hocko > > > SUSE Labs