From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D74A3C47DB7 for ; Thu, 18 Jan 2024 19:00:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6D0226B007E; Thu, 18 Jan 2024 14:00:38 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 6802D6B00A3; Thu, 18 Jan 2024 14:00:38 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 521476B00A8; Thu, 18 Jan 2024 14:00:38 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 420116B007E for ; Thu, 18 Jan 2024 14:00:38 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id AE45680DB4 for ; Thu, 18 Jan 2024 19:00:37 +0000 (UTC) X-FDA: 81693348114.21.DC355ED Received: from mail-pj1-f44.google.com (mail-pj1-f44.google.com [209.85.216.44]) by imf16.hostedemail.com (Postfix) with ESMTP id A2C3C180021 for ; Thu, 18 Jan 2024 19:00:35 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GjapnOFJ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705604435; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XAcMAf7Ovbr7Vmhjkv9al2GrF4OlxBEV2eCDppGIUMw=; b=C5heAHGJXpuLVm/S0tIgqwEnonEpY6VP2Zi6XXe2Bm3CyjI/dM3NZnG1JVJaY6F1tGwXI0 feNPP0/HMauFVqjcKfeWHS3jmIMvcfGQcLynWN+H5Z4cq4pbVC4VskXs6nTHdBZgTB4eZc 9+xmpPvnLi6uvov6vLhzaRoz+XBAuIM= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=GjapnOFJ; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.44 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705604435; a=rsa-sha256; cv=none; b=syGHRUyVa4B6apaP7pTPadFQuPMmlrdcjB5Q8aL3bKNVK0PhaUTLEJ6RImTWgYzI/O6FPY Uj10tiT9L8ZYKrwsA8PJRCMXQpd5aqLfwvnv30IxYZGLqdmdNBTYO+Vw//2wv7kXHV2aat Gai9y54HLAGY6/om544HGygBGWlzOl0= Received: by mail-pj1-f44.google.com with SMTP id 98e67ed59e1d1-2902b0e9524so25746a91.1 for ; Thu, 18 Jan 2024 11:00:35 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705604434; x=1706209234; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=XAcMAf7Ovbr7Vmhjkv9al2GrF4OlxBEV2eCDppGIUMw=; b=GjapnOFJ0XJokk+7n8wMtbMhLYg10uB0BmFS23TvN4d5f8a9TCpRPAPM6lLqdJlm6p LRqkHsu7DhGF7AVUOjqpgiivRXpmfSwyJ+9hD+uTmgVymJivOyKHJKDq0uFLDQy4WEIl tByUyNH00S5h4D47byFivCGR3dxCrta+AFAISYvJz7kB4VlnlV+pTfUteF06y6snLEuj UUdkS+e9oznYQMOVHDcN95B2qwXUulw1cwyipRh6PD4Mc2mCjxnJVoSg7DabqdKTm6H0 5WvpZ8k0bwIHeu8KdKT37eAbr3jTo4bfGSPggf9bk3EbZSybT6aS5OuaqEF/hsfVhR8J IAFg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705604434; x=1706209234; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=XAcMAf7Ovbr7Vmhjkv9al2GrF4OlxBEV2eCDppGIUMw=; b=jZK1+1++ZabnzVW9jy3amp27CVOW2ltIg4uBLHYwtGFM9dcD/7bqApGnmZZ66tmRgi ZeNJAwokX/KJ2vu6ZRTyFYjBjV0Ox9/vSFqRNsOOZPs17xF/4Y8FDelTo6PUKTebGf5K 2+QFAi7UzmSPY1LlJxlx6+/6pYfFpLRVGVXYaDhbSYQe+7w8AB4Q/A4BsgSIL27p7KBi /vPFb4yZGmN0iLukA3L64FmRvn2l0dJSt+/Da1R6DCGOVYufiJUrsjfg84rCtM60P3A9 eA11YQY4F2R4gELCQ2nzl4c5YIBL5Us9HT/79WUgODmlLkPchVSpWIVlT1UjkcGX5LOY fCzw== X-Gm-Message-State: AOJu0YxM58oDnQCiS7OXLAf28EVbBoXJwdi6SmeUuL4nTZwVH5yy7jZ1 P5qloQYX245fxqYpu4rD+/SnCbTe8++TEOL203MVU7OstZHSKTug+n09STMo+Km+088gdxHBVRA JxgcfuRjLhYF+8U7qiMEGH1PhylM= X-Google-Smtp-Source: AGHT+IHEy7skCBtgPdajoz9zt08QL2GTIY08k+U/TB1G5WRO8gUmPuqTc+lXn0jliCoNccW6sdsaD4y0e3hcNPKw2Go= X-Received: by 2002:a17:90b:3ec3:b0:28c:3042:f8cb with SMTP id rm3-20020a17090b3ec300b0028c3042f8cbmr1067312pjb.21.1705604434386; Thu, 18 Jan 2024 11:00:34 -0800 (PST) MIME-Version: 1.0 References: <20240118120347.61817-1-ioworker0@gmail.com> In-Reply-To: From: Yang Shi Date: Thu, 18 Jan 2024 11:00:21 -0800 Message-ID: Subject: Re: [PATCH v2 1/1] mm/madvise: add MADV_F_COLLAPSE_LIGHT to process_madvise() To: "Zach O'Keefe" Cc: Michal Hocko , Lance Yang , akpm@linux-foundation.org, david@redhat.com, songmuchun@bytedance.com, peterx@redhat.com, mknyszek@google.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-api@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: A2C3C180021 X-Stat-Signature: 9h3rbzni39zp1p3t64iycx8wgu9t5suy X-HE-Tag: 1705604435-877745 X-HE-Meta: U2FsdGVkX1+I1MHpr8vuBijbZnQbUtB2GgLnxiXk6oNcjib9tsoJBLt4l4EuxayExnAht9CtsMDI901sLZrWChCgasBMdnKDaMBYA9lLWzFeyBCVTl4AqRRp+ZrksoNd/eRN9qJ/15wYQmhPuY50WZqO/qSBH7tcKkWmXOrUxsbLeSpsii6ubCa+PE0qjas+X2kTXFFvBUlMzqgdooqniumLvk9cCp8LnSz+eIvVcU37hnXi/P9BxCo68cX788ZjIl+gh9M/vyNYof44i/VWgZ18Ah1H3Sr6zZxbF3n2You6GCc4UYp3snSGhfxnw6uDopYAlxAuEXcufzyWYWUFLSF+ypdlCMfMcF9GVYe3M22G8kuyHIskSCL3T3oLw0fBcAPpX0UUpGXjdglAgXVVPbL4NCIqasGAe0KU2Ij00FbLhalDowiev0YvMdMzMtl7DV63aqfkEVhle0iHVe47+p9cqtyU3BL21cHZ2pKzPacUqu7iZUkBgW6r45M4EQWtdNcgT+drO+i1sHTwr1/QvkEolsxYsdwPpCoaMc1y2wTyG8ZuAEzMdIzR1yfVkQRvbFWT50gNOSZcEexOuoTWznm9mMfHWYDQA7SkvqSRIAFvR6C42H7tKZRN1LI6vgWCrJyJxPBMFA0ahjsFfMduGWD0n24CjD5L+ojAJ4CiIxsfLWIZKh+9wnFqM97B0evgPyooP7VTPpViVr/MLgVHWJckIfvo+1lpzAVy7FxmAfQ4pYEq57k0HI3kHoE8UeR4QnkePdVLOWTrgTKbc+gIKz4f9sYvtsXu9jkXtdOi6Lj/0ypNs4pKtEd1Adc9Dub3kCZoUZmaVrRNGmVoWcPXiXhS/gsAajxiJEWhnu5LbIPxNwlo2L1UJnA0uUiIGswN9CiJypegHUJgRsbIEoRupsJOQqYO1hT9EU/Bf2w9CU4jvVUXkQOmRIhYg8XwXQNrGgU/jrCXk/ARhrwOKN2 VFVRlYOg oiUVmOJFOJb3Z+ZBU2k6MhBM/oIM+oDFW4GSuUik95U6Zhgltj0SflYIDqwMelyO8iukyNVwlK0+3GiBmh8Q3Fzfg4IABxozCC1i/6qTx9vl3QMDdLtN5cohltIL7j1wLzjNVeiD0UMg7uKsM2W6FSMWjTCMejMuiSBPuVSJmkzpHjD2V+1L+w7sA9Ch1pBB0uZOra2kSFC6NwujEpJDC4sN2dIWvudbPJLvSr7QyjxQbeQ4qC0cCBXEBmlfM7Fl3DY6VjzlmPUhUJ2kGvwN1uAL8e1ICgVfFc9bf9HQ99zeNYy0JhE9aptfTHQHzS6EzyLGawCiKX5AkGAIN7GlUzxipuoanWnZ37JGdIGQSkzirNjG0eqbam+n0LGVPSV6Ov3u8cAjLSXbFdoQeyFcRXuA6z0JY+AC0Uj/gZN38U8dykxwPAohScp1Mo8KIGymcGqIZ8ypqPbi+dQwbpVBffA2Zuh3/PTOZPRHOS1Vd8KOByPeCoFCKXcbtSjoxV5Sv5fK0pzN03GFq6RjtvOcYmOXOYbrJxJCFOFm6z6P4stkQrO1DxcjJMjgGDgDPYhhlZYooSwnIKyMdCnpdSqpHtiNLMw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Jan 18, 2024 at 6:59=E2=80=AFAM Zach O'Keefe w= rote: > > On Thu, Jan 18, 2024 at 5:43=E2=80=AFAM Michal Hocko wr= ote: > > > > Dang, forgot to cc linux-api... > > > > On Thu 18-01-24 14:40:19, Michal Hocko wrote: > > > On Thu 18-01-24 20:03:46, Lance Yang wrote: > > > [...] > > > > > > before we discuss the semantic, let's focus on the usecase. > > > > > > > Use Cases > > > > > > > > An immediate user of this new functionality is the Go runtime heap = allocator > > > > that manages memory in hugepage-sized chunks. In the past, whether = it was a > > > > newly allocated chunk through mmap() or a reused chunk released by > > > > madvise(MADV_DONTNEED), the allocator attempted to eagerly back mem= ory with > > > > huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPS= E)[3] > > > > respectively. However, both approaches resulted in performance issu= es; for > > > > both scenarios, there could be entries into direct reclaim and/or c= ompaction, > > > > leading to unpredictable stalls[4]. Now, the allocator can confiden= tly use > > > > process_madvise(MADV_F_COLLAPSE_LIGHT) to attempt the allocation of= huge pages. > > Aside: The thought was a MADV_F_COLLAPSE_LIGHT _flag_; so it'd be > process_madvise(..., MADV_COLLAPSE, MADV_F_COLLAPSE_LIGHT) > > > > IIUC the primary reason is the cost of the huge page allocation which > > > can be really high if the memory is heavily fragmented and it is call= ed > > > synchronously from the process directly, correct? Can that be worked > > > around by process_madvise and performing the operation from a differe= nt > > > context? Are there any other reasons to have a different mode? > > > > > > I mean I can think of a more relaxed (opportunistic) MADV_COLLAPSE - > > > e.g. non blocking one to make sure that the caller doesn't really blo= ck > > > on resource contention (be it locks or memory availability) because t= hat > > > matches our non-blocking interface in other areas but having a LIGHT > > > operation sounds really vague and the exact semantic would be > > > implementation specific and might change over time. Non-blocking has = a > > > clear semantic but it is not really clear whether that is what you > > > really need/want. > > IIUC, usecase from Go is unbounded latency due to sync compaction in a > context where the latency is unacceptable. Working w/ them to > understand how things can be improved -- it's possible the changes can > occur entirely on their side, w/o any additional kernel support. > > The non-blocking case awkwardly sits between MADV_COLLAPSE today, and > khugepaged; esp when common case is that the allocation can probably > be satisfied in fast path. > > The suggestion for something like "LIGHT" was intentionally vague > because it could allow for other optimizations / changes down the > line, as you point out. I think that might be a win, vs tying to a > specific optimization (e.g. like a MADV_F_COLLAPSE_NODEFRAG). But I > could be alone on that front, given the design of > /sys/kernel/mm/transparent_hugepage. Per the description Go marks the address spaces with MADV_HUGEPAGE. It means the application really wants to have huge page back the address space so kernel will try as hard as possible to get huge page. This is the default behavior of MADV_HUGEPAGE. If they don't want to enter direct reclaim, they can configure the defrag mode to "defer", which means no direct reclaim and wakeup kswapd and kcompactd, and rely on khugepaged to install huge page later on. But this mode is not supported by khugepaged defrag, so MADV_COLLAPSE may not support it (IIRC MADV_COLLAPSE uses khugepaged defrag mode). Or they can just not call MADV_HUGEPAGE and leave the decision to the users, IIRC Java does so (specifying a flag to indicate use huge page or not by the users). > > But circling back, I agree w/ you that the first order of business is to > iron out a real usecase. As of right now, it's not clear something > like this is required or helpful. > > Thanks, > Zach > > > > > > > > [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56= 404faca29a82689c77 > > > > [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa733374099= 6181268b60a3a > > > > [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca= 3be4b3a9bd2af > > > > [4] https://github.com/golang/go/issues/63334 > > > > > > > > [v1] https://lore.kernel.org/lkml/20240117050217.43610-1-ioworker0@= gmail.com/ > > > -- > > > Michal Hocko > > > SUSE Labs > > > > -- > > Michal Hocko > > SUSE Labs