From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 67B24C47258 for ; Thu, 18 Jan 2024 01:51:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F41476B007D; Wed, 17 Jan 2024 20:51:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id EF13E6B007E; Wed, 17 Jan 2024 20:51:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB9246B0080; Wed, 17 Jan 2024 20:51:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id CACA46B007D for ; Wed, 17 Jan 2024 20:51:45 -0500 (EST) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AF7E81C121B for ; Thu, 18 Jan 2024 01:51:45 +0000 (UTC) X-FDA: 81690755370.25.BC15E7F Received: from mail-yb1-f177.google.com (mail-yb1-f177.google.com [209.85.219.177]) by imf02.hostedemail.com (Postfix) with ESMTP id F173780011 for ; Thu, 18 Jan 2024 01:51:43 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NkEzt5dm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1705542704; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=zFlUvRLLxU9Bh03SHU8Xlmb6p9OjF1f1q03v5iks3yk=; b=JsuEG1mCMQDTfP3pn667npSMzt+KZPlvc/YOU4qfkkwugahyE43BBBJFksz0jh4P7t5NXP baSzNVeyDC1sN5vx67lGMFIpUA7HCb5V5YeJ7dtKIbIEOyGqB38+AiqZAokEF7AUHtZIg3 HIr6k3RC89YiaPfpQXbStr8+AoR3ANg= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=NkEzt5dm; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf02.hostedemail.com: domain of ioworker0@gmail.com designates 209.85.219.177 as permitted sender) smtp.mailfrom=ioworker0@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1705542704; a=rsa-sha256; cv=none; b=fIOkCDI7EXG/P7uNHMGXSZqfes3kb1R3BClPRU5vZPQQP2aMSIAk/Ec0ilK3IC/eL6uxPI foJf66KQBx+QklgknNMryf3yy/p+8ob8X8JJ9j9X6brNkppPx8LXwnV5ykduDPG1KeVmVM 5yeNfivN3acYv/DlXV3BkTYg54rHsos= Received: by mail-yb1-f177.google.com with SMTP id 3f1490d57ef6-dbd029beef4so246136276.0 for ; Wed, 17 Jan 2024 17:51:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1705542703; x=1706147503; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=zFlUvRLLxU9Bh03SHU8Xlmb6p9OjF1f1q03v5iks3yk=; b=NkEzt5dmR0RK4VEhIDxZBa86PyDOYhcjSIzllQ2dwjNVfBawy3okl/RN5XHNx1u7Ut Z5RwvKwxcjLVR9D5dtKu9ZOrbn4Ueq6ZV7YFY3ztR5C5JM22nCI/4ayX8YeMCDbV4kaa dMye2Gw4TEC6uQVxSQsTkPGv7/YymUNclXNtQReGYUj30VBHuUsTQC6+o7JKHp6sF0rn G7B/ZzbS9VN/z5Hb+/ICV6qUaf1ryU5CHtLqcq+31WapHKoado51YbZ5bIshO7iuadHm 0Y2+75f3Waqf1w99mkB/jhxsi/y4l9d3lbxj3rcTGUEkg+47dOrHXzmS83SvRMGntmrX MfOw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1705542703; x=1706147503; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=zFlUvRLLxU9Bh03SHU8Xlmb6p9OjF1f1q03v5iks3yk=; b=mpfAm4ukScVGjYv5T5++QVG/wQHoE7FAVIWQKOe+7KRDa3CHnRx3KaWGcaRw2iwKAB KO/3QC5Ro8ZhOVLmk9Wp3CFHq6zDk43UDfqCpBPkuEd2qgP5p9be1ImyT15npsyYmQ7Y NUYweN6AzLHlMGfCv9mLfPudAEl0ESPh4uRg69WC7KIZspNiek8I4S7xgoSioimM5Zus G2EcHytxwFVYBuNiImcNA/bR531LDliirIxtge32Xse6ML5ZB9csr9mlK16JU2ukGIg0 JApVxTrNQNCT4ynKOnl40Ec7HfPb+LiAUn/qExWmtnSAubM6H2rOzrfzgjq3NohZ4tRI toDg== X-Gm-Message-State: AOJu0Yxfxxhj8xYoR95TQzLgV7hHsMOfpvYEp9gEcyM/ecrRXRrugS6h ELW1nZjg3xcbeLHLyJZdQ3OghQmnwDtZjp54dUU6uSHIoQWjqBbbiQlJMoXnUZUjpcm6MP4ve1o IS0v1Uvz7XGF38VVjzvOR5vE0pG8= X-Google-Smtp-Source: AGHT+IFrT7OEBcIDMjRqPmGgW8mNOdR276PugYkRYkwNXrye37KQTXmHBLX1+yeql3TzZczi/5Mx6ookRj33c2smh0E= X-Received: by 2002:a25:b40a:0:b0:db9:84c4:151a with SMTP id n10-20020a25b40a000000b00db984c4151amr150239ybj.34.1705542703095; Wed, 17 Jan 2024 17:51:43 -0800 (PST) MIME-Version: 1.0 References: <20240117050217.43610-1-ioworker0@gmail.com> <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com> In-Reply-To: <22b24ce9-d143-4b5f-87da-bf68e4fa46d3@redhat.com> From: Lance Yang Date: Thu, 18 Jan 2024 09:51:32 +0800 Message-ID: Subject: Re: [PATCH v1 1/2] mm/madvise: introduce MADV_TRY_COLLAPSE for attempted synchronous hugepage collapse To: David Hildenbrand Cc: "Zach O'Keefe" , akpm@linux-foundation.org, songmuchun@bytedance.com, linux-kernel@vger.kernel.org, Yang Shi , Peter Xu , Michael Knyszek , Minchan Kim , Michal Hocko , linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: F173780011 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: xdnr6enr6mpzc7wio7djp49yxitqjabh X-HE-Tag: 1705542703-450768 X-HE-Meta: U2FsdGVkX19B5FDdxJCsVrhPNEC5y3tPFO5C26lnlAS37oaSnZ59ajeFKgS7wxkr1j0tEklWTJIVLjiLeZ9IyDdlfUainBMvIdxqxV3fbVJ6XC/TSm9PPFH1XTuL7oTfTXBhvyZv+xT6nUmio7UzEO9Nkr7UFkyWhtiD+FssET+udsgEGZewf1+GH5pLHctyZnmbx7hd6l2vvubjuN/l2MpUnVIhsV3ZQ50tXkxEWRavVJNyAA23lI2mqq7lMy6Q/PMovQGbkceHNxU1QFM8j/08TMqQM1qmC5QKXv6iMHJaD757FPLyygz/pLB9bds5q9ix3IXxV5YxIv9bR/aSeEJlZq6RmjMD6H9MN/qwz9RXHDYYLU4rc9VcUaMegKfsjanzfEAO8ducLT2GfiyyFW9MNgafT7hs9IqHpwRNPNR6xOuhYs0PAGa60iTkpTBsRWj08mSH2klfq3uw9OOKLgdpN4L6KA6sBUhL+QXpOwfEa58PYhgR54Vr3a67s6iB7DRJgqVBXXUvkuCIKs/bQM6x+eEN0OeazmDvzRlgTdWQkwa7IcQmDKVXg1vPTgzvO3UhA0L3YUVJinySbH3q1NJvyJCWTqngaV0t1NSTqLXX197mz1Qe/HYC+BGHgpaShThUGpiTy9V70dOXNil0VzEtsxei77lc0eKJNozeVs2aHpaw2ugD6F7tny+ugku241ixbVhYtQRNojLwQGrr7FniIpRG6HLmRGzh3TMzkAELjbCMvWPS820s6zAX+TbMqqdNW3DBjjid/d/g01kSu/D5hJiro7LEYNrnp9XotovXMc2El8j0Sx7EV520FAEpDW+VUunWy8GLaVkJC9cz4CGfuXKibQ+fglkfEFNKs71VTZ4wCErjWadDOpAarrxrI5OGco/W/LvPBQI9smlKJJIYOL3TUWyMDXLeCDMVappwENHzz8UrxDJ9MOLkB/btgd2SJZrmOYwYa+GpVvO l5NuQScO vXGBzhJpUlfrkE6WD4k1kkFVh6Oy2v9FU02FubsVMMC+wscl5rXLOyddBb+SChyvHId0BScaPLMaM/AFLX8lL7grminpJXBD1TutmaKS8k0IJFeQYaisBZRRDP83BYpxy4fG5lTLHqXb8jOiAJol4PxjuU5v9DXT9lCXZGOpNbjN0EWtVHcEpK1vNwC0GpIHSgcRtsesnEB+/8ogLq3ZeQHVlK5VgbmVp66uHJVxxes9mQXWJkRtJSFC85W//jf74jv0a3+pmLFuOmeYP4Ueo6eP8aE2iRQ6LGfwYSTBF64I0tJ7EZfdo9FFchbMeOmINgKIh2UvQzLz1eyRLjJKftDtYMbo0yvqDBC2i7echW9LrHRPLIh9lwkzdEMOByXxc9jwtJaAzUTIczbBaKCHUF6D0/Ttwi3L3xaWE9+/fa1DJ4MiZ8E+adJCSGPfv2Z3CH1dYVXCQkxaa35qBZYB36oB48zr1nGl0CgNJ X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hey David, Thanks for taking the time to review! David Hildenbrand =E4=BA=8E2024=E5=B9=B41=E6=9C=8818=E6= =97=A5=E5=91=A8=E5=9B=9B 02:41=E5=86=99=E9=81=93=EF=BC=9A > > On 17.01.24 18:10, Zach O'Keefe wrote: > > [+linux-mm & others] > > > > On Tue, Jan 16, 2024 at 9:02=E2=80=AFPM Lance Yang wrote: > >> > >> This idea was inspired by MADV_COLLAPSE introduced by Zach O'Keefe[1]. > >> > >> Introduce a new madvise mode, MADV_TRY_COLLAPSE, that allows users to > >> make a least-effort attempt at a synchronous collapse of memory at > >> their own expense. > >> > >> The only difference from MADV_COLLAPSE is that the new hugepage alloca= tion > >> avoids direct reclaim and/or compaction, quickly failing on allocation= errors. > >> > >> The benefits of this approach are: > >> > >> * CPU is charged to the process that wants to spend the cycles for the= THP > >> * Avoid unpredictable timing of khugepaged collapse > >> * Prevent unpredictable stalls caused by direct reclaim and/or compact= ion > >> > >> Semantics > >> > >> This call is independent of the system-wide THP sysfs settings, but wi= ll > >> fail for memory marked VM_NOHUGEPAGE. If the ranges provided span > >> multiple VMAs, the semantics of the collapse over each VMA is independ= ent > >> from the others. This implies a hugepage cannot cross a VMA boundary.= If > >> collapse of a given hugepage-aligned/sized region fails, the operation= may > >> continue to attempt collapsing the remainder of memory specified. > >> > >> The memory ranges provided must be page-aligned, but are not required = to > >> be hugepage-aligned. If the memory ranges are not hugepage-aligned, t= he > >> start/end of the range will be clamped to the first/last hugepage-alig= ned > >> address covered by said range. The memory ranges must span at least o= ne > >> hugepage-sized region. > >> > >> All non-resident pages covered by the range will first be > >> swapped/faulted-in, before being internally copied onto a freshly > >> allocated hugepage. Unmapped pages will have their data directly > >> initialized to 0 in the new hugepage. However, for every eligible > >> hugepage aligned/sized region to-be collapsed, at least one page must > >> currently be backed by memory (a PMD covering the address range must > >> already exist). > >> > >> Allocation for the new hugepage will not enter direct reclaim and/or > >> compaction, quickly failing if allocation fails. When the system has > >> multiple NUMA nodes, the hugepage will be allocated from the node prov= iding > >> the most native pages. This operation operates on the current state of= the > >> specified process and makes no persistent changes or guarantees on how= pages > >> will be mapped, constructed, or faulted in the future. > >> > >> Return Value > >> > >> If all hugepage-sized/aligned regions covered by the provided range we= re > >> either successfully collapsed, or were already PMD-mapped THPs, this > >> operation will be deemed successful. On success, madvise(2) returns 0= . > >> Else, -1 is returned and errno is set to indicate the error for the > >> most-recently attempted hugepage collapse. Note that many failures mi= ght > >> have occurred, since the operation may continue to collapse in the eve= nt a > >> single hugepage-sized/aligned region fails. > >> > >> ENOMEM Memory allocation failed or VMA not found > >> EBUSY Memcg charging failed > >> EAGAIN Required resource temporarily unavailable. Try again > >> might succeed. > >> EINVAL Other error: No PMD found, subpage doesn't have Prese= nt > >> bit set, "Special" page no backed by struct page, VMA > >> incorrectly sized, address not page-aligned, ... > >> > >> Use Cases > >> > >> An immediate user of this new functionality is the Go runtime heap all= ocator > >> that manages memory in hugepage-sized chunks. In the past, whether it = was a > >> newly allocated chunk through mmap() or a reused chunk released by > >> madvise(MADV_DONTNEED), the allocator attempted to eagerly back memory= with > >> huge pages using madvise(MADV_HUGEPAGE)[2] and madvise(MADV_COLLAPSE)[= 3] > >> respectively. However, both approaches resulted in performance issues;= for > >> both scenarios, there could be entries into direct reclaim and/or comp= action, > >> leading to unpredictable stalls[4]. Now, the allocator can confidently= use > >> madvise(MADV_TRY_COLLAPSE) to attempt the allocation of huge pages. > >> > >> [1] https://github.com/torvalds/linux/commit/7d8faaf155454f8798ec56404= faca29a82689c77 > >> [2] https://github.com/golang/go/commit/8fa9e3beee8b0e6baa733374099618= 1268b60a3a > >> [3] https://github.com/golang/go/commit/9f9bb26880388c5bead158e9eca3be= 4b3a9bd2af > >> [4] https://github.com/golang/go/issues/63334 > > > > Thanks for the patch, Lance, and thanks for providing the links above, > > referring to issues Go has seen. > > > > I've reached out to the Go team to try and understand their use case, > > and how we could help. It's not immediately clear whether a > > lighter-weight MADV_COLLAPSE is the answer, but it could turn out to > > be. > > > > That said, with respect to the implementation, should a need for a > > lighter-weight MADV_COLLAPSE be warranted, I'd personally like to see > > process_madvise(2) be the "v2" of madvise(2), where we can start > > leveraging the forward-facing flags argument for these different > > advice flavors. We'd need to safely revert v5.10 commit a68a0262abdaa > > ("mm/madvise: remove racy mm ownership check") so that > > process_madvise(2) can always operate on self. IIRC, this was ~ the > > plan we landed on during MADV_COLLAPSE dev discussions (i.e. pick a > > sane default, and implement options in flags down the line). > > +1, using process_madvise() would likely be the right approach. Thanks for your suggestion! I completely agree :) Lance > > -- > Cheers, > > David / dhildenb >