From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3FC3C4828D for ; Mon, 5 Feb 2024 19:41:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 756686B007B; Mon, 5 Feb 2024 14:41:19 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 706976B007E; Mon, 5 Feb 2024 14:41:19 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5CF036B0080; Mon, 5 Feb 2024 14:41:19 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 4E18C6B007B for ; Mon, 5 Feb 2024 14:41:19 -0500 (EST) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id E1E32140675 for ; Mon, 5 Feb 2024 19:41:18 +0000 (UTC) X-FDA: 81758769036.11.D146391 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) by imf01.hostedemail.com (Postfix) with ESMTP id 2A19B40014 for ; Mon, 5 Feb 2024 19:41:16 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HyB0+1Qn; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707162077; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3b6oeS+sd5xoZNeIJEQUdkAN1RG5u3Jd70MTpYqp6SQ=; b=sUY6JHIPSCBjl+m4TqvFr2m63L3oTWy6ZUDP+Ci1a61TNcMnG+J7HzNlnU3iar+Bxba15n Ep160tuKtscRgKJIIA1H8z2kJr9qT7vS2jYtTyODWea8XDQ+OvwNef8eEPb5V5t7+ZLQaC isymtSAv2zifCH3Jw516ZM576tjnA/g= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=HyB0+1Qn; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf01.hostedemail.com: domain of shy828301@gmail.com designates 209.85.216.50 as permitted sender) smtp.mailfrom=shy828301@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707162077; a=rsa-sha256; cv=none; b=YLKqtBfEfh7e2K/lfFkOI2WM7NiN4espDrXxVoIGqJklEcmD1Jlb9/ZTFsZ0by3qCzh/ih G+5yxmtRPrOcQ8Q0ALMJyoI/26FVyJ0gKtwkyLpXlZt0AWj0aMmZ2/yI5diaWRb1EpVZ4K r5WKW66TJjN91tKaOAuqjNswSVnUovU= Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-290fb65531eso3519596a91.2 for ; Mon, 05 Feb 2024 11:41:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1707162076; x=1707766876; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3b6oeS+sd5xoZNeIJEQUdkAN1RG5u3Jd70MTpYqp6SQ=; b=HyB0+1Qn35TK/ygXbzd7+PK0DdIeVYDSSKc45GcgbVf0sz1bV8OdE4BHGp0bXKj+VE UnotQ5V328qf645ih4FUl55fEDtMYjk/qx+yr0JxLqGZrM9N4NTkrX3piCaNAa/dtXtU bqVgADA4wjxFfB4tR7ftxQAb0bST1mnhQj8hkAL30UapnYiq/YShWulURjdFNmCDPNC8 IRjMZ1NQGeXSFhAyfeCSvX8GXRzm5UgE1PnErchicsz3pIZwNUN0cQ9hzASESVy8vsP0 CqU1eGB3rrHJ3+nyySu5Jzx+TEQQupFY9byHsD+RcFMHAvHFDzxGtzEE4+UKc+JlGTab nPUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1707162076; x=1707766876; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3b6oeS+sd5xoZNeIJEQUdkAN1RG5u3Jd70MTpYqp6SQ=; b=Txbpeqqke1MyMFBU+q2kRGnUkTzbPSmNnGqxHI9MOE/bRADT8iLDg0jCfeE6iQfgv4 5OabBdaD72gNaYkzFg1PZzkp92pZQEs26ZIG/xdoJtHXNjC0cJaB0qN3R6nAmGENtjQ3 jWvROqkisvQhnhOTQIjQl39eWxni6Rhqxfh98CUy12DGQP3KAP9D9tM/gRDPcjlDcKZW ux5R1ixSgW9FPCZo+gKWq1Lyte7UFNJpW8XyR4/Mr6XoR5d1uDEjZa/Tc2j/6VgS6FZ3 kwVU97kEWuSLE6RVJwvgxrDb7QMqMgMAbdpm3stHSrIXsBR7ggKIXuvnmG+teq4KcRAs Eclg== X-Gm-Message-State: AOJu0YzY1DDLTC6vwgNxveOG+PZQ9fLDYtfylD7WGxyXLxr47UUY4ZZT w9cZImDD0xsl5e1JNMCPsgWcKkfuszC0v/euTKyuxF80taK6G0oZX/4wPdzdersgMx7r+d5wcf5 4dnemrZoskMVsnUaWzPBb2rSUdsg= X-Google-Smtp-Source: AGHT+IGZcG8Y0PjKAioDsv7aAA6ZG2FjD0D14DE2NUFI5ykKeMGcmCUrJ2PIWDJdVOpxzinNN2Cjm/ZRFPVZGoOahrc= X-Received: by 2002:a17:90b:4b41:b0:296:1979:cc61 with SMTP id mi1-20020a17090b4b4100b002961979cc61mr492646pjb.0.1707162075750; Mon, 05 Feb 2024 11:41:15 -0800 (PST) MIME-Version: 1.0 References: <20240201125226.28372-1-ioworker0@gmail.com> In-Reply-To: From: Yang Shi Date: Mon, 5 Feb 2024 11:41:02 -0800 Message-ID: Subject: Re: [PATCH 1/1] mm/khugepaged: skip copying lazyfree pages on collapse To: Lance Yang Cc: Michal Hocko , David Hildenbrand , akpm@linux-foundation.org, zokeefe@google.com, songmuchun@bytedance.com, peterx@redhat.com, minchan@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 2A19B40014 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: pudaxqxiqqfngs9xcgsghaoxespwy6y9 X-HE-Tag: 1707162076-998411 X-HE-Meta: U2FsdGVkX19g4Wg02JO0jMyeqefZzQ5NSHgUjMTLX39xZXB6nfP8s/NX9hRSy05KimjvumuHwZ7E6OI5k3mhYT4+eg4X9qeidv+p/M9D4T37kRPIsw73iwqY1sgnOV833AyXVwV/fxDFhtdWe74bE7zfkxHqhDMSsDQ5kFbY6YkjwKhZDjnOEEjaEunhaXB/lWoS8fpQtISZ2VNhMnAGqAMA/I17Zf7geHve+GrNSfraJ2L0IbEtXbyQK508/+7ipLQkuh/S2hU/tDuxjxjZLPJ7B0UbvyQ2zxPHvQmWdKnkJkQNO4kMCG96ZUG9M5Lf9Ap/0H0uG35QT3Laq1Uyrsrfdke++bc2tKbII9mobr0fxo8+HIDHbsygW97zDMIDEAYxmTvV/MA99d75M3LNl5opOvTTHZpEiJ8hkVkM+iuisD9OUukZatoTHNqIMtwvKfqEJnd6yiDHJfx+1lqOuz0RY3o2rZusG5xZLj0nFy1282F2O/L3WXmR/6+vrydn3YuJXyTs+wrYFs9ecH85oDI8I8DZgoXL0H1cG2fOQdOANLZncyPCcDxPbL7kYJ1MrIgGfwn3jGcvfaBucXotWwt/0p0R2tyt8Mu2JV1/6wspcBHS/ofuAw5SEBSPFWYe/bqF5XpR0RMRnsCaexXLZKkGfCqeuQAbJUy1VO4GPOGUIeWp7CIxRcb0FdlWyUmZzJKCN6YaHIVaUeIiULMBASdZ9QvA8nSgORqTvXLPliy3PEVEb9UULwC/EogJjRZq0BvfVRKPz5wlGbdc6YRxNyhyFI5w+rsBBZw/e+CKoZyNoFNqRvI2Rs45CGjzdkRaDis7qvBtD2rbDSWD4HKOVVgUTG+nqW3KyB8Ddg3NHMEHyrNp8peNvJCKzUUrMqm1FWOY8JlEv2kWvAqhWhPJUqNEEWJ+NgZtnfCRRTBIpsfGf/EUy9W/8osJWoh5oZerOwrc3uo/KIz/gY4oMfa 7Ifa3LR/ Ny0qrAKnLuarmMZqIb34dqxdmP6bGAF/xeyWtegBp0X/FlF0s57eshTQnJLTRGA2euRq4Yb7GYeagMwqhLrOI7PGrlfl2yCdyuhOrRp3vW7STO+MUt/k7vGU3h1xn/nYCZHN415sigz7Otc+ENIDur7LLVMeMqHDRPacC86RaP9jlZ5BDs9+ZyI6GGZM7gzKzZ68fI74hHKFDBxORCd/r3JD8c8jII96NrWLB+YwnQtljO3FdxPmXvXk/pwBxaH27Gg03nLa2fFwNPgh8KRcuVtE8JXeSDz2ZOqVSsAnKGw+tqU4B+CPYP7NhEO5kAzS4SpoYP2aX5fjkQb/nIxcl0AnrEXAo+KdKVk/OA4xromRiYTDC+5fGH+KHAqIch1KRfdrHudH/CbcIoCBrhOfN3Frs84QkUs7PXCwR/qq6XdPEXLs3bdBLVvOe6Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000055, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Feb 2, 2024 at 8:17=E2=80=AFPM Lance Yang wro= te: > > Hey Michal, David, Yang, > > I sincerely appreciate your time! > > I still have two questions that are perplexing me. > > First question: > Given that khugepaged doesn't treat MADV_FREE > pages as pte_none, why skip the 2M block when all > the pages within the range are old and unreferenced, > but won't skip if the partial range is MADV_FREE, > even if it's not redirtied? Why make this distinction? > Would it not be more straightforward to maintain > if either all were skipped or not? It is just some heuristic in the code and may be some arbitrary choice. It could controlled in a more fine-grained way if we really see some workloads get benefit. > > Second question: > Does copying lazyfree pages (not redirtied) to the > new huge page during khugepaged collapse > undermine the semantics of MADV_FREE? > Users mark pages as lazyfree with MADV_FREE, > expecting these pages to be eventually reclaimed. > Even without subsequent writes, these pages will > no longer be reclaimed, even if memory pressure > occurs. Yeah, it just means khugepaged wins the race against page reclaim. I'm supposed the delayed free is one of the design goals of MADV_FREE, and the risk is the pages may not be freed eventually. If you want immediate free or more deterministic behavior, you should use MADV_DONTNEED or munmap IIUC. > > BR, > Lance > > On Sat, Feb 3, 2024 at 1:42=E2=80=AFAM Yang Shi wro= te: > > > > On Fri, Feb 2, 2024 at 6:53=E2=80=AFAM Lance Yang = wrote: > > > > > > How about blocking khugepaged from > > > collapsing lazyfree pages? This way, > > > is it not better to keep the semantics > > > of MADV_FREE? > > > > > > What do you think? > > > > First of all, khugepaged doesn't treat MADV_FREE pages as pte_none > > IIUC. The khugepaged does skip the 2M block if all the pages are old > > and unreferenced pages in the range in hpage_collapse_scan_pmd(), then > > repeat the check in collapse_huge_page() again. > > > > And MADV_FREE pages are just old and unreferenced. This is actually > > what your first test case does. The whole 2M range is MADV_FREE range, > > so they are skipped by khugepaged. > > > > But if the partial range is MADV_FREE, khugepaged won't skip them. > > This is what your second test case does. > > > > Secondly, I think it depends on the semantics of MADV_FREE, > > particularly how to treat the redirtied pages. TBH I'm always confused > > by the semantics. For example, the page contained "abcd", then it was > > MADV_FREE'ed, then it was written again with "1234" after "abcd". So > > the user should expect to see "abcd1234" or "00001234". > > > > I'm supposed it should be "abcd1234" since MADV_FREE pages are still > > valid and available, if I'm wrong please feel free to correct me. If > > so we should always copy MADV_FREE pages in khugepaged regardless of > > whether it is redirtied or not otherwise it may incur data corruption. > > If we don't copy, then the follow up redirty after collapse to the > > hugepage may return "00001234", right? > > > > The current behavior is copying the page. > > > > > > > > Thanks, > > > Lance > > > > > > On Fri, Feb 2, 2024 at 10:42=E2=80=AFPM Michal Hocko wrote: > > > > > > > > On Fri 02-02-24 21:46:45, Lance Yang wrote: > > > > > Here is a part from the man page explaining > > > > > the MADV_FREE semantics: > > > > > > > > > > The kernel can thus free thesepages, but the > > > > > freeing could be delayed until memory pressure > > > > > occurs. For each of the pages that has been > > > > > marked to be freed but has not yet been freed, > > > > > the free operation will be canceled if the caller > > > > > writes into the page. If there is no subsequent > > > > > write, the kernel can free the pages at any time. > > > > > > > > > > IIUC, if there is no subsequent write, lazyfree > > > > > pages will eventually be reclaimed. > > > > > > > > If there is no memory pressure then this might not > > > > ever happen. User cannot make any assumption about > > > > their content once madvise call has been done. The > > > > content has to be considered lost. Sure the userspace > > > > might have means to tell those pages from zero pages > > > > and recheck after the write but that is about it. > > > > > > > > > khugepaged > > > > > treats lazyfree pages the same as pte_none, > > > > > avoiding copying them to the new huge page > > > > > during collapse. It seems that lazyfree pages > > > > > are reclaimed before khugepaged collapses them. > > > > > This aligns with user expectations. > > > > > > > > > > However, IMO, if the content of MADV_FREE pages > > > > > remains valid during collapse, then khugepaged > > > > > treating lazyfree pages the same as pte_none > > > > > might not be suitable. > > > > > > > > Why? > > > > > > > > Unless I am missing something (which is possible of > > > > course) I do not really see why dropping the content > > > > of those pages and replacing them with a THP is any > > > > difference from reclaiming those pages and then faulting > > > > in a non-THP zero page. > > > > > > > > Now, if khugepaged reused the original content of MADV_FREE > > > > pages that would be a slightly different story. I can > > > > see why users would expect zero pages to back madvised > > > > area. > > > > -- > > > > Michal Hocko > > > > SUSE Labs