From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 577E7C5478C for ; Tue, 27 Feb 2024 07:54:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B8C554401FE; Tue, 27 Feb 2024 02:54:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B3CBF4401FB; Tue, 27 Feb 2024 02:54:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 9DD144401FE; Tue, 27 Feb 2024 02:54:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 8D7E84401FB for ; Tue, 27 Feb 2024 02:54:50 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 5FE8F1A05CF for ; Tue, 27 Feb 2024 07:54:50 +0000 (UTC) X-FDA: 81836822340.16.E1B7D75 Received: from mail-vs1-f42.google.com (mail-vs1-f42.google.com [209.85.217.42]) by imf20.hostedemail.com (Postfix) with ESMTP id 9A0161C0006 for ; Tue, 27 Feb 2024 07:54:48 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PMGesGr+; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709020488; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/LGdQMg4v/47ebBCPE8p52p0nFqyTAE6s0wWMTEuF4s=; b=tDd9sRC6vxE0uS9cjDH3Hm1dbQVULZqNXrZSUSJMJH6Uk6kDRvu8Yis6hkGgtjDmgGikXZ 0OIP9A+uI0eW2PBfpH6ttgRPxkqej4pw/SmAxwLLiAMOcG8bW6LYOqUa/sRVp79pjK5rga eFuCeKMXYYDAPkON2b514zGo5CnGAt0= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PMGesGr+; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf20.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.42 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709020488; a=rsa-sha256; cv=none; b=mMZd07o8WqvIfAhF/2si6PTm/ni1xgmnDcJgm31dleJ/4GcUTq/Hkdl0/qvcdyAw8M/4RX MIyNwnh7WIIXoxXwnKWNU/rH/agRjE4y34xyFh9fHy8i3eUOkki/Z4wV2oV2AusSLnsWlT ssIPOvwVRn2++G09WQwUumAw31LXVrE= Received: by mail-vs1-f42.google.com with SMTP id ada2fe7eead31-4704c69a3d9so532866137.2 for ; Mon, 26 Feb 2024 23:54:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1709020487; x=1709625287; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=/LGdQMg4v/47ebBCPE8p52p0nFqyTAE6s0wWMTEuF4s=; b=PMGesGr+wd1Hzcg+WJCDeYPRdCf5quZm0xq8lV6tpK14l1aVQdfipiJ920nUNtAoQS eKgQH+VKp8qqJ8SO6NtOooSt8C61pJTk5466J8tpr+Fm1+IZ8XtYQy8nd7V9/qXf6fL3 ZZFoBhiReewfKlEshvSeoY2TbHZiKF+j6Y9fd0/eMD9P+kJ+inhBjsjg4quchsqbjkkz 8s1RVgVME2vGZ5/MFQj0u8bNeddtOuAGhp5PAC+WGk68QDv7UHn2w3SyTV2VyoYkWaoi FJ+IrKiYbAJ9owTJ2RFHsUbOs+wlFazDfO9YRQR9rE+BAnPoB7UEmc2yEQh8W1Xix1+E /GNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709020487; x=1709625287; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=/LGdQMg4v/47ebBCPE8p52p0nFqyTAE6s0wWMTEuF4s=; b=b6gLcWXd76/FcUYAaBjwE4H/w3w6iuGSCj89h3mHakp8VA1Pm/7/g/As4beb1UAmZ0 IgsAFHa6io91XKpOTmJNkSeqG87OivXLOoLs9JVe92rGB7gHsU1op2CZfkBAh1xDFRbV TRQxmWFKEXztIA/OvzM9Yoxnhe80XhWyDpZz1SF/JEbolra3/8OvOR0S6ZqSroKJP87O pT/L8Rd2K40v4jXEhrKnzyRskh8W+V6DN9JF+vseVbBIRjHRAjhTQzk6F8CLWvb5dxpJ SKzgt45dh+Xgk8V8uUS6lMuXsB/Zsi7ztv4AP6n3a2JoIMrR1elUx8KseXyujqXs7wzZ 7DZg== X-Forwarded-Encrypted: i=1; AJvYcCWZO1Cd10KTH8EWvywz9QetfsaJkHNka9yu732JtOagOOUU9PCFPWeX/7FTHIRlNPDTPBpbJXFkdQyBGvSmqw9Wb20= X-Gm-Message-State: AOJu0YwDSlp+FU7sKAIe6pZtFgucCJWpAiLgRi4oLRwbeRxiV+rZmL74 v6tnTfTbbs7XbgAEQqP5kahfC7ecdiLyzqAaEyXz5RGdix0IvPMuiArBCqEJNpJXROWayyjhbFS 4SlDnPvD9Sahoc+kPV0PndN+aVaY= X-Google-Smtp-Source: AGHT+IF98DWS9zVjEyWiQ8fTdCdqJFfECqUTHMssPpb3Vg9a44Qi64zbHW9JxYLjN2AEkVeTA8HdDMwtlrl916iEJtM= X-Received: by 2002:a67:fc84:0:b0:470:3afd:81dd with SMTP id x4-20020a67fc84000000b004703afd81ddmr6721902vsp.4.1709020487627; Mon, 26 Feb 2024 23:54:47 -0800 (PST) MIME-Version: 1.0 References: <20240226083714.26187-1-ioworker0@gmail.com> <9bcf5141-7376-441e-bbe3-779956ef28b9@redhat.com> <318be511-06de-423e-8216-af869f27f849@arm.com> <19758162-be5f-4dc4-b316-77b0115d12ce@intel.com> <3c56d7b8-b76d-4210-b431-ee6431775ba7@intel.com> <6ea0020a-8f4b-44d1-a3b2-7c2905d32772@intel.com> <009e5633-decb-4c21-b5fc-58984fbade96@intel.com> In-Reply-To: <009e5633-decb-4c21-b5fc-58984fbade96@intel.com> From: Barry Song <21cnbao@gmail.com> Date: Tue, 27 Feb 2024 20:54:36 +1300 Message-ID: Subject: Re: [PATCH 1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free To: Yin Fengwei Cc: Ryan Roberts , Lance Yang , David Hildenbrand , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mhocko@suse.com, minchan@kernel.org, peterx@redhat.com, shy828301@gmail.com, songmuchun@bytedance.com, wangkefeng.wang@huawei.com, zokeefe@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 9A0161C0006 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: dd9k9qbpfis47mxctomf9xjiiqsjzbmt X-HE-Tag: 1709020488-86492 X-HE-Meta: U2FsdGVkX1/W6fuoDyQrjkjrDeqp+TtM6lMivmzNuKZnpeNDK7J/9EtI1V1vOL45JEyYQ87cSBGRLRc/HeBUvYAJP5mENxkHOjmNe//dv07IRxX7fC3DCRemjPxVOqBSapRQW+H/zpcqvgwV0AHdvT0ezT0bivgAvQVNLLIswDj0QyZvbDQxT4uKboEPlnfNeDnDemZGLKSZfC4Cw/LUpF/LyKWONwokcoyqajP2igx1eloeimj3iHXC2z6OcqBmPTSxUDtLPMlOTwcB4Obb/b53NFQ3k1nFO9UxPVj2ywIPLTXz4e3vUa3UOO+jKwiWiRwjvalm38/jk+21bJDJpkQbCfFYYpJWDUViBgmCQBYBVZ+O+mTSLcAM4clVcijWkmHr05W6Z6FcErTJQuw43o0sOYTmFbgoKICVKLYdLoam01UEnX3qtkMTSZ07JvzpqxYj2wqkvuw4Ix2zaNqWR0CeCuJumdGRPW1aBVjlDBXqC9QVxzptp/dlHJO+oPhi0+noAoe6+xehmZsO27RXDdKYchW90guwIPkWHexEhXHzuOufYMMDAiis/PqEOFXQJ4dQymReZEne0RzyVr/JAJGHAOJ+YUEBfXV1oU1ukBhrxBXTCNK7LUcqfPdf3r25HLsNQ0ogCFH8qdxWEt8TsnCzBhjfgHvvxSdnGcivsmUH8y6yvF/dRgLF6HIOPQa2fGdppUvjP/v1tFNBoahNPlv2sHgzsY+TH09g6kWgFc5sYhj5/w1knhJZBl2M45uCtXMizpuPlsh03177WtEjXXmIGmeLSJyPfkhKor5vGUjE4W/tKNhpKomrsi0flGe6AggAAK15DD8cC1UDZUTLPlbbw8c0Qpq8NDgdKeQ5oXYft6IkvymwgY2IE6ySe8JqbYNmuiArFwZBmwxc48u8HVTjyIU0BgHvXDtH92Zxx2gxMWMGrXCJfXfSv2EkRLRsko1U2yorrRHu2+bh+et moHpkWHq BYD10ejpYfHWjmRcwjZJkfk9XVX/rzMi9UYPf41bcADHmGC+yVSBwEzr4C+MBbUl32DmefeL9e+xn6vWhYdl1wjU8nd8BF8YltwFEFyKKHc1sFnsOBaf49udQznB7fu2L+yzm6Y/LCVpZDUohSlPfOen4suIIYzE/ejVivFenHT0iYf9l702ipHlf7PSbewVe7xmPPnniXEUON3TldxlFXGRzzzbe4y24U4WNI+tvy+9RFgLiIquTIs0jhZp14qcr05ek5QFfgWJ8FxH3d8fFMevwPX9JoAIBvCEgYvX2s1Dbm1c0pV7/pLFM2l7WFBjhbfQ4ELOI85yt8fqvxz2ELnZSpRYnOn3/LHQY100CQZRY0v/IK29wShniM1zF3EP0PEKacobyN/WE4YnU589qFSawx6yHTXQw//mVDaYAMn/zZf+9ho/CEQtOGbUowQtfO0Kq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 27, 2024 at 8:42=E2=80=AFPM Yin Fengwei = wrote: > > > > On 2/27/24 15:21, Barry Song wrote: > > On Tue, Feb 27, 2024 at 8:11=E2=80=AFPM Barry Song <21cnbao@gmail.com> = wrote: > >> > >> On Tue, Feb 27, 2024 at 8:02=E2=80=AFPM Yin Fengwei wrote: > >>> > >>> > >>> > >>> On 2/27/24 14:40, Barry Song wrote: > >>>> On Tue, Feb 27, 2024 at 7:14=E2=80=AFPM Yin Fengwei wrote: > >>>>> > >>>>> > >>>>> > >>>>> On 2/27/24 10:17, Barry Song wrote: > >>>>>>> Like if we hit folio which is partially mapped to the range, don'= t split it but > >>>>>>> just unmap the mapping part from the range. Let page reclaim deci= de whether > >>>>>>> split the large folio or not (If it's not mapped to any other ran= ge,it will be > >>>>>>> freed as whole large folio. If part of it still mapped to other r= ange,page reclaim > >>>>>>> can decide whether to split it or ignore it for current reclaim c= ycle). > >>>>>> Yes, we can. but we still have to play the ptes check game to avoi= d adding > >>>>>> folios multiple times to reclaim the list. > >>>>>> > >>>>>> I don't see too much difference between splitting in madvise and s= plitting > >>>>>> in vmscan. as our real purpose is avoiding splitting entirely map= ped > >>>>>> large folios. for partial mapped large folios, if we split in madv= ise, then > >>>>>> we don't need to play the game of skipping folios while iterating = PTEs. > >>>>>> if we don't split in madvise, we have to make sure the large folio= is only > >>>>>> added in reclaimed list one time by checking if PTEs belong to the > >>>>>> previous added folio. > >>>>> > >>>>> If the partial mapped large folio is unmapped from the range, the r= elated PTE > >>>>> become none. How could the folio be added to reclaimed list multipl= e times? > >>>> > >>>> in case we have 16 PTEs in a large folio. > >>>> PTE0 present > >>>> PTE1 present > >>>> PTE2 present > >>>> PTE3 none > >>>> PTE4 present > >>>> PTE5 none > >>>> PTE6 present > >>>> .... > >>>> the current code is scanning PTE one by one. > >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4,= PTE6... > >>> No. Before detect the folio is fully mapped to the range, we can't ad= d folio > >>> to reclaim list because the partial mapped folio shouldn't be added. = We can > >>> only scan PTE15 and know it's fully mapped. > >> > >> you never know PTE15 is the last one mapping to the large folio, PTE15= can > >> be mapping to a completely different folio with PTE0. > >> > >>> > >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we= know > >>> this is a partial mapped large folio. We will unmap it. Then all 16 P= TEs > >>> become none. > >> > >> I don't understand why all 16PTEs become none as we set PTEs to none. > >> we set PTEs to swap entries till try_to_unmap_one called by vmscan. > >> > >>> > >>> If the large folio is fully mapped, the folio will be added to reclai= m list > >>> after scan PTE15 and know it's fully mapped. > >> > >> our approach is calling pte_batch_pte while meeting the first pte, if > >> pte_batch_pte =3D 16, > >> then we add this folio to reclaim_list and skip the left 15 PTEs. > > > > Let's compare two different implementation, for partial mapped large fo= lio > > with 8 PTEs as below, > > > > PTE0 present for large folio1 > > PTE1 present for large folio1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folio1 > > PTE5 present for large folio1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > If we don't split in madvise(depend on vmscan to split after adding > > folio1), we will have > Let me clarify something here: > > I prefer that we don't split large folio here. Instead, we unmap the > large folio from this VMA range (I think you missed the unmap operation > I mentioned). I don't understand why we unmap as this is a MADV_PAGEOUT not an unmap. unmapping totally changes the semantics. Would you like to show pseudo code? for MADV_PAGEOUT on swap-out, the last step is writing swap entries to replace PTEs which are present. I don't understand how an unmap can be involved in this process. > > The intention is trying best to avoid splitting the large folio. If > the folio is only partially mapped to this VMA range, it's likely it > will be reclaimed as whole large folio. Which brings benefit for lru > and zone lock contention comparing to splitting large folio. which also brings negative side effects such as redundant I/O. For example, if you have only one subpage left in a large folio, pageout will still write nr_pages subpages into swap, then immediately free them in swap. > > The thing I am not sure is unmapping from specific VMA range is not > available and whether it's worthy to add it. I think we might have the possibility to have some complex code to add folio1, folio2, folio3, folio4 and folio5 in the above example into reclaim_list while avoiding splitting folio1. but i really don't understand how unmap will work. > > > to make sure folio1, folio2, folio3, folio4, folio5 are added to > > reclaim_list by doing a complex > > game while scanning these 8 PTEs. > > > > if we split in madvise, they become: > > > > PTE0 present for large folioA - splitted from folio 1 > > PTE1 present for large folioB - splitted from folio 1 > > PTE2 present for another folio2 > > PTE3 present for another folio3 > > PTE4 present for large folioC - splitted from folio 1 > > PTE5 present for large folioD - splitted from folio 1 > > PTE6 present for another folio4 > > PTE7 present for another folio5 > > > > we simply add the above 8 folios into reclaim_list one by one. > > > > I would vote for splitting for partial mapped large folio in madvise. > > Thanks Barry