From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A484CEE6426 for ; Thu, 12 Sep 2024 03:11:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C200D6B00A7; Wed, 11 Sep 2024 23:11:24 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BD0396B00AC; Wed, 11 Sep 2024 23:11:24 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A98616B00AD; Wed, 11 Sep 2024 23:11:24 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 890166B00A7 for ; Wed, 11 Sep 2024 23:11:24 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id E106BA077A for ; Thu, 12 Sep 2024 03:11:23 +0000 (UTC) X-FDA: 82554610446.19.CFCE9CC Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189]) by imf29.hostedemail.com (Postfix) with ESMTP id 808F9120008 for ; Thu, 12 Sep 2024 03:11:20 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726110543; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=qZ92/xZd0IfPyXYaHeqqIV9QdMnyYLbDNxCr5ejwcC8=; b=5MGZk+VizYXIxbYIqGgr/Bg82igpOKF+qbQJ2b5SltrXHNVb3p//yVfsrfym+vpN8H1KWe PepqAdvj3NWwZrR3EQMUUzKC2KTmBGOYxB2jMdNrqZ3x71LfzDmvnWRZZgG0WCYhK73VGc VhebhDOqoi6wG0vZPZqV/aKua848gC0= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=none; spf=pass (imf29.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726110543; a=rsa-sha256; cv=none; b=gv5Jtq2KfJomc/eMLYjTDU+Q+Mn3lw9lZGof3qRMidIgVtCGqfzM/XDaV2Dd3OgTzaZrvJ 1MaSsm9dr4NfM4q3zzZnQq96EyAEV3QPT8LCW39zcUjGEwkpX5Lh+NLshXlBMCLTuzC8tg tKOPKVbyPgqEM6LQIFP884/N7pt9zu8= Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4X42Wx13Pxz69NK; Thu, 12 Sep 2024 11:11:09 +0800 (CST) Received: from kwepemd200019.china.huawei.com (unknown [7.221.188.193]) by mail.maildlp.com (Postfix) with ESMTPS id 122AA140453; Thu, 12 Sep 2024 11:11:16 +0800 (CST) Received: from [10.173.127.72] (10.173.127.72) by kwepemd200019.china.huawei.com (7.221.188.193) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 12 Sep 2024 11:11:12 +0800 Subject: Re: [PATCH v23 1/4] mm: add MAP_DROPPABLE for designating always lazily freeable mappings To: "Jason A. Donenfeld" , , , CC: , , , Linus Torvalds , Greg Kroah-Hartman , Adhemerval Zanella Netto , Carlos O'Donell , Florian Weimer , Arnd Bergmann , Jann Horn , Christian Brauner , David Hildenbrand , , David Hildenbrand References: <20240712014009.281406-1-Jason@zx2c4.com> <20240712014009.281406-2-Jason@zx2c4.com> From: Miaohe Lin Message-ID: Date: Thu, 12 Sep 2024 11:11:11 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 MIME-Version: 1.0 In-Reply-To: <20240712014009.281406-2-Jason@zx2c4.com> Content-Type: text/plain; charset="utf-8" Content-Language: en-US Content-Transfer-Encoding: 7bit X-Originating-IP: [10.173.127.72] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To kwepemd200019.china.huawei.com (7.221.188.193) X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 808F9120008 X-Stat-Signature: mooz8jr53h9hj78w4kfaefg1ip4kfcr9 X-Rspam-User: X-HE-Tag: 1726110680-394100 X-HE-Meta: U2FsdGVkX1+uuTeQlR7KtfhG/SYWe4aEZr066Vw3JF7JpwIOvZJSOuglbNvww+ZKDWv8Ea8ESupe/SwY0bYl6UZlKZPf4b4mWv2w13XQhaXRjEMJNSsCqCWR55VanfD8Inqob3KMAtvJWd02+6m3vowtnsWbLEhw9l+iKUjO+rYfvWZWR3IjdyiSHDqBosoB5sI1HKvr4keUJnU5i//CgTCr1PpVOwqmwSMluEeav49xWX/QgJBBXFQVBXC+zpNILharixRkBHCsOydfVYSHV5m9eA7okvRJHYHXHTiQZg7qF7cNY4XIahhTJFrBNzKizQzROU7b2t0HJD5MiHkAEV4vQSseZ7BJD9uP8SQEe6jzgsomYw6IjVvJ/uxMlWkAOfK1d1BqYHMQMfSXxxGxIItmUbQLyGFvg56p1zMSnIVtOc1Xj4L7zE0lpnFqpjtb20oCvQeoOlbPSvdU8yB2kkxVFUbNEPNBTdtz47hi3tjaixDEy6SGPwYQSsvV8rufWkhPXVyS78gf1gf3fJv98/m4f2VB+LaSOLfQado8zJs4/LHEELHbBbC3dTVWlA8xuDarevFAbVj81rgSWBMCO3uQbnxmhKGu4JOuRqhVFyVskKnYjMirMDhdGn+raAlpB69mfOe8l4y8+gGu3+qA2nex4eC829uI0RwlhOpW3HVMgRLcO71wrIwWTZ9QAqgOrHyWxfrxmpuKHMctL1zpsifrAkn+nclEcPNm7k32UwMl5QAtB1uJ2bJvgHywNL+720g7W+2t2NN7vDEcA43vpBIIZaMxSlXwSj7HL/QXFI9rmuv20eOQVozvETzXC6gbIxA245HGg4horof2C0s2rKOo5MLF34aMY6oWMG/V6aiMiW9Q/9YAjlaGYgTQgraAfCxZDAKdWZl4XsLXzxRv6M3afX5CR5Kk+egj0t0bkimKobzObVz7HRqyvWwXYfe6T8dVXv+JjW5BEVYL5P2 wy88Txrd C/b+5MVmoaqu1KDEH+/qHB1GVMQ3tK/PPpE0JgPUDx4GnmXH2a5BsLJ4p2jnje3aiFaI0MxCma6zPi1M0eRwg2nrqcJpfp+WnxDAdPxfMUb5ubsZdl3spXvm5xQjqNIbtRzQC3pgU2yOvuBRTyj6FscEbG/WwWGvVWlaDZvdG+7LhA4MIGk1p6jw9S0SDdCeoFuA5LOkh7bTieB+17lOv8O0ciHZD5q2x3Qm+YjvyfI1QQIOkSGpxb3ps1zKfGr1zXcitDtjBunJrfTMC7CPCOLFEje4R4+IRSTNMwSgg0ymRT4qAndbNwFR5o2S+zjW3eLTY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/7/12 9:40, Jason A. Donenfeld wrote: > The vDSO getrandom() implementation works with a buffer allocated with a > new system call that has certain requirements: > > - It shouldn't be written to core dumps. > * Easy: VM_DONTDUMP. > - It should be zeroed on fork. > * Easy: VM_WIPEONFORK. > > - It shouldn't be written to swap. > * Uh-oh: mlock is rlimited. > * Uh-oh: mlock isn't inherited by forks. > > - It shouldn't reserve actual memory, but it also shouldn't crash when > page faulting in memory if none is available > * Uh-oh: VM_NORESERVE means segfaults. > > It turns out that the vDSO getrandom() function has three really nice > characteristics that we can exploit to solve this problem: > > 1) Due to being wiped during fork(), the vDSO code is already robust to > having the contents of the pages it reads zeroed out midway through > the function's execution. > > 2) In the absolute worst case of whatever contingency we're coding for, > we have the option to fallback to the getrandom() syscall, and > everything is fine. > > 3) The buffers the function uses are only ever useful for a maximum of > 60 seconds -- a sort of cache, rather than a long term allocation. > > These characteristics mean that we can introduce VM_DROPPABLE, which > has the following semantics: > > a) It never is written out to swap. > b) Under memory pressure, mm can just drop the pages (so that they're > zero when read back again). > c) It is inherited by fork. > d) It doesn't count against the mlock budget, since nothing is locked. > e) If there's not enough memory to service a page fault, it's not fatal, > and no signal is sent. > > This way, allocations used by vDSO getrandom() can use: > > VM_DROPPABLE | VM_DONTDUMP | VM_WIPEONFORK | VM_NORESERVE > > And there will be no problem with OOMing, crashing on overcommitment, > using memory when not in use, not wiping on fork(), coredumps, or > writing out to swap. > > In order to let vDSO getrandom() use this, expose these via mmap(2) as > MAP_DROPPABLE. > > Note that this involves removing the MADV_FREE special case from > sort_folio(), which according to Yu Zhao is unnecessary and will simply > result in an extra call to shrink_folio_list() in the worst case. The > chunk removed reenables the swapbacked flag, which we don't want for > VM_DROPPABLE, and we can't conditionalize it here because there isn't a > vma reference available. > > Finally, the provided self test ensures that this is working as desired. > > Cc: linux-mm@kvack.org > Acked-by: David Hildenbrand > Signed-off-by: Jason A. Donenfeld > --- ... > diff --git a/mm/memory.c b/mm/memory.c > index d10e616d7389..18fe893ce96d 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -5690,6 +5690,10 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, > > lru_gen_exit_fault(); > > + /* If the mapping is droppable, then errors due to OOM aren't fatal. */ > + if (vma->vm_flags & VM_DROPPABLE) > + ret &= ~VM_FAULT_OOM; > + I'm sorry for jumping in here. I am confused about the code in handle_mm_fault(). Since VM_FAULT_OOM is simply dropped, page fault will be re-triggered soon? If so, when oom is disabled or fails to move forward, page fault will re-trigger again and again as no memory is available? I might be miss something. Thanks. .