From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28F8BEB64D9 for ; Tue, 27 Jun 2023 10:19:10 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6624D8D0002; Tue, 27 Jun 2023 06:19:09 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 611648D0001; Tue, 27 Jun 2023 06:19:09 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4B3CA8D0002; Tue, 27 Jun 2023 06:19:09 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3C7298D0001 for ; Tue, 27 Jun 2023 06:19:09 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id D50E6806CD for ; Tue, 27 Jun 2023 10:19:08 +0000 (UTC) X-FDA: 80948129976.14.668B6E4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf17.hostedemail.com (Postfix) with ESMTP id 8E65B4000D for ; Tue, 27 Jun 2023 10:19:05 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NhBcogkA; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687861145; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=PVciXP+z4zzbCJzP6EwhmNISY1M3S5n3Lib1Z6WDS0Q=; b=j0iocaerrKlj1jSH7aAxieDxq9ICIL2PpkCD8tYuEx3sotU7UkYv7m51QwX8MqrbGhBGBQ pjZzawdGmXAOIWEo2Lbo/6h4mfNty7OwoEQdPEYY7wMi4D8Pnmsg/Lhlz+/WETMRr2Eier BanTBekvFW4kt36F2v0Pt18dUB3rroc= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=NhBcogkA; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687861145; a=rsa-sha256; cv=none; b=op431NrT8J6g2YM239maZBf3/gO073p9A6bvAlzDj++JiynOlMHvDYS0fktfvKHp6iPOvo DK4mWkHuSfBr3dbWTmisYtbC9TvbONlP2TcCQG78BAiIEL9GKl5hfQdCAEI2Yw0IkQacNk 7CjMAvyUDW7mc7ROO0HpytlGcAs61L8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687861144; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=PVciXP+z4zzbCJzP6EwhmNISY1M3S5n3Lib1Z6WDS0Q=; b=NhBcogkADseFt7uQxbkZ5SElUibpx5uohlQu3EXVM5/qiSJSVxVjxornen/8E27uzCR1Ns J/2ooa7U15Ie5LeDxApL9NSHGED6fylcebsby/gjO1GN8Y1fpvr4S3xgBGgu6+sBmcwEu1 Rg56kWQudeg8XGoVTBeiTMzX8dzzwqg= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-674-kdZljLFvOUKNKYVhjPaI8Q-1; Tue, 27 Jun 2023 06:19:01 -0400 X-MC-Unique: kdZljLFvOUKNKYVhjPaI8Q-1 Received: by mail-wr1-f69.google.com with SMTP id ffacd0b85a97d-3113da8b778so1686642f8f.3 for ; Tue, 27 Jun 2023 03:19:01 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687861140; x=1690453140; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=PVciXP+z4zzbCJzP6EwhmNISY1M3S5n3Lib1Z6WDS0Q=; b=Ox9K/xaq/EwuCAZUrz+TikbRtfXP0OCDjlrvUBTe8MfgLFTd+6OBXJtSfnzL3HYHgE nMPB4D61CQOBnSK1MNvlsddLX/09phssc0KESeOSZaQALB39UpyjZW9swqTpgx5JBzN0 elXbF4QoqLeOi1LodBPtqW053EUDWZaV/nXnsKfETul3LL4639NA0aJCDZh1qZlf+Q5G USqkHQ7phkAJPwPRqv9IpTRCR5tg/DDq1dP6ABZlRy7k9yL7SYWMYB7K/YAjDKczXyPr f2/DoF1BLfTBJH80uNCzzKWXg0vXrFMvBt7IQizeR1HvJNLsOhcVUjg+E8JrnIWBc/xe sfHw== X-Gm-Message-State: AC+VfDxZ5ClAMBxKw8LYUze2gHcAVkNY9yF7iNblnQpW0chiQENBJH/j aqdQlQQmo7/HzSFNoakSqvLCuoIJINv96rKI2Dt5jti4upxdv0a1oKtCiXN91Rs+S/ID4r29q1h K0OlJ3nfOpJk= X-Received: by 2002:adf:ec45:0:b0:30f:c9f5:7370 with SMTP id w5-20020adfec45000000b0030fc9f57370mr11517993wrn.25.1687861140202; Tue, 27 Jun 2023 03:19:00 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ74xMOP2OxaEfjkBONlWibOkqUQKbSlBQnEWXQqsXnxyS5eu5FzrJ2UoqYORap13ob6XHpvQg== X-Received: by 2002:adf:ec45:0:b0:30f:c9f5:7370 with SMTP id w5-20020adfec45000000b0030fc9f57370mr11517974wrn.25.1687861139751; Tue, 27 Jun 2023 03:18:59 -0700 (PDT) Received: from ?IPV6:2003:cb:c737:4900:68b3:e93b:e07a:558b? (p200300cbc737490068b3e93be07a558b.dip0.t-ipconnect.de. [2003:cb:c737:4900:68b3:e93b:e07a:558b]) by smtp.gmail.com with ESMTPSA id k5-20020a5d6e85000000b003063a92bbf5sm10048047wrz.70.2023.06.27.03.18.58 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 Jun 2023 03:18:59 -0700 (PDT) Message-ID: <1b5ce269-bb54-2f65-6919-8b2bb453c09a@redhat.com> Date: Tue, 27 Jun 2023 12:18:58 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Lorenzo Stoakes Cc: Vlastimil Babka , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Andrew Morton , Mike Rapoport , "Liam R . Howlett" References: <20230626204612.106165-1-lstoakes@gmail.com> <074fc253-beb4-f7be-14a1-ee5f4745c15b@suse.cz> <1832a772-93b4-70ad-3a2c-d8da104ce8dd@redhat.com> <40cd965f-ba4f-44d8-8e7c-d6267b91a9b3@lucifer.local> <57c677d1-9809-966e-5254-f01f59eff7bc@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm/mprotect: allow unfaulted VMAs to be unaccounted on mprotect() In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 8E65B4000D X-Rspam-User: X-Stat-Signature: zcro9a4fq9a7edf3ttxzyj9ytxfxeest X-Rspamd-Server: rspam01 X-HE-Tag: 1687861145-209576 X-HE-Meta: U2FsdGVkX18W782epZksZfnIsEbkHTo6gSeJwIks8C92qfdZcx2IXQauwrS/Ckt/SyhgNm4ruF4+BuNMwxT02VTuSMSu2XLocAr1tfq6uUBHTu/UY/C61dfI1T4Fho1mLyApCuew3at0RRZ8BSarMw2s5+Q2z5pFXVQkzTBINAD6V9Rowylf4EKqno+sIm6T66DwEhQHoIfEnqS7PLSXXIeX5k7nWfM30yL1uQkxqUD0ATY357IviLwJIr1ttJrXGy+bit7GP8Q22pK8mRMJLSK+0aJ9ShG84+DIWUvDt9wO3QrNJcDxyYljwYUCDo5dgzoOAMUfb1z+1wpubMZyXoXKnlN3QFBgpuh2Yiajm7ANiDSMjLWfvkrUSw23cso4Qtsl/PN+dDsNYmQ5H0CZGOcoCgkFPVko16b0Sz3TsOh6NeglrNozidFzE0CcpyRPFWaGcZCrslEb7kHsz2FHVzeV7gslbzytWclRy018gg2kEuM0frlozY9vEynGhmbQuZ5sCQTUGAORFUGXAXOsQLUtbnGtL0UppgsLzqQftOFJzNkHyaJ+dnY7gv6yVOZgWFxdQbgkOFYNWvJCDa/m3AnijUkfQHEre3mbJ1byVtpAOFVpFmQhyt9yllYtqmv73Zhtl5gsOkT1KuvME/LZ7/itvLw6ljM3sXxriQtSu80rKZqcYOozoTPcZYWvbT2XPW9zpVe8MgCAYgjjQVP3eIviVEgo695Bq4A2cjfHFi+kdmSKzhyNHd15j+N30/QvUJ1QhGngITuAlF39F58fwF/x1iiX3fVDAlUOGkNRv9AuG1NYVFHAR2URIhL0PpSFruvNYK3AOIRJCPdn6HnUmJWdOCIb6aWYmZokzxNfpzbc9Ih0V7h7Q8iQ51mv9XaWFl4+Pm2Bh5Vjs6kucDaOj95WlxtGuzfMC3BF7lnD8rlgd9oPiitDCszBhjAlU0HzUSr1Mf8IRa1W7Y0wcEQ RdSSXzSv 266VsVRL5RpIlZ9tJTvulxkS5dfXQ8FYxd0TE5BLjxEBtXoi/mIG+ZHedTkzbsm7I0kFJBWHffOMnhiz3KVhbfT+BgMuSTKcilvyKX9KNy2WU95feZzvyCq6xzn+LlCd6fgQcMkzyU2K2mO9TTvJxloNpXhrZ1aFK8i3vbjiZLxkl24lDjLb06P/55jC/1o1/HXwAtERMwDrGlGKeilINpF6K4xoq0MN08BT2x9e+7Ng0SSYQ1yLZPwqi/XlsJNSNhRZh0mn1Ck1wuH9Uq8G4WJNV3LR8LHt8uCIo8+adovOIehu5HyjVTWCvIoXMboBCqPsaDmKhJxXgbEU0QPS6tpNMmw/7CvYUGmke1hzAmQHaUyoKG9oXrXic2dp7x1NvkdRT X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: [...] >> >> Yeah, and that needs time and you have to motivate me :) >> > > Beer? ;) Oh, that always works :) > >>> Well the motivator for the initial investigation was rppt playing with >>> R[WO]X (this came from an #mm irc conversation), however in his case he >>> will be mapping pages between the two. >> >> And that's the scenario I think we care about in practice (actually >> accessing memory). [...] >>> In real-use scenarios, yes fuzzers are a thing, but what comes to mind more >>> immediately is a process that maps a big chunk of virtual memory PROT_NONE >>> and uses that as part of an internal allocator. >>> >>> If the process then allocates memory from this chunk (mprotect() -> >>> PROT_READ | PROT_WRITE), which then gets freed without being used >>> (mprotect() -> PROT_NONE) we hit the issue. For OVERCOMMIT_NEVER this could >>> become quite an issue more so than the VMA fragmentation. >> >> Using mprotect() when allocating/freeing memory in an allocator is already >> horribly harmful for performance (well, and the #VMAs), so I don't think >> that scenario is relevant in practice. > > Chrome for instance maintains vast memory ranges as PROT_NONE. I've not dug > into what they're doing, but surely to make use of them they'd need to > mprotect() or mmap()/mremap() (which maybe is what the intent is) I suspect they are doing something similar than glibc (and some other allocators like jemalloc IIRC), because they want to minimze the #VMAs. > > But fair point. However I can't imagine m[re]map'ing like this would be > cheap either, as you're doing the same kind of expensive operations, so the > general _approach_ seems like it's used in some way in practice. Usually people access memory and not play mprotect() games for fun :) > >> >> What some allocators (iirc even glibc) do is reserve a bigger area with >> PROT_NONE and grow the accessible part slowly on demand, discarding freed >> memory using MADV_DONTNEED. So you essentially end up with two VMAs -- one >> completely accessible, one completely inaccessible. >> >> They don't use mprotect() because: >> (a) It's bad for performance >> (b) It might increase the #VMAs >> >> There is efence, but I remember it simply does mmap()+munmap() and runs into >> VMA limits easily just by relying on a lot of mappings. >> >> >>> >>> In addition, I think a user simply doing the artificial test above would >>> find the split remaining quite confusing, and somebody debugging some code >>> like this would equally wonder why it happened, so there is benefit in >>> clarity too (they of course observing the VMA fragmentation from the >>> perspective of /proc/$pid/[s]maps). >> >> My answer would have been "memory gets commited the first time we allow >> write access, and that wasn't the case for all memory in that range". >> >> >> Now, take your example above and touch the memory. >> >> >> ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0); >> mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE); >> *(ptr + page_size) = 1; >> mprotect(ptr + page_size, page_size, PROT_READ); >> >> >> And we'll not merge the VMAs. >> >> Which, at least to me, makes existing handling more consistent. > > Indeed, but I don't think it's currently consistent at all. > > The 'correct' solution would be to:- > > 1. account for the block when it becomes writable > 2. unaccount for any pages not used when it becomes unwritable > I've been messing with something related (but slightly different) for a while now in my mind, and I'm not at the point where I can talk about my work/idea yet. But because I've been messing with it, I can comment on some existing oddities. Just imagine: * userfaultfd() can place anon pages even in PROT_NONE areas * ptrace can place anon pages in PROT_READ areas * "fun" like the forbidden shared zeropage on s390x in some VMAs can place anon pages into PROT_READ areas. It's all far from "correct" when talking about memory accounting. But it seems to get the job done for the most case for now. > However since we can't go from vma -> folios for anon pages without some > extreme effort this is not feasible. > > Therefore the existing code hacks it and just keep things accountable. > > The patch reduces the hacking so we get halfway to the correct approach. > > So before: "if you ever make this read/write, we account it forever" > After: "if you ever make this read/write and USE IT, we account it forever" > "USE" is probably the wrong word. Maybe "MODIFIED", but there are other cases (MADV_POPULATE_WRITE) > To me it is more consistent. Of course this is subjective... > You made the conditional more complicated to make it consistent, won't argue with that :) >> >> And users could rightfully wonder "why isn't it getting merged". And the >> answer would be the same: "memory gets commited the first time we allow >> write access, and that wasn't the case for all memory in that range". >> > > Yes indeed, a bigger answer is that we don't have fine-grained accounting > for pages for anon_vma. Yes, VM_ACCOUNT is all-or nothing, which makes a lot of sense in many cases (not in all, though). [...] >> >>>>> So in practice programs will likely do the PROT_WRITE in order to actually >>>>> populate the area, so this won't trigger as I commented above. But it can >>>>> still help in some cases and is cheap to do, so: >>>> >>>> IMHO we should much rather look into getting hugetlb ranges merged. Mt >>>> recollection is that we'll never end up merging hugetlb VMAs once split. >>> >>> I'm not sure how that's relevant to fragmented non-hugetlb VMAs though? >> >> It's a VMA merging issue that can be hit in practice, so I raised it. >> >> >> No strong opinion from my side, just my 2 cents reading the patch >> description and wondering "why do we even invest time thinking about this >> case" -- and eventually make handling less consistent IMHO (see above). > > Hmm it seems ilke you have quite a strong opinion :P but this is why I cc-d > you, as you are a great scrutiniser. I might make it sound like a strong opinion (because I am challenging the motivation), but there is no nak :) > > Yeah, the time investment was just by accident, the patch was originally a > throwaway thing to prove the point :] > > I very much appreciate your time though! And I owe you at least one beer now. > > I would ask that while you might question the value, whether you think it > so harmful as not to go in, so Andrew can know whether this debate = don't > take? > > An Ack-with-meh would be fine. But also if you want to nak, it's also > fine. I will buy you the beer either way ;) It's more a "no nak" -- I don't see the real benefit but I also don't see the harm (as long as VMA locking is not an issue). If others see the benefit, great, so I'll let these decide. -- Cheers, David / dhildenb