From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 161A1C25B4E for ; Tue, 24 Jan 2023 18:35:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A12FA6B0072; Tue, 24 Jan 2023 13:35:42 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C2906B0075; Tue, 24 Jan 2023 13:35:42 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 88AB66B0078; Tue, 24 Jan 2023 13:35:42 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 7BCEF6B0072 for ; Tue, 24 Jan 2023 13:35:42 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 4C7CF809DA for ; Tue, 24 Jan 2023 18:35:42 +0000 (UTC) X-FDA: 80390546124.27.334ABCE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 220A6C0011 for ; Tue, 24 Jan 2023 18:35:39 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=UfJCnLlE; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674585340; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=VjOt108jOzO7hAm8XAZeitRfmcvIsjxxD6Zp+UdMWV8=; b=oqgF7sv1gJoh81h6CnteTvmdyyt506xfCLD2idMjw3mqNB/aKgiG3lEQRnFuxfGc1RBSTR jy6Ds7AukX4X2sdwdAGRPLLHBb6RikFVhRQbq2t/M2iA5HumuurSNzLNSHCITwhT/tW8Hy IM+GYw2+RBJRjEcJPlKv7x6DIXWa2qU= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=UfJCnLlE; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674585340; a=rsa-sha256; cv=none; b=RA3Bexu5EbJZ9DjaiOsPB/4khRBZm9qsAJlS8WEW8Arut3q4udysEvTIR76HQVvPg/nVwu vj62bDwSXEw89rKz6xS5WCH84tOj/DIBOX5HGDS+DSqNeafMNZzEvb3SrRY0t8X9jSeH7C l8srUIgXOtD7petL9T8BLXKFuC1zmc8= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1674585339; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VjOt108jOzO7hAm8XAZeitRfmcvIsjxxD6Zp+UdMWV8=; b=UfJCnLlEYH3Y2In160usH5T1dKUEJLSu7TGONKrtC01RA90/76S5oIcnryPhNam1xVxapH 2vQ88YlJs3pLkUxqQr6T8zZdmPYWDvuEiOXJNdGSdlNUeuU5A0Bw2N0SMbASdR8B2Dobkr 4z4jPn/mHLpYuD9P8tmDBHM9Iy3g6fQ= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-662-zlut15k0OaWn1niJ7pF7ug-1; Tue, 24 Jan 2023 13:35:38 -0500 X-MC-Unique: zlut15k0OaWn1niJ7pF7ug-1 Received: by mail-wm1-f70.google.com with SMTP id k20-20020a05600c1c9400b003db2e916b3aso8235174wms.6 for ; Tue, 24 Jan 2023 10:35:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VjOt108jOzO7hAm8XAZeitRfmcvIsjxxD6Zp+UdMWV8=; b=G5XQFXkjroOT6eRa2WQQQSCiN5hO/wvyyTDNPxMxshm/HKcxkkC0h19/khTdPxxWoU k8A30l7Sj0OErU2F4CR30dd3QIhZUozF7HMFDH3ndsSMeimAqV9sCH6zH0bPOwX+oc3P f9Nkw88agzXMXa+0VXdXiIkv/eOQ9lWfe+KBwVDl1cLLfy9u68qQtzfLMO+NiGrYZ95q w00bmBd/7AkleigxX/DaQFuST6MtIQYkmEmSMAID42OawgxmN2kvdHySc1BMhn5mgpJG ELxUVtAJ166XO+3Ydai7ALrqICd3zUeBRe3Xet0QbVQP+ZiEaEk7pmZNVSnsc2sDbiT2 1EMQ== X-Gm-Message-State: AFqh2kpHF3b6rAr7sPjOYRMQUOtKhQzEQ5Y23JmTX4aUu/SKIi25ZsFU QhuYCCBZH+SPOH+IXzXgLb1Ie7lCglPUfSrQ/7VKFNHLMItLZ0eFsGVmav/Oa4zu9WF5rpKAy5K +p3rGfM6P7gg= X-Received: by 2002:a05:600c:5116:b0:3db:1a8:c041 with SMTP id o22-20020a05600c511600b003db01a8c041mr29199739wms.17.1674585336960; Tue, 24 Jan 2023 10:35:36 -0800 (PST) X-Google-Smtp-Source: AMrXdXv5MncSbEcAMim4Nj1YhBCu3huetgCuK+1ee2uo27wwv0w3RS+tx0XiniF/ZgHgXXCaIIBe5g== X-Received: by 2002:a05:600c:5116:b0:3db:1a8:c041 with SMTP id o22-20020a05600c511600b003db01a8c041mr29199718wms.17.1674585336645; Tue, 24 Jan 2023 10:35:36 -0800 (PST) Received: from ?IPV6:2003:cb:c707:9d00:9303:90ce:6dcb:2bc9? (p200300cbc7079d00930390ce6dcb2bc9.dip0.t-ipconnect.de. [2003:cb:c707:9d00:9303:90ce:6dcb:2bc9]) by smtp.gmail.com with ESMTPSA id o4-20020a05600c2e0400b003db305bece4sm2545835wmf.45.2023.01.24.10.35.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 24 Jan 2023 10:35:35 -0800 (PST) Message-ID: <3cc8f142-a69d-ae84-6a33-50bdc9aade21@redhat.com> Date: Tue, 24 Jan 2023 19:35:35 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.6.0 To: Matthew Wilcox , linux-mm@kvack.org Cc: Vishal Moola , Hugh Dickins , Rik van Riel , "Yin, Fengwei" References: From: David Hildenbrand Organization: Red Hat Subject: Re: Folio mapcount In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Stat-Signature: 5j14twwsbqkm8nei8zqqa4kk58macdq7 X-Rspamd-Queue-Id: 220A6C0011 X-HE-Tag: 1674585339-377294 X-HE-Meta: U2FsdGVkX18Jq5Gn90coN+QiHhjDtGkOo5cxUoCvirclwbZiNpmRmykJY5GBJySTf0zgL9szS7tRHHXEnIKTW9dMapvXp912n87DWDuInP5F1e4Yq7oW7GBT+e4Laka5wCNiR+0FsEz2byoyukQVqukhctWRmP62zqX6IN+bolHykDUR7rX5IISfq7vbhqbxsBFvDjwS12L9trS2+co6UavrStaSJcvqkdSGzKptmTxktK+1RzLNf26ldNV3CuQfBXRRUHK2EER6fuTVe2pTxlaK6qYgA/bUhZBFsnbW5kBoxq4FRaoAC1kxNNaEjPB8ZyD2mKyvb5vwKRQsVyHzEsWWZiX5El3Jkl8nwo3O+D7upyWcuMWOvDhLoyXjZXcul4dcSljY+kor7+ZLE+naVGLUq1vHnln18opmBviXe+ekupahwL9shJXyvKmCqrf3UAhwcm3d0Z+ElVGlWgEaVK3JGSOLBwTy3rTsvU9+/lcx+fpUTi2AIxD6Eje8JuNQiMhj+QavAPAT2ne6C1m0p6Njkiyuj3Wdau+3aDDrLHAVajQKuZzGAFXIHaLWNf8XwVazQ6NverAKUD6RcPEREczlsQuJiqLNWEDdIkd5kZSpP2+qUKfEF0kaUUz9ZzN0RI8uSsvIp8tjDp5mtHItgot/XeQs8WZsa/n88pYtb9DwFJ2+8NUpjoKfuGrqV9fhRg6q7HtKVuVGV9wkULiBlEdUEUG0ZODUa0IlSC6Yttw7EWn1rShO2ci6aOACAm5BJH5VSLp0uIN97VVYXHZ+R2SUEQ40AQnyC5OuEUc6Sx/zqalx0p3/B/ZwVAUGpi4U2nNFHbpOEgVAVGnNdCeUvONQhfSkFO5e89gg+Hmm82eQTM6sQmriwNQG8XC43b6uS701WIJKA44yo8VHpc6pumnSvECuR6Voi9jpkzlM4RhmLdQr/kLROGII5g+1Sd1/PMlOB9BH99Vg+2x1O9O LOEYKsYA lEHmI+vbiiAkVMCFf6a83l2AlUn5Or5Tw+BJ9eiasZDinrnmiwbZM3nH2EYztRqGgeietWHkkqczOPQKc7QiTVB2l8n9Z4iYHLBO3wstmZHlFi2UwN4FHOvsRVYhCNAG5/B8qJYhMLTpw7A5qHaB/8g4Z4AOi0nvkI8Nzm7TywDU+gKM5TBXePbIFvmLvNHUSGHN1QwciML4Kr1g2wCKPh7cWWJSaK7jUvVib8Z9gRrRmrkyFtc3bKPPVyKLyzSx5TSJvgobIlkXfxl1ti3/62j6t3/oyVrVgvhgLvCKeNlP+KnF1ZolGtrq0oNhl1qGi/ew3LFAobR1gjDIEn1s3J0SRqUdpDdjxo0hnvAz7a9XHy+Zgp6IlcKG6Tl/VQWvHkkxhWWylLodMALFB9duMQgme2wOO2wR8ZMF/958y70booISQMBl0QK3zyi7ibjJzwYelKjVw4sjtQGiMsrX9SnTyRQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.01.23 19:13, Matthew Wilcox wrote: > Once we get to the part of the folio journey where we have > one-pointer-per-page, we can't afford to maintain per-page state. > Currently we maintain a per-page mapcount, and that will have to go. > We can maintain extra state for a multi-page folio, but it has to be a > constant amount of extra state no matter how many pages are in the folio. > > My proposal is that we maintain a single mapcount per folio, and its > definition is the number of (vma, page table) tuples which have a > reference to any pages in this folio. > > I think there's a good performance win and simplification to be had > here, so I think it's worth doing for 6.4. > > Examples > -------- > > In the simple and common case where every page in a folio is mapped > once by a single vma and single page table, mapcount would be 1 [1]. > If the folio is mapped across a page table boundary by a single VMA, > after we take a page fault on it in one page table, it gets a mapcount > of 1. After taking a page fault on it in the other page table, its > mapcount increases to 2. > > For a PMD-sized THP naturally aligned, mapcount is 1. Splitting the > PMD into PTEs would not change the mapcount; the folio remains order-9 > but it stll has a reference from only one page table (a different page > table, but still just one). > > Implementation sketch > --------------------- > > When we take a page fault, we can/should map every page in the folio > that fits in this VMA and this page table. We do this at present in > filemap_map_pages() by looping over each page in the folio and calling > do_set_pte() on each. We should have a: > > do_set_pte_range(vmf, folio, addr, first_page, n); > > and then change the API to page_add_new_anon_rmap() / page_add_file_rmap() > to pass in (folio, first, n) instead of page. That gives us one call to > page_add_*_rmap() per (vma, page table) tuple. > > In try_to_unmap_one(), page_vma_mapped_walk() currently calls us for > each pfn. We'll want a function like > page_vma_mapped_walk_skip_to_end_of_ptable() > in order to persuade it to only call us once or twice if the folio > is mapped across a page table boundary. > > Concerns > -------- > > We'll have to be careful to always zap all the PTEs for a given (vma, > pt) tuple at the same time, otherwise mapcount will get out of sync > (eg map three pages, unmap two; we shouldn't decrement the mapcount, > but I don't think we can know that. But does this ever happen? I think > we always unmap the entire folio, like in try_to_unmap_one(). Not sure about file THP, but for anon ... it's very common to partially MADV_DONTNEED anon THP. Or to have a wild mixture of two (or more) anon THP fragments after fork() when COW'ing on the PTE-mapped THP ... > > I haven't got my head around SetPageAnonExclusive() yet. I think it can > be a per-folio bit, but handling a folio split across two page tables > may be tricky. I tried hard (very hard!) to make that work but reality caught up. And the history of why that handling is required goes back to the old days where we had per-subpage refcounts to then have per-subpage mapcounts to now have only a single bit to get COW handling right. There are very (very!) ugly corner cases of partial mremap, partial MADV_WILLNEED ... some are included in the cow selftest for that reason. One bit per subpage is certainly "not perfect" but not the end of the world for now. 512/8 -> 64 byte for a 2 MiB folio ... For now I would focus on the mapcount ... that will be a challenge on its own and a bigger improvement :P -- Thanks, David / dhildenb