From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 215ADC71159 for ; Mon, 16 Jun 2025 20:24:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 727E16B00A0; Mon, 16 Jun 2025 16:24:17 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6FEE46B00A1; Mon, 16 Jun 2025 16:24:17 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5EDC26B00A2; Mon, 16 Jun 2025 16:24:17 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 4864B6B00A0 for ; Mon, 16 Jun 2025 16:24:17 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 027CEBE581 for ; Mon, 16 Jun 2025 20:24:16 +0000 (UTC) X-FDA: 83562390954.11.86CA772 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf17.hostedemail.com (Postfix) with ESMTP id CEDBD40006 for ; Mon, 16 Jun 2025 20:24:12 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=THTIn3Cj; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1750105454; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MujpOMnQjLqgp4bdsGzt2203T3X/ak3Vnqd8IfmjjDE=; b=ynzMulai2vH39NU/xF+mcBEZunxU7IsEE5U+7CNhr5PUC1p/hvpcNtd2Iapouq0QLCEjf7 r3H/qu4ifKDaXtXfEUqnpZWF3PmW2Jg6RADDAsBU9Bzp9ogj3PoTuGpDxohEDY4anc+68j Sk6FlicMZNfsG367ZkkhrAH7TD5Leck= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=THTIn3Cj; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1750105454; a=rsa-sha256; cv=none; b=VqMPrrB5vsRvDnQzxWLT7xla1igILNP26xAMAW7fO/ejFhXBrzbv7+pKuJp9+fmnRQcigD IQ10CM8BsWKmRHmWSld/WPtuQ7pcUrPLPLPRl8tw1Qn1mDjq2MWY35OqJiOSgTkSCm7arP EMUeJGgU5jz34GsDxJMPbKwRT6FdQt0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1750105452; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=MujpOMnQjLqgp4bdsGzt2203T3X/ak3Vnqd8IfmjjDE=; b=THTIn3Cj3h+P8inqka9XQKOBTKHfXCuWK/2NPiABa49X0zzmt8c7P+N3SAkzgLVj/LVDvt gynE0EFHYi+4dBOtk6vc0lHB90d7gT1WiDlcDZkC90IIAE1ZZUjSk3Bjc1SfpJ2gR5UxaA jf7/Xn2WEICaNw1dkf4JCxdNaXbxSMQ= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-361-vvd_KE5qPTGZOJ_O2T1Frg-1; Mon, 16 Jun 2025 16:24:08 -0400 X-MC-Unique: vvd_KE5qPTGZOJ_O2T1Frg-1 X-Mimecast-MFC-AGG-ID: vvd_KE5qPTGZOJ_O2T1Frg_1750105448 Received: by mail-wr1-f71.google.com with SMTP id ffacd0b85a97d-3a579728319so764113f8f.0 for ; Mon, 16 Jun 2025 13:24:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750105447; x=1750710247; h=content-transfer-encoding:in-reply-to:organization:autocrypt :content-language:from:references:cc:to:subject:user-agent :mime-version:date:message-id:x-gm-message-state:from:to:cc:subject :date:message-id:reply-to; bh=MujpOMnQjLqgp4bdsGzt2203T3X/ak3Vnqd8IfmjjDE=; b=JQF9jz2gEm5pd2FkfFlXQaUc3wo6Y11gBCQjcbg3O58zQ7vOvqm/qWZYaABW/F8mOG cVTfh9kuo10QKAlbvTW8ysFRrqpgCyToKywZrlEpf9fAn6Kb5gHyQ8KHgT2p8zvBIEbM SUPH7v4LWYnTDpLsbPFhZhNS+fh6k3vb5Rr0lLUEISJYGLE+YjrF57cStcNeuwOI7yLt C1yIk8aJXubUtdVOCsIVbwu/5yS6kJlB1ZlBTJRmeEok6LZSKdWV/ggCe3I18E+muZVm oTMI0AqsmfvtoVNTtx9YIVoI9FfCKYYBTV5TOlaizHt302XRy+v5FUsugSuBZr10cmC8 Wcgg== X-Forwarded-Encrypted: i=1; AJvYcCVpWsz3TKMlytcCkS70E0Eh4BMthu9fFBCe/i8TPdFmMos2N5k8a2vw9xe9fN51M7DCnI2z270vWw==@kvack.org X-Gm-Message-State: AOJu0YxyPrU2QNJwSHUo9FE8vPa3uj23Km1GYrIN2iRhVzyF1A+52dAn 2NzHPZdF0LeUPlXuHfP4D+2W8JvDXhXT1KzQng+MXaqSs0JnUZou9M43RTsxnaoyjrjUbS/OZp/ rW2tIqHW5aVn0JgRr43GN1k0vaxAo72R+rE3whqMpp8zIdbenX3lr X-Gm-Gg: ASbGncsCQxK24DaJHHxUcdWK1UWhpouiIduCN1wEcVf2OL42Kv9/6zvAdx8Ji2NdtH+ Z/CUoGbLRVsD6865ZP5GsZqceY7d7E1130gEjR1h3vIiY8YZiEHl1Q311qmHhREH2gSeyaFx92M 4Qo4yTFw58fhOoU6vh+VW0qnmkYVX7lJSOs7tPfD/Wp96M0p/tpDQPJz0BkGLeY4QSbA+0N1yUc 7D8m9coYdUzMkyQRiks3jPKgXPzGFFbeoOzehdaQFqL3P2gjAflrcs2yRATyzna1zrnZwAif7NP We5soCZGjXEncsknLIEEEffeLboyL/8zVu8cbKZmEugdmDgiqoqS8WVrNgZjNv3ZnpQMBvwDsz0 V82JG8ZveOIFEmzclAi9VNVM5WLwFq433QnIvtrhdSv8ZNuiHDw== X-Received: by 2002:a05:6000:24c8:b0:3a4:f918:9db9 with SMTP id ffacd0b85a97d-3a572e79fa4mr8413995f8f.32.1750105447435; Mon, 16 Jun 2025 13:24:07 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGNs1HsVYbArG/PvO9ENQuG43J1s+ttVlfi7JHJJXlZOrnOFgZxTdDyS4Qpy5inRN7IlR7Tlw== X-Received: by 2002:a05:6000:24c8:b0:3a4:f918:9db9 with SMTP id ffacd0b85a97d-3a572e79fa4mr8413977f8f.32.1750105446913; Mon, 16 Jun 2025 13:24:06 -0700 (PDT) Received: from ?IPV6:2003:d8:2f3a:e300:c660:4ff5:5bfb:f5c5? (p200300d82f3ae300c6604ff55bfbf5c5.dip0.t-ipconnect.de. [2003:d8:2f3a:e300:c660:4ff5:5bfb:f5c5]) by smtp.gmail.com with ESMTPSA id ffacd0b85a97d-3a568a734b5sm12221372f8f.33.2025.06.16.13.24.05 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 16 Jun 2025 13:24:06 -0700 (PDT) Message-ID: <7e51e1e2-7272-48d5-9457-40ab87ad7694@redhat.com> Date: Mon, 16 Jun 2025 22:24:05 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH 00/11] mm/mremap: introduce more mergeable mremap via MREMAP_RELOCATE_ANON To: Lorenzo Stoakes , Andrew Morton Cc: Vlastimil Babka , Jann Horn , "Liam R . Howlett" , Suren Baghdasaryan , Matthew Wilcox , Pedro Falcato , Rik van Riel , Harry Yoo , Zi Yan , Baolin Wang , Nico Pache , Ryan Roberts , Dev Jain , Jakub Matena , Wei Yang , Barry Song , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: z-Cn8Eq1u0_Y5yo6Qi4FTfOgdxSfCgxNupKB_KKiYSs_1750105448 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: w4bekehoce4p9ypm766c6j6rgydt9z6n X-Rspamd-Queue-Id: CEDBD40006 X-Rspam-User: X-Rspamd-Server: rspam09 X-HE-Tag: 1750105452-493489 X-HE-Meta: U2FsdGVkX1+w+8W6jVAnC4MyBmXPOdq4WMwD31S2SgJioWUY57qkzcSeXVaeVSkjcU7fTLpvoYxsm1IxLjtIdTh5p0MVCyNdy067rd5M6p5Ibi8chEeFRS9hEV2LXNK/vRT6IUp2b7bfYnQjlYQjOPqvKgGuMvqfipFAsWFx/LW0v93Ga4n2+Pra0uJ9Y9KekYBd9Ah9cpH4i5IzsFM/XTVpGUETUMwpmvjDHJR/I9J9rkaseOFEfwJp97gAkh11GWCqK8nZ0h4ERr/QYqhc+F4Pf3p7WvDmqAyaArRDJLhT4v7n1hI8CwrHoGj13DL0Hu3/f00Xm9Aaus93NxVKLLAvjmgrIJNTkvAQYcPftg11lpkGD22B0mHny7lEG9jCjpYWkvhXgUNg9cUALwGpzitGTvo188z2xxEv9P7s2ef8SlVZnTbq358ZmJ5JghRVwZHc4gAP7WveskMFXekhucyKvXs8yLEyy9FuqcsEu8xAlw46h9Y3AJJbt8PJJPUfd6F/vqMxkZf41rZ9r6CRTyU0RDTNMqf97Kr+vO2v6Qv2KOjv7fCPRFGB9ggdn4tduzEV9IrKufeM9LPIgKDHZWbmEVlXOvTVZJQ0Seq5qDr2g6tvOjV2RJRCjo87EWh2ZW8Qwlxj1Eai3V+xVkm2CE0e+Cfe8KKdPsJqjOY/UuEZ5uba+f+q+ZUPPShUm802wx1UIv4t3bY+PmvNhIz/53IZBeTysvWaT/8CygyaCOH+4sMb4d/s1/zs95auoShz3qJYdyP9G7jDPFyxGM0woI6JqsGX7anNDCa6dUZiRPnZ24AqW0LdffGtRkCOMvFNThPOtlXKUU1srict048WxGalInU3LMrLYUmkAnby8cHXQBNWmagMOM72X6HU7gBMd9gZLQF2A8k2qrQzaGZoZ7VFSz0YwgqLdSIg5ElujX7BssV+Te4EmL8HFg7SnOPRi+N+WFL6M6E9ra27l3w bGddiLe6 pykUJ6JKddIxaXNjPMFSRCNvNI8S0I8LOTG2FBWrnuQyJlT75itaMdaux8sS2EmhKglTVFwQtldEH21kmC5xstB1Ahsoyk78ygplYErru1k+lal6aW9MJefNf2iVeg3j5WZSKxupZAzEPsLaAbm//HjOsH8RAulFQfElroUG2GTjSIV7F5qB4ROiwb/LpUXdd+IJta1lFIolwhJkAHnSfUpZ/L3LcZvRmip0vwV5RDHtSqioDIQB/00VtH7m0x6/0GCq8EFgRc7uHTGPlZcyYb2w/E1TwKo0NDF1y85ICCFS7F75jyiC0+RbLyRr56iynRoepZZ63sKNaNJs1udT0PbM6x2TEwBlLzSUJ5py5kCqS6qIW2Y/E72BnCsy5ElFPT8vFWMmBmiKNgGmzL37OMJTcoq60s4mU/9nitmxeCOwNc6AZYlaYJ1LpvxhKNH3WZM6BJKsKFbTKeM+Kv/VzTFa4B2MwWYq2JSNcVuQ2g0tP8zXQodhq7KsTDcEsNWkPpy4XRgjhMV2ny51vpYknvuDOBLfFas1LAJiTAUEOC6OIu6Y3NfJNCNcHmgvg65/Q6yC9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi Lorenzo, as discussed offline, there is a lot going on an this is rather ... a lot of code+complexity for something that is more a corner cases. :) Corner-case as in: only select user space will benefit from this, which is really a shame. After your presentation at LSF/MM, I thought about this further, and I was wondering whether: (a) We cannot make this semi-automatic, avoiding flags. (b) We cannot simplify further by limiting it to the common+easy cases first. I think you already to some degree did b) as part of this non-RFC, which is great. So before digging into the details, let's discuss the high level problem briefly. I think there are three parts to it: (1) Detecting whether it is safe to adjust the folio->index (small folios) (2) Performance implications of doing so (3) Detecting whether it is safe to adjust the folio->index (large PTE- mapped folios) Regarding (1), if we simply track whether a folio was ever used for COW-sharing, it would be very easy: and not only for present folios, but for any anon folios that are referenced by swap/migration entries. Skimming over patch #1, I think you apply a similar logic, which is good. Regarding (2), it would apply when we mremap() anon VMAs and they happen to reside next to other anon VMAs. Which workloads are we concerned about harming by implementing this optimization? I recall that the most common use case for mremap() is actually for file mappings, but I might be wrong. In any case, we could just have a different way to enable this optimization than for each and every mremap() invocation in a process. Regarding (3), if we were to split large folios that cross VMA boundaries during mremap(), it would be simpler. How is it handled in this series if we large folio crosses VMA boundaries? (a) try splitting or (b) fail (not transparent to the user :( ). > This also creates a difference in behaviour, often surprising to users, > between mappings which are faulted and those which are not - as for the > latter we adjust vma->vm_pgoff upon mremap() to aid mergeability. > > This is problematic firstly because this proliferates kernel allocations > that are pure memory pressure - unreclaimable and unmovable - > i.e. vm_area_struct, anon_vma, anon_vma_chain objects that need not exist. > > Secondly, mremap() exhibits an implicit uAPI in that it does not permit > remaps which span multiple VMAs (though it does permit remaps that > constitute a part of a single VMA). If I mremap() to create a hole and mremap() it back, I would assume to automatically get the hole closed again, without special flags. Well, we both know this is not the case :) > > This means that a user must concern themselves with whether merges succeed > or not should they wish to use mremap() in such a way which causes multiple > mremap() calls to be performed upon mappings. Right. > > This series provides users with an option to accept the overhead of > actually updating the VMA and underlying folios via the > MREMAP_RELOCATE_ANON flag. Okay. I wish we could avoid this flag ... > > If MREMAP_RELOCATE_ANON is specified, but an ordinary merge would result in > the mremap() succeeding, then no attempt is made at relocation of folios as > this is not required. Makes sense. This is the existing behavior then. > > Even if no merge is possible upon moving of the region, vma->vm_pgoff and > folio->index fields are appropriately updated in order that subsequent > mremap() or mprotect() calls will succeed in merging. By looking at the surrounding VMAs or simply by trying to always keep the folio->index to correspond to the address in the VMA? (just if mremap() never happened, I assume?) > > This flag falls back to the ordinary means of mremap() should the operation > not be feasible. It also transparently undoes the operation, carefully > holding rmap locks such that no racing rmap operation encounters incorrect > or missing VMAs. I absolutely dislike this undo operation, really. :( I hope we can find a way to just detect early whether this optimization would work. Which are the exact error cases you can run into for un-doing? I assume: (a) cow-shared anon folio (can detect early) (b) large folios crossing VMAs (TBD) (c) KSM folios? Probably we could move them, I *think* we would have to update the ksm_rmap_item. Alternatively, we could indicate if a VMA had any KSM folios and give up early in the first version. (d) GUP pins: I think we could allow that ... folio_maybe_dma_pinned() is racy either way (GUP-fast!). To deal with GUP-fast we would have to play different games ... Anything else? > > In addition, the MREMAP_MUST_RELOCATE_ANON flag is supplied in case the > user needs to know whether or not the operation succeeded - this flag is > identical to MREMAP_RELOCATE_ANON, only if the operation cannot succeed, > the mremap() fails with -EFAULT. How would an APP deal with these errors? Do you have a user in mind that could do something sensible based on this error? I'm having a hard time imagining that :) > > Note that no-op mremap() operations (such as an unpopulated range, or a > merge that would trivially succeed already) will succeed under > MREMAP_MUST_RELOCATE_ANON. > > mremap() already walks page tables, so it isn't an order of magntitude > increase in workload, but constitutes the need to walk to page table leaf > level and manipulate folios. Only for anon VMAs, though. Do you have some numbers how bad it is? I mean, mremap() is already a pretty invasive/expensive operation ... :) ... which is why people started using uffdio_move instead, to avoid the heavy-weight locks. > > The operations all succeed under THP and in general are compatible with > underlying large folios of any size. In fact, the larger the folio, the > more efficient the operation is. Yes. > > Performance testing indicate that time taken using MREMAP_RELOCATE_ANON is > on the same order of magnitude of ordinary mremap() operations, with both > exhibiting time to the proportion of the mapping which is populated. > > Of course, mremap() operations that are entirely aligned are significantly > faster as they need only move a VMA and a smaller number of higher order > page tables, but this is unavoidable. > > Previous efforts in this area > ============================= > > An approach addressing this issue was previously suggested by Jakub Matena > in a series posted a few years ago in [0] (and discussed in a masters > thesis). > > However this was a more general effort which attempted to always make > anonymous mappings more mergeable, and therefore was not quite ready for > the upstream limelight. In addition, large folio work which has occurred > since requires us to carefully consider and account for this. > > This series is more conservative and targeted (one must specific a flag to > get this behaviour) and additionally goes to great efforts to handle large > folios and account all of the nitty gritty locking concerns that might > arise in current kernel code. > > Thanks goes out to Jakub for his efforts however, and hopefully this effort > to take a slightly different approach to the same problem is pleasing to > him regardless :) > > [0]:https://lore.kernel.org/all/20220311174602.288010-1-matenajakub@gmail.com/ > > Use-cases > ========= > > * ZGC is a concurrent GC shipped with OpenJDK. A prototype is being worked > upon which makes use of extensive mremap() operations to perform > defragmentation of objects, taking advantage of the plentiful available > virtual address space in a 64-bit system. > > In instances where one VMA is faulted in and another not, merging is not > possible, which leads to significant, unreclaimable, kernel metadata > overhead and contention on the vm.max_map_count limit. > > This series eliminates the issue entirely. > * It was indicated that Android similarly moves memory around and > encounters the very same issues as ZGC. Isn't Android using uffdio_move? > * SUSE indicate they have encountered similar issues as pertains to an > internal client. > > Past approaches > =============== > > In discussions at LSF/MM/BPF It was suggested that we could make this an > madvise() operation, however at this point it will be too late to correctly > perform the merge, requiring an unmap/remap which would be egregious. > > It was further suggested that we simply defer the operation to the point at > which an mremap() is attempted on multiple immediately adjacent VMAs (that > is - to allow VMA fragmentation up until the point where it might cause > perceptible issues with uAPI). > > This is problematic in that in the first instance - you accrue > fragmentation, and only if you were to try to move the fragmented objects > again would you resolve it. > > Additionally you would not be able to handle the mprotect() case, and you'd > have the same issue as the madvise() approach in that you'd need to > essentially re-map each VMA. > > Additionally it would become non-trivial to correctly merge the VMAs - if > there were more than 3, we would need to invent a new merging mechanism > specifically for this, hold locks carefully over each to avoid them > disappearing from beneath us and introduce a great deal of non-optional > complexity. > > While imperfect, the mremap flag approach seems the least invasive most > workable solution (until further rework of the anon_vma mechanism can be > achieved!) Well, at that point we already have these new flags ... :( > > include/linux/rmap.h | 4 + > include/uapi/linux/mman.h | 8 +- > mm/internal.h | 1 + > mm/mremap.c | 719 ++++++- > mm/vma.c | 77 +- > mm/vma.h | 36 +- ~ +40% on LOC on mm/mremap.c :( -- Cheers, David / dhildenb