From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 37450C54E49 for ; Thu, 29 Feb 2024 15:15:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6276E6B0093; Thu, 29 Feb 2024 10:15:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D80E6B0095; Thu, 29 Feb 2024 10:15:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 451846B0098; Thu, 29 Feb 2024 10:15:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 31DC66B0093 for ; Thu, 29 Feb 2024 10:15:30 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id E3739160312 for ; Thu, 29 Feb 2024 15:15:29 +0000 (UTC) X-FDA: 81845190378.21.F38262D Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf10.hostedemail.com (Postfix) with ESMTP id 92773C0017 for ; Thu, 29 Feb 2024 15:15:27 +0000 (UTC) Authentication-Results: imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Dm8dPhsa; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709219727; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XksZnvuy+JBo0aXm3tdc5TbZ2tj86eV6KsBhmYe3WEA=; b=nPOSw0rXOGyCC8O5rqTTQt8AjS4OE+VyVYrHsUcSM8uHrcL9kkFBeIKWe8OUSc50Z205VK KyB6rvKDqHXQOWibDQx9pMdibtuYy9xL4peFB9xXGZYQHBTKkhvXNT6qVNZv9ER+nsNxut 5lqPx9SpVeopRE3FUNjVMjvetlzJ9eE= ARC-Authentication-Results: i=1; imf10.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Dm8dPhsa; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf10.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709219727; a=rsa-sha256; cv=none; b=AWXFIVXEpG4rBlLK2noodXRI6ZdGMQ725iONlBcnF85HxoTu+5lk/JRsZS5VsWlJ14Lh5p Gy7j1aIafpE2yewP84qMQBo/0/DQrstlx9ahDTEJ1TGlnzC2sKiOTw8flixv+fvltG4G7y qbWuisl+uCj9IeIOfJe6yGY+i74t1lE= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1709219726; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:autocrypt:autocrypt; bh=XksZnvuy+JBo0aXm3tdc5TbZ2tj86eV6KsBhmYe3WEA=; b=Dm8dPhsadxlaYfZlqN74lfkMtWpxu+mWxyx/ds58/pReg3+FExkyD5H2L1WaFtcNfhNjnW gwBmJYwPn4sJCdKveDK6gdLjmFAqodRH6NzeFaYLp3SMY8qofBZNPkXAGfsV9G7Vv9IVjk DZpFBFqb+gWeYjNXiYbI8+XBPrAgbb4= Received: from mail-lf1-f72.google.com (mail-lf1-f72.google.com [209.85.167.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-562--W6To4jYO5mZob18-uHdZg-1; Thu, 29 Feb 2024 10:15:23 -0500 X-MC-Unique: -W6To4jYO5mZob18-uHdZg-1 Received: by mail-lf1-f72.google.com with SMTP id 2adb3069b0e04-5131d042b1aso790308e87.0 for ; Thu, 29 Feb 2024 07:15:21 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1709219720; x=1709824520; h=content-transfer-encoding:in-reply-to:organization:autocrypt:from :references:cc:to:content-language:subject:user-agent:mime-version :date:message-id:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=XksZnvuy+JBo0aXm3tdc5TbZ2tj86eV6KsBhmYe3WEA=; b=dziuIJlfFocfxsFQX5gXlqh23Wqw3Bvxo3euKnPYW7qg29SCJ+NT+s+wqb1KogSEoW t+STlPj9HcJENGPQ1kzTmqEHdGju4jSJzqiUVyVnCloTa6DMbP7lliNbwxGUmtN23J3h YDEM+80UWJdraNSC+ncr7Ft+gR0422lfAC5KOLog76C0BFpFYRvOUMHrk/w/IFJF4nYw yxMTfI/B9pHh7QnOTqUYGi5fB3GhGNbt43KTrEhWj74VN6Vr+Jiqpan0u44AlNzMeWjG +SL9wydgXmYicdMHWHs6GhS8Gd0RQQrDNrMFUtFRxUt3UcKPefoBPNQIzo5AZyC5X5bW KHCg== X-Forwarded-Encrypted: i=1; AJvYcCX+UkoI0j3la7/w7bXKp1zqVuZ65TeIv4ch8sClyW3KWFN7nTjA/V6Mg4QwiJdo9kGBdvAccI9xv4p/yk6mqsMBq5Y= X-Gm-Message-State: AOJu0YyaDS85c05kIt4MCqrG6BhPLB7t4rinXhgCRLZgthO945Houv/b I2aULeHA3tq7m97k41Khtu3KxZsjRl0q1bjzkJmZvYMMCVEKTELgMDlfFB7OgOKdlRaXY6Cd3av isGV2WXUTbwgPB2rF/bDi2RzFuIugUhCnyFO7Yg0Y14N7gFPM X-Received: by 2002:a05:6512:3ba7:b0:513:1aae:63a1 with SMTP id g39-20020a0565123ba700b005131aae63a1mr2091440lfv.57.1709219720645; Thu, 29 Feb 2024 07:15:20 -0800 (PST) X-Google-Smtp-Source: AGHT+IGR6roiU80CcuWzequdHGPHIgHEq7Wl06z+Z3hBFK5gz4ATJ0erD22zByNQny0ZubViRfB/1w== X-Received: by 2002:a05:6512:3ba7:b0:513:1aae:63a1 with SMTP id g39-20020a0565123ba700b005131aae63a1mr2091416lfv.57.1709219720115; Thu, 29 Feb 2024 07:15:20 -0800 (PST) Received: from ?IPV6:2003:cb:c707:fa00:74f2:89da:ed65:8b50? (p200300cbc707fa0074f289daed658b50.dip0.t-ipconnect.de. [2003:cb:c707:fa00:74f2:89da:ed65:8b50]) by smtp.gmail.com with ESMTPSA id h7-20020a05600c314700b00412b431eb0csm2380472wmo.14.2024.02.29.07.15.19 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 29 Feb 2024 07:15:19 -0800 (PST) Message-ID: <0a63c084-7507-42fd-9201-ab2ca8d38f6e@redhat.com> Date: Thu, 29 Feb 2024 16:15:18 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [LSF/MM/BPF TOPIC] Sharing page tables across processes (mshare) To: Matthew Wilcox Cc: Khalid Aziz , lsf-pc@lists.linux-foundation.org, "linux-mm@kvack.org" References: From: David Hildenbrand Autocrypt: addr=david@redhat.com; keydata= xsFNBFXLn5EBEAC+zYvAFJxCBY9Tr1xZgcESmxVNI/0ffzE/ZQOiHJl6mGkmA1R7/uUpiCjJ dBrn+lhhOYjjNefFQou6478faXE6o2AhmebqT4KiQoUQFV4R7y1KMEKoSyy8hQaK1umALTdL QZLQMzNE74ap+GDK0wnacPQFpcG1AE9RMq3aeErY5tujekBS32jfC/7AnH7I0v1v1TbbK3Gp XNeiN4QroO+5qaSr0ID2sz5jtBLRb15RMre27E1ImpaIv2Jw8NJgW0k/D1RyKCwaTsgRdwuK Kx/Y91XuSBdz0uOyU/S8kM1+ag0wvsGlpBVxRR/xw/E8M7TEwuCZQArqqTCmkG6HGcXFT0V9 PXFNNgV5jXMQRwU0O/ztJIQqsE5LsUomE//bLwzj9IVsaQpKDqW6TAPjcdBDPLHvriq7kGjt WhVhdl0qEYB8lkBEU7V2Yb+SYhmhpDrti9Fq1EsmhiHSkxJcGREoMK/63r9WLZYI3+4W2rAc UucZa4OT27U5ZISjNg3Ev0rxU5UH2/pT4wJCfxwocmqaRr6UYmrtZmND89X0KigoFD/XSeVv jwBRNjPAubK9/k5NoRrYqztM9W6sJqrH8+UWZ1Idd/DdmogJh0gNC0+N42Za9yBRURfIdKSb B3JfpUqcWwE7vUaYrHG1nw54pLUoPG6sAA7Mehl3nd4pZUALHwARAQABzSREYXZpZCBIaWxk ZW5icmFuZCA8ZGF2aWRAcmVkaGF0LmNvbT7CwZgEEwEIAEICGwMGCwkIBwMCBhUIAgkKCwQW AgMBAh4BAheAAhkBFiEEG9nKrXNcTDpGDfzKTd4Q9wD/g1oFAl8Ox4kFCRKpKXgACgkQTd4Q 9wD/g1oHcA//a6Tj7SBNjFNM1iNhWUo1lxAja0lpSodSnB2g4FCZ4R61SBR4l/psBL73xktp rDHrx4aSpwkRP6Epu6mLvhlfjmkRG4OynJ5HG1gfv7RJJfnUdUM1z5kdS8JBrOhMJS2c/gPf wv1TGRq2XdMPnfY2o0CxRqpcLkx4vBODvJGl2mQyJF/gPepdDfcT8/PY9BJ7FL6Hrq1gnAo4 3Iv9qV0JiT2wmZciNyYQhmA1V6dyTRiQ4YAc31zOo2IM+xisPzeSHgw3ONY/XhYvfZ9r7W1l pNQdc2G+o4Di9NPFHQQhDw3YTRR1opJaTlRDzxYxzU6ZnUUBghxt9cwUWTpfCktkMZiPSDGd KgQBjnweV2jw9UOTxjb4LXqDjmSNkjDdQUOU69jGMUXgihvo4zhYcMX8F5gWdRtMR7DzW/YE BgVcyxNkMIXoY1aYj6npHYiNQesQlqjU6azjbH70/SXKM5tNRplgW8TNprMDuntdvV9wNkFs 9TyM02V5aWxFfI42+aivc4KEw69SE9KXwC7FSf5wXzuTot97N9Phj/Z3+jx443jo2NR34XgF 89cct7wJMjOF7bBefo0fPPZQuIma0Zym71cP61OP/i11ahNye6HGKfxGCOcs5wW9kRQEk8P9 M/k2wt3mt/fCQnuP/mWutNPt95w9wSsUyATLmtNrwccz63XOwU0EVcufkQEQAOfX3n0g0fZz Bgm/S2zF/kxQKCEKP8ID+Vz8sy2GpDvveBq4H2Y34XWsT1zLJdvqPI4af4ZSMxuerWjXbVWb T6d4odQIG0fKx4F8NccDqbgHeZRNajXeeJ3R7gAzvWvQNLz4piHrO/B4tf8svmRBL0ZB5P5A 2uhdwLU3NZuK22zpNn4is87BPWF8HhY0L5fafgDMOqnf4guJVJPYNPhUFzXUbPqOKOkL8ojk CXxkOFHAbjstSK5Ca3fKquY3rdX3DNo+EL7FvAiw1mUtS+5GeYE+RMnDCsVFm/C7kY8c2d0G NWkB9pJM5+mnIoFNxy7YBcldYATVeOHoY4LyaUWNnAvFYWp08dHWfZo9WCiJMuTfgtH9tc75 7QanMVdPt6fDK8UUXIBLQ2TWr/sQKE9xtFuEmoQGlE1l6bGaDnnMLcYu+Asp3kDT0w4zYGsx 5r6XQVRH4+5N6eHZiaeYtFOujp5n+pjBaQK7wUUjDilPQ5QMzIuCL4YjVoylWiBNknvQWBXS lQCWmavOT9sttGQXdPCC5ynI+1ymZC1ORZKANLnRAb0NH/UCzcsstw2TAkFnMEbo9Zu9w7Kv AxBQXWeXhJI9XQssfrf4Gusdqx8nPEpfOqCtbbwJMATbHyqLt7/oz/5deGuwxgb65pWIzufa N7eop7uh+6bezi+rugUI+w6DABEBAAHCwXwEGAEIACYCGwwWIQQb2cqtc1xMOkYN/MpN3hD3 AP+DWgUCXw7HsgUJEqkpoQAKCRBN3hD3AP+DWrrpD/4qS3dyVRxDcDHIlmguXjC1Q5tZTwNB boaBTPHSy/Nksu0eY7x6HfQJ3xajVH32Ms6t1trDQmPx2iP5+7iDsb7OKAb5eOS8h+BEBDeq 3ecsQDv0fFJOA9ag5O3LLNk+3x3q7e0uo06XMaY7UHS341ozXUUI7wC7iKfoUTv03iO9El5f XpNMx/YrIMduZ2+nd9Di7o5+KIwlb2mAB9sTNHdMrXesX8eBL6T9b+MZJk+mZuPxKNVfEQMQ a5SxUEADIPQTPNvBewdeI80yeOCrN+Zzwy/Mrx9EPeu59Y5vSJOx/z6OUImD/GhX7Xvkt3kq Er5KTrJz3++B6SH9pum9PuoE/k+nntJkNMmQpR4MCBaV/J9gIOPGodDKnjdng+mXliF3Ptu6 3oxc2RCyGzTlxyMwuc2U5Q7KtUNTdDe8T0uE+9b8BLMVQDDfJjqY0VVqSUwImzTDLX9S4g/8 kC4HRcclk8hpyhY2jKGluZO0awwTIMgVEzmTyBphDg/Gx7dZU1Xf8HFuE+UZ5UDHDTnwgv7E th6RC9+WrhDNspZ9fJjKWRbveQgUFCpe1sa77LAw+XFrKmBHXp9ZVIe90RMe2tRL06BGiRZr jPrnvUsUUsjRoRNJjKKA/REq+sAnhkNPPZ/NNMjaZ5b8Tovi8C0tmxiCHaQYqj7G2rgnT0kt WNyWQQ== Organization: Red Hat In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: 92773C0017 X-Stat-Signature: rqo46xom4m1jnqx6k1nr91xi7uyumz79 X-Rspam-User: X-HE-Tag: 1709219727-83659 X-HE-Meta: U2FsdGVkX192Yx+yaUwb3rDFVC+MuOweEtsxRlgYKxsmGD0PFybt8w5P36AOc3eC4Cr49cbDeHe9XY270SaZH0rn+qHwzciYdHkmUwTQWqJ8KlY5q7kVdwMhukub7puOREUBMoFHjhDw86VgYqnJ5o7kYC+M8aTVZul6+UHTccTkdrl5fiOwFcOD83St0F/m0funu1PXPhrnuK/RnJwT8KF/OeU560v7j0kM+csfG1mUBNNiRgOOe3WsbHWKcfyuqJ7hphKyF9cVHIrOLmXDXNUXhEajuFpmSCPuCwINmkzeRMltBeDg4Ou4ID7MMSA6/Rv7WfwIEfdpjzdVtzkbfpKORz5y2/dIg3JcKOHG4Zty9295T/ypWaogbjEHwXUsBehDbtZ9JtrRdAn/EaCTABvuhNlfaFocxyCaEbCopwTCUErgWnR7GLCnuob0XjqbA0QvV6q5hTzT+iuFe5E5hHFXEfpFzHqxVujqvdcIhEZS4KrsxmbDfsAYQL/V9Co698acAawtz4wm4JpJfBoZQHCAaKTGH4bP2aC8HGNt2XgCXzs/KZjbYC8MmYZHPX9EZo73erBNt9ve4Ry2bNA8Ion21hRg5kFacCX13u23T+DUFFmzAvPhAeYN8KYaZrSgJcc7BbW4UTZRBjhLyH1Qk8P6ZlRrCqnt1GaTWorhYLz2VLjrqeOZO+mLsXj2+TdGDmtU5j9QR2kxGG3et8CpDFynGZeUf7QyRj25hnZxn3AH70lh1wBpGm//SaSiWT2NeYgc3IGGtgQfq2mvlOPzLn7gU4ZIb7F+AGDewfGVyzFGk9cx8VKFUHGL7SaJiDMIhRmwkCRNT6L8uh9uQrMH6FuPFu/uiHzqh7G1fZUeShIcCYj97eKvBzLAV2VhAm/QL14JTJoKx0bxVroNo5v4qnlRvKuGN/+JWr2oGsIQOHk8vxDR9CNV1QgQ4n4PWlwIHuUvA686g4qoEZxCyyd FkheK2v5 FY+0b2CZpuYWEu34fukYz8Kx8cd80qyntbi7faUyWGYhUJfwdR5+0UG0tOb+kFENRRjom9g7ZUdcyLImM5NtyKFwL4e4PX/gkpWGN1J//jnzEfqSkETJ2hzqUEAyooWvLLsyaJdhbP6y16WxAWsTW8h9Fy7SRCeoKjm4MRQYgHlKzYGkjjSnFEUSraNbjsisEbOanYhF8s5lcTQU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 29.02.24 15:12, Matthew Wilcox wrote: > On Thu, Feb 29, 2024 at 10:21:26AM +0100, David Hildenbrand wrote: >> On 28.02.24 23:56, Khalid Aziz wrote: >>> Threads of a process share address space and page tables that allows for >>> two key advantages: >>> >>> 1. Amount of memory required for PTEs to map physical pages stays low >>> even when large number of threads share the same pages since PTEs are >>> shared across threads. >>> >>> 2. Page protection attributes are shared across threads and a change >>> of attributes applies immediately to every thread without any overhead >>> of coordinating protection bit changes across threads. >>> >>> These advantages no longer apply when unrelated processes share pages. >>> Large database applications can easily comprise of 1000s of processes >>> that share 100s of GB of pages. In cases like this, amount of memory >>> consumed by page tables can exceed the size of actual shared data. >>> On a database server with 300GB SGA, a system crash was seen with >>> out-of-memory condition when 1500+ clients tried to share this SGA even >>> though the system had 512GB of memory. On this server, in the worst case >>> scenario of all 1500 processes mapping every page from SGA would have >>> required 878GB+ for just the PTEs. >>> >>> I have sent proposals and patches to solve this problem by adding a >>> mechanism to the kernel for processes to use to opt into sharing >>> page tables with other processes. We have had discussions on original >>> proposal and subsequent refinements but we have not converged on a >>> solution. As systems with multi-TB memory and in-memory databases >>> are becoming more and more common, this is becoming a significant issue. >>> An interactive discussion can help us reach a consensus on how to >>> solve this. >> >> Hi, >> >> I was hoping for a follow-up to my previous comments from ~4 months ago [1], >> so one problem of "not converging" might be "no follow-up discussion". >> >> Ideally, this session would not focus on mshare as previously discussed at >> LSF/MM, but take a step back and discuss requirements and possible >> adjustments to the original concept to get something possibly cleaner. > > I think the concept is clean. > Your concept doesn't fit our use case! Which one exactly are you talking about in particular? I raised various alternatives/modifications for discussion, learning what works and what doesn't work on the way. (I never understood why protection on the pagecache level wouldn't work for your use case, but let's put that aside). In my last mail, I had the following: " It's been a while, but I remember that the feedback in the room was primarily that: (a) the original mshare approach/implementation had a very dangerous smell to it. Rerouting mmap/mprotect/... is just absolutely nasty. (b) that pure page table sharing itself might be itself a reasonable optimization worth having. I still think generic page table sharing (as a pure optimization) can be something reasonable to have, and can help existing use cases without the need to modify any software (well, except maybe give a hint that it might be reasonable). As said, I see value in some fd-thingy that can be mmaped, but is internally assembled from other fds (using protect ioctls, not mmap) with sub-protection (using protect ioctls, not mprotect). The ioctls would be minimal and clearly specified. Most madvise()/uffd/... would simply fail when seeing a VMA that mmaps such a fd thingy. No rerouting of mmap, munmap, mprotect, ... Under the hood, one can use a MM to manage all that and share page tables. But it would be an implementation detail. " So I do think original mshare could be done "less scary" [1] by exposing a different, well defined and restricted interface to manage the "content" of mshare. There is a lot of stuff to describe I have in mind, but it doesn't make sense to describe if it won't solve your usecase. In my world it would end up cleaner, and naive me would have thought that you would enjoy something close to original mshare, just a bit less scary :) > So essentially what you're asking for is for us to do a lot of work > which doesn't solve our problem. You can imagine our lack of enthusiasm > for this. I recall that implementing generic page table sharing is a lot of work that Oracle isn't interested in doing that, fair enough, I understood that. Really, the amount of work is unclear if we don't talk about the actual solution. I cannot really do more than offer help like I did: "I'm happy to discuss further. In a bi-weekly MM meeting, off-list or here.". But if my comments are so unreasonable that they are not even worth discussing them, likely I wouldn't be of any help in another mshare session. [1] https://lwn.net/Articles/895217/ -- Cheers, David / dhildenb