From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 565DDC433EF for ; Mon, 25 Apr 2022 14:07:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D77A86B0074; Mon, 25 Apr 2022 10:07:18 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D25776B0075; Mon, 25 Apr 2022 10:07:18 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BA0896B0078; Mon, 25 Apr 2022 10:07:18 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24]) by kanga.kvack.org (Postfix) with ESMTP id AA6306B0074 for ; Mon, 25 Apr 2022 10:07:18 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 77EC261FAE for ; Mon, 25 Apr 2022 14:07:18 +0000 (UTC) X-FDA: 79395578556.02.4153380 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf09.hostedemail.com (Postfix) with ESMTP id 35A84140034 for ; Mon, 25 Apr 2022 14:07:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1650895637; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=SVM3muYPJyfaOhJtwnQezE4YHj5J/gwIIpTn05v9eQ0=; b=QzYbfdR4RuzeSefeidR9/I0ClTGmgwZt4kSt8nmOHJcSGsb1GSupOIKesceS9bcXWPL53J tdm+14FaCIH1fu0wnYN6N9eD3lz9UH1hQ34lkfooslif4JGBJh085r52XJfL0p1JTdqPom Veh2Tno3gfIbKg48e9PiXTgGmxBlxt4= Received: from mail-wm1-f72.google.com (mail-wm1-f72.google.com [209.85.128.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-453-4ZUL6cNiMSGi6Y9nt-5VkQ-1; Mon, 25 Apr 2022 10:07:08 -0400 X-MC-Unique: 4ZUL6cNiMSGi6Y9nt-5VkQ-1 Received: by mail-wm1-f72.google.com with SMTP id az19-20020a05600c601300b003914ac8efb8so7198335wmb.2 for ; Mon, 25 Apr 2022 07:07:08 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=SVM3muYPJyfaOhJtwnQezE4YHj5J/gwIIpTn05v9eQ0=; b=iE/O1a9VwG9rz8Fg9akoF0qDIzMO3FiJppaezq42f1beeEtSwKPB1vLry3pT7ev7/7 vo0jLP4+zGnSRY1WcTSIAtjiqJoWCPfWQkMxWsmyZNs7dserjYhOuThoB4ggvemvNWB0 gqMxhnbutyv2WATYeoaIidHfbjJTjMi2toSQYJ6lGnXZ87hiIo22+ELYecn6vXbd69e6 UzawH3ePrXtB3t4VSv57jM+uq6qq200WxPhSf9UmP3fY0gI6q7U6zjqoaj02QyjGjQyx 6jrFR62dmVMViCSHnU+Yk3Ue9zymZM+/wEaEdAF5TmOEmhKZTCAvH+j4qt6VqOvJhrvt ttqA== X-Gm-Message-State: AOAM531uruY8Wdz/Dwa9idfs2BiBYtUBDC0XdnFMAdbaQCosF+bWKx9t TOInKP75kxakJYV7OvHmf5Ek3BuZsWwAMJjws1rRWI8a2spUPpCd/j7y4ua3KzM1EG8fH8gMyWq y2+kdfK925vs= X-Received: by 2002:a5d:6241:0:b0:207:ac0e:3549 with SMTP id m1-20020a5d6241000000b00207ac0e3549mr14371877wrv.343.1650895627107; Mon, 25 Apr 2022 07:07:07 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzgoNX9maOD+O7h8van/rWNjXvLl8lKMAp755C32BIUi/e19PViAQ+6Xwf7k6WLPh49+cUdlA== X-Received: by 2002:a5d:6241:0:b0:207:ac0e:3549 with SMTP id m1-20020a5d6241000000b00207ac0e3549mr14371841wrv.343.1650895626813; Mon, 25 Apr 2022 07:07:06 -0700 (PDT) Received: from ?IPV6:2003:cb:c700:fc00:490d:ed6a:8b22:223a? (p200300cbc700fc00490ded6a8b22223a.dip0.t-ipconnect.de. [2003:cb:c700:fc00:490d:ed6a:8b22:223a]) by smtp.gmail.com with ESMTPSA id t18-20020a05600c199200b0039291537cfesm11669789wmq.21.2022.04.25.07.07.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 25 Apr 2022 07:07:06 -0700 (PDT) Message-ID: Date: Mon, 25 Apr 2022 16:07:03 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.6.2 To: Jason Gunthorpe Cc: Sean Christopherson , Andy Lutomirski , Chao Peng , kvm list , Linux Kernel Mailing List , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, Linux API , qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , the arch/x86 maintainers , "H. Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A. Shutemov" , "Nakajima, Jun" , Dave Hansen , Andi Kleen References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <20220310140911.50924-5-chao.p.peng@linux.intel.com> <02e18c90-196e-409e-b2ac-822aceea8891@www.fastmail.com> <7ab689e7-e04d-5693-f899-d2d785b09892@redhat.com> <20220412143636.GG64706@ziepe.ca> <1686fd2d-d9c3-ec12-32df-8c4c5ae26b08@redhat.com> <20220413175208.GI64706@ziepe.ca> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK In-Reply-To: <20220413175208.GI64706@ziepe.ca> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 4ma119mbhm7ixsg5crxeh3gexwwz6rd4 X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 35A84140034 X-Rspam-User: Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=QzYbfdR4; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf09.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1650895635-599464 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 13.04.22 19:52, Jason Gunthorpe wrote: > On Wed, Apr 13, 2022 at 06:24:56PM +0200, David Hildenbrand wrote: >> On 12.04.22 16:36, Jason Gunthorpe wrote: >>> On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: >>> >>>> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >>>> past already with secretmem, it's not 100% that good of a fit (unmovable >>>> is worth than mlocked). But it gets the job done for now at least. >>> >>> No, it doesn't. There are too many different interpretations how >>> MELOCK is supposed to work >>> >>> eg VFIO accounts per-process so hostile users can just fork to go past >>> it. >>> >>> RDMA is per-process but uses a different counter, so you can double up >>> >>> iouring is per-user and users a 3rd counter, so it can triple up on >>> the above two >> >> Thanks for that summary, very helpful. > > I kicked off a big discussion when I suggested to change vfio to use > the same as io_uring > > We may still end up trying it, but the major concern is that libvirt > sets the RLIMIT_MEMLOCK and if we touch anything here - including > fixing RDMA, or anything really, it becomes a uAPI break for libvirt.. > Okay, so we have to introduce a second mechanism, don't use RLIMIT_MEMLOCK for new unmovable memory, and then eventually phase out RLIMIT_MEMLOCK usage for existing unmovable memory consumers (which, as you say, will be difficult). >>>> So I'm open for alternative to limit the amount of unmovable memory we >>>> might allocate for user space, and then we could convert seretmem as well. >>> >>> I think it has to be cgroup based considering where we are now :\ >> >> Most probably. I think the important lessons we learned are that >> >> * mlocked != unmovable. >> * RLIMIT_MEMLOCK should most probably never have been abused for >> unmovable memory (especially, long-term pinning) > > The trouble is I'm not sure how anything can correctly/meaningfully > set a limit. > > Consider qemu where we might have 3 different things all pinning the > same page (rdma, iouring, vfio) - should the cgroup give 3x the limit? > What use is that really? I think your tackling a related problem, that we double-account unmovable/mlocked memory due to lack of ways to track that a page is already pinned by the same user/cgroup/whatsoever. Not easy to solve. The problem also becomes interesting if iouring with fixed buffers doesn't work on guest RAM, but on some other QEMU buffers. > > IMHO there are only two meaningful scenarios - either you are unpriv > and limited to a very small number for your user/cgroup - or you are > priv and you can do whatever you want. > > The idea we can fine tune this to exactly the right amount for a > workload does not seem realistic and ends up exporting internal kernel > decisions into a uAPI.. IMHO, there are three use cases: * App that conditionally uses selected mechanism that end up requiring unmovable, long-term allocations. Secretmem, iouring, rdma. We want some sane, small default. Apps have a backup path in case any such mechanism fails because we're out of allowed unmovable resources. * App that relies on selected mechanism that end up requiring unmovable, long-term allocations. E.g., vfio with known memory consumption, such as the VM size. It's fairly easy to come up with the right value. * App that relies on multiple mechanism that end up requiring unmovable, long-term allocations. QEMU with rdma, iouring, vfio, ... I agree that coming up with something good is problematic. Then, there are privileged/unprivileged apps. There might be admins that just don't care. There might be admins that even want to set some limit instead of configuring "unlimited" for QEMU. Long story short, it should be an admin choice what to configure, especially: * What the default is for random apps * What the maximum is for selected apps * Which apps don't have a maximum -- Thanks, David / dhildenb