From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A1B5C43219 for ; Wed, 24 Nov 2021 16:44:31 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B83566B0075; Wed, 24 Nov 2021 11:44:15 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B334A6B0078; Wed, 24 Nov 2021 11:44:15 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 985876B007B; Wed, 24 Nov 2021 11:44:15 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0144.hostedemail.com [216.40.44.144]) by kanga.kvack.org (Postfix) with ESMTP id 82D266B0075 for ; Wed, 24 Nov 2021 11:44:15 -0500 (EST) Received: from smtpin31.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 39A317582A for ; Wed, 24 Nov 2021 16:44:05 +0000 (UTC) X-FDA: 78844396050.31.CCDB91F Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 0064B100009E for ; Wed, 24 Nov 2021 16:44:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637772243; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QsMXLEsfUpg3nTh4LVta/33dXBWrEFr08t86tyNbxD8=; b=A76AX7zxVEKnJpDetWPWG8JEy5XR2PQvlnqqogVIejfZLJSVDPcJIKXODQ1ZGhorKyQrKq uloOCQfy70d6x3fmwl39qf1E0TRBhdU1cdtsbYgWXzW0WL0k94MquHG85tWIJif+WVgKca yjEISlBIJ5AGySBqhJSCN5kl9+VP3EA= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-313-GubxrxIKOI63mN2AdOKoIQ-1; Wed, 24 Nov 2021 11:44:02 -0500 X-MC-Unique: GubxrxIKOI63mN2AdOKoIQ-1 Received: by mail-wr1-f72.google.com with SMTP id q7-20020adff507000000b0017d160d35a8so637694wro.4 for ; Wed, 24 Nov 2021 08:44:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=QsMXLEsfUpg3nTh4LVta/33dXBWrEFr08t86tyNbxD8=; b=oLtgEMmPR1jmfuHb2ctn89ilsYDuq9JDhJEdYnPoa0I/KwtFHDedGn6rPeu4kScUoo 7BDj5GiBcS97oxnEP+g1F2pnxnQmTL2thLDH4/hlWWbuevbhKW1i5DV0iadV7DkQW389 Jd8Zkhb+5HgdO7UvKrAzGZiNUJk38dB64TO+P3AcQWGUaNZzCLue91tRBppD+Y4xM46O BMxyjvy/S7/bdTUCxewnpbtfm2vnBl9GSXKpmLak9vX5hoG6IN5auWeLP/brfUU1mYyZ xmN/0yAWwyAeUD6BRA8p4fvfkz5PXwD0EwgFpMoA0taeXuncljkzyKn66ysm1/S+EFXu uD5Q== X-Gm-Message-State: AOAM533V6exHlBwIFN4ZuoPeWLGvW7yu0bTpwQHBhX2BASJr6elvRcTC LRJVcRhk9SX6saCudAov/0pRUmG3MbUvhR3KrkcSt7ruoEazk85br9+PDrJT+rajqbK3wWZ439M oMUazYU31vbI= X-Received: by 2002:a7b:cd02:: with SMTP id f2mr16953554wmj.115.1637772241098; Wed, 24 Nov 2021 08:44:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJy/aTWG+EhXz/LEDFYbuIzrNXajT2pjFwM6ekqvEq0ltZvjf/SVX2f84TgWFEY087L2XnaMJQ== X-Received: by 2002:a7b:cd02:: with SMTP id f2mr16953496wmj.115.1637772240810; Wed, 24 Nov 2021 08:44:00 -0800 (PST) Received: from [192.168.3.132] (p5b0c6380.dip0.t-ipconnect.de. [91.12.99.128]) by smtp.gmail.com with ESMTPSA id f15sm307434wmg.30.2021.11.24.08.43.59 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 Nov 2021 08:43:59 -0800 (PST) Message-ID: <63294e63-cf82-1f59-5ea8-e996662e6393@redhat.com> Date: Wed, 24 Nov 2021 17:43:58 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Jason Gunthorpe Cc: Vlastimil Babka , Jens Axboe , Andrew Dona-Couch , Andrew Morton , Drew DeVault , Ammar Faizi , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, io_uring Mailing List , Pavel Begunkov , linux-mm@kvack.org References: <20211123170056.GC5112@ziepe.ca> <20211123235953.GF5112@ziepe.ca> <2adca04f-92e1-5f99-6094-5fac66a22a77@redhat.com> <20211124132353.GG5112@ziepe.ca> <20211124132842.GH5112@ziepe.ca> <20211124134812.GI5112@ziepe.ca> <2cdbebb9-4c57-7839-71ab-166cae168c74@redhat.com> <20211124153405.GJ5112@ziepe.ca> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB In-Reply-To: <20211124153405.GJ5112@ziepe.ca> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: n3fa9cebgp7tbm3fpkw553johox7c6n3 Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=A76AX7zx; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf07.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 0064B100009E X-HE-Tag: 1637772240-186309 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.11.21 16:34, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: > >> I'm not aware of any where you can fragment 50% of all pageblocks in the >> system as an unprivileged user essentially consuming almost no memory >> and essentially staying inside well-defined memlock limits. But sure if >> there are "many" people will be able to come up with at least one >> comparable thing. I'll be happy to learn. > > If the concern is that THP's can be DOS'd then any avenue that renders > the system out of THPs is a DOS attack vector. Including all the > normal workloads that people run and already complain that THPs get > exhausted. > > A hostile userspace can only quicken this process. We can not only fragment THP but also easily smaller compound pages, with less impact though (well, as long as people want more than 0.1% per user ...). We want to make more excessive use of THP; the whole folio work is about using THP. Some people are even working on increasing the MAX_ORDER and introduce gigantic THP. And here we are having mechanisms available to unprivileged users to just sabotage the very thing at its core extremely easily. Personally, I think this is very bad, but that's just my humble opinion. > >> My position that FOLL_LONGTERM for unprivileged users is a strong no-go >> stands as it is. > > As this basically excludes long standing pre-existing things like > RDMA, XDP, io_uring, and more I don't think this can be the general > answer for mm, sorry. Let's think about options to restrict FOLL_LONGTERM usage: One option would be to add toggle(s) (e.g., kernel cmdline options) to make relevant mechanisms (or even FOLL_LONGTERM itself) privileged. The admin can opt in if unprivileged users should have that capability. A distro might overwrite the default and set it to "on". I'm not completely happy about that. Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, but instead as something that explicitly matches the differing semantics. We could have a limit for privileged and one for unprivileged users. The default in the kernel could be 0 but an admin/system can overwrite it to opt in and a distro might apply different rules. Yes, we're back to the original question about limits, but now with the thought that FOLL_LONGTERM really is different than mlock and potentially more dangerous. At the same time, eventually work on proper alternatives with mmu notifiers (and possibly without the any such limits) where possible and required. (I assume it's hardly possible for RDMA because of the way the hardware works) Just some ideas, open for alternatives. I know that for the cases where we want it to "just work" for unprivileged users but cannot even have alternative implementations, this is bad. > > Sure, lets stop now since I don't think we can agree. Don't get me wrong, I really should be working on other stuff, so I have limited brain capacity and time :) OTOH I'm willing to help at least discuss alternatives. Let's think about realistic alternatives to keep FOLL_LONGTERM for any user working (that would tackle the extreme fragmentation issue at least, ignoring e.g., other fragmentation we can trigger with FOLL_LONGTERM or ZONE_MOVABLE/MIGRATE_CMA): The nasty thing really is splitting a compound page and then pinning some pages, even if it's pinning the complete compound range. Ideally, we'd defer any action to the time we actually FOLL_LONGTERM pin a page. a) I think we cannot migrate pages when splitting the PMD (e.g., unmap, MADV_DONTNEED, swap?, page compaction?). User space can just pin the compound page to block migration. b) We might migrate pages when splitting the compound page. In split_huge_page_to_list() we know that we have nobody pinning the page. I did not check if it's possible. There might be cases where it's not immediately clear if it's possible (e.g., inside shrink_page_list()) It would mean that we would migrate pages essentially any time we split a compound page because there could be someone FOLL_LONGTERM pinning the page later. Usually we'd expect page compaction to fix this up on actual demand. I'd call this sub-optimal. c) We migrate any time someone FOLL_LONGTERM pins a page and the page is not pinned yet -- because it might have been a split compound page. I think we can agree that that's not an option :) d) We remember if a page was part of a compound page and was not freed yet. If we FOLL_LONGTERM such a page, we migrate it. Unfortunately, we're short on pageflags for anon pages I think. Hm, alternatives? -- Thanks, David / dhildenb