From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A9FB9C433EF for ; Wed, 24 Nov 2021 19:10:13 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2C8FF6B0072; Wed, 24 Nov 2021 14:09:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 2792B6B0075; Wed, 24 Nov 2021 14:09:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 11A2A6B007B; Wed, 24 Nov 2021 14:09:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0116.hostedemail.com [216.40.44.116]) by kanga.kvack.org (Postfix) with ESMTP id F18E46B0072 for ; Wed, 24 Nov 2021 14:09:57 -0500 (EST) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B846B886E4 for ; Wed, 24 Nov 2021 19:09:47 +0000 (UTC) X-FDA: 78844763214.06.EA9C7D2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf27.hostedemail.com (Postfix) with ESMTP id C372970000A9 for ; Wed, 24 Nov 2021 19:09:45 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637780986; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=isQsxZ+UIWuySQG9MheyCegH/iYJKHmJZxPqthMror0=; b=XTGGPM2IjDIlRrmfjTZF30naq210eEAZiIo5HG6Zqg7Ol/de0YNuEPSbCr5R6E6LkKoA5j Zq0dyUyNsoc/PYwH9svysXkuoe/RGzfzuyy7BzWBmVBpjiwNGf+v++pYjbhZQ2Y/DRYJx+ OTxgYPirwDAISk8BMwkfrREfAW5pNRg= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-48-k3Fl_B_5ObeVVbNQS96q-w-1; Wed, 24 Nov 2021 14:09:45 -0500 X-MC-Unique: k3Fl_B_5ObeVVbNQS96q-w-1 Received: by mail-wm1-f71.google.com with SMTP id l4-20020a05600c1d0400b00332f47a0fa3so1950326wms.8 for ; Wed, 24 Nov 2021 11:09:45 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=isQsxZ+UIWuySQG9MheyCegH/iYJKHmJZxPqthMror0=; b=PXQBzNuHuoNDLOIColpyj6hv7ImG7yREwGR+YWzRDkbmB9t4AFN3UZi9I0L3k7IiwS ArxLNoLioaHZIJvWbbA3WhF6M47A5eHxCMzlDNN1rTFiUaxAM1KJmKH/viIU/FHl+I4X xZ/vJeNqCh4X0yu/63tFVT4i6q1KGIgBGuwno+De3IN1hckt2sgv6Q2W5Ewmq4ofczLK EkOUufPMqbuhbO9mkJ8ruwZjnWaHNiW1qyiDvs5go1OWtxO5+S7kgWJTTbMjCARlJlJh pVEtnvbNYFAAFLXmN28amhJYrNvuVVrN6AABXfpVRvnfBTJFalUvSH2oJC6w8CaFWswT iUUw== X-Gm-Message-State: AOAM530Au8Tjw70A/u26M3QsCCm9lNFPGbuvg1TRziIem5/AnoME6GUD nKu08X27cJbIkhCRFs5sgY6be5C8gxrc1LWpbTnl3tmvVk/2GVeVIyPsNr5JGP+lxsdPz9RXphu 421mNNDhZRvc= X-Received: by 2002:a5d:4e0f:: with SMTP id p15mr22328421wrt.48.1637780983923; Wed, 24 Nov 2021 11:09:43 -0800 (PST) X-Google-Smtp-Source: ABdhPJw4cnG6zQAtfTxbiF3GkI02UpMt0XyyBIq1dEWEXIxz2pHNKNMEO5mJ1P1ByRxTDNJ/bVfvjw== X-Received: by 2002:a5d:4e0f:: with SMTP id p15mr22328368wrt.48.1637780983653; Wed, 24 Nov 2021 11:09:43 -0800 (PST) Received: from [192.168.3.132] (p5b0c6380.dip0.t-ipconnect.de. [91.12.99.128]) by smtp.gmail.com with ESMTPSA id bg12sm805901wmb.5.2021.11.24.11.09.42 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 24 Nov 2021 11:09:42 -0800 (PST) Message-ID: Date: Wed, 24 Nov 2021 20:09:42 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Jason Gunthorpe Cc: Vlastimil Babka , Jens Axboe , Andrew Dona-Couch , Andrew Morton , Drew DeVault , Ammar Faizi , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, io_uring Mailing List , Pavel Begunkov , linux-mm@kvack.org References: <20211123235953.GF5112@ziepe.ca> <2adca04f-92e1-5f99-6094-5fac66a22a77@redhat.com> <20211124132353.GG5112@ziepe.ca> <20211124132842.GH5112@ziepe.ca> <20211124134812.GI5112@ziepe.ca> <2cdbebb9-4c57-7839-71ab-166cae168c74@redhat.com> <20211124153405.GJ5112@ziepe.ca> <63294e63-cf82-1f59-5ea8-e996662e6393@redhat.com> <20211124183544.GL5112@ziepe.ca> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB In-Reply-To: <20211124183544.GL5112@ziepe.ca> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam11 X-Rspamd-Queue-Id: C372970000A9 X-Stat-Signature: cx77p39npjgo7ttw76er46db91y95iy9 Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XTGGPM2I; spf=none (imf27.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1637780985-333856 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.11.21 19:35, Jason Gunthorpe wrote: > On Wed, Nov 24, 2021 at 05:43:58PM +0100, David Hildenbrand wrote: >> On 24.11.21 16:34, Jason Gunthorpe wrote: >>> On Wed, Nov 24, 2021 at 03:14:00PM +0100, David Hildenbrand wrote: >>> >>>> I'm not aware of any where you can fragment 50% of all pageblocks in the >>>> system as an unprivileged user essentially consuming almost no memory >>>> and essentially staying inside well-defined memlock limits. But sure if >>>> there are "many" people will be able to come up with at least one >>>> comparable thing. I'll be happy to learn. >>> >>> If the concern is that THP's can be DOS'd then any avenue that renders >>> the system out of THPs is a DOS attack vector. Including all the >>> normal workloads that people run and already complain that THPs get >>> exhausted. >>> >>> A hostile userspace can only quicken this process. >> >> We can not only fragment THP but also easily smaller compound pages, >> with less impact though (well, as long as people want more than 0.1% per >> user ...). > > My point is as long as userspace can drive this fragmentation, by any > means, we can never have DOS proof higher order pages, so lets not > worry so much about one of many ways to create fragmentation. > That would be giving up on compound pages (hugetlbfs, THP, ...) on any current Linux system that does not use ZONE_MOVABLE -- which is not something I am not willing to buy into, just like our customers ;) See my other mail, the upstream version of my reproducer essentially shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And at least to me that's an interesting insight :) I agree that the more extreme scenarios I can construct are a secondary concern. But my upstream reproducer just highlights what can easily happen in reality. >>>> My position that FOLL_LONGTERM for unprivileged users is a strong no-go >>>> stands as it is. >>> >>> As this basically excludes long standing pre-existing things like >>> RDMA, XDP, io_uring, and more I don't think this can be the general >>> answer for mm, sorry. >> >> Let's think about options to restrict FOLL_LONGTERM usage: > > Which gives me the view that we should be talking about how to make > high order pages completely DOS proof, not about FOLL_LONGTERM. Sure, one step at a time ;) > > To me that is exactly what ZONE_MOVABLE strives to achieve, and I > think anyone who cares about QOS around THP must include ZONE_MOVABLE > in their solution. For 100% yes. > > In all of this I am thinking back to the discussion about the 1GB THP > proposal which was resoundly shot down on the grounds that 2MB THP > *doesn't work* today due to the existing fragmentation problems. The point that "2MB THP" doesn't work is just wrong. pageblocks do their job very well, but we can end up in corner case situations where more and more pageblocks are getting fragmented. And people are constantly improving these corner cases (e.g. proactive compaction). Usually you have to allocate *a lot* of memory and put the system under extreme memory pressure, such that unmovable allocations spill into movable pageblocks and the other way around. The thing about my reproducer is that it does that without any memory pressure, and that is the BIG difference to everything else we have in that regard. You can have an idle 1TiB system running my reproducer and it will fragment half of of all pageblocks in the system while mlocking ~ 1GiB. And that highlights the real issue IMHO. The 1 GB THP project is still going on BTW. > >> Another option would be not accounting FOLL_LONGTERM as RLIMIT_MEMLOCK, >> but instead as something that explicitly matches the differing >> semantics. > > Also a good idea, someone who cares about this should really put > pinned pages into the cgroup machinery (with correct accounting!) > >> At the same time, eventually work on proper alternatives with mmu >> notifiers (and possibly without the any such limits) where possible >> and required. > > mmu_notifiers is also bad, it just offends a different group of MM > concerns :) Yeah, I know, locking nightmare. > > Something like io_ring is registering a bulk amount of memory and then > doing some potentially long operations against it. The individual operations it performs are comparable to O_DIRECT I think -- but no expert. > > So to use a mmu_notifier scheme you'd have to block the mmu_notifier > invalidate_range_start until all the operations touching the memory > finish (and suspend new operations at the same time!). > > Blocking the notifier like this locks up the migration/etc threads > completely, and is destructive to the OOM reclaim. > > At least with a pinned page those threads don't even try to touch it > instead of getting stuck up. Yes, if only we'd be pinning for a restricted amount of time ... -- Thanks, David / dhildenb