From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3E728C433F5 for ; Tue, 30 Nov 2021 15:53:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A3AE6B0072; Tue, 30 Nov 2021 10:52:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 952BD6B0074; Tue, 30 Nov 2021 10:52:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7CB7A6B0075; Tue, 30 Nov 2021 10:52:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0080.hostedemail.com [216.40.44.80]) by kanga.kvack.org (Postfix) with ESMTP id 6F5636B0072 for ; Tue, 30 Nov 2021 10:52:50 -0500 (EST) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 2D00289926 for ; Tue, 30 Nov 2021 15:52:40 +0000 (UTC) X-FDA: 78866039280.08.3DA37C2 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf14.hostedemail.com (Postfix) with ESMTP id 6527060019B5 for ; Tue, 30 Nov 2021 15:52:38 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1638287559; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jH9EXfZHYnVVr/gc0bkKyv6b3sFlfwoXqOh2jbmFu5w=; b=fxevny+oozzIi/iZAcyeadKxwRCIlfdW/780GVYLy5hYwUWYpVtEzW3Xoi4gFuykQZOY3d bxOJy/dbK7qp2htoJjEg57sLqWBWrMazBd8Sy+HoMrf0+q2s58mOersoLxzfKiW1xVW0Jx +l7HO2Yd3ZbFazR2VoqIweLM5kT59Tk= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-212-HtEyWk7BNLK_MNceVXiJqw-1; Tue, 30 Nov 2021 10:52:37 -0500 X-MC-Unique: HtEyWk7BNLK_MNceVXiJqw-1 Received: by mail-wm1-f71.google.com with SMTP id ay34-20020a05600c1e2200b00337fd217772so13123815wmb.4 for ; Tue, 30 Nov 2021 07:52:37 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=jH9EXfZHYnVVr/gc0bkKyv6b3sFlfwoXqOh2jbmFu5w=; b=2G7K4Mrc2wLwcoMmqwhZb2KBRAD+jyYGkXBr2xxqoiq0JW43ZiUUpqfxwG3Dur5UQd 5OU+oPrKWyrzLevdxQhOop9epLtTS9o/vxMtQfUF9fyJgVdd4aiMPHYm1eoRDiy6AeAx APS2vdBFy8GD1YFUjVxoIPoqw3RW4ZVVIwFlwOSheNVvxTkD3ssuxe0SRJdFzUOIyi5M cYfk9DsH2RMbAzCj6BpfxB1TZNbaA6hY8B1zVzqif6+4WV/7wasaNdIFWmKj3eBi+JBw hbrPC/Aj5vlzJprsz/jQdIJ3R9hf0b2+gXdyV5Oq4Vv1eDU1XdZD/s7RQofhan//D0WN tOyQ== X-Gm-Message-State: AOAM53008buhGPnzBQpi7V1bEk5Q/l9EdMyKg0ZE02HSRtYC4K+2Lzsf b4NFEAwuevm/G/b3dpDDjRwZMPMhIDMmWIzmQkG5gy/1rcg38TGp+e79N09BbfdDKAWxid8fRUr jPznlJCSHaAM= X-Received: by 2002:a5d:4ed1:: with SMTP id s17mr43275361wrv.310.1638287556322; Tue, 30 Nov 2021 07:52:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJyCk0Kh25FxWu333V+ypoDmQYiIbGoNO0Q+dLf07zWjEGO0dC+mlmyLLG91wlwVMItlU6Av5Q== X-Received: by 2002:a5d:4ed1:: with SMTP id s17mr43275336wrv.310.1638287556029; Tue, 30 Nov 2021 07:52:36 -0800 (PST) Received: from [192.168.3.132] (p5b0c68ec.dip0.t-ipconnect.de. [91.12.104.236]) by smtp.gmail.com with ESMTPSA id t8sm3227398wmq.32.2021.11.30.07.52.35 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 30 Nov 2021 07:52:35 -0800 (PST) Message-ID: <8f82eacb-c6ad-807c-7e13-cd369e91a43d@redhat.com> Date: Tue, 30 Nov 2021 16:52:34 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Jason Gunthorpe Cc: Vlastimil Babka , Jens Axboe , Andrew Dona-Couch , Andrew Morton , Drew DeVault , Ammar Faizi , linux-kernel@vger.kernel.org, linux-api@vger.kernel.org, io_uring Mailing List , Pavel Begunkov , linux-mm@kvack.org References: <20211124132353.GG5112@ziepe.ca> <20211124132842.GH5112@ziepe.ca> <20211124134812.GI5112@ziepe.ca> <2cdbebb9-4c57-7839-71ab-166cae168c74@redhat.com> <20211124153405.GJ5112@ziepe.ca> <63294e63-cf82-1f59-5ea8-e996662e6393@redhat.com> <20211124183544.GL5112@ziepe.ca> <20211124231133.GM5112@ziepe.ca> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] Increase default MLOCK_LIMIT to 8 MiB In-Reply-To: <20211124231133.GM5112@ziepe.ca> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6527060019B5 X-Stat-Signature: iyz4d5nje364kwdrctbkdu9xx67tiqu9 Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=fxevny+o; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf14.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1638287558-112773 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: (sorry, was busy working on other stuff) >> That would be giving up on compound pages (hugetlbfs, THP, ...) on any >> current Linux system that does not use ZONE_MOVABLE -- which is not >> something I am not willing to buy into, just like our customers ;) > > So we have ZONE_MOVABLE but users won't use it? It's mostly used in the memory hot(un)plug context and we'll see growing usage there in the near future (mostly due to dax/kmem, virtio-mem). One has to be very careful how to size ZONE_MOVABLE, though, and it's incompatible with various use cases (even huge pages on some architectures are not movable and cannot be placed on ZONE_MOVABLE ...). That's why we barely see it getting used automatically outside of memory hot(un)plug context or when explicitly setup by the admin for a well fine-tuned system. > > Then why is the solution to push the same kinds of restrictions as > ZONE_MOVABLE on to ZONE_NORMAL? On any zone except ZONE_DEVICE to be precise. Defragmentation is one of the main reasons we have pageblocks after all -- besides CMA and page isolation. If we don't care about de-fragmentation we could just squash MIGRATE_MOVABLE, MIGRATE_UNMOVABLE, MIGRATE_RECLAIMABLE into a single type. But after all that's the only thing that provides us with THP in most setups out there. Note that some people (IIRC Mel) even proposed to remove ZONE_MOVABLE and instead have "sticky" MIGRATE_MOVABLE pageblocks, meaning MIGRATE_MOVABLE pageblocks that cannot be converted to a different type or stolen from -- which would mimic the same thing as the pageblocks we essentially have in ZONE_MOVABLE. > >> See my other mail, the upstream version of my reproducer essentially >> shows what FOLL_LONGTERM is currently doing wrong with pageblocks. And >> at least to me that's an interesting insight :) > > Hmm. To your reproducer it would be nice if we could cgroup control > the # of page blocks a cgroup has pinned. Focusing on # pages pinned > is clearly the wrong metric, I suggested the whole compound earlier, > but your point about the entire page block being ruined makes sense > too. # pages pinned is part of the story, but yes, "pinned something inside a pageblocks" is a better metric. I would think that this might be complicated to track, though ... especially once we have multiple cgroups pinning inside a single pageblock. Hm ... > > It means pinned pages will have be migrated to already ruined page > blocks the cgroup owns, which is a more controlled version of the > FOLL_LONGTERM migration you have been thinking about. MIGRATE_UNMOVABLE pageblocks are already ruined. But we'd need some way to manage/charge pageblocks per cgroup I guess? that sounds very interesting. > > This would effectively limit the fragmentation a hostile process group > can create. If we further treated unmovable cgroup charged kernel > allocations as 'pinned' and routed them to the pinned page blocks it > start to look really interesting. Kill the cgroup, get all your THPs > back? Fragmentation cannot extend past the cgroup? So essentially any accounted unmovable kernel allocation (e.g., page tables, secretmem, ... ) would try to be placed on a MIGRATE_UNMOVABLE pageblock "charged" to the respective cgroup? > > ie there are lots of batch workloads that could be interesting there - > wrap the batch in a cgroup, run it, then kill everything and since the > cgroup gives some lifetime clustering to the allocator you get a lot > less fragmentation when the batch is finished, so the next batch gets > more THPs, etc. > > There is also sort of an interesting optimization opportunity - many > FOLL_LONGTERM users would be happy to spend more time pinning to get > nice contiguous memory ranges. Might help convince people that the > extra pin time for migrations is worthwhile. Indeed. And fortunately, huge page users (heavily used in vfio context and for VMs) wouldn't be affected because they only pin huge pages and there is nothing to migrate then (well, excluding MIGRATE_CMA and ZONE_MOVABLE what we have already, of course). > >>> Something like io_ring is registering a bulk amount of memory and then >>> doing some potentially long operations against it. >> >> The individual operations it performs are comparable to O_DIRECT I think > > Yes, and O_DIRECT can take 10s's of seconds in troubled cases with IO > timeouts and things. > I might be wrong about O_DIRECT semantics, though. Staring at fs/io_uring.c I don't really have a clue how they are getting used. I assume they are getting used for DMA directly. -- Thanks, David / dhildenb