From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E637AC433E1 for ; Tue, 18 Aug 2020 15:16:13 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8CC152076E for ; Tue, 18 Aug 2020 15:16:13 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kwH9ePFI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8CC152076E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id DB3768D0015; Tue, 18 Aug 2020 11:16:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D63988D0003; Tue, 18 Aug 2020 11:16:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C526C8D0015; Tue, 18 Aug 2020 11:16:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204]) by kanga.kvack.org (Postfix) with ESMTP id AFE1B8D0003 for ; Tue, 18 Aug 2020 11:16:12 -0400 (EDT) Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 6CB852C98 for ; Tue, 18 Aug 2020 15:16:12 +0000 (UTC) X-FDA: 77164040184.29.jail99_3f0419b27020 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin29.hostedemail.com (Postfix) with ESMTP id D3EB8180364EE for ; Tue, 18 Aug 2020 15:15:51 +0000 (UTC) X-HE-Tag: jail99_3f0419b27020 X-Filterd-Recvd-Size: 10483 Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194]) by imf48.hostedemail.com (Postfix) with ESMTP for ; Tue, 18 Aug 2020 15:15:47 +0000 (UTC) Received: by mail-pg1-f194.google.com with SMTP id g33so9899241pgb.4 for ; Tue, 18 Aug 2020 08:15:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=sender:date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=pwMsSYi3yzWPdYuy9VRr42DWIJxMpEHXFNoyYr1nDlE=; b=kwH9ePFIC4X2stZC1vF06KMg2JaTZ6Iak+ZtHt3W/GDsksedTOb0bPHmECv73CjRV1 M314BZtcW6Pc2Flxj1iZrJRaFanrKyBRpYxzUzHZf/mz8PTmfBx/YuCY/ckcoNCySkYj yxY6+4U4bTiO9ajzdZvahjtZKZYnafWRkPkczZkSEISDF8BjG6deEpZ9XsS2874dka6x sN6mI6ixHPtYMoF9FJcakrNSBTXka/Unwo82bTDEqJrmtW0igxjQG9l2tnyLLX3g2B5K eoH1fXJneJhjJGIQ5Ju50397xeAtK95syo7WBpIGvkwXtks3rsaoT/1EpU4FJGMx0D60 Qudw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:sender:date:from:to:cc:subject:message-id :references:mime-version:content-disposition:in-reply-to; bh=pwMsSYi3yzWPdYuy9VRr42DWIJxMpEHXFNoyYr1nDlE=; b=OSHO2KKnOEidQ32wIKbrwnuIN4DQleYBx3kHUGa/gYUMwnHknI10I1mGuli9048MbV sHCVZ5iymSUsFz7P5LPu1itzPKtKbAAMvUX8Bj5kYAoFfNG4saCatgb2WTcSSWMpHocP 9Ugjl9mnJnu2/sagL787r/Nl1CsJ9bRIlT1zO8AxiQ5xvnbVtOhVxXhqIk8l7qOdBPDG Sy4GgxeOQaeEIUDastJgWcoTvs8iYvKSz8L/RcrKxPBcXuAgwgO1DlhZOp50BjheHg28 EcMHMXKy76qOrQi74UdcRFA2AjL88v2eAHwNA37g1W8W2bj5dnCrClIssNLMf5ec040L Vu0Q== X-Gm-Message-State: AOAM5322UO2eh0WK6buYCFKNgXaOSDRVui+gutMkr5MTnvL7VudF0E+y mUMclu34zKaHFOYpk+sXIAw= X-Google-Smtp-Source: ABdhPJzOrypju3V/uwNyoJCmZoyGBhL4J8tvL08hDwvhnlVi36Z2/MaZTcGBiO3lxdfdMx/waZM9oQ== X-Received: by 2002:a63:3850:: with SMTP id h16mr13630187pgn.218.1597763746517; Tue, 18 Aug 2020 08:15:46 -0700 (PDT) Received: from google.com ([2620:15c:211:1:7220:84ff:fe09:5e58]) by smtp.gmail.com with ESMTPSA id q18sm25028988pfn.106.2020.08.18.08.15.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Aug 2020 08:15:45 -0700 (PDT) Date: Tue, 18 Aug 2020 08:15:43 -0700 From: Minchan Kim To: David Hildenbrand Cc: Andrew Morton , linux-mm , Joonsoo Kim , Vlastimil Babka , John Dias , Suren Baghdasaryan , pullip.cho@samsung.com, Chris Goldsworthy Subject: Re: [RFC 0/7] Support high-order page bulk allocation Message-ID: <20200818151543.GE3852332@google.com> References: <20200814173131.2803002-1-minchan@kernel.org> <4e2bd095-b693-9fed-40e0-ab538ec09aaa@redhat.com> <20200817152706.GB3852332@google.com> <20200817163018.GC3852332@google.com> <20200817233442.GD3852332@google.com> <7c07e8cf-6adc-92be-d819-d60a389559d8@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <7c07e8cf-6adc-92be-d819-d60a389559d8@redhat.com> X-Rspamd-Queue-Id: D3EB8180364EE X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote: > On 18.08.20 01:34, Minchan Kim wrote: > > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote: > >> On 17.08.20 18:30, Minchan Kim wrote: > >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote: > >>>> On 17.08.20 17:27, Minchan Kim wrote: > >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote: > >>>>>> On 14.08.20 19:31, Minchan Kim wrote: > >>>>>>> There is a need for special HW to require bulk allocation of > >>>>>>> high-order pages. For example, 4800 * order-4 pages. > >>>>>>> > >>>>>>> To meet the requirement, a option is using CMA area because > >>>>>>> page allocator with compaction under memory pressure is > >>>>>>> easily failed to meet the requirement and too slow for 4800 > >>>>>>> times. However, CMA has also the following drawbacks: > >>>>>>> > >>>>>>> * 4800 of order-4 * cma_alloc is too slow > >>>>>>> > >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous > >>>>>>> memory once and then split them into order-4 chunks. > >>>>>>> The problem of this approach is CMA allocation fails one of the > >>>>>>> pages in those range couldn't migrate out, which happens easily > >>>>>>> with fs write under memory pressure. > >>>>>> > >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1 > >>>>>> chunks and split them. That would already heavily reduce the call frequency. > >>>>> > >>>>> I think you meant this: > >>>>> > >>>>> alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1) > >>>>> > >>>>> It would work if system has lots of non-fragmented free memory. > >>>>> However, once they are fragmented, it doesn't work. That's why we have > >>>>> seen even order-4 allocation failure in the field easily and that's why > >>>>> CMA was there. > >>>>> > >>>>> CMA has more logics to isolate the memory during allocation/freeing as > >>>>> well as fragmentation avoidance so that it has less chance to be stealed > >>>>> from others and increase high success ratio. That's why I want this API > >>>>> to be used with CMA or movable zone. > >>>> > >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one > >>>> big 300M allocation. As you correctly note, memory placed into CMA > >>>> should be movable, except for (short/long) term pinnings. In these > >>>> cases, doing allocations smaller than 300M and splitting them up should > >>>> be good enough to reduce the call frequency, no? > >>> > >>> I should have written that. The 300M I mentioned is really minimum size. > >>> In some scenraio, we need way bigger than 300M, up to several GB. > >>> Furthermore, the demand would be increased in near future. > >> > >> And what will the driver do with that data besides providing it to the > >> device? Can it be mapped to user space? I think we really need more > >> information / the actual user. > >> > >>>> > >>>>> > >>>>> A usecase is device can set a exclusive CMA area up when system boots. > >>>>> When device needs 4800 * order-4 pages, it could call this bulk against > >>>>> of the area so that it could effectively be guaranteed to allocate > >>>>> enough fast. > >>>> > >>>> Just wondering > >>>> > >>>> a) Why does it have to be fast? > >>> > >>> That's because it's related to application latency, which ends up > >>> user feel bad. > >> > >> Okay, but in theory, your device-needs are very similar to > >> application-needs, besides you requiring order-4 pages, correct? Similar > >> to an application that starts up and pins 300M (or more), just with > >> ordr-4 pages. > > > > Yes. > > > >> > >> I don't get quite yet why you need a range allocator for that. Because > >> you intend to use CMA? > > > > Yes, with CMA, it could be more guaranteed and fast enough with little > > tweaking. Currently, CMA is too slow due to below IPI overheads. > > > > 1. set_migratetype_isolate does drain_all_pages for every pageblock. > > 2. __aloc_contig_migrate_range does migrate_prep > > 3. alloc_contig_range does lru_add_drain_all. > > > > Thus, if we need to increase frequency of call as your suggestion, > > the set up overhead is also scaling up depending on the size. > > Such overhead makes sense if caller requests big contiguous memory > > but it's too much for normal high-order allocations. > > > > Maybe, we might optimize those call sites to reduce or/remove > > frequency of those IPI call smarter way but that would need to deal > > with success ratio vs fastness dance in the end. > > > > Another concern to use existing cma API is it's trying to make > > allocation successful at the cost of latency. For example, waiting > > a page writeback. > > > > That's the this new sematic API comes from for compromise since I believe > > we need some way to separate original CMA alloc(biased to be guaranteed > > but slower) with this new API(biased to be fast but less guaranteed). > > > > Is there any idea to work without existing cma API tweaking? > > Let me try to summarize: > > 1. Your driver needs a lot of order-4 pages. And it's needs them fast, > because of observerable lag/delay in an application. The pages will be > unmovable by the driver. > > 2. Your idea is to use CMA, as that avoids unmovable allocations, > theoretically allowing you to allocate all memory. But you don't > actually want a large contiguous memory area. > > 3. Doing a whole bunch of order-4 cma allocations is slow. > > 4. Doing a single large cma allocation and splitting it manually in the > caller can fail easily due to temporary page pinnings. > > > Regarding 4., [1] comes to mind, which has the same issues with > temporary page pinnings and solves it by simply retrying. Yeah, there > will be some lag, but maybe it's overall faster than doing separate > order-4 cma allocations? Thanks for the pointer. However, it's not a single reason to make CMA failure. Historically, there are various potentail problems to make "temporal" as "non-temporal" like page write, indirect dependency between objects. > > In general, proactive compaction [2] comes to mind, does that help? I think it makes sense if such high-order allocation are dominant in the system workload because the benefit caused by TLB would be bigger than cost caused by frequent migration overhead. However, it's not the our usecase. > > [1] > https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/ > [2] https://nitingupta.dev/post/proactive-compaction/ > I understand pfn stuff in the API is not pretty but the concept of idea makes sense to me in that go though the *migratable area* and get possible order pages with hard effort. It looks like GFP_NORETRY version for kmem_cache_alloc_bulk. How about this? int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);