From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=G0TP=B4=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id E637AC433E1
	for <linux-mm@archiver.kernel.org>; Tue, 18 Aug 2020 15:16:13 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 8CC152076E
	for <linux-mm@archiver.kernel.org>; Tue, 18 Aug 2020 15:16:13 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="kwH9ePFI"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 8CC152076E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id DB3768D0015; Tue, 18 Aug 2020 11:16:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D63988D0003; Tue, 18 Aug 2020 11:16:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C526C8D0015; Tue, 18 Aug 2020 11:16:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204])
	by kanga.kvack.org (Postfix) with ESMTP id AFE1B8D0003
	for <linux-mm@kvack.org>; Tue, 18 Aug 2020 11:16:12 -0400 (EDT)
Received: from smtpin29.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 6CB852C98
	for <linux-mm@kvack.org>; Tue, 18 Aug 2020 15:16:12 +0000 (UTC)
X-FDA: 77164040184.29.jail99_3f0419b27020
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin29.hostedemail.com (Postfix) with ESMTP id D3EB8180364EE
	for <linux-mm@kvack.org>; Tue, 18 Aug 2020 15:15:51 +0000 (UTC)
X-HE-Tag: jail99_3f0419b27020
X-Filterd-Recvd-Size: 10483
Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194])
	by imf48.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Tue, 18 Aug 2020 15:15:47 +0000 (UTC)
Received: by mail-pg1-f194.google.com with SMTP id g33so9899241pgb.4
        for <linux-mm@kvack.org>; Tue, 18 Aug 2020 08:15:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=sender:date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=pwMsSYi3yzWPdYuy9VRr42DWIJxMpEHXFNoyYr1nDlE=;
        b=kwH9ePFIC4X2stZC1vF06KMg2JaTZ6Iak+ZtHt3W/GDsksedTOb0bPHmECv73CjRV1
         M314BZtcW6Pc2Flxj1iZrJRaFanrKyBRpYxzUzHZf/mz8PTmfBx/YuCY/ckcoNCySkYj
         yxY6+4U4bTiO9ajzdZvahjtZKZYnafWRkPkczZkSEISDF8BjG6deEpZ9XsS2874dka6x
         sN6mI6ixHPtYMoF9FJcakrNSBTXka/Unwo82bTDEqJrmtW0igxjQG9l2tnyLLX3g2B5K
         eoH1fXJneJhjJGIQ5Ju50397xeAtK95syo7WBpIGvkwXtks3rsaoT/1EpU4FJGMx0D60
         Qudw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
         :references:mime-version:content-disposition:in-reply-to;
        bh=pwMsSYi3yzWPdYuy9VRr42DWIJxMpEHXFNoyYr1nDlE=;
        b=OSHO2KKnOEidQ32wIKbrwnuIN4DQleYBx3kHUGa/gYUMwnHknI10I1mGuli9048MbV
         sHCVZ5iymSUsFz7P5LPu1itzPKtKbAAMvUX8Bj5kYAoFfNG4saCatgb2WTcSSWMpHocP
         9Ugjl9mnJnu2/sagL787r/Nl1CsJ9bRIlT1zO8AxiQ5xvnbVtOhVxXhqIk8l7qOdBPDG
         Sy4GgxeOQaeEIUDastJgWcoTvs8iYvKSz8L/RcrKxPBcXuAgwgO1DlhZOp50BjheHg28
         EcMHMXKy76qOrQi74UdcRFA2AjL88v2eAHwNA37g1W8W2bj5dnCrClIssNLMf5ec040L
         Vu0Q==
X-Gm-Message-State: AOAM5322UO2eh0WK6buYCFKNgXaOSDRVui+gutMkr5MTnvL7VudF0E+y
	mUMclu34zKaHFOYpk+sXIAw=
X-Google-Smtp-Source: ABdhPJzOrypju3V/uwNyoJCmZoyGBhL4J8tvL08hDwvhnlVi36Z2/MaZTcGBiO3lxdfdMx/waZM9oQ==
X-Received: by 2002:a63:3850:: with SMTP id h16mr13630187pgn.218.1597763746517;
        Tue, 18 Aug 2020 08:15:46 -0700 (PDT)
Received: from google.com ([2620:15c:211:1:7220:84ff:fe09:5e58])
        by smtp.gmail.com with ESMTPSA id q18sm25028988pfn.106.2020.08.18.08.15.45
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 18 Aug 2020 08:15:45 -0700 (PDT)
Date: Tue, 18 Aug 2020 08:15:43 -0700
From: Minchan Kim <minchan@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>, John Dias <joaodias@google.com>,
	Suren Baghdasaryan <surenb@google.com>, pullip.cho@samsung.com,
	Chris Goldsworthy <cgoldswo@codeaurora.org>
Subject: Re: [RFC 0/7] Support high-order page bulk allocation
Message-ID: <20200818151543.GE3852332@google.com>
References: <20200814173131.2803002-1-minchan@kernel.org>
 <4e2bd095-b693-9fed-40e0-ab538ec09aaa@redhat.com>
 <20200817152706.GB3852332@google.com>
 <aa96518d-94c9-8c28-5e67-59388587b3bd@redhat.com>
 <20200817163018.GC3852332@google.com>
 <f047f2b2-9f62-cbf4-3c6b-a0f3bf1e9406@redhat.com>
 <20200817233442.GD3852332@google.com>
 <7c07e8cf-6adc-92be-d819-d60a389559d8@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <7c07e8cf-6adc-92be-d819-d60a389559d8@redhat.com>
X-Rspamd-Queue-Id: D3EB8180364EE
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam04
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, Aug 18, 2020 at 09:49:24AM +0200, David Hildenbrand wrote:
> On 18.08.20 01:34, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 18:30, Minchan Kim wrote:
> >>> On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >>>> On 17.08.20 17:27, Minchan Kim wrote:
> >>>>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>>>> There is a need for special HW to require bulk allocation of
> >>>>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>>>
> >>>>>>> To meet the requirement, a option is using CMA area because
> >>>>>>> page allocator with compaction under memory pressure is
> >>>>>>> easily failed to meet the requirement and too slow for 4800
> >>>>>>> times. However, CMA has also the following drawbacks:
> >>>>>>>
> >>>>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>>>
> >>>>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>>>> memory once and then split them into order-4 chunks.
> >>>>>>> The problem of this approach is CMA allocation fails one of the
> >>>>>>> pages in those range couldn't migrate out, which happens easily
> >>>>>>> with fs write under memory pressure.
> >>>>>>
> >>>>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>>>
> >>>>> I think you meant this:
> >>>>>
> >>>>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>>>
> >>>>> It would work if system has lots of non-fragmented free memory.
> >>>>> However, once they are fragmented, it doesn't work. That's why we have
> >>>>> seen even order-4 allocation failure in the field easily and that's why
> >>>>> CMA was there.
> >>>>>
> >>>>> CMA has more logics to isolate the memory during allocation/freeing as
> >>>>> well as fragmentation avoidance so that it has less chance to be stealed
> >>>>> from others and increase high success ratio. That's why I want this API
> >>>>> to be used with CMA or movable zone.
> >>>>
> >>>> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >>>> big 300M allocation. As you correctly note, memory placed into CMA
> >>>> should be movable, except for (short/long) term pinnings. In these
> >>>> cases, doing allocations smaller than 300M and splitting them up should
> >>>> be good enough to reduce the call frequency, no?
> >>>
> >>> I should have written that. The 300M I mentioned is really minimum size.
> >>> In some scenraio, we need way bigger than 300M, up to several GB.
> >>> Furthermore, the demand would be increased in near future.
> >>
> >> And what will the driver do with that data besides providing it to the
> >> device? Can it be mapped to user space? I think we really need more
> >> information / the actual user.
> >>
> >>>>
> >>>>>
> >>>>> A usecase is device can set a exclusive CMA area up when system boots.
> >>>>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>>>> of the area so that it could effectively be guaranteed to allocate
> >>>>> enough fast.
> >>>>
> >>>> Just wondering
> >>>>
> >>>> a) Why does it have to be fast?
> >>>
> >>> That's because it's related to application latency, which ends up
> >>> user feel bad.
> >>
> >> Okay, but in theory, your device-needs are very similar to
> >> application-needs, besides you requiring order-4 pages, correct? Similar
> >> to an application that starts up and pins 300M (or more), just with
> >> ordr-4 pages.
> > 
> > Yes.
> > 
> >>
> >> I don't get quite yet why you need a range allocator for that. Because
> >> you intend to use CMA?
> > 
> > Yes, with CMA, it could be more guaranteed and fast enough with little
> > tweaking. Currently, CMA is too slow due to below IPI overheads.
> > 
> > 1. set_migratetype_isolate does drain_all_pages for every pageblock.
> > 2. __aloc_contig_migrate_range does migrate_prep
> > 3. alloc_contig_range does lru_add_drain_all.
> > 
> > Thus, if we need to increase frequency of call as your suggestion,
> > the set up overhead is also scaling up depending on the size.
> > Such overhead makes sense if caller requests big contiguous memory
> > but it's too much for normal high-order allocations.
> > 
> > Maybe, we might optimize those call sites to reduce or/remove
> > frequency of those IPI call smarter way but that would need to deal
> > with success ratio vs fastness dance in the end.
> > 
> > Another concern to use existing cma API is it's trying to make
> > allocation successful at the cost of latency. For example, waiting
> > a page writeback.
> > 
> > That's the this new sematic API comes from for compromise since I believe
> > we need some way to separate original CMA alloc(biased to be guaranteed
> > but slower) with this new API(biased to be fast but less guaranteed).
> > 
> > Is there any idea to work without existing cma API tweaking?
> 
> Let me try to summarize:
> 
> 1. Your driver needs a lot of order-4 pages. And it's needs them fast,
> because of observerable lag/delay in an application. The pages will be
> unmovable by the driver.
> 
> 2. Your idea is to use CMA, as that avoids unmovable allocations,
> theoretically allowing you to allocate all memory. But you don't
> actually want a large contiguous memory area.
> 
> 3. Doing a whole bunch of order-4 cma allocations is slow.
> 
> 4. Doing a single large cma allocation and splitting it manually in the
> caller can fail easily due to temporary page pinnings.
> 
> 
> Regarding 4., [1] comes to mind, which has the same issues with
> temporary page pinnings and solves it by simply retrying. Yeah, there
> will be some lag, but maybe it's overall faster than doing separate
> order-4 cma allocations?

Thanks for the pointer. However, it's not a single reason to make CMA
failure. Historically, there are various potentail problems to make
"temporal" as "non-temporal" like page write, indirect dependency
between objects.

> 
> In general, proactive compaction [2] comes to mind, does that help?

I think it makes sense if such high-order allocation are dominant in the
system workload because the benefit caused by TLB would be bigger than cost
caused by frequent migration overhead. However, it's not the our usecase.

> 
> [1]
> https://lore.kernel.org/r/1596682582-29139-2-git-send-email-cgoldswo@codeaurora.org/
> [2] https://nitingupta.dev/post/proactive-compaction/
> 

I understand pfn stuff in the API is not pretty but the concept of idea
makes sense to me in that go though the *migratable area* and get possible
order pages with hard effort. It looks like GFP_NORETRY version for
kmem_cache_alloc_bulk.

How about this?

    int cma_alloc(struct cma *cma, int order, unsigned int nr_elem, struct page **pages);