From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=hL0W=B3=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,DKIM_INVALID,
	DKIM_SIGNED,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no
	autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 56E03C433E3
	for <linux-mm@archiver.kernel.org>; Mon, 17 Aug 2020 23:34:48 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id DC50B2065D
	for <linux-mm@archiver.kernel.org>; Mon, 17 Aug 2020 23:34:47 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="nhWhhQio"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DC50B2065D
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id 329C06B0002; Mon, 17 Aug 2020 19:34:47 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2B4EB6B0005; Mon, 17 Aug 2020 19:34:47 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 17B936B0006; Mon, 17 Aug 2020 19:34:47 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0033.hostedemail.com [216.40.44.33])
	by kanga.kvack.org (Postfix) with ESMTP id EEE126B0002
	for <linux-mm@kvack.org>; Mon, 17 Aug 2020 19:34:46 -0400 (EDT)
Received: from smtpin05.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id B0E98181AEF32
	for <linux-mm@kvack.org>; Mon, 17 Aug 2020 23:34:46 +0000 (UTC)
X-FDA: 77161667772.05.music52_2d032fc2701a
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin05.hostedemail.com (Postfix) with ESMTP id 88D99180260EB
	for <linux-mm@kvack.org>; Mon, 17 Aug 2020 23:34:46 +0000 (UTC)
X-HE-Tag: music52_2d032fc2701a
X-Filterd-Recvd-Size: 9446
Received: from mail-pg1-f194.google.com (mail-pg1-f194.google.com [209.85.215.194])
	by imf14.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 17 Aug 2020 23:34:46 +0000 (UTC)
Received: by mail-pg1-f194.google.com with SMTP id m34so8851780pgl.11
        for <linux-mm@kvack.org>; Mon, 17 Aug 2020 16:34:46 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=sender:date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=oL4RNLFZaT3hwX5H2T2AOKB3+9/HJg+KBcGjjMqUXx4=;
        b=nhWhhQioN1LcgpggTOLkjFlJHf1ciEIR+VWDMNAqBxV8SYJTOSxh7/bM1pPW/eQlSU
         3SiveTfcesck+jTmgid5e9tMmC6uMXt+PlFh1FDmcrM7PPo4IcqruIZv4E2KpqE3N1Zs
         dAIt+gznJbWGDVLVeMLdAgZsiXNubggBBsLjUXfYr6jtKGz1bORN7je7dHmABGZLJIt4
         n+7BbNMXJDEMAsNpmk+hMZuFQX1n7YJNVAyVfUVXG12xA19pee/yskIq9stIJ0smci1N
         kejvvoRfg25vJnksh+BLsRA0Qgl/iKtEhGYsVvnMiY6nC7ekHxk33QYTgX8hZ31iosnS
         CcHQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:sender:date:from:to:cc:subject:message-id
         :references:mime-version:content-disposition:in-reply-to;
        bh=oL4RNLFZaT3hwX5H2T2AOKB3+9/HJg+KBcGjjMqUXx4=;
        b=rMUkyDkvDY60ALCE7Z2+c1pMuJIIX+9D1i5sbOYAAl1OHphybK0/PujoPrx38qU4fu
         Cnzd+fjWQeuwSaeFMt5mlGICsW/ZflE35e3DRY9AwaU00/7rv93rngnaBIGBL4B2WlKp
         u1gxS61esHH1931eAyvvV0BMTVewh9i4mcuMHiFUG6tTR4yXdf30Fnj90LCDPoLlEU1Z
         y22ghq8JEbLglCjbmm4fJgFfwUDePGMN2Lw0/ehBmsPnHPXsSc2bikZrikF7bHBDm16x
         qq6+2ykIwtak/DTbzV5cgu9ABm2fNz9QStBmR52EzAVWMuJe5L0kWCBC8UuYtbBvfqFu
         7sVA==
X-Gm-Message-State: AOAM533jUKfZ9WhkUXufjkOf2Ksk1gwQ8fCUx/ZFZdcnQ3rHthw8bf9Q
	4FjzZyoDr7CJB6zdw5pEgDk=
X-Google-Smtp-Source: ABdhPJxEx/tfbl5fYV9WrkJTrVq5K35i0+Q5HXHgdA+kJZrRSSud4S1yNO/HDsDESdve7bK1ZUWh5g==
X-Received: by 2002:aa7:9552:: with SMTP id w18mr10088268pfq.150.1597707284957;
        Mon, 17 Aug 2020 16:34:44 -0700 (PDT)
Received: from google.com ([2620:15c:211:1:7220:84ff:fe09:5e58])
        by smtp.gmail.com with ESMTPSA id z26sm6792594pfa.55.2020.08.17.16.34.43
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 17 Aug 2020 16:34:43 -0700 (PDT)
Date: Mon, 17 Aug 2020 16:34:42 -0700
From: Minchan Kim <minchan@kernel.org>
To: David Hildenbrand <david@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	linux-mm <linux-mm@kvack.org>, Joonsoo Kim <iamjoonsoo.kim@lge.com>,
	Vlastimil Babka <vbabka@suse.cz>, John Dias <joaodias@google.com>,
	Suren Baghdasaryan <surenb@google.com>, pullip.cho@samsung.com
Subject: Re: [RFC 0/7] Support high-order page bulk allocation
Message-ID: <20200817233442.GD3852332@google.com>
References: <20200814173131.2803002-1-minchan@kernel.org>
 <4e2bd095-b693-9fed-40e0-ab538ec09aaa@redhat.com>
 <20200817152706.GB3852332@google.com>
 <aa96518d-94c9-8c28-5e67-59388587b3bd@redhat.com>
 <20200817163018.GC3852332@google.com>
 <f047f2b2-9f62-cbf4-3c6b-a0f3bf1e9406@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <f047f2b2-9f62-cbf4-3c6b-a0f3bf1e9406@redhat.com>
X-Rspamd-Queue-Id: 88D99180260EB
X-Spamd-Result: default: False [0.00 / 100.00]
X-Rspamd-Server: rspam05
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, Aug 17, 2020 at 06:44:50PM +0200, David Hildenbrand wrote:
> On 17.08.20 18:30, Minchan Kim wrote:
> > On Mon, Aug 17, 2020 at 05:45:59PM +0200, David Hildenbrand wrote:
> >> On 17.08.20 17:27, Minchan Kim wrote:
> >>> On Sun, Aug 16, 2020 at 02:31:22PM +0200, David Hildenbrand wrote:
> >>>> On 14.08.20 19:31, Minchan Kim wrote:
> >>>>> There is a need for special HW to require bulk allocation of
> >>>>> high-order pages. For example, 4800 * order-4 pages.
> >>>>>
> >>>>> To meet the requirement, a option is using CMA area because
> >>>>> page allocator with compaction under memory pressure is
> >>>>> easily failed to meet the requirement and too slow for 4800
> >>>>> times. However, CMA has also the following drawbacks:
> >>>>>
> >>>>>  * 4800 of order-4 * cma_alloc is too slow
> >>>>>
> >>>>> To avoid the slowness, we could try to allocate 300M contiguous
> >>>>> memory once and then split them into order-4 chunks.
> >>>>> The problem of this approach is CMA allocation fails one of the
> >>>>> pages in those range couldn't migrate out, which happens easily
> >>>>> with fs write under memory pressure.
> >>>>
> >>>> Why not chose a value in between? Like try to allocate MAX_ORDER - 1
> >>>> chunks and split them. That would already heavily reduce the call frequency.
> >>>
> >>> I think you meant this:
> >>>
> >>>     alloc_pages(GFP_KERNEL|__GFP_NOWARN, MAX_ORDER - 1)
> >>>
> >>> It would work if system has lots of non-fragmented free memory.
> >>> However, once they are fragmented, it doesn't work. That's why we have
> >>> seen even order-4 allocation failure in the field easily and that's why
> >>> CMA was there.
> >>>
> >>> CMA has more logics to isolate the memory during allocation/freeing as
> >>> well as fragmentation avoidance so that it has less chance to be stealed
> >>> from others and increase high success ratio. That's why I want this API
> >>> to be used with CMA or movable zone.
> >>
> >> I was talking about doing MAX_ORDER - 1 CMA allocations instead of one
> >> big 300M allocation. As you correctly note, memory placed into CMA
> >> should be movable, except for (short/long) term pinnings. In these
> >> cases, doing allocations smaller than 300M and splitting them up should
> >> be good enough to reduce the call frequency, no?
> > 
> > I should have written that. The 300M I mentioned is really minimum size.
> > In some scenraio, we need way bigger than 300M, up to several GB.
> > Furthermore, the demand would be increased in near future.
> 
> And what will the driver do with that data besides providing it to the
> device? Can it be mapped to user space? I think we really need more
> information / the actual user.
> 
> >>
> >>>
> >>> A usecase is device can set a exclusive CMA area up when system boots.
> >>> When device needs 4800 * order-4 pages, it could call this bulk against
> >>> of the area so that it could effectively be guaranteed to allocate
> >>> enough fast.
> >>
> >> Just wondering
> >>
> >> a) Why does it have to be fast?
> > 
> > That's because it's related to application latency, which ends up
> > user feel bad.
> 
> Okay, but in theory, your device-needs are very similar to
> application-needs, besides you requiring order-4 pages, correct? Similar
> to an application that starts up and pins 300M (or more), just with
> ordr-4 pages.

Yes.

> 
> I don't get quite yet why you need a range allocator for that. Because
> you intend to use CMA?

Yes, with CMA, it could be more guaranteed and fast enough with little
tweaking. Currently, CMA is too slow due to below IPI overheads.

1. set_migratetype_isolate does drain_all_pages for every pageblock.
2. __aloc_contig_migrate_range does migrate_prep
3. alloc_contig_range does lru_add_drain_all.

Thus, if we need to increase frequency of call as your suggestion,
the set up overhead is also scaling up depending on the size.
Such overhead makes sense if caller requests big contiguous memory
but it's too much for normal high-order allocations.

Maybe, we might optimize those call sites to reduce or/remove
frequency of those IPI call smarter way but that would need to deal
with success ratio vs fastness dance in the end.

Another concern to use existing cma API is it's trying to make
allocation successful at the cost of latency. For example, waiting
a page writeback.

That's the this new sematic API comes from for compromise since I believe
we need some way to separate original CMA alloc(biased to be guaranteed
but slower) with this new API(biased to be fast but less guaranteed).

Is there any idea to work without existing cma API tweaking?

> 
> > 
> >> b) Why does it need that many order-4 pages?
> > 
> > It's HW requirement. I couldn't say much about that.
> 
> Hm.
> 
> > 
> >> c) How dynamic is the device need at runtime?
> > 
> > Whenever the application launched. It depends on user's usage pattern.
> > 
> >> d) Would it be reasonable in your setup to mark a CMA region in a way
> >> such that it will never be used for other (movable) allocations,
> > 
> > I don't get your point. If we don't want the area to used up for
> > other movable allocation, why should we use it as CMA first?
> > It sounds like reserved memory and just wasted the memory.
> 
> Right, it's just very hard to get what you are trying to achieve without
> the actual user at hand.
> 
> For example, will the pages you allocate be movable? Does the device
> allow for that? If not, then the MOVABLE zone is usually not valid
> (similar to gigantic pages not being allocated from the MOVABLE zone).
> So your stuck with the NORMAL zone or CMA. Especially for the NORMAL
> zone, alloc_contig_range() is currently not prepared to properly handle
> sub-MAX_ORDER - 1 ranges. If any involved pageblock contains an
> unmovable page, the allcoation will fail (see pageblock isolation /
> has_unmovable_pages()). So CMA would be your only option.

Those page are not migrable so I agree here CMA would be only option.