From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.7 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id BF269C433DB for ; Wed, 23 Dec 2020 03:57:46 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 2A095224D4 for ; Wed, 23 Dec 2020 03:57:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 2A095224D4 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 5DFDA8D0014; Tue, 22 Dec 2020 22:57:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5697B6B00A4; Tue, 22 Dec 2020 22:57:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 42FE08D0014; Tue, 22 Dec 2020 22:57:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0126.hostedemail.com [216.40.44.126]) by kanga.kvack.org (Postfix) with ESMTP id 263716B00A3 for ; Tue, 22 Dec 2020 22:57:45 -0500 (EST) Received: from smtpin13.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id D89E6180AD837 for ; Wed, 23 Dec 2020 03:57:44 +0000 (UTC) X-FDA: 77623188048.13.night02_030862227465 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin13.hostedemail.com (Postfix) with ESMTP id B566C18140B7B for ; Wed, 23 Dec 2020 03:57:44 +0000 (UTC) X-HE-Tag: night02_030862227465 X-Filterd-Recvd-Size: 9117 Received: from mail-lf1-f51.google.com (mail-lf1-f51.google.com [209.85.167.51]) by imf04.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Dec 2020 03:57:44 +0000 (UTC) Received: by mail-lf1-f51.google.com with SMTP id o17so37027528lfg.4 for ; Tue, 22 Dec 2020 19:57:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=enuHhNBx3erTv6NGu3mnW7SPnLxGpEoV7bQL4nKhqzc=; b=VA1JPRz3K+EPcAExLpHvImEKhTkZKKlFdfj/D85QkXAxWQfXv3NhMFQjOjsPDmbib4 TTMVjd/41mO7CUjSZveAq8PN7jv+07omfCXAEn/XS0PEmibkdGHTIV6xMkpM4Z0vdLWr uMY5ZscGmuwOHcf+O7m9nVa2ImfQWUDxqypaqEdzsfBGgXdGn07TRdEwSEaMLOLgRmfK YkfYygjWIfGtiS2jSsvLR9myX1+K3N7TdcbfzSh2q0saW6mvepJQIRqgE/DXzXMhFwJ/ hc/03ydcIRVUdRyKQAtqG+E+MTaVllURwLkiu8kuuM9KyNdWgBjzOKV04LHTUQDHFk+0 jOoQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=enuHhNBx3erTv6NGu3mnW7SPnLxGpEoV7bQL4nKhqzc=; b=GAakInAeEaJL+LIppAh7haNORPV9GYpnOby6pB23LqV0SA+4dC7AozqrbSlDz9nZID Mgz5x5yN2mcA1glJjnu6EtA+lJWiVo19KDj2XyPrZ6znyszV7JI397jrnMfROOCuKp/B folMQbkCMttktEHv3W4e+WRBdhWn6SXuQhrcU8zjalI6NBTyVqgrZxzVWVle3euDOoYq umY0Q8sScvshrbo2n1GipAP2weNUGqCBnB/lr7H7m7k/H79vD4CLW4QYkN07jj1hDVpi xcjkI3oRAiqHjXcln9cAn8zvaNKevhAXIdJ80BSV3dUS4xFywUwycEUh6LSgRnujH36T eCnw== X-Gm-Message-State: AOAM531AkF+YyrzItW+bmVzni1KSzS9zOJO0YJ9HQICWRNvW0nMjy8Sv YIngA9hcAE7GaPnzF9w8JLJvxmBmQuEAoYjsiy4= X-Google-Smtp-Source: ABdhPJxeb9uR35SLDkEHQkSFzjI6GJGZz5VYrcM5t1drBy2nU9D3Lmn1TDuOZuq9KEJJ3obKbiwEa4Kg14NvoKT3LmY= X-Received: by 2002:a05:6512:1112:: with SMTP id l18mr9559659lfg.538.1608695862702; Tue, 22 Dec 2020 19:57:42 -0800 (PST) MIME-Version: 1.0 References: <20201222074656.GA30035@open-light-1.localdomain> <63318bf1-21ea-7202-e060-b4b2517c684e@oracle.com> In-Reply-To: <63318bf1-21ea-7202-e060-b4b2517c684e@oracle.com> From: Liang Li Date: Wed, 23 Dec 2020 11:57:31 +0800 Message-ID: Subject: Re: [RFC PATCH 1/3] mm: support hugetlb free page reporting To: Mike Kravetz Cc: Alexander Duyck , Mel Gorman , Andrew Morton , Andrea Arcangeli , Dan Williams , "Michael S. Tsirkin" , David Hildenbrand , Jason Wang , Dave Hansen , Michal Hocko , Liang Li , linux-mm , LKML , virtualization@lists.linux-foundation.org, qemu-devel@nongnu.org Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: > On 12/21/20 11:46 PM, Liang Li wrote: > > Free page reporting only supports buddy pages, it can't report the > > free pages reserved for hugetlbfs case. On the other hand, hugetlbfs > > is a good choice for a system with a huge amount of RAM, because it > > can help to reduce the memory management overhead and improve system > > performance. > > This patch add the support for reporting hugepages in the free list > > of hugetlb, it canbe used by virtio_balloon driver for memory > > overcommit and pre zero out free pages for speeding up memory population. > > My apologies as I do not follow virtio_balloon driver. Comments from > the hugetlb perspective. Any comments are welcome. > > static struct page *dequeue_huge_page_node_exact(struct hstate *h, int nid) > > @@ -5531,6 +5537,29 @@ follow_huge_pgd(struct mm_struct *mm, unsigned long address, pgd_t *pgd, int fla > > return pte_page(*(pte_t *)pgd) + ((address & ~PGDIR_MASK) >> PAGE_SHIFT); > > } > > > > +bool isolate_free_huge_page(struct page *page, struct hstate *h, int nid) > > Looks like this always returns true. Should it be type void? will change in the next revision. > > +{ > > + bool ret = true; > > + > > + VM_BUG_ON_PAGE(!PageHead(page), page); > > + > > + list_move(&page->lru, &h->hugepage_activelist); > > + set_page_refcounted(page); > > + h->free_huge_pages--; > > + h->free_huge_pages_node[nid]--; > > + > > + return ret; > > +} > > + > > ... > > +static void > > +hugepage_reporting_drain(struct page_reporting_dev_info *prdev, > > + struct hstate *h, struct scatterlist *sgl, > > + unsigned int nents, bool reported) > > +{ > > + struct scatterlist *sg = sgl; > > + > > + /* > > + * Drain the now reported pages back into their respective > > + * free lists/areas. We assume at least one page is populated. > > + */ > > + do { > > + struct page *page = sg_page(sg); > > + > > + putback_isolate_huge_page(h, page); > > + > > + /* If the pages were not reported due to error skip flagging */ > > + if (!reported) > > + continue; > > + > > + __SetPageReported(page); > > + } while ((sg = sg_next(sg))); > > + > > + /* reinitialize scatterlist now that it is empty */ > > + sg_init_table(sgl, nents); > > +} > > + > > +/* > > + * The page reporting cycle consists of 4 stages, fill, report, drain, and > > + * idle. We will cycle through the first 3 stages until we cannot obtain a > > + * full scatterlist of pages, in that case we will switch to idle. > > + */ > > As mentioned, I am not familiar with virtio_balloon and the overall design. > So, some of this does not make sense to me. > > > +static int > > +hugepage_reporting_cycle(struct page_reporting_dev_info *prdev, > > + struct hstate *h, unsigned int nid, > > + struct scatterlist *sgl, unsigned int *offset) > > +{ > > + struct list_head *list = &h->hugepage_freelists[nid]; > > + unsigned int page_len = PAGE_SIZE << h->order; > > + struct page *page, *next; > > + long budget; > > + int ret = 0, scan_cnt = 0; > > + > > + /* > > + * Perform early check, if free area is empty there is > > + * nothing to process so we can skip this free_list. > > + */ > > + if (list_empty(list)) > > + return ret; > > Do note that not all entries on the hugetlb free lists are free. Reserved > entries are also on the free list. The actual number of free entries is > 'h->free_huge_pages - h->resv_huge_pages'. > Is the intention to process reserved pages as well as free pages? Yes, Reserved pages was treated as 'free pages' > > + > > + spin_lock_irq(&hugetlb_lock); > > + > > + if (huge_page_order(h) > MAX_ORDER) > > + budget = HUGEPAGE_REPORTING_CAPACITY; > > + else > > + budget = HUGEPAGE_REPORTING_CAPACITY * 32; > > + > > + /* loop through free list adding unreported pages to sg list */ > > + list_for_each_entry_safe(page, next, list, lru) { > > + /* We are going to skip over the reported pages. */ > > + if (PageReported(page)) { > > + if (++scan_cnt >= MAX_SCAN_NUM) { > > + ret = scan_cnt; > > + break; > > + } > > + continue; > > + } > > + > > + /* > > + * If we fully consumed our budget then update our > > + * state to indicate that we are requesting additional > > + * processing and exit this list. > > + */ > > + if (budget < 0) { > > + atomic_set(&prdev->state, PAGE_REPORTING_REQUESTED); > > + next = page; > > + break; > > + } > > + > > + /* Attempt to pull page from list and place in scatterlist */ > > + if (*offset) { > > + isolate_free_huge_page(page, h, nid); > > Once a hugetlb page is isolated, it can not be used and applications that > depend on hugetlb pages can start to fail. > I assume that is acceptable/expected behavior. Correct? > On some systems, hugetlb pages are a precious resource and the sysadmin > carefully configures the number needed by applications. Removing a hugetlb > page (even for a very short period of time) could cause serious application > failure. That' true, especially for 1G pages. Any suggestions? Let the hugepage allocator be aware of this situation and retry ? > My apologies if that is a stupid question. I really have no knowledge of > this area. > > Mike Kravetz Thanks for your comments, Mike Liang -- >