From: Michal Hocko <mhocko@kernel.org>
To: Wei Wang <wei.w.wang@intel.com>
Cc: linux-kernel@vger.kernel.org,
virtualization@lists.linux-foundation.org, kvm@vger.kernel.org,
linux-mm@kvack.org, mst@redhat.com, mawilcox@microsoft.com,
akpm@linux-foundation.org, virtio-dev@lists.oasis-open.org,
david@redhat.com, cornelia.huck@de.ibm.com,
mgorman@techsingularity.net, aarcange@redhat.com,
amit.shah@redhat.com, pbonzini@redhat.com,
liliang.opensource@gmail.com, yang.zhang.wz@gmail.com,
quan.xu@aliyun.com
Subject: Re: [PATCH v13 4/5] mm: support reporting free page blocks
Date: Thu, 3 Aug 2017 15:50:47 +0200 [thread overview]
Message-ID: <20170803135047.GV12521@dhcp22.suse.cz> (raw)
In-Reply-To: <59832265.1040805@intel.com>
On Thu 03-08-17 21:17:25, Wei Wang wrote:
> On 08/03/2017 08:41 PM, Michal Hocko wrote:
> >On Thu 03-08-17 20:11:58, Wei Wang wrote:
> >>On 08/03/2017 07:28 PM, Michal Hocko wrote:
> >>>On Thu 03-08-17 19:27:19, Wei Wang wrote:
> >>>>On 08/03/2017 06:44 PM, Michal Hocko wrote:
> >>>>>On Thu 03-08-17 18:42:15, Wei Wang wrote:
> >>>>>>On 08/03/2017 05:11 PM, Michal Hocko wrote:
> >>>>>>>On Thu 03-08-17 14:38:18, Wei Wang wrote:
> >>>>>[...]
> >>>>>>>>+static int report_free_page_block(struct zone *zone, unsigned int order,
> >>>>>>>>+ unsigned int migratetype, struct page **page)
> >>>>>>>This is just too ugly and wrong actually. Never provide struct page
> >>>>>>>pointers outside of the zone->lock. What I've had in mind was to simply
> >>>>>>>walk free lists of the suitable order and call the callback for each one.
> >>>>>>>Something as simple as
> >>>>>>>
> >>>>>>> for (i = 0; i < MAX_NR_ZONES; i++) {
> >>>>>>> struct zone *zone = &pgdat->node_zones[i];
> >>>>>>>
> >>>>>>> if (!populated_zone(zone))
> >>>>>>> continue;
> >>>>>>> spin_lock_irqsave(&zone->lock, flags);
> >>>>>>> for (order = min_order; order < MAX_ORDER; ++order) {
> >>>>>>> struct free_area *free_area = &zone->free_area[order];
> >>>>>>> enum migratetype mt;
> >>>>>>> struct page *page;
> >>>>>>>
> >>>>>>> if (!free_area->nr_pages)
> >>>>>>> continue;
> >>>>>>>
> >>>>>>> for_each_migratetype_order(order, mt) {
> >>>>>>> list_for_each_entry(page,
> >>>>>>> &free_area->free_list[mt], lru) {
> >>>>>>>
> >>>>>>> pfn = page_to_pfn(page);
> >>>>>>> visit(opaque2, prn, 1<<order);
> >>>>>>> }
> >>>>>>> }
> >>>>>>> }
> >>>>>>>
> >>>>>>> spin_unlock_irqrestore(&zone->lock, flags);
> >>>>>>> }
> >>>>>>>
> >>>>>>>[...]
> >>>>>>I think the above would take the lock for too long time. That's why we
> >>>>>>prefer to take one free page block each time, and taking it one by one
> >>>>>>also doesn't make a difference, in terms of the performance that we
> >>>>>>need.
> >>>>>I think you should start with simple approach and impove incrementally
> >>>>>if this turns out to be not optimal. I really detest taking struct pages
> >>>>>outside of the lock. You never know what might happen after the lock is
> >>>>>dropped. E.g. can you race with the memory hotremove?
> >>>>The caller won't use pages returned from the function, so I think there
> >>>>shouldn't be an issue or race if the returned pages are used (i.e. not free
> >>>>anymore) or simply gone due to hotremove.
> >>>No, this is just too error prone. Consider that struct page pointer
> >>>itself could get invalid in the meantime. Please always keep robustness
> >>>in mind first. Optimizations are nice but it is even not clear whether
> >>>the simple variant will cause any problems.
> >>
> >>how about this:
> >>
> >>for_each_populated_zone(zone) {
> >> for_each_migratetype_order_decend(min_order, order, type) {
> >> do {
> >> => spin_lock_irqsave(&zone->lock, flags);
> >> ret = report_free_page_block(zone, order, type,
> >> &page)) {
> >> pfn = page_to_pfn(page);
> >> nr_pages = 1 << order;
> >> visit(opaque1, pfn, nr_pages);
> >> }
> >> => spin_unlock_irqrestore(&zone->lock, flags);
> >> } while (!ret)
> >>}
> >>
> >>In this way, we can still keep the lock granularity at one free page block
> >>while having the struct page operated under the lock.
> >How can you continue iteration of free_list after the lock has been
> >dropped?
>
> report_free_page_block() has handled all the possible cases after the lock
> is
> dropped. For example, if the previous reported page has not been on the free
> list, then the first node from the list of this order will be given. This is
> because
> page allocation takes page blocks from the head to end, for example:
>
> 1,2,3,4,5,6
> if the previous reported free block is 2, when we give 2 to the report
> function
> to get the next page block, and find 1,2,3 have all gone, it will report 4,
> which
> is the head of the free list.
As I've said earlier. Start simple optimize incrementally with some
numbers to justify a more subtle code.
--
Michal Hocko
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2017-08-03 13:50 UTC|newest]
Thread overview: 40+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-08-03 6:38 [PATCH v13 0/5] Virtio-balloon Enhancement Wei Wang
2017-08-03 6:38 ` [PATCH v13 1/5] Introduce xbitmap Wei Wang
2017-08-07 6:58 ` Wei Wang
2017-08-09 21:36 ` Andrew Morton
2017-08-10 5:59 ` Wei Wang
2017-08-03 6:38 ` [PATCH v13 2/5] xbitmap: add xb_find_next_bit() and xb_zero() Wei Wang
2017-08-03 6:38 ` [PATCH v13 3/5] virtio-balloon: VIRTIO_BALLOON_F_SG Wei Wang
2017-08-03 14:22 ` Michael S. Tsirkin
2017-08-03 15:17 ` Wang, Wei W
2017-08-03 15:55 ` Michael S. Tsirkin
2017-08-03 6:38 ` [PATCH v13 4/5] mm: support reporting free page blocks Wei Wang
2017-08-03 9:11 ` Michal Hocko
2017-08-03 10:42 ` Wei Wang
2017-08-03 10:44 ` Michal Hocko
2017-08-03 11:27 ` Wei Wang
2017-08-03 11:28 ` Michal Hocko
2017-08-03 12:11 ` Wei Wang
2017-08-03 12:41 ` Michal Hocko
2017-08-03 13:17 ` Wei Wang
2017-08-03 13:50 ` Michal Hocko [this message]
2017-08-03 15:20 ` Wang, Wei W
2017-08-03 21:02 ` Michael S. Tsirkin
2017-08-04 7:53 ` Michal Hocko
2017-08-04 8:15 ` Wei Wang
2017-08-04 8:24 ` Michal Hocko
2017-08-04 8:55 ` Wei Wang
2017-08-08 6:12 ` Wei Wang
2017-08-08 6:34 ` [virtio-dev] " Wei Wang
2017-08-10 7:05 ` Michal Hocko
2017-08-10 7:38 ` Wei Wang
2017-08-10 7:53 ` Michal Hocko
2017-08-03 6:38 ` [PATCH v13 5/5] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ Wei Wang
2017-08-03 8:13 ` Pankaj Gupta
2017-08-03 12:28 ` Wei Wang
2017-08-03 13:05 ` Pankaj Gupta
2017-08-03 13:21 ` Wei Wang
2017-08-03 12:33 ` Michael S. Tsirkin
2017-08-03 16:11 ` kbuild test robot
2017-08-16 5:57 ` [virtio-dev] [PATCH v13 0/5] Virtio-balloon Enhancement Adam Tao
2017-08-16 9:33 ` Wei Wang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170803135047.GV12521@dhcp22.suse.cz \
--to=mhocko@kernel.org \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=amit.shah@redhat.com \
--cc=cornelia.huck@de.ibm.com \
--cc=david@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=liliang.opensource@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mawilcox@microsoft.com \
--cc=mgorman@techsingularity.net \
--cc=mst@redhat.com \
--cc=pbonzini@redhat.com \
--cc=quan.xu@aliyun.com \
--cc=virtio-dev@lists.oasis-open.org \
--cc=virtualization@lists.linux-foundation.org \
--cc=wei.w.wang@intel.com \
--cc=yang.zhang.wz@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox