From: "Huang\, Ying" <ying.huang@intel.com>
To: Johannes Weiner <hannes@cmpxchg.org>
Cc: "Huang, Ying" <ying.huang@intel.com>,
Anshuman Khandual <khandual@linux.vnet.ibm.com>,
Andrew Morton <akpm@linux-foundation.org>,
tim.c.chen@intel.com, dave.hansen@intel.com,
andi.kleen@intel.com, aaron.lu@intel.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, Hugh Dickins <hughd@google.com>,
Shaohua Li <shli@kernel.org>, Minchan Kim <minchan@kernel.org>,
Rik van Riel <riel@redhat.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Subject: Re: [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64
Date: Fri, 23 Sep 2016 16:47:22 +0800 [thread overview]
Message-ID: <87oa3ft339.fsf@yhuang-dev.intel.com> (raw)
In-Reply-To: <20160922192509.GA6054@cmpxchg.org> (Johannes Weiner's message of "Thu, 22 Sep 2016 15:25:09 -0400")
Johannes Weiner <hannes@cmpxchg.org> writes:
> Hi Ying,
>
> On Tue, Sep 20, 2016 at 10:01:30AM +0800, Huang, Ying wrote:
>> It appears all patches other than [10/10] in the series is used by the
>> last patch [10/10], directly or indirectly. And Without [10/10], they
>> don't make much sense. So you suggest me to use one large patch?
>> Something like below? Does that help you to review?
>
> I find this version a lot easier to review, thank you.
>
>> As the first step, in this patch, the splitting huge page is
>> delayed from almost the first step of swapping out to after allocating
>> the swap space for the THP and adding the THP into the swap cache.
>> This will reduce lock acquiring/releasing for the locks used for the
>> swap cache management.
>
> I agree that that's a fine goal for this patch series. We can worry
> about 2MB IO submissions later on.
>
>> @@ -503,6 +503,19 @@ config FRONTSWAP
>>
>> If unsure, say Y to enable frontswap.
>>
>> +config ARCH_USES_THP_SWAP_CLUSTER
>> + bool
>> + default n
>> +
>> +config THP_SWAP_CLUSTER
>> + bool
>> + depends on SWAP && TRANSPARENT_HUGEPAGE && ARCH_USES_THP_SWAP_CLUSTER
>> + default y
>> + help
>> + Use one swap cluster to hold the contents of the THP
>> + (Transparent Huge Page) swapped out. The size of the swap
>> + cluster will be same as that of THP.
>
> Making swap space allocation and swapcache handling THP-native is not
> dependent on the architecture, it's generic VM code. Can you please
> just define the cluster size depending on CONFIG_TRANSPARENT_HUGEPAGE?
>
>> @@ -196,7 +196,11 @@ static void discard_swap_cluster(struct
>> }
>> }
>>
>> +#ifdef CONFIG_THP_SWAP_CLUSTER
>> +#define SWAPFILE_CLUSTER (HPAGE_SIZE / PAGE_SIZE)
>> +#else
>> #define SWAPFILE_CLUSTER 256
>> +#endif
>> #define LATENCY_LIMIT 256
>
> I.e. this?
>
> #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> #define SWAPFILE_CLUSTER HPAGE_PMD_NR
> #else
> #define SWAPFILE_CLUSTER 256
> #endif
I make the value of SWAPFILE_CLUSTER depends on architecture, because I
don't know whether it is good to change it to enable THP optimization
for some architectures. For example, in MIPS, the huge page size could
be as large as 1 << (16 + 16 -3 ) == 512M. I suspect it is reasonable
to make swap cluster so big. So I think it may be better to let
architecture to determine when to enable THP swap optimization.
>> @@ -18,6 +18,13 @@ struct swap_cgroup {
>> };
>> #define SC_PER_PAGE (PAGE_SIZE/sizeof(struct swap_cgroup))
>>
>> +struct swap_cgroup_iter {
>> + struct swap_cgroup_ctrl *ctrl;
>> + struct swap_cgroup *sc;
>> + swp_entry_t entry;
>> + unsigned long flags;
>> +};
>> +
>> /*
>> * SwapCgroup implements "lookup" and "exchange" operations.
>> * In typical usage, this swap_cgroup is accessed via memcg's charge/uncharge
>> @@ -75,6 +82,35 @@ static struct swap_cgroup *lookup_swap_c
>> return sc + offset % SC_PER_PAGE;
>> }
>>
>> +static void swap_cgroup_iter_init(struct swap_cgroup_iter *iter,
>> + swp_entry_t ent)
>> +{
>> + iter->entry = ent;
>> + iter->sc = lookup_swap_cgroup(ent, &iter->ctrl);
>> + spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
>> +}
>> +
>> +static void swap_cgroup_iter_exit(struct swap_cgroup_iter *iter)
>> +{
>> + spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
>> +}
>> +
>> +/*
>> + * swap_cgroup is stored in a kind of discontinuous array. That is,
>> + * they are continuous in one page, but not across page boundary. And
>> + * there is one lock for each page.
>> + */
>> +static void swap_cgroup_iter_advance(struct swap_cgroup_iter *iter)
>> +{
>> + iter->sc++;
>> + iter->entry.val++;
>> + if (!(((unsigned long)iter->sc) & PAGE_MASK)) {
>> + spin_unlock_irqrestore(&iter->ctrl->lock, iter->flags);
>> + iter->sc = lookup_swap_cgroup(iter->entry, &iter->ctrl);
>> + spin_lock_irqsave(&iter->ctrl->lock, iter->flags);
>> + }
>> +}
>> +
>> /**
>> * swap_cgroup_cmpxchg - cmpxchg mem_cgroup's id for this swp_entry.
>> * @ent: swap entry to be cmpxchged
>> @@ -87,45 +123,49 @@ static struct swap_cgroup *lookup_swap_c
>> unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
>> unsigned short old, unsigned short new)
>> {
>> - struct swap_cgroup_ctrl *ctrl;
>> - struct swap_cgroup *sc;
>> - unsigned long flags;
>> + struct swap_cgroup_iter iter;
>> unsigned short retval;
>>
>> - sc = lookup_swap_cgroup(ent, &ctrl);
>> + swap_cgroup_iter_init(&iter, ent);
>>
>> - spin_lock_irqsave(&ctrl->lock, flags);
>> - retval = sc->id;
>> + retval = iter.sc->id;
>> if (retval == old)
>> - sc->id = new;
>> + iter.sc->id = new;
>> else
>> retval = 0;
>> - spin_unlock_irqrestore(&ctrl->lock, flags);
>> +
>> + swap_cgroup_iter_exit(&iter);
>> return retval;
>> }
>>
>> /**
>> - * swap_cgroup_record - record mem_cgroup for this swp_entry.
>> - * @ent: swap entry to be recorded into
>> + * swap_cgroup_record - record mem_cgroup for a set of swap entries
>> + * @ent: the first swap entry to be recorded into
>> * @id: mem_cgroup to be recorded
>> + * @nr_ents: number of swap entries to be recorded
>> *
>> * Returns old value at success, 0 at failure.
>> * (Of course, old value can be 0.)
>> */
>> -unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id)
>> +unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
>> + unsigned int nr_ents)
>> {
>> - struct swap_cgroup_ctrl *ctrl;
>> - struct swap_cgroup *sc;
>> + struct swap_cgroup_iter iter;
>> unsigned short old;
>> - unsigned long flags;
>>
>> - sc = lookup_swap_cgroup(ent, &ctrl);
>> + swap_cgroup_iter_init(&iter, ent);
>>
>> - spin_lock_irqsave(&ctrl->lock, flags);
>> - old = sc->id;
>> - sc->id = id;
>> - spin_unlock_irqrestore(&ctrl->lock, flags);
>> + old = iter.sc->id;
>> + for (;;) {
>> + VM_BUG_ON(iter.sc->id != old);
>> + iter.sc->id = id;
>> + nr_ents--;
>> + if (!nr_ents)
>> + break;
>> + swap_cgroup_iter_advance(&iter);
>> + }
>>
>> + swap_cgroup_iter_exit(&iter);
>> return old;
>> }
>
> The iterator seems overkill for one real user, and it's undesirable in
> the single-slot access from swap_cgroup_cmpxchg(). How about something
> like the following?
>
> static struct swap_cgroup *lookup_swap_cgroup(struct swap_cgroup_ctrl *ctrl,
> pgoff_t offset)
> {
> struct page *page;
>
> page = page_address(ctrl->map[offset / SC_PER_PAGE]);
> return page + (offset % SC_PER_PAGE);
> }
>
> unsigned short swap_cgroup_cmpxchg(swp_entry_t ent,
> unsigned short old, unsigned short new)
> {
> struct swap_cgroup_ctrl *ctrl;
> struct swap_cgroup *sc;
> unsigned long flags;
> unsigned short retval;
> pgoff_t off = swp_offset(ent);
>
> ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> sc = lookup_swap_cgroup(ctrl, swp_offset(ent));
>
> spin_lock_irqsave(&ctrl->lock, flags);
> retval = sc->id;
> if (retval == old)
> sc->id = new;
> else
> retval = 0;
> spin_unlock_irqrestore(&ctrl->lock, flags);
>
> return retval;
> }
>
> unsigned short swap_cgroup_record(swp_entry_t ent, unsigned short id,
> unsigned int nr_entries)
> {
> struct swap_cgroup_ctrl *ctrl;
> struct swap_cgroup *sc;
> unsigned short old;
> unsigned long flags;
>
> ctrl = &swap_cgroup_ctrl[swp_type(ent)];
> sc = lookup_swap_cgroup(ctrl, offset);
> end = offset + nr_entries;
>
> spin_lock_irqsave(&ctrl->lock, flags);
> old = sc->id;
> while (offset != end) {
> sc->id = id;
> offset++;
> if (offset % SC_PER_PAGE)
> sc++;
> else
> sc = lookup_swap_cgroup(ctrl, offset);
> }
> spin_unlock_irqrestore(&ctrl->lock, flags);
>
> return old;
> }
Yes. You are right. I mis-understood the locking semantics of swap
cgroup before. Thanks for pointing out that. I will change it in the
next version.
>> @@ -145,20 +162,66 @@ void __delete_from_swap_cache(struct pag
>>
>> entry.val = page_private(page);
>> address_space = swap_address_space(entry);
>> - radix_tree_delete(&address_space->page_tree, page_private(page));
>> - set_page_private(page, 0);
>> ClearPageSwapCache(page);
>> - address_space->nrpages--;
>> - __dec_node_page_state(page, NR_FILE_PAGES);
>> - INC_CACHE_INFO(del_total);
>> + for (i = 0; i < nr; i++) {
>> + struct page *cur_page = page + i;
>> +
>> + radix_tree_delete(&address_space->page_tree,
>> + page_private(cur_page));
>> + set_page_private(cur_page, 0);
>> + }
>> + address_space->nrpages -= nr;
>> + __mod_node_page_state(page_pgdat(page), NR_FILE_PAGES, -nr);
>> + ADD_CACHE_INFO(del_total, nr);
>> +}
>> +
>> +#ifdef CONFIG_THP_SWAP_CLUSTER
>> +int add_to_swap_trans_huge(struct page *page, struct list_head *list)
>> +{
>> + swp_entry_t entry;
>> + int ret = 0;
>> +
>> + /* cannot split, which may be needed during swap in, skip it */
>> + if (!can_split_huge_page(page))
>> + return -EBUSY;
>> + /* fallback to split huge page firstly if no PMD map */
>> + if (!compound_mapcount(page))
>> + return 0;
>
> The can_split_huge_page() (and maybe also the compound_mapcount())
> optimizations look like they could be split out into separate followup
> patches. They're definitely nice to have, but don't seem necessary to
> make this patch minimally complete.
Yes. Will change this.
>> @@ -168,11 +231,23 @@ int add_to_swap(struct page *page, struc
>> VM_BUG_ON_PAGE(!PageLocked(page), page);
>> VM_BUG_ON_PAGE(!PageUptodate(page), page);
>>
>> + if (unlikely(PageTransHuge(page))) {
>> + err = add_to_swap_trans_huge(page, list);
>> + switch (err) {
>> + case 1:
>> + return 1;
>> + case 0:
>> + /* fallback to split firstly if return 0 */
>> + break;
>> + default:
>> + return 0;
>> + }
>> + }
>> entry = get_swap_page();
>> if (!entry.val)
>> return 0;
>>
>> - if (mem_cgroup_try_charge_swap(page, entry)) {
>> + if (mem_cgroup_try_charge_swap(page, entry, 1)) {
>> swapcache_free(entry);
>> return 0;
>> }
>
> Instead of duplicating the control flow at such a high level -
> add_to_swap() and add_to_swap_trans_huge() are basically identical -
> it's better push down the THP handling as low as possible:
>
> Pass the page to get_swap_page(), and then decide in there whether
> it's THP and you need to allocate a single entry or a cluster.
>
> And mem_cgroup_try_charge_swap() already gets the page. Again, check
> in there how much swap to charge based on the passed page instead of
> passing the same information twice.
>
> Doing that will change the structure of the patch too much to review
> the paths below in their current form. I'll have a closer look in the
> next version.
The original swap code will allocate one swap slot and try to charge the
one swap entry in the swap cgroup for a THP. We will continue to use
that code path if we failed to allocate a swap cluster for a THP.
Although it is possible to change the original logic to split the THP
before allocating swap slot and charging in the swap cgroup, but I don't
think that should be in this patchset. And whether it is good to do
that is questionable.
Best Regards,
Huang, Ying
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2016-09-23 8:47 UTC|newest]
Thread overview: 61+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-09-07 16:45 [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 01/10] mm, swap: Make swap cluster size same of THP size on x86_64 Huang, Ying
2016-09-08 5:45 ` Anshuman Khandual
2016-09-08 18:07 ` Huang, Ying
2016-09-19 17:09 ` Johannes Weiner
2016-09-20 2:01 ` Huang, Ying
2016-09-22 19:25 ` Johannes Weiner
2016-09-23 8:47 ` Huang, Ying [this message]
2016-09-08 8:21 ` Anshuman Khandual
2016-09-08 11:03 ` Kirill A. Shutemov
2016-09-08 17:39 ` Huang, Ying
2016-09-08 11:07 ` Kirill A. Shutemov
2016-09-08 17:23 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 02/10] mm, memcg: Add swap_cgroup_iter iterator Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 03/10] mm, memcg: Support to charge/uncharge multiple swap entries Huang, Ying
2016-09-08 5:46 ` Anshuman Khandual
2016-09-08 8:28 ` Anshuman Khandual
2016-09-08 18:15 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 04/10] mm, THP, swap: Add swap cluster allocate/free functions Huang, Ying
2016-09-08 5:49 ` Anshuman Khandual
2016-09-08 8:30 ` Anshuman Khandual
2016-09-08 18:14 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 05/10] mm, THP, swap: Add get_huge_swap_page() Huang, Ying
2016-09-08 11:13 ` Kirill A. Shutemov
2016-09-08 17:22 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 06/10] mm, THP, swap: Support to clear SWAP_HAS_CACHE for huge page Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 07/10] mm, THP, swap: Support to add/delete THP to/from swap cache Huang, Ying
2016-09-08 9:00 ` Anshuman Khandual
2016-09-08 18:10 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 08/10] mm, THP: Add can_split_huge_page() Huang, Ying
2016-09-08 11:17 ` Kirill A. Shutemov
2016-09-08 17:02 ` Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 09/10] mm, THP, swap: Support to split THP in swap cache Huang, Ying
2016-09-07 16:46 ` [PATCH -v3 10/10] mm, THP, swap: Delay splitting THP during swap out Huang, Ying
2016-09-09 5:43 ` [PATCH -v3 00/10] THP swap: Delay splitting THP during swapping out Minchan Kim
2016-09-09 15:53 ` Tim Chen
2016-09-09 20:35 ` Huang, Ying
2016-09-13 6:13 ` Minchan Kim
2016-09-13 6:40 ` Huang, Ying
2016-09-13 7:05 ` Minchan Kim
2016-09-13 8:53 ` Huang, Ying
2016-09-13 9:16 ` Minchan Kim
2016-09-13 23:52 ` Chen, Tim C
2016-09-19 7:11 ` Minchan Kim
2016-09-19 15:59 ` Tim Chen
2016-09-18 1:53 ` Huang, Ying
2016-09-19 7:08 ` Minchan Kim
2016-09-20 2:54 ` Huang, Ying
2016-09-20 5:06 ` Minchan Kim
2016-09-20 5:28 ` Huang, Ying
2016-09-13 14:35 ` Andrea Arcangeli
2016-09-19 17:33 ` Hugh Dickins
2016-09-22 22:56 ` Shaohua Li
2016-09-22 23:49 ` Chen, Tim C
2016-09-22 23:53 ` Andi Kleen
2016-09-23 0:38 ` Rik van Riel
2016-09-23 2:32 ` Huang, Ying
2016-09-25 19:18 ` Shaohua Li
2016-09-26 1:06 ` Minchan Kim
2016-09-26 3:25 ` Huang, Ying
2016-09-23 2:12 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87oa3ft339.fsf@yhuang-dev.intel.com \
--to=ying.huang@intel.com \
--cc=aaron.lu@intel.com \
--cc=akpm@linux-foundation.org \
--cc=andi.kleen@intel.com \
--cc=dave.hansen@intel.com \
--cc=hannes@cmpxchg.org \
--cc=hughd@google.com \
--cc=khandual@linux.vnet.ibm.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=minchan@kernel.org \
--cc=riel@redhat.com \
--cc=shli@kernel.org \
--cc=tim.c.chen@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox