Hello People, [ For the ones in linux-mm that are receiving this for the first time, this is a follow up of http://thread.gmane.org/gmane.linux.kernel.containers/21295 ] Here is a new, a bit more mature version of my previous RFC. Now I Request For More Comments from you guys in this new version of the patch. Highlights: * Although I do intend to experiment with more scenarios (suggestions welcome), there does not seem to be a (huge) performance hit with this patch applied, at least in a basic latency benchmark. That indicates that even if we can demonstrate a performance hit, it won't be too hard to optimize it away (famous last words?) Since the patch touches both rcv and snd sides, I benchmarked it with netperf against localhost. Command line: netperf -t TCP_RR -H localhost. Without the patch ================= Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 26996.35 16384 87380 With the patch =============== Local /Remote Socket Size Request Resp. Elapsed Trans. Send Recv Size Size Time Rate bytes Bytes bytes bytes secs. per sec 16384 87380 1 1 10.00 27291.86 16384 87380 As you can see, rate is a bit higher, but still under an one percent range, meaning it is basically unchanged. I will benchmark it with various levels of cgroup nesting on my next submission so we can have a better idea of the impact of it when enabled. * As nicely pointed out by Kamezawa, I dropped the sockets cgroup, and introduced a kmem cgroup. After careful consideration, I decided not to reuse the memcg. Basically, my impression is that memcg is concerned with user objects, with page granularity and its swap attributes. Because kernel objects are entirely different, I prefer to group them here. * Only tcp ipv4 is converted - because it is basically the one in which memory pressure thresholds are really put to use. I plan to touch the other protocols in the next submission. * As with other sysctls, the sysctl controlling tcp memory pressure behaviour was made per-netns. But it will show cgroup-data for the current cgroup. The cgroup control file, however, will only set a maximum value. The pressure thresholds is not the business of the box administrator, but rather, of the container's - anything goes, provided none of the 3 values go over the maximum. Comments welcome