From 503c358cf1925853195ee39ec437e51138bbb7df Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:47 -0800 Subject: list_lru: introduce list_lru_shrink_{count,walk} Kmem accounting of memcg is unusable now, because it lacks slab shrinker support. That means when we hit the limit we will get ENOMEM w/o any chance to recover. What we should do then is to call shrink_slab, which would reclaim old inode/dentry caches from this cgroup. This is what this patch set is intended to do. Basically, it does two things. First, it introduces the notion of per-memcg slab shrinker. A shrinker that wants to reclaim objects per cgroup should mark itself as SHRINKER_MEMCG_AWARE. Then it will be passed the memory cgroup to scan from in shrink_control->memcg. For such shrinkers shrink_slab iterates over the whole cgroup subtree under the target cgroup and calls the shrinker for each kmem-active memory cgroup. Secondly, this patch set makes the list_lru structure per-memcg. It's done transparently to list_lru users - everything they have to do is to tell list_lru_init that they want memcg-aware list_lru. Then the list_lru will automatically distribute objects among per-memcg lists basing on which cgroup the object is accounted to. This way to make FS shrinkers (icache, dcache) memcg-aware we only need to make them use memcg-aware list_lru, and this is what this patch set does. As before, this patch set only enables per-memcg kmem reclaim when the pressure goes from memory.limit, not from memory.kmem.limit. Handling memory.kmem.limit is going to be tricky due to GFP_NOFS allocations, and it is still unclear whether we will have this knob in the unified hierarchy. This patch (of 9): NUMA aware slab shrinkers use the list_lru structure to distribute objects coming from different NUMA nodes to different lists. Whenever such a shrinker needs to count or scan objects from a particular node, it issues commands like this: count = list_lru_count_node(lru, sc->nid); freed = list_lru_walk_node(lru, sc->nid, isolate_func, isolate_arg, &sc->nr_to_scan); where sc is an instance of the shrink_control structure passed to it from vmscan. To simplify this, let's add special list_lru functions to be used by shrinkers, list_lru_shrink_count() and list_lru_shrink_walk(), which consolidate the nid and nr_to_scan arguments in the shrink_control structure. This will also allow us to avoid patching shrinkers that use list_lru when we make shrink_slab() per-memcg - all we will have to do is extend the shrink_control structure to include the target memcg and make list_lru_shrink_{count,walk} handle this appropriately. Signed-off-by: Vladimir Davydov Suggested-by: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 24 +++++++++++------------- 1 file changed, 11 insertions(+), 13 deletions(-) (limited to 'fs/super.c') diff --git a/fs/super.c b/fs/super.c index eae088f6aaae..4554ac257647 100644 --- a/fs/super.c +++ b/fs/super.c @@ -77,8 +77,8 @@ static unsigned long super_cache_scan(struct shrinker *shrink, if (sb->s_op->nr_cached_objects) fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - inodes = list_lru_count_node(&sb->s_inode_lru, sc->nid); - dentries = list_lru_count_node(&sb->s_dentry_lru, sc->nid); + inodes = list_lru_shrink_count(&sb->s_inode_lru, sc); + dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects = dentries + inodes + fs_objects + 1; if (!total_objects) total_objects = 1; @@ -86,20 +86,20 @@ static unsigned long super_cache_scan(struct shrinker *shrink, /* proportion the scan between the caches */ dentries = mult_frac(sc->nr_to_scan, dentries, total_objects); inodes = mult_frac(sc->nr_to_scan, inodes, total_objects); + fs_objects = mult_frac(sc->nr_to_scan, fs_objects, total_objects); /* * prune the dcache first as the icache is pinned by it, then * prune the icache, followed by the filesystem specific caches */ - freed = prune_dcache_sb(sb, dentries, sc->nid); - freed += prune_icache_sb(sb, inodes, sc->nid); + sc->nr_to_scan = dentries; + freed = prune_dcache_sb(sb, sc); + sc->nr_to_scan = inodes; + freed += prune_icache_sb(sb, sc); - if (fs_objects) { - fs_objects = mult_frac(sc->nr_to_scan, fs_objects, - total_objects); + if (fs_objects) freed += sb->s_op->free_cached_objects(sb, fs_objects, sc->nid); - } drop_super(sb); return freed; @@ -118,17 +118,15 @@ static unsigned long super_cache_count(struct shrinker *shrink, * scalability bottleneck. The counts could get updated * between super_cache_count and super_cache_scan anyway. * Call to super_cache_count with shrinker_rwsem held - * ensures the safety of call to list_lru_count_node() and + * ensures the safety of call to list_lru_shrink_count() and * s_op->nr_cached_objects(). */ if (sb->s_op && sb->s_op->nr_cached_objects) total_objects = sb->s_op->nr_cached_objects(sb, sc->nid); - total_objects += list_lru_count_node(&sb->s_dentry_lru, - sc->nid); - total_objects += list_lru_count_node(&sb->s_inode_lru, - sc->nid); + total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); + total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); total_objects = vfs_pressure_ratio(total_objects); return total_objects; -- cgit v1.2.3 From 4101b624352fddb5ed72e7a1b6f8be8cffaa20fa Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:58:51 -0800 Subject: fs: consolidate {nr,free}_cached_objects args in shrink_control We are going to make FS shrinkers memcg-aware. To achieve that, we will have to pass the memcg to scan to the nr_cached_objects and free_cached_objects VFS methods, which currently take only the NUMA node to scan. Since the shrink_control structure already holds the node, and the memcg to scan will be added to it when we introduce memcg-aware vmscan, let us consolidate the methods' arguments in this structure to keep things clean. Signed-off-by: Vladimir Davydov Suggested-by: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 12 ++++++------ fs/xfs/xfs_super.c | 7 +++---- include/linux/fs.h | 6 ++++-- 3 files changed, 13 insertions(+), 12 deletions(-) (limited to 'fs/super.c') diff --git a/fs/super.c b/fs/super.c index 4554ac257647..a2b735a42e74 100644 --- a/fs/super.c +++ b/fs/super.c @@ -75,7 +75,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink, return SHRINK_STOP; if (sb->s_op->nr_cached_objects) - fs_objects = sb->s_op->nr_cached_objects(sb, sc->nid); + fs_objects = sb->s_op->nr_cached_objects(sb, sc); inodes = list_lru_shrink_count(&sb->s_inode_lru, sc); dentries = list_lru_shrink_count(&sb->s_dentry_lru, sc); @@ -97,9 +97,10 @@ static unsigned long super_cache_scan(struct shrinker *shrink, sc->nr_to_scan = inodes; freed += prune_icache_sb(sb, sc); - if (fs_objects) - freed += sb->s_op->free_cached_objects(sb, fs_objects, - sc->nid); + if (fs_objects) { + sc->nr_to_scan = fs_objects; + freed += sb->s_op->free_cached_objects(sb, sc); + } drop_super(sb); return freed; @@ -122,8 +123,7 @@ static unsigned long super_cache_count(struct shrinker *shrink, * s_op->nr_cached_objects(). */ if (sb->s_op && sb->s_op->nr_cached_objects) - total_objects = sb->s_op->nr_cached_objects(sb, - sc->nid); + total_objects = sb->s_op->nr_cached_objects(sb, sc); total_objects += list_lru_shrink_count(&sb->s_dentry_lru, sc); total_objects += list_lru_shrink_count(&sb->s_inode_lru, sc); diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index f2449fd86926..8fcc4ccc5c79 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1537,7 +1537,7 @@ xfs_fs_mount( static long xfs_fs_nr_cached_objects( struct super_block *sb, - int nid) + struct shrink_control *sc) { return xfs_reclaim_inodes_count(XFS_M(sb)); } @@ -1545,10 +1545,9 @@ xfs_fs_nr_cached_objects( static long xfs_fs_free_cached_objects( struct super_block *sb, - long nr_to_scan, - int nid) + struct shrink_control *sc) { - return xfs_reclaim_inodes_nr(XFS_M(sb), nr_to_scan); + return xfs_reclaim_inodes_nr(XFS_M(sb), sc->nr_to_scan); } static const struct super_operations xfs_super_operations = { diff --git a/include/linux/fs.h b/include/linux/fs.h index cdcb1e9d9613..a20d6586e517 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -1635,8 +1635,10 @@ struct super_operations { struct dquot **(*get_dquots)(struct inode *); #endif int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); - long (*nr_cached_objects)(struct super_block *, int); - long (*free_cached_objects)(struct super_block *, long, int); + long (*nr_cached_objects)(struct super_block *, + struct shrink_control *); + long (*free_cached_objects)(struct super_block *, + struct shrink_control *); }; /* -- cgit v1.2.3 From c0a5b560938a0f2fd2fbf66ddc446c7c2b41383a Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:07 -0800 Subject: list_lru: organize all list_lrus to list To make list_lru memcg aware, we need all list_lrus to be kept on a list protected by a mutex, so that we could sleep while walking over the list. Therefore after this change list_lru_destroy may sleep. Fortunately, there is only one user that calls it from an atomic context - it's put_super - and we can easily fix it by calling list_lru_destroy before put_super in destroy_locked_super - anyway we don't longer need lrus by that time. Another point that should be noted is that list_lru_destroy is allowed to be called on an uninitialized zeroed-out object, in which case it is a no-op. Before this patch this was guaranteed by kfree, but now we need an explicit check there. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 8 ++++++++ include/linux/list_lru.h | 3 +++ mm/list_lru.c | 34 ++++++++++++++++++++++++++++++++++ 3 files changed, 45 insertions(+) (limited to 'fs/super.c') diff --git a/fs/super.c b/fs/super.c index a2b735a42e74..b027849d92d2 100644 --- a/fs/super.c +++ b/fs/super.c @@ -282,6 +282,14 @@ void deactivate_locked_super(struct super_block *s) unregister_shrinker(&s->s_shrink); fs->kill_sb(s); + /* + * Since list_lru_destroy() may sleep, we cannot call it from + * put_super(), where we hold the sb_lock. Therefore we destroy + * the lru lists right now. + */ + list_lru_destroy(&s->s_dentry_lru); + list_lru_destroy(&s->s_inode_lru); + put_filesystem(fs); put_super(s); } else { diff --git a/include/linux/list_lru.h b/include/linux/list_lru.h index 53c1d6b78270..ee9486ac0621 100644 --- a/include/linux/list_lru.h +++ b/include/linux/list_lru.h @@ -31,6 +31,9 @@ struct list_lru_node { struct list_lru { struct list_lru_node *node; +#ifdef CONFIG_MEMCG_KMEM + struct list_head list; +#endif }; void list_lru_destroy(struct list_lru *lru); diff --git a/mm/list_lru.c b/mm/list_lru.c index 07e198c77888..a9021cb3ccde 100644 --- a/mm/list_lru.c +++ b/mm/list_lru.c @@ -9,6 +9,34 @@ #include #include #include +#include + +#ifdef CONFIG_MEMCG_KMEM +static LIST_HEAD(list_lrus); +static DEFINE_MUTEX(list_lrus_mutex); + +static void list_lru_register(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_add(&lru->list, &list_lrus); + mutex_unlock(&list_lrus_mutex); +} + +static void list_lru_unregister(struct list_lru *lru) +{ + mutex_lock(&list_lrus_mutex); + list_del(&lru->list); + mutex_unlock(&list_lrus_mutex); +} +#else +static void list_lru_register(struct list_lru *lru) +{ +} + +static void list_lru_unregister(struct list_lru *lru) +{ +} +#endif /* CONFIG_MEMCG_KMEM */ bool list_lru_add(struct list_lru *lru, struct list_head *item) { @@ -137,12 +165,18 @@ int list_lru_init_key(struct list_lru *lru, struct lock_class_key *key) INIT_LIST_HEAD(&lru->node[i].list); lru->node[i].nr_items = 0; } + list_lru_register(lru); return 0; } EXPORT_SYMBOL_GPL(list_lru_init_key); void list_lru_destroy(struct list_lru *lru) { + /* Already destroyed or not yet initialized? */ + if (!lru->node) + return; + list_lru_unregister(lru); kfree(lru->node); + lru->node = NULL; } EXPORT_SYMBOL_GPL(list_lru_destroy); -- cgit v1.2.3 From 2acb60a0461fec8a4516686c913a295ce872697f Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:14 -0800 Subject: fs: make shrinker memcg aware Now, to make any list_lru-based shrinker memcg aware we should only initialize its list_lru as memcg aware. Let's do it for the general FS shrinker (super_block::s_shrink). There are other FS-specific shrinkers that use list_lru for storing objects, such as XFS and GFS2 dquot cache shrinkers, but since they reclaim objects that are shared among different cgroups, there is no point making them memcg aware. It's a big question whether we should account them to memcg at all. Signed-off-by: Vladimir Davydov Cc: Dave Chinner Cc: Johannes Weiner Cc: Michal Hocko Cc: Greg Thelen Cc: Glauber Costa Cc: Alexander Viro Cc: Christoph Lameter Cc: Pekka Enberg Cc: David Rientjes Cc: Joonsoo Kim Cc: Tejun Heo Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'fs/super.c') diff --git a/fs/super.c b/fs/super.c index b027849d92d2..482b4071f4de 100644 --- a/fs/super.c +++ b/fs/super.c @@ -189,9 +189,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) INIT_HLIST_BL_HEAD(&s->s_anon); INIT_LIST_HEAD(&s->s_inodes); - if (list_lru_init(&s->s_dentry_lru)) + if (list_lru_init_memcg(&s->s_dentry_lru)) goto fail; - if (list_lru_init(&s->s_inode_lru)) + if (list_lru_init_memcg(&s->s_inode_lru)) goto fail; init_rwsem(&s->s_umount); @@ -227,7 +227,7 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags) s->s_shrink.scan_objects = super_cache_scan; s->s_shrink.count_objects = super_cache_count; s->s_shrink.batch = 1024; - s->s_shrink.flags = SHRINKER_NUMA_AWARE; + s->s_shrink.flags = SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE; return s; fail: -- cgit v1.2.3 From 49e7e7ff8d551b5b1e2f8da8497b9058cfa25672 Mon Sep 17 00:00:00 2001 From: Vladimir Davydov Date: Thu, 12 Feb 2015 14:59:17 -0800 Subject: fs: shrinker: always scan at least one object of each type In super_cache_scan() we divide the number of objects of particular type by the total number of objects in order to distribute pressure among As a result, in some corner cases we can get nr_to_scan=0 even if there are some objects to reclaim, e.g. dentries=1, inodes=1, fs_objects=1, nr_to_scan=1/3=0. This is unacceptable for per memcg kmem accounting, because this means that some objects may never get reclaimed after memcg death, preventing it from being freed. This patch therefore assures that super_cache_scan() will scan at least one object of each type if any. [akpm@linux-foundation.org: add comment] Signed-off-by: Vladimir Davydov Cc: Johannes Weiner Cc: Michal Hocko Cc: Alexander Viro Cc: Dave Chinner Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds --- fs/super.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) (limited to 'fs/super.c') diff --git a/fs/super.c b/fs/super.c index 482b4071f4de..a89d6250bab8 100644 --- a/fs/super.c +++ b/fs/super.c @@ -91,14 +91,17 @@ static unsigned long super_cache_scan(struct shrinker *shrink, /* * prune the dcache first as the icache is pinned by it, then * prune the icache, followed by the filesystem specific caches + * + * Ensure that we always scan at least one object - memcg kmem + * accounting uses this to fully empty the caches. */ - sc->nr_to_scan = dentries; + sc->nr_to_scan = dentries + 1; freed = prune_dcache_sb(sb, sc); - sc->nr_to_scan = inodes; + sc->nr_to_scan = inodes + 1; freed += prune_icache_sb(sb, sc); if (fs_objects) { - sc->nr_to_scan = fs_objects; + sc->nr_to_scan = fs_objects + 1; freed += sb->s_op->free_cached_objects(sb, sc); } -- cgit v1.2.3