The Saga Of The Crashed NAS

2016-12-11

Update: This problem is actually not fixed. Shortly after publishing this, it crashed again with an out of memory error.

Update 2: I’ve disabled quotas and snapshots on my NAS. If this solves the problem, I’ll hold on to it until I’m better prepared to upgrade.

This is a bit of a post mortem and retrospective of the issue I had with my home NAS box. It had crashed with an out of memory error and rebooting did not solve the problem. The situation was looking grim when I started researching as all indicators pointed to “HOPE YOU HAVE BACKUPS, YOU GOTTA RESET TO FACTORY.”

I have a Netgear ReadyNAS RN104. It’s an older model with an ARM Cortex A9 single core and 512 MB of RAM. It has 4 drive bays, which I have filled with 3 TB NAS optimized drives. I’m using the built-in XRAID, as Netgear calls it, which does some interesting things with md and btrfs, so I have a bit over 8 TB of storage on the NAS box.

Yesterday morning, when I went to start working on a small project, I needed to use one of my services I host at home which depended on the NAS. It wasn’t working. It took about 10 minutes to figure out that the NAS was the culprit, so I turned the box where I could see the status display to find an out of memory error. I almost lost hope as there is next to no information on actually solving this issue. There’s a huge number of things that can cause an out of memory condition on a Linux based environment. I, eventually, gave up searching for answers on the internet and decided to put my skills developed over more than a decade to see what I can do to solve this.

The first thing I decided to do was to see if the system was reachable after boot but before the out of memory condition. I power cycled it and tested ssh during the boot process every few seconds. Eventually, once it moved on from 41% booted, I could get a ssh connection. I was able to poke around and do some basic recon, and I watched system resources until the out of memory condition occured. I only got about 2-5 minutes of access each time I did this, and I did this for many rounds.

I could not find a process that was consuming the memory. This tells me the kernel is doing it, or something in kernel space. I would get at least one secondary ssh session running with the top command so I can watch memory and guage my time. It almost never exhausted swap unless I was doing something to cause that, such as the one time I was trying to build more swap. I managed to get 2 GB of swap added via a swap file, but this only confirmed that this was not the problem.

Next I decided to try was to get it to boot to Technical Support mode. This is the mode Netgear support will tell you to boot to so they can connect to it remotely and check the health of the system. Unfortunately, this mode is useless if you want to troubleshoot a system that has runtime issues, not hardware issues. So, moving on.

The next thing I decided to try was OS Reinstall mode. There’s not much information on how this mode differs from Factory Reset, but I did find one comment that stated it simply just overwrites the OS install on the box to reset settings and scripts. When I ran it, it appeared to run a modified version of the upgrade. It didn’t reset my accounts, volumes, share settings, etc. In fact, the only thing I could find that it changed for me was the root password. This didn’t stop the out of memory condition, so moving on.

At this point, I want to investigate btrfs. I believe it was ReadyNAS OS 6.5 that introduced some new btrfs features, such as quotas, so my suspicion starts going in this direction.

My first thought was to investigate snapshots. Snapshot management had been an issue in the past, but never with out of memory conditions. I tried to see if I could get the snapshots slowly pruned out in the 2-5 minute windows I had. Unfortunately, this seemed impossible. Something was tying up storage and memory resources and I couldn’t complete any of the snapshot purges. I decided I’m just going to set this aside and see if there’s something else to poke at.

Still focused on btrfs, I decided to see if I can disable quotas. This has been a very problematic feature. When the initial release using btrfs quotas was sent out, many NAS boxes were getting hung up with 100% cpu usage issues caused by this new system. There wasn’t much warning about this feature and the implications of it. If you managed to upgrade from a version prior to its implementation but after the 100% CPU usage issue, you still had about a day of scanning you had to deal with, causing slow performance during that time. On the slower RN104, this was a major problem.

So I looked up the commands to disable quotas. Fortunately, these commands executed quickly. I was able to disable quotas on all of the shares before crash. In fact, I looked down at the box and noticed the information display had blanked, indicating it completed boot without crashing.

SUCCESS! I don’t have to rebuild my data!

If you weren’t able to derive what was happening, the btrfs quota management was attempting to allocate more memory than is available as physical memory…say, 520 MB. No amoun of swap will help this condition as the process is requesting the memory to be all available and active, so no part of it can be swapped. This is really kind of an oops on Netgear’s part as they likely tested this code on their current generation hardware, but not their legacy hardware. However, it appears they, or someone else upstream, fixed this.

I was on ReadyOS 6.5.2 and the latest was 6.6.0. The new version was a major upgrade, with a new base OS, new kernel, and in turn, new btrfs. Since all of my services that depended on it were already disrupted, I decided to push the upgrade.

Once the upgrade was done, I wanted to re-enable quotas because this is how ReadyOS manages space consumption. I decided to try one share at a time. However, the moment I executed the rescan, the ReadNAS platform took over and just rebuilt all the quotas.

I watched this thing closely as it did this. I was worried the flaw was simply I have an old model and it’s time to upgrade, as the lowest end model Netgear sells now has 4 times the RAM. However, the memory usage remained sane and it never crashed. I now have a nice stable NAS.

I did manage to get the things done that I intended to get done earlier in the morning. But this took several hours away from doing other things. Despite this, losing several hours to fix it is better than spending a week or more rebuilding my data.

So there’s a few take-aways from this.

That last one is important just in general, but the interesting thing about this box on my network is this is my backups. It also hosts various other things. There’s probably less than 10% of data that can’t be rebuilt from other sources. It provides mass storage for some of my network services that are mostly hosted inside my network for convenience. Worst case scenario is I might lose my image library, but I’m honestly not sure how much of a huge loss that would be.