Monday, July 14, 2008

Why you have to reboot your router


This Slashdot post asks "Why Do We Have To Restart Routers?", referring to home routers like Linksys, D-Link, and Apple AirPort. It claims "It seems like routers, purpose-built with an embedded OS, should be the most stable devices on my network". The assumption that purpose-built/embedded OSes are more robust is wrong, the opposite is true. Purpose-built/embedded operating systems are fragile.

The cause of this is "confirmation-bias". Engineering is broken into multiple teams. One team writes the code, another team ("quality assurance" or "QA") tests the code. However, QA suffers from critical-thinking defects. They look for tests that the product can pass, they avoid doing tests where the product will fail.

In cognitive-science/critical-thinking, this is known as "confirmation-bias". The Wikipedia page on this has an excellent example of this, which I'll paraphrase here.

A guy named Peter Wason did a study where he gave subjects a triple of numbers, such as [2 4 6], and told the subjects that they conform to a particular rule. The subjects were asked to discover the hidden rule. The were to pick their own triplets, and be told "yes" or "no" whether they conformed to the hidden rule.

What Wason found is that people chose triplets that tried to "confirm" their theory rather than "falsify" it. If they were guessing "all even numbers", they would choose a triplet like [6 8 2] to prove their theory rather than a triplet like [3 4 5] to disprove it. Those using a confirmation-bias usually failed at finding the correct theory, whereas those using a falsification-bias quickly found it.

Falsification is what makes science "science". When testing theories, scientists design tests to prove their theory false. A solid scientific theory is not something proven true, but something that we tried hard, but failed, to prove false. This is why things like "Astrology" are not scientific - while there appears plausible from a confirmation-bias point-of-view, it utterly fails to hold up when approached with a falsification-bias. Conversely, scientific theories like Newton's Mechanics or Einstein's Relativity hold up quite well to attempts trying to disprove them.

When testing home-routers, the QA department is given a list of features the router must pass. For example, they must support typical web surfing for a number of desktops behind the router, so QA might test to make sure that the device can handle 100 concurrent connections. However, QA does not test what happens when devices try a thousand, ten thousand, or a million connections. They test the confirmation-bias of "can it handle 100 connections?", not a falsification-bias of "how many connections can it handle, and what happens when it reaches its maximum?".

Because of this, the first time somebody misconfigures BitTorrent to use too many connections, the router crashes. Likewise, internal processes within the router crash often and silently restarting without being visible from the outside - but still passes QA tests because they aren't looking for that. Anything unusual that the user does is likely to cause a crash.

Fuzzing, or sending intentionally corrupted traffic, is an example of falsification QA testing that is never done. In Errata Security's testing most home routers will crash if you WiFi fuzz them. Their web-interfaces will result in crashes if you fuzz their HTTP management port. They will crash if you fuzz TCP/IP on their management port. They will also crash in their "stateful-inspection" logic if you fuzz the traffic going through them. Errata Security has fuzzed many typical home gateways, and NONE of them have passed well-known, public fuzzing suites. We therefore conclude that fuzzing isn't part of their QA process.

Well-known operating-systems like Linux and Windows don't have this problem. First of all, they have a lot of eyes looking at these things from the outside trying to break them. At the same time, they have internal QA people with a falsification-bias, trying to create tests that break them.

Conversely, expensive enterprise products (like switches) have the same confirmation-bias as cheap home products. We crashed an enterprise switch the other day running Nessus against it. I've tested products from all the high-end vendors, and they all demonstrate confirmation-bias in their testing. The only products that seem to have falsification-bias in their QA process are the major operating systems and a few of the major applications.

Confirmation-bias is why you hire pen-testers, code-auditors, and security consultants. You are hiring people to falsify your assumptions, not confirm them. This is the secret reason why pentesters look so smart. It's not that they are actually smarter than your security team, it's that they have a falsification-bias rather than a confirmation-bias.

Indeed, it's the top-down (from management) confirmation-bias that is at the heart of cybersecurity woes. Everyone is trying to prove they are secure, anyone trying to show where they are not secure is suppressed, called a "trouble maker" or "not a team player".

4 comments:

Ryan Russell said...

Well that's just insulting to those of us who do QA (or did, until recently.)

The reason that more functional QA testing than penetration testing gets done is because of time.

Doesn't mean I didn't do some, though. I have found more security holes in our product before it shipped than everyone else in the world put together.

Robert Graham said...

It's not the QA engineers per se, it's management.

Who allocates time and priorities? Management. If you don't have enough time, it's because management didn't give you time.

Did management reward you for finding security holes? Or did they express their displeasure?

Ryan Russell said...

Depends on when in the cycle I found them. ;)

Yes, of course management sets the funding, and the rest flows from there. But I wholeheartedly disagree that the problem is because QA doesn't know or doesn't care. That's like saying all programmers don't know how to write secure code. They suffer from the same budget problem.

If you want my opinion why home routers need to be rebooted, it's often because some table filled up or something is leaking memory. You can either blame that on programmers not doing a good enough job writing code for a constrained environment, or the product manager for speccing too constrained an environment. Both are budget-related.

david said...

Do you test firmwares like dd-wrt, tomato, openwrt, etc.?