Criteria for a Lab to Certify Software

Software companies do not make secure software because consumers do not demand it. This shibboleth of the IT industry is beginning to be challenged. It is dawning on consumers that security matters as computers at work and home are sickened with viruses, as the press gives more play to cybersecurity stories, and as software companies work to market security to consumers. Now, consumers are beginning to want to translate their developing, yet vague, understanding that security is important into actionable steps, such as buying more secure software products. This, in turn, will encourage companies to produce more secure software. Unfortunately, consumers have no good way to tell how secure software products are.

Security is an intangible beast. When a consumer goes to buy a new release of Office, he can see that the software now offers him a handy little paper clip to guide him through that letter he is writing. He understands that Microsoft has made tangible changes, perhaps worth paying for, or perhaps not, to the software. Security does not lend itself to such neat packaging: Microsoft can say it puts more resources into security, but should the consumer trust this means the software is more secure? How does the security of Microsoft's products compare to that of other companies?

To help the consumer answer these questions, one idea that has been floated is to create an independent lab to test and rate software on security. A rating system would allow consumers to vote with their dollars for more secure software. It could also pave the way to assigning legal liability to companies for software defects. The point of rating software would be to tip a software company’s cost-benefit analysis in favor of providing security, be that via the threat of liability—increasing the cost the company incurs for selling insecure software—or via increased consumer demand for secure software—increasing the benefit the company receives for making their products secure.

A certifying lab seems to be an ingenious solution. The consumers who buy the software and the lawyers who would assign liability do not know what secure software looks like so they will trust the software security experts running the lab to tell them. The trouble is, those software security experts do not know what security looks like either.

Why is software security so hard? Quite simply, software security experts do not know whether a piece of software is secure because it is generally impossible to ascertain whether a piece of software has a given property for all but the most simple classes of properties. Security, needless to say, is not one of those simple properties. Why is this true? Consider first what security vulnerabilities are. Security vulnerabilities are typically errors—bugs—made by the designers or implementers of a piece of software. These bugs, many of them very minor mistakes, live in a huge sea of code, millions of lines long for much commercial software, that is set up in untold numbers of different environments, with different configurations, different inputs and different interactions with other software. This creates an exponential explosion in the number of things we’re asking the software to be able to do without making a mistake.

So where do bugs come from? Bugs are either features that the software’s designers planned for the software but that did not make it into the final product, or undesired features that the designers did not ask for, but that became part of the final product anyway. It’s very easy for a programmer to introduce the latter type of error, which is where most security vulnerabilities sneak in. Perhaps Roger the developer writes perfect code—an impossibility in and of itself since humans make mistakes. Now Joe next door writes another piece of code that uses Roger’s code in a way Roger did not expect and this can introduce bugs. Or Roger’s code, which was written on Roger’s computer with its unique configurations works fine on Roger’s computer but not fine on Joe’s next door.

If you can’t avoid making errors, can you catch them? The best we can do is test the software, and testing will never catch all errors since you need to test for an unlimited number of possible things the software might do. If you want to keep your cat fenced up in your backyard, you have to anticipate all possible ways the wily cat might escape. You can climb on the roof and guess whether the cat would be willing to jump off it over the fence. You can crouch down and guess whether a cat can slink under the fence. And still, the cat will probably escape via the route you overlooked. It is very hard to prove that something is not going to happen, that the cat cannot possibly escape the yard, that a piece of software has no vulnerabilities.

Two other problems make building secure software difficult and unique from other complicated engineering tasks, such as building a bridge. First, there is often little correlation between the magnitude of error in the software and the problems that the error will cause. If you forget to drill in a bolt when building a bridge, the bridge is not likely to collapse. However, if you forget to type a line of code, or even a character, you may well compromise the integrity of the entire program.

The second problem is that software is constantly being probed for vulnerabilities by attackers because spectacular attacks can be carried out while incurring little cost; they require minimal skill and effort, and the risk of detection is quite slim. A bridge’s vulnerabilities are not easy to exploit—it would be a hassle to try to blow up a bridge—and so the payoff for trying is not worth it to most.

Trying to measure security Software’s complexity makes it difficult not only to avoid making mistakes but also to devise foolproof ways to detect these mistakes: to measure how many of them, and how consequential, they are. However, computer scientists have devised a few methods for taking educated guesses at how secure a piece of software is.

Evaluating security can be attempted by looking at the design of the software and the implementation of that design. Experts can study the plan, or specification, for the software, to see if security features, such as encryption and access control, have been included in the software. They can also look at the software’s threat model, its evaluation of what threats the software might face and how it will handle those threats. On the implementation side, the experts could evaluate how closely the software matches its specification, which is a very difficult thing to do well. They can also look at the process by which the software developers created the software. Many computer scientists believe that the more closely a software development process mirrors the formal step-by-step exactitude of a civil engineering process, the more secure the code should be. So the experts can evaluate if the development team documented its source code so others can understand what the code does. They can see if a developer’s code has been reviewed by the developer’s peers. They can evaluate if the software has been developed according to a reasonable schedule and not rushed to market. They can look at how well the software was tested. The hope, then, is if the developers take measures like these, the code is likely to have fewer mistakes and thus will be more secure.

The experts might also test the software themselves. They might run programs to check the source code for types of vulnerabilities that we know occur frequently, such as buffer overflows. Or they might employ human testers to probe the code for vulnerabilities. Finally, the experts could follow the software after it has been shipped, tracking how often patches to fix vulnerabilities are released, how severe a problem the patch fixes and how long it takes the company to release patches from when the vulnerability was found.

Creating a certifying lab: Requirements For a lab to inspire confidence in its ratings, the lab must meet several requirements.

1. Reasonable tests for measuring security The first requirement for an independent lab would be to decide on which of the above methods it would choose to evaluate security. The lab’s method must produce a result that enough people believe has merit. Some security experts argue that the methods we currently use to evaluate software aren’t good enough yet and so there are no reasonable tests for measuring security. Talk of a lab, they say, is premature: Even if we can find some security vulnerabilities—which we can using these methods—there will always be another flaw that is overlooked. Therefore, any attempt to certify a piece of software as “secure” is essentially meaningless since such certification can’t guarantee with any degree of certainty that the software cannot fall victim to devastating attack. Others argue that since we know how to find some security vulnerabilities, certifying software based on whether or not they have these findable flaws is better than doing nothing.

Assuming for the moment that we agree with the computer scientists who believe doing something is better than doing nothing, which of these methods should the lab use? Ideally, if the only consideration were to do the best job possible in estimating whether the software is secure, the lab would look at all of them. But in the real world, the lab will face constraints such as the time and money it takes to evaluate the software, and access to proprietary information such as source code and development processes.

Our current best attempt to certify software security works by employing experts to look at documentation about the software to evaluate the design and implementation of the software. This process places special emphasis on looking at security features such as encryption, access control and authentication. Software companies submit their products to labs which use this scheme, called the Common Criteria, to evaluate their software. A good analogy for how this works is to consider a house with a door and a lock. The Common Criteria try to examine how good that lock is: Is it the right lock? Is it installed properly? Critics of the Common Criteria say a major shortcoming of this method for evaluating software is that it relies too much on the documentation of the design and implementation and essentially ignores the source code itself which can be full of vulnerabilities. Figuring out if the lock on your door is a good one is hardly useful if the bad guys are poking holes through the walls of your house, which is what flawed code lets you do. Another problem with the Common Criteria is that it also looks at the software in a very limited environment—it evaluated Windows for use in a computer with no network connections and no remote media —too limited to be meaningful in any general sense.

Another approach to devising methods to measure software security is under development at the Cylab at Carnegie Mellon University in partnership with leading IT companies. The Cylab is working on a plan to plug holes in the walls of the house by running automated checking tools—programs that run against a piece of software’s source code much like spell-checkers—to catch three common security vulnerabilities, including buffer overflows. According to Larry Maccherone, Manager of Software Assurance Initiatives of Cylab, 85 percent of the vulnerabilities cataloged in the CERT database, a database run by CMU’s Software Engineering Institute which is one of the more comprehensive collections of known security vulnerabilities, are of the kind that the Cylab’s tools will test for. The flaw with this method for evaluating software is that it is only looking at a part of the problem: In this case, we are ignoring the locks on the doors.

2. A meaningful rating system Once the lab decides on which methods it will use to evaluate the software’s security, it needs to devise a way to convey information about its findings to consumers. Rating the software is obvious solution. Any rating system for the software must say how secure the lab thinks the software might be, but at the same time, not give users a false sense of security, so to speak. The lab should make clear that the ratings are simply an attempt to measure security and can never guarantee the software is secure. In addition, the descriptors the ratings system uses to describe how secure a piece of software is must make sense to the average consumer, and to those assigning liability, and at the same time map to metrics intelligible to computer scientists. A simple solution is to pass or fail the software, as Maccherone proposes. However, translating what passing means and offering the appropriate caveats as described above, is no mean feat. For the average user you might try to explain that that the tools the Cylab would use to scan the source code are intended to find vulnerabilities that make up 85 percent of the vulnerabilities present in the CERT database, and that different pieces of software contained certain numbers of these vulnerabilities. However, as McGraw points out, the average user, much like Gary Larson’s cartoon dogs, would hear “Blah blah blah CERT database.” Confused, the user might then ask: “Does this mean my computer is going to turn into a spam-sending zombie or not?” This is a perfectly fair, and unanswerable, question. Spafford adds that it would be difficult to present meaningful information even to sophisticated users like system administrators.

Hal Varian, Professor of Economics at Berkeley, suggests ratings might go beyond a general pass/fail system and note the software is “certified for home use” or “certified for business use.” This scheme too would confront the same problem: What does it mean for a piece of software to be certified for home or business use and how do you convey that meaning to consumers? Maccherone worries that it would be difficult to differentiate between certification for different types of use at the code level.

The Common Criteria, the primary certification system in use today, makes no attempt to make its ratings meaningful to the average consumer: The ratings are largely intended for use by government agencies. Vendors who want to sell to certain parts of the U.S. government must ensure their products are certified by the Common Criteria. Companies do tout their Common Criteria certification in marketing literature, but this likely means nothing to most consumers.

3. Educated consumers Consumers then, would need to know about and value the ratings. Press attention, the endorsement of the ratings by leading industry figures and computer scientists, and more consumer education on why security is important would help here.

4. Critical mass of participating companies A critical mass of companies would need to participate in the certification process for it to be useful. There are various ways to convince companies to participate. Industry leaders might take the initiative and participate in creating and deploying the ratings for the benefit of the industry as a whole, like the leading IT companies partnered in the Cylab would do. According to Maccherone, industry giants like Microsoft have every interest in promoting industry-wide security evaluations. Big vendors want to see the security problem solved, and they would like to be involved in helping develop standards for measuring security rather than have security standards handed to them.

Alternatively, government policy might mandate that certain government agencies only use software that has been certified, which is how the Common Criteria certifying process works.

5. Costs in Time and Money It must not be prohibitively expensive to submit your software for review so the smaller players in the industry can afford to be included. This is a major complaint about the Common Criteria labs: It’s too expensive, which bars all but the biggest companies from having their software evaluated.

As new releases in software come with some regularity, the review process must take place quickly enough for the software to have time on the market before its next version comes out. Critics fault the Common Criteria for being too slow as well.

The Cylab, because it relies on automated tools, might well delivered speedier and cheaper results than a process like the Common Criteria which depends on human experts. Then again, a major problem with automated tools is that each of their finds needs to be evaluated by a human since the number of false positives they uncover is huge. Spafford noted that in one test he performed, the tools uncovered hundreds of false positives in ten thousand lines of code. Windows is 60 million lines of code long. How much would handling these false positives increase the costs in time and money for the lab?

6. Accountability and independence Finally, the lab must be accountable and independent. It needs to be held responsible for its scoring process, and it must be able to evaluate software without regard for whose software it is evaluating and who is funding the evaluation. Jonathon Shapiro, Professor of Computer Science at Johns Hopkins University, points out that the Common Criteria labs are not as independent and accountable as they could be: He says that companies are playing the labs off each other for favorable treatment. While the Cylab’s reliance on tools might increase its ability to be independent—tools give impartial results—again, humans need to be involved to evaluate the tools’ findings. Processes with human intervention must be structured so they maintain independence and accountability.

Is a lab worth doing now? So would it be more useful to have a lab that can test for some flaws than it would be to do nothing? Would widely publicizing the Common Criteria ratings—assuming they are meaningful, which many experts doubt—or hyping the Cylab’s certification when it goes live to consumers inspire companies to make certifiable software? It seems it would, at least as long as consumers perceived their computers to be “safer” than they were before buying certified software. However, if consumers began to feel their computer was no more secure for having bought certified software, the certification system would quickly fall apart. (How consumers might perceive this to be true is another question entirely—most consumers probably believe their computers are pretty secure right now anyway.) It is for this reason that we may just have one shot at making a workable lab. Perhaps then, we should wait until we can be reasonably confident that certification means a piece of software is really more secure than its uncertified rival.

Rather than encouraging the creation of a lab now, policy-making power then is probably best spent on funding security research so we can come up with metrics that would be meaningful enough for a lab to use. Spafford points out that research into security metrics, as well as into security more broadly, is woefully under-funded. He believes we need to educate those who hold the purse strings that security is about more than anti-virus software and patches. Security also about thinking of ways to make the software you are about to build secure, not just trying to clean up after faulty software. Funding for research into better tools, new languages, and new architectures is probably the best contribution policy-makers can make toward improving software security.

Interviews:

Katzke, Stuart W., 2004, Senior Research Scientist, National Bureau of Standards and Technology [Interview] November 12. Maccherone, Larry, 2004, Manager of Software Assurance Initiatives, Cylab, Carnegie Mellon University [Interviews] November 11, and November 18. Maurer, Stephen, 2004, Lecturer, Goldman School of Public Policy, University of California, Berkeley [Email exchanges] November 10-20. McGraw, Gary, 2004, Chief Technology Officer, Cigital Corporation [Interview] November 30. Razmov, Valentin, 2004, Graduate Student in Computer Science, University of Washington [Interview] November 4. Shapiro, Jonathan, 2004, Assistant Professor of Computer Science, Johns Hopkins University [Interview] November 17. Spafford, Eugene, 2004, Professor of Computer Science, Purdue University [Interview] December 2. Varian, Hal, 2004, Professor of Economics, Business, Information Systems, University of California, Berkeley [Email exchange] November 15. Wing, Jeannette, 2004, Chair, Department of Computer Science, Carnegie Mellon University [Interview] November 12.

Criteria for a Lab to Certify Software

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools