Acidus wrote: If only WASC or OWASP or somebody has some guidelines for evaluating web scanner results :-).
Sensitivity is a great measurement for scanner evaluation. Were you able to read this thread on the webappsec-l mailing-list? I also make reference to the Brian Chess Metric, available in this presentation from MetriCon 1.0 - http://www.securitymetrics.org/content/attach/Welcome_blogentry_010806_1/software_chess.ppt The Web Application Security Evaluation Criteria is a set of guidelines to evaluate web application security scanners on their identification of web application vulnerabilities and its completeness. It will cover things like crawling, parsing, session handling, types of vulnerabilities and information about those vulnerabilities.
Nah, there just writing up some definitions. It will never get very advanced. In addition, I am concerned by the web application security industry - an industry filled with gifted security experts and practitioners, who embraced Suto's whitepaper warmly, without questioning its results or the methodology by which it was conducted for a single moment.
This is B.S. Jeremiah Grossman and RSnake (Robert Hansen) both quoted Larry Suto's paper - but myself, Dave Aitel, Matthew Wollenweber, Charlie Miller, J.M. Seitz, Adam Muntner, and several others were quick to jump in and complain about Larry's results. I'm talking about people with real-world experience and that don't have any ties to IBM, HP, Cenzic, or another web application vulnerability scanner vendor. Did you see the outcome of the Larry Suto paper to the wassec-l, securitymetrics-l, samate-l, fuzzing-l and dailydave-l mailing-lists? Go back and check your work. Unfortunately, it was only recently discussed on the webappsec-l mailing-list, via a link to Ory's paper. Suto, having good intentions published what he thought was in the best interest of the industry, and my biggest complaint to him was that his experiment methodology was never fully disclosed to the public, therefore could never be confirmed nor rebutted. On the other hand, one would expect security experts to use a little more judgment when reading technical whitepapers, and be skeptical of results from experiments that are not well documented. Putting numbers into a table doesn't make them meaningful.
I do find it surprising, but you have to realize that there may be more at work here. While Larry Suto may not have had a bias towards NTObjectives or against web application scanners, I think that both Jeremiah Grossman and RSnake do. Jeremiah works for a web application scanner service company that competes with IBM, HP, Cenzic, and Veracode's web application security services - but does not compete with Hailstorm, Acunetix, NTOSpider, WebInspect/DevInspect/QAInspect, or AppScan. AFAIK, RSnake uses his XSSFuzz tool, Burp Suite, TamperData, and WhiteAcid's XSS Assistant... because there was no mention of web application vulnerability scanners in their book, "XSS Attacks" except that web application vulnerability scanners do not find persistent or DOM-based XSS very well compared to manual methods. I think Dave Aitel pointed me to Larry's work a few days before Jeremiah or RSnake had posted anything about it. The interesting take (and possibly his meaning behind it) was that a few people learned the interesting lesson from the Larry Suto paper. What did we learn from Larry's paper? We learned that Fortify Software's presentations at BH-Federal-2007 (Chess/Kureha) and BH-US-2007 (Irongeek) were correct: that web application vulnerability scanners are really automated fault-injection tools, and thus - limited to the same coverage problems that fuzz testing or fault-injection type testing tools have. What problems do automated fuzz testing and fault-injection have? Well, they are "ad-hoc" security-testing methods limited primarily to semantic security-related bugs - mostly buffer overflows, integer vulnerabilities, format string vulnerabilities, input/output conditions that cause errors, interface problems, and issues with protocol, file, or language parsers. This is still very useful, but not the entire picture of security-related bugs in the universe. There are some conditions or security properties that can only be tested using formal, semi-formal, or informal methods (versus ad-hoc methods such as automated fuzz and fault-injection testing). For example, state transitions are usually best tested using testing methods outside of ad-hoc testing. While automated fuzz and fault-injection testing may find problems with object reuse (heap, stack, disk, memory/filesystem issues, etc) or basic covert channels (problems with inter-process communications... network stacks i.e. protocol loading/unloading, but additionally disk channels i.e. file loading/unloading, language parsers i.e. HTML/JS/XML sources/sinks, database loading/unloading i.e. SQL source/sinks) - they often do not take into account advanced covert channel attacks such as timing channels, attacks against the TCB or security perimeter, or a lack of proper (or attacks against) properly issued trusted paths or reference monitors (i.e. secure kernel). The result is that of the code "covered" by automated fuzz or fault-injection testing - one can only really achieve about 30% of coverage without the source code (i.e. "smart" fuzz testing) or metrics from binary or bytecode analysis (i.e. reverse engineering, hit-tracing/process-stalking such as OllyDbg or PaiMei, etc). However, all is not lost because semantic errors usually are bound in clusters of 10%, so covering 30% of an application may in fact be too much coverage than is necessary. If the goal is to find the low-hanging fruit in a penetration-test, then a scanner or manual method has probably succeeded at 10% much faster, than by covering 30% of the application, which would take a lot more time. The point of finding low-hanging fruit in a penetration-test is not only to prove that the application/software is insecure, but to provide focus areas for secure design review and secure code review to get to the "root-cause" of the problem (i.e. input/output validation/encoding and proper canocalization in the case of XSS and proper/complete use of parameterized queries in the case of SQLi). Ideally, software development teams will take note of the root-cause issues that penetration-testing gives them and develop their own tools and processes to add automated fuzz and fault-injection testing techniques to their secure design review, secure manual code review, secure automated static analysis checks, and actively in their build-servers. The best and cheapest way to do this is to utilize continuous-prevention development, which is the core of my suggested CPSL secure development practice. All of this has been well-known since the 1970's, as testing techniques have not really improved since then. Some people have been misusing code coverage techniques for almost 40 years now, but fortunately there is literature now that we can point people towards so that they can understand. One of them is: [PDF] How to Misuse Code Coverage Be sure to read this guide before complaining about the results of your web application vulnerability scanner. Ideally, you'd be using a hybrid-tool such as AppScan DE or DevInspect, or possibly QAInspect or AppScan Tester instead of a zero-knowledge black-box tool. Romain Gaucher has partially implemented PHP-SAT along with his Grabber tool in an extension called Crystal that combines tool methods as hybrid. Expect other hybrid methods to become more popular over time, although it would be nice to include code crap metrics (i.e. code coverage and cyclomatic complexity) along with this sort of testing, just to improve technique. However, these metrics shouldn't be used to measure the success of these testing methods for all of the reasons mentioned (plus what is available in the paper on How to Misuse Code Coverage). So, in reality, lower code coverage could be better than higher code coverage, especially since you waste less time finding the same amount of bugs. Also - if you read anything about the binary classifier, "sensitivity", you'll note that a higher amount of false positives is desired since it lowers the possibility of false negatives in these types of tools. By combining the binary classifier "specificity" in static source-code|bytecode|binary analysis (e.g. Fotrify SCA, FindBugs, Veracode SecurityReview), you can work towards a "Gold Standard" test (see the Wikipedia entry on Sensitivity(tests) above). RE: Ory and the kicking of ass and taking of names |