Making Software: TDD doesn’t work; Architecting is mixed; scripting is great

November 1, 2010 at 9:16 am 64 comments

I got my copy of Making Software by Oram and Wilson on Friday, and have read just over half of it already. I’m really enjoying it! I’ve always enjoyed empirical data on programming (Whoa! My teenagers would be rolling their eyes at such a geeky statement!), and this book is chockfull of it. Some of the most interesting chapters for me so-far:

Herraiz and Hassan have a great study comparing all the various complexity metrics to plain ole counting lines of code. The correlation is astounding. Sure, sure — counting lines of code is a rough measure. But when the correlation with McCabe’s or Halstead’s metrics are 0.97, why go to all that trouble?
Barry Boehm does an even-handed (even if technically WAY complicated) analysis of whether up-front architecting is really worth it, or are agile methods (with architecting distributed across) good enough? The tradeoff is between having to do rework later, and spending way too much expensive effort up front than is worthwhile. His answer is that it depends on the size of the code and the riskiness of the effort. The cost of NOT architecting is between 18% and 92% of the total cost, depending on the project. I think he’s trying to make the case that agile methods are too expensive and that Kent Beck’s arguments for agile methods are wrong.
The chapter on “How effective is Test-Driven Development?” was really enlightening for me. There are just no convincing data that building tests before building code is worth it! There aren’t measurable benefits in quality, for a variety of measures.
Whitecraft and Williams’ chapter on “Why aren’t more women in computer science?” is terrific, even-handed and sobering. They take seriously the question, “Is there value in getting more women into computing?” They come up with the answer that it’s worth trying to get more women in, but the claims about the value of diversity just aren’t supported yet. In 2003, Norway passed a law that all public firms had to have at least 40% women on their board. A 2009 analysis showed that the companies’ value dropped, but because they brought on lots of younger and inexperienced women to meet the requirement. Will it help in the long run? Still to be determined. A really interesting datum on why women left computing: Unlike men, women who excel in math also tend to excel verbally, thus have more options available to them than men who are great at math but can’t string two words together. And if you have more choices, why take one that has less social value?
The chapter I was most excited about reading, and which was ultimately most disappointing, was McConnell’s on “What does 10x mean? Measuring variations in programmer productivity.” Is there really a magnitude difference between the best and worst programmers? He goes through a bunch of studies (some of individuals, some of teams; some on initial code writing, some on debugging), but in not enough detail to feel convincing or enlightening to me. In the end, there’s evidence for a 10-fold difference, but some of the results are as high as 25:1 and as low as 3.5:1. What’s most interesting for me is what’s not there: There’s no evidence supporting a three or five magnitude difference, that I’ve heard talk about. The data says 10:1 is about the most we can argue.
Prechelt’s chapter on comparing programming languages was fascinating. In general, scripting languages won big, but he points out that a good programmer in a non-scripting language can often beat out a mediocre programmer in a scripting language. Choose for the programmer, not the language, he argues. But if you consider this result with the complexity and Boehm results earlier, there’s a strong argument for teaching scripting languages. Scripting languages have fewer lines-of-code per solution, which leads to fewer errors and less complexity. Start students in scripting languages, and they have a leg up for using those languages later.

I can’t take the big book with me to China, so I won’t get to finish it for a few weeks, but I can already recommend it. Really fascinating to see where there’s real data, and where there’s not.

Entry filed under: Uncategorized. Tags: computing education research, NCWIT, programming languages, software engineering, women in computing.

It starts: Dept of Ed regulations for Colleges First things aren’t always first

64 Comments Add your own

1. John "Z-Bo" Zabroski | November 1, 2010 at 10:35 am

Did Victor Basili contribute a chapter?

If you want a great resource on Empirical Software Engineering, then I suspect “Foundations of Empirical Software Engineering: The Legacy of Victor R. Basili” will be a more profitable personal edification. I haven’t read Making Software, of course.

Certainly does suck that O’Reilly’s website doesn’t list all the names of the chapter contributors, and Amazon doesn’t allow a preview of the book.

Also, the danger in some studies is when the authors try to use empirical software engineering to extrapolate. You simply can’t do that. What you can do is adopt something like Total Quality Management and software processes internally to continuously improve product quality. If you aren’t measuring things like mean-time-to-failure of 24×7 operation, then you’re not that interesting of a data point.
Reply
- 2. Mark Guzdial | November 1, 2010 at 11:35 am
  
  Yes, he did contribute a chapter. My group jealously guards our set of the “Empirical studies of programmers” workshop proceedings, still the best work I’ve read on studying programmers, especially novices/students.
  Reply
  - 3. John "Z-Bo" Zabroski | November 1, 2010 at 12:07 pm
    
    DBLP doesn’t appear to list those workshop proceedings in your bibliography. I also have to profess to not knowing this workshop specifically, although I may have read a famous paper from it here and there.
    
    What do you recommend I read?
    Reply
4. John "Z-Bo" Zabroski | November 1, 2010 at 10:37 am

Also, MS Research published a study a few years ago that TDD projects took longer to complete but had higher product quality / customer satisfaction. Was that mentioned in that chapter?
Reply
- 5. Mark Guzdial | November 1, 2010 at 11:33 am
  
  I don’t recall. They definitely looked at other studies, and they explored the trade-off between quality and time to write tests. They didn’t find much quality improvement due to tests.
  Reply
  - 6. John "Z-Bo" Zabroski | November 1, 2010 at 12:12 pm
    
    This is the study I am referring to (paper linked below the Channel 9 video interviewing the PI): http://channel9.msdn.com/Blogs/Peli/Experimental-study-about-Test-Driven-Development
    Reply
    - 7. Lucas Layman | November 3, 2010 at 8:14 am
      
      Hi John,
      
      As one of the authors of the section on TDD, I can confirm that we did include the study that you linked.
      
      To clarify Mark’s point on the lack of convincing data:
      there is no /consistent/ effect on external quality in the studies that we examined, which were conducted in a variety of settings and on projects of varying scope. Some data suggests that TDD improves quality, some data suggests that it decreases quality, and some data is inconclusive.
      
      The real problem is that it is very difficult to understand the why/why not TDD worked because of the quality of reporting in many studies. Nachi’s studies were some of the better-reported studies, but most TDD studies do not have enough information to make a confident comparison and evaluation of the TDD experiment with your own environment.
      
      As a side note, we are preparing a scientific article on our TDD review which will provide more detail than the book chapter. The “Making Software” chapter was written to be palatable to a wide audience.
      
      Regards,
      Lucas
      Reply
      - 8. Michael Foord | November 4, 2010 at 10:24 am
        
        So failing to find evidence is not evidence of failure…
        Reply
        
        9. Mark Guzdial | November 4, 2010 at 6:49 pm
        
        But since there is no such thing as evidence of failure, a reasonable person can make a reasonable judgement based on available efforts. One can check with Bigfoot or the Loch Ness Monster for further verification on this point.
        Reply
      - 10. John "Z-Bo" Zabroski | November 4, 2010 at 10:37 pm
        
        It sounds to me, from knowledge of real-world studies, that in industry TDD ends up lengthening the project but improves overall quality. See Steve Freeman’s comments.
        
        The problem with these studies as I’ve seen them is they do a very poor job of sensitivity analysis. In addition to lack of rigor by academics, too many practitioners tend to misinterpret these studies (usually the most common mistake I see is unwarranted extrapolation). The camp who practices TDD and uses it as a core part of the software development lifecycle in very large scale software (VLSS) is the only study that matters. TDD wasn’t designed in a lab, so why test it in a lab setting? This is Steve’s point, and the Microsoft study is not about lab settings.
        
        By the way, you may not be aware of this — and I would not be surprised, since I often encounter academics who know nothing practical about what they study — but Steve Freeman is the co-author of a book on TDD: “Growing Object-Oriented Software, Guided By Tests”. In this book, Freeman and Nat Pryce explain why simply using something like Bertrand Meyer’s Design By Contract formalism is not enough. Freeman and Pryce have also written at least one paper I can think of about this: “Mock Roles, Not Objects”. Apparently academics are not familiar with this paper, since I’ve seen Meyer quoted in trade press web sites like Artima.com saying design-by-contract automated tests is essentially always superior to unit tests written manually by the developer.
        
        Also, interestingly, there have been studies involving only studies where the authors show that code quality increases with test count, regardless of whether the group tested first or tested last. However, the test-driven (test-first) group of students wrote more tests in the same amount of time (from Proceedings of the IEEE Transactions on Software Engineering, “On the Effectiveness of Test-first Approach to Programming”).
        Reply
    - 11. Hakan Erdogmus | November 3, 2010 at 4:01 pm
      
      I am one of the authors of the TDD chapter. Ours was a review of existing some 30+ studies of TDD. Yes, Nachi’s experiements (the MSR study mentioned above) were included. BTW, I wouldn’t have concluded that “tdd doesn’t work” based on our data. Rather, I would have conservatively concluded: there is moderate evidence supporting TDD’s external quality and test-support benefits, but there is no consistent evidence in one direction or other regarding its productivity effects.
      
      As one of the other posts states, there are so many contextual factors in such studies, and those probably explain the high variation in results. In some cases, certain key context factors (like, did the subjects really know how to apply TDD or were they experienced enough?) may dominate the control variable (the prescription of TDD as the assigned process).
      Reply
      - 12. Mark Guzdial | November 4, 2010 at 12:39 am
        
        To Hakan and Lucas: Thank you for responding here! Yes, I took liberties to claim that “TDD doesn’t work” — those are my words, not yours. It’s impossible to prove that something doesn’t work. However, when many studies can’t find a consistent benefit, it is reasonable for an individual observer to conclude that it may not “work” (where work is “provide greater benefit than cost”) for one’s own circumstance. I believe that you show that it would be reasonable to conclude that TDD doesn’t work as expected.
        
        Cheers,
        Mark
        Reply
      - 13. Steve Freeman | November 4, 2010 at 7:25 pm
        
        (trying again). Reading through the studies that aren’t behind academic paywalls, it looks like the “in the field” studies report more benefits than the “in the lab” studies. This isn’t very surprising since, in my anecdotal experience, the value kicks in with scale and many of the lab studies are conducted with inexperienced programmers. The other limitation with lab studies is they cannot address some of the benefits of TDD outside the raw state of the code, such as safety of ongoing maintenance and (more subtle) the focus on which features to build.
        
        Personally, I don’t think that TDD is well-enough understood yet in the field to have be proven either way (how many professional developers even understand OO?). Most academic material I’ve seen is even worse, since it’s not often taught from deep experience.
        
        This might mean that TDD is too hard to do well, but then we used to think the same about meta-programming and the kids just do it.
        Reply
        
        14. Mark Guzdial | November 5, 2010 at 7:21 pm
        
        Could “in the field” studies be suffering from a Hawthorne effect?
        
        If the point of test-driven design is to (a) specify intent early with a test and (b) catch mistakes early to avoid rework, the value should be greater for novices, shouldn’t it?
        Reply
      - 15. John "Z-Bo" Zabroski | November 5, 2010 at 8:08 pm
        
        Mark,
        
        Why are you relying on intuition in a thread about empirical software engineering? Doesn’t that just seem odd to you?
        
        At any rate, the value is not necessarily greater for novices, because novices tend to be poor at knowing how to test well. Even trained software engineers fail the classic test in Glenford Myers’ The Art of Software Testing, that goes like this:
        
        A Software is given, which receives three inputs (numbers) that define the size of a triangle’s sides. The SW can, then, categorize the triangle in one of these categories: Equilateral, Scalene, Isosceles or Invalid.
        How many test cases can you build for testing this triangle?
        The more heterogeneous the cases the better, and: The more effective, better yet.
        
        Bob Binder, in his book Testing Object-Oriented Systems, actually expands on this famous example in the first chapter of his book, arguing in 1999 that a great example from 20 years earlier did not have as good test coverage as it could have had. He then enumerates every test case.
        
        Beyond this history lesson, novices tend not to work on projects of the same scale as experienced developers. As you should know, very few undergraduate programs require students to interact with a large (e.g. open source) code base. As you should also know, Bertrand Meyer’s A Touch of Class manifesto is an exception to the rule. And Bertrand teaches Design-by-Contract and heuristics for editing the code base, such as the Open/Closed Principle and Command/Query Separation.
        
        Aside from that, even experienced software developers cannot simply wish away the complexity of a problem by writing tests. The hardest part about writing software is the problem domain, and knowing how to model things at the right level of abstraction. Ron Jeffries is a prominent TDD advocate who was outclassed by Peter Norvig at writing a Sudoku Solver simply because Norvig is one of the foremost experts on AI and Jeffries is not. Jeffries’ code contained a lot of goldplating that didn’t actually relate to the problem domain, but he sure did have a lot of tests.
        
        In another situation, Jeffries wrote an entire book on the premise “How can we write an application if we have no idea how to write it?” His book Extreme Programming Adventures in C# is wonderful, because he starts off knowing very little about the language, very little about WinForms GUI toolkit, and (supposedly) very little about how to write a text editor. The idea is that as you read the book you are essentially doing XP with one of its creators. The idea is to take a leap of faith and see what it truly looks like to do no design up front whatsoever, and just wing the project together, learning as you go. In this sense, it is an extreme reaction to Big Design Up Front (BDUF) software.
        
        In the text editing example, it is also easy to point out that the author should have at least read the seminal thesis on emacs, The Craft of Textediting. But that’s not the point. The real point is how do you explore a problem domain you know nothing about. You don’t even know what keywords to search for on Google Scholar, ACM Digital Library, etc. (I remember this in college, when I tried writing a spellchecker without ever hearing of The Levenshtein Distance. If I knew how to phrase my problem, as a model of the similarity between two strings, I might have found it immediately, but I didn’t. I also hadn’t heard of Golding and Roth’s context sensitive spell checker. Back in 2002, search engines weren’t very good at helping you find this information, and Wikipedia was only launched January 15th, 2001!)
        
        The big idea I took away from Jeffries was how effective he was at coming up to speed on C#, WinForms, and the problem domain at the same time just by using TDD. He didn’t even use a fancy unit testing++ library like Test-NG or Pex, he just created something simple himself. — If he had combined this with knowledge of the problem domain, he might have been even more effective, but only as much as eliminating accidental complexities allow.
        Reply
        
        16. Mark Guzdial | November 5, 2010 at 8:34 pm
        
        Sorry, John — I don’t see where I’m relying on intuition. You should also look at the Boehm chapter, which explains why Big Design Up Front can lead to a halving of the costs of not doing that.
        Reply
      - 17. John "Z-Bo" Zabroski | November 5, 2010 at 11:06 pm
        
        TDD and BDUF aren’t necessarily mutually exclusive, but two sides of the same coin.
        
        With TDD, it is predominantly useful to drive requirements gathering on projects with ill-defined requirements. What I find most striking from the Ron Jeffries examples I gave you above is that in each one, he was outclassed by someone who had better requirements.
        
        When you have all the requirements from the stakeholders and can use that to drive the design process, then you will get great results, and so I am not surprised to hear Boehm’s stats. Nothing is better than understanding the problem domain, but in most truly great software I’ve seen, it has taken at least 10 years over multiple iterations to get it “right” and dominate the market… They usually start off knowing what the core feature of the product is, and that is about it.
        
        To what extent did Boehm measure feature arbitrage and the project manager’s commitment to slashing bonus features that were just bells and whistles? To what extent did the project experience requirements turbulence?
        
        If the point of Big Design Up Front is to (a) specify intent early with a requirement analysis and (b) catch design mistakes early to avoid rework in implementation, the value should be greater for novices, shouldn’t it?
        Reply
        
        18. Mark Guzdial | November 10, 2010 at 9:59 am
        
        No, that’s not Boehm’s argument. He says that when there is greater risk, when you don’t know all the requirements up front, it’s more important to do more “BDUF.” Boehm’s chapter does point out that it’s a trade-off: Too much BDUF is cost without additional benefit. I do recommend the chapter — I really got a new insight into how Boehm views the world, and it’s much more nuanced and objective than I’d appreciated previously.
        Reply
      - 19. Steve Freeman | November 6, 2010 at 6:28 am
        
        Of course a Hawthorne effect is possible, except for the criminally neglected Braithwaite study which looked at open source projects. And the MS study looked at sufficiently large projects that I would expect the effect to wear off.
        
        As for the effect on newbies, TDD is not pixie dust. It takes skill and experience to learn and, by definition, the lab studies imply that one team was inadequately trained. In fact, a team in one of the Finnish studies said that they needed to learn more.
        
        I think the only conclusion from this discussion is that the data just isn’t good enough to draw such a strong conclusion.
        Reply
        
        20. Mark Guzdial | November 10, 2010 at 9:53 am
        
        Steve, I completely agree that TDD is not pixie dust, and any technique requires practice to develop. I’m looking at TDD as a learning scientist — what is the mechanism by which TDD should be providing advantage? As I understand the TDD, the advantages of TDD come from forcing clarity by writing tests, and avoiding rework because tests fail early. Those mechanisms should start showing advantage immediately, as soon as a student starts developing code. If advantages don’t show up until after much longer, then maybe there are other mechanisms going on.
        Reply
      - 21. John "Z-Bo" Zabroski | November 10, 2010 at 10:20 am
        
        If you want to demonstrate to students the real-world advantages of using TDD, then use real-world open source projects written in a TDD fashion (or at least with pre-existing, quality unit tests).
        
        TDD is really about methodology and having a stable engineering process for iterating and incrementing a design.
        Reply
      - 22. John "Z-Bo" Zabroski | November 10, 2010 at 2:55 pm
        
        Why don’t you ask people like Steve Freeman and Nat Pryce for their feedback rather than scheming by your lonesome? Granted Steve might only care about colleagues his age, but you’ll never know until you ask!
        Reply
      - 23. Steve Freeman | November 10, 2010 at 3:49 pm
        
        @Mark it depends what you’re looking for. My recall of the lab studies is that they set the functionality quite tightly, so there’s little opportunity for discovery and validation. Then they look at the quality of the code and, unsurprisingly with small projects and inexperienced teams, there wasn’t much effect. I remember one early study where maintainability was defined in terms of documentation rather than flexible code.
        
        I’m also unreasonably suspicious of some university programming teaching, especially something new such as TDD where there might not be deep experience. I’ve seen some unimpressive course material online.
        
        There are exceptions, of course. The University of Sheffield, for example, has had an XP lab for years that builds real systems for non-profits with the 4th years guiding 3rd years. They claim to have some very good experiences.
        Reply
24. Alan Kay | November 1, 2010 at 10:39 am

But what are they measuring?

I’m willing to go along with the statistics as they present them, but what does this actually mean about what we should be doing in computing and in programming?

I.e. if you measure a bad practice well, but say this is the way the practice is rather than how poorly it is being done, you are actually holding back progress and providing bogus evidence for bad ideas.

Cheers,

Alan
Reply
- 25. Mark Guzdial | November 1, 2010 at 11:32 am
  
  They’re clearly measuring current practice, Alan, along a variety of dimensions. Some of the chapters point to the need for something better, like Boehm’s chapter and the chapter on languages. But it’s true — you can only measure empirically what’s here, not what could be there.
  Cheers,
  Mark
  Reply
  - 26. John "Z-Bo" Zabroski | November 1, 2010 at 12:14 pm
    
    In some cases you can’t even re-evaluate what was empirically measured. Ask the authors for their data set and they won’t even reply to your e-mails, let alone tell you that they can’t provide to you for competitive reasons.
    
    This makes a lot of empirical software engineering research seem like a bit of a sham.
    
    There were some papers recently that investigated Conway’s Law with respect to distributed teams, and I requested the data set to see if I could find some more interesting results using the same exact data set, and I never got a reply, after asking three times spaced over multiple months.
    Reply
    - 27. Mark Guzdial | November 1, 2010 at 12:59 pm
      
      There’s a great chapter in “Making Software” on Conway’s Law and Corollary that shows that it applies even to open-source software.
      Reply
    - 28. Greg Wilson | November 2, 2010 at 3:43 pm
      
      John,
      Are you generalizing from one example? Or do you have enough other experiences of this kind to justify your statement about the shamfulness of “a lot” of empirical SE research? (And as a control on the experiment, have you measured how often profs of any kind to respond to email inquiries at all? 🙂
      Greg
      Reply
      - 29. John "Z-Bo" Zabroski | November 2, 2010 at 3:55 pm
        
        I don’t know what “enough other experiences” is to you, but it is not just one example. Maybe I come across rude and abrasive in my e-mails, since I am basically challenging their results and questioning how effectively they ran their numbers through the machine. I think all test data should be published with all papers. It is not like this is genome research and there are intellectual patents on the discoveries. We’re all just trying to become better engineers across the board from these studies.
        
        I e-mail academics regularly. I think you and I have exchanged conversations before, but that might have been on a blog you wrote. Usually they are poor about follow-up conversations, and end up continuing the conversation three months down the road, saying, “oh, hey, by the way, I didn’t forget, and I talked to XYZ and they said ___. Just thought you’d like to know.” (paraphrasing)
        Reply
  - 30. Steve Freeman | November 6, 2010 at 6:50 am
    
    I think there’s a deeper problem here. TDD is a practice that’s swept through a generation of top quality programmers. While this might result from Beck’s mass hypnosis, that seems unlikely.
    
    There’s an opportunity here that academia is completely missing to understand the value that these people are clearly finding in TDD–rather than trying to hold back the tide. Perhaps there’s a better way altogether.
    
    I have a hypothesis that deep TDD is a different cognitive process from “straight” programming. I’d love to find someone who could investigate that.
    Reply
- 31. John "Z-Bo" Zabroski | November 1, 2010 at 12:18 pm
  
  Roger Sessions had a blog post within the past year where he deconstructed the famous Standish Report, arguing that it doesn’t tell us anything meaningful. This sort of deconstruction of what is being measured is valuable thought-stuff: http://simplearchitectures.blogspot.com/2009/10/problem-with-standish.html
  
  He also has other blog posts on the same issue, since he appears to want to use his blog as a platform for grind the issue and sharpening his argument.
  Reply
32. Greg Wilson | November 2, 2010 at 3:40 pm

A minor correction: you refer to “Making Software by Oram and Wilson”, but we’re just the editors—not to underplay the effort involved in herding cats or anything 🙂
Reply
33. Steve Freeman | November 2, 2010 at 5:34 pm

I’m fascinated by the clash between academic research on TDD (“2 groups of 3rd year students wrote a one-week project”) and the committed adoption by most of the best programmers I know. How could so many first-class people have been fooled by a technique that “doesn’t work”? Sounds like a good research topic.

In the meantime, Keith Braithwaite came up with some very interesting results about the difference between code with and without tests (http://www.keithbraithwaite.demon.co.uk/professional/presentations/). Like a physics result, it doesn’t have a good explanation yet (although I have a hunch), but it deserves more attention.
Reply
- 34. John "Z-Bo" Zabroski | November 4, 2010 at 10:50 pm
  
  Steve,
  
  Have you seen the paper published by Muller and Hofer: “The Effect of Experience on the Test-Driven Development Process”? Muller also has a separate paper where he shows that students have fewer errors when they reused code if they did test-first.
  
  Hot Button/Tangent: What I gather from all this is that we really need IDEs with microrefactorings/”semantic editing” of source code.
  Reply
35. Tweets that mention Making Software: TDD doesn’t work; Architecting is mixed; scripting is great « Computing Education Blog -- Topsy.com | November 4, 2010 at 6:57 pm

[…] This post was mentioned on Twitter by Henrique Pereira and Leonardo Santagada, Thiago Silva. Thiago Silva said: Making Software: TDD doesn’t work; Architecting is mixed; scripting is great: http://tinyurl.com/2vkoc57 […]
Reply
36. Michael Foord | November 5, 2010 at 8:13 am

“But since there is no such thing as evidence of failure”

Don’t be dafted, you even cited some of the studies. The conclusion of the book was that the evidence is mixed. There is *some* good evidence that TDD does work – see the Microsoft study, Muller and Hofer, and so on.

Doesn’t suit your agenda though. 🙂
Reply
- 37. Mark Guzdial | November 5, 2010 at 9:10 am
  
  I’m sorry, Michael — I didn’t speak clearly. You can have evidence of failure, obviously. You can’t Prove failure. The next trial could work. Just as it is possible to still find Bigfoot or the Loch Ness Monster. Nobody can prove they don’t exist. But at some point, reasonable people can point to the preponderance of evidence. Similarly, someone can look at the “mixed” results and decide that, no, TDD probably doesn’t work — not when quality improvements are so uncertain, and the cost is certain.
  Reply
  - 38. John "Z-Bo" Zabroski | November 5, 2010 at 9:19 am
    
    *is* the cost certain?
    Reply
    - 39. Mark Guzdial | November 5, 2010 at 9:21 am
      
      Until tests write themselves, yes — it costs something (time) to write tests.
      Reply
      - 40. John "Z-Bo" Zabroski | November 6, 2010 at 1:18 am
        
        That’s not what I meant. Everything has a cost. I meant is the cost certain to be ineffective. Look at the context in which you made your statement: “not when quality improvements are so uncertain, and the cost is certain.”
        
        You are making a relative comparison here. Costs do exist, sure, but here specifically you are saying costs increase unnecessarily.
        
        Now, at least one study I mentioned already in this thread has shown that if you do TDD instead of test-last, you write more tests and improve quality, in the same amount of time. So is the relative cost you are talking about certain? Well, based on that one study, no.
        Reply
  - 41. Michael Foord | November 5, 2010 at 10:01 am
    
    Right, but given there is some *good* evidence of success (just not conclusive on the balance of all the studies that purport to look at TDD) you are misrepresenting the state of the evidence grossly.
    Reply
    - 42. Steve Freeman (PhD) | November 5, 2010 at 12:26 pm
      
      Our worlds are so far apart… It is so much easier to maintain a system with a good test suite, that I now view it as unprofessional to ship without (my current client has a classic example). It is very hard to retrofit tests to an existing system (and even more expensive) rather than discover where the hooks need to be as you build, so I can’t see the point of not writing tests up front.
      
      The consensus for “real” projects seems to be about 15% overhead for a massive improvement in reliability, which seems like a reasonable payoff–unless you’re hoping to make money on fix requests. It’s only 15% for twice the code, since much of the effort in writing tests is actually deciding the detail of what the production code will do. This, in my experience, leads to systems with more focussed functionality than those written from specs–which the lab studies don’t address. The deeper payoff is much less stress for the development team which I, for one, prefer.
      
      At a recent workshop I went to, the academics were commenting on the hostility from some groups in academia to practices such as TDD (consider the actual contents of the “Alarming results” paper, which does not justify the title). Their view was that the practitioners are well ahead of the researchers (wouldn’t be the first time 🙂
      Reply
      - 43. Hakan Erdogmus | November 12, 2010 at 3:38 pm
        
        I totally agree with Steve here (and I am one of the authors of the chapter on TDD). I also consider it unprofessional to ship software without a solid scaffolding of unit tests, regardless of how they were developed, and retrofitting tests are immensely difficult.
        
        Re empirical evidence on TDD, people look at empirical results and they usually see what they want to see. Overall, I see sufficient evidence for systematic quality gains (perhaps barring some special contexts), while not enough evidence for a consistent productivity hit. This is the subtle undertone of our chapter, although it’s deliberately left open to interpretation by the reader. Put in perspective, this result is still much better than most other approaches in software development, with the possible exception of code inspections/reviews. Quality is an important lever, not only for the sake of quality itself, but also for long-term productivity, that is when the cost of rework is taken into account. While not scientifically conclusive, in my interpretation, the results are still promising considering that nothing in our industry is what it’s initially made out to be by zealots.
        
        I also agree with Steve on the “alarming results” paper. Contrary to what the title suggests, the results are not that alarming at all. I don’t know what they were thinking. I guess, what they were trying to say is that “TDD is not the best thing since sliced bread”. So be it.
        Reply
      - 44. Steve Freeman | November 13, 2010 at 11:38 am
        
        @Hakan Nice summary. Thanks
        Reply
    - 45. Mark Guzdial | November 5, 2010 at 7:19 pm
      
      Not all studies are created equal. Mistakes might be made. If you have two studies with two different results, one has to choose what to believe.
      
      I teach TDD. We use TDD in our development of JES. As a Smalltalk programmer, unit testing is part of our standard practice. These results are surprising to me. Trying to be objective, I’m wondering if it’s still worthwhile to teach TDD when and where I do.
      Reply
      - 46. John "Z-Bo" Zabroski | November 6, 2010 at 1:12 am
        
        I would probably put more effort into improving the structure of the website you linked to, rather than switching how you teach things. It is a nightmare trying to find anything on that page.
        
        If you were to switch things and try a new approach, you would instead waste time working on new material, rather than improving a pre-existing product.
        
        It would be interesting to know, though, if students who were proselytized TDD in college as the One True Way were more likely to choose Test Engineer positions at software companies, as compared with their peers.
        Reply
        
        47. Mark Guzdial | November 10, 2010 at 9:54 am
        
        The website is written for teachers, John. That’s the audience. We test with teachers, and they can find what they need to teach their courses.
        Reply
      - 48. John "Z-Bo" Zabroski | November 10, 2010 at 10:28 am
        
        I went to a college where the largest major was Child Study, and the second largest was basically Secondary Education minor with some major competency.
        
        The thing teachers want the most (at least at the K-12 level) is to borrow snippets of ideas for lesson plans from other teachers. At the collegiate level, I am not sure how many professors even know what a lesson plan looks like, and is probably closer to whatever their personal taste is — My friend does napkin notes to structure things.
        
        By the way, have you audited other courses that teach programming using media computation, especially ones that aren’t based on your specific approach?
        Reply
        
        49. Mark Guzdial | November 10, 2010 at 11:40 am
        
        We had a great session at last year’s SIGCSE Symposium, where seven teachers presented their “media computation” based/inspired courses, all of which were not using our textbooks. Really fascinating.
        Reply
  - 50. Steve Freeman | November 6, 2010 at 7:04 am
    
    I believe the consensus is that TDD costs about 15% more for a massive improvement in reliability. This seems like a reasonable trade-off for many teams (my current client is in test-free integration hell).
    
    The maintenance costs of a system without a proper test suite are so high, that I now think it’s unprofessional to ship that way (unless I charge for fix requests :). Given that, we can write the tests at the end–which is even more expensive than writing them first. Or, we can write the tests as we go along and use them to help us understand the domain and to make the system testable. The first delivery is the smallest part of the cost.
    
    So what else am I expected to do?
    Reply
51. Raoul Duke | April 6, 2011 at 1:57 pm

Dang, people, could you tear into something more than just the TDD chapter? Why are people so quick to get into some big tussle over TDD all the time, but not the zillions of other important things to consider?
Reply
52. Agile Brazil 2011, presentation Luca Bastos | Blog da Concrete | August 24, 2011 at 1:27 pm

[…] Productivity. É bastante polêmico e suscitou reações como a do prof Mark Guzdial no comentário do livro. Também atrai curiosidade saber de onde saiu o tal número […]
Reply
53. NYTimes takes on Cognitive Tutors: What can we really prove with studies? « Computing Education Blog | October 11, 2011 at 9:45 am

[…] the whole “What Works Clearinghouse” make any sense at all. I raised this question in my piece for Greg’s Making Software. We have studies where Media Computation has worked well in terms of impacting […]
Reply
54. Steve Freeman | January 12, 2012 at 7:58 am

What in your experience has led you to these strong conclusions?
Reply
- 55. Steve Freeman | January 12, 2012 at 8:42 am
  
  we’re in agreement at some level, because unthinking use of _any_ technique or technology is unlikely to be successful. I’ve been through enough cycles now to see everything overpitched, rejected, then (finally) understood.
  
  You might get your point across better with a softer tone..
  Reply
- 56. Steve Freeman | January 12, 2012 at 8:47 am
  
  I think you’ll find that everything works (well, almost) in the right context. ORMs can be a nightmare, but they can also make life easier on lower performance projects (i.e. almost everything else).
  
  I’ve seen many teams crippled by adopting inappropriate technology, and I’ve also seen the opposite, where teams have crippled themselves by writing their own versions of well-understood frameworks.
  
  I’ve also done many rescue projects. Some of these had weak testing, some had TDD done really badly. Neither case is particularly easy to deal with and all had weak modularisation.
  Reply
  - 57. John Zabroski | January 16, 2012 at 12:55 am
    
    Terra,
    
    If you think the problem is too many Comp Sci students becoming programmers, then professional colleges likely would not help. There are only 10,000 USA Comp Sci graduates year over year, well below the demand. Moreover, if we break this demand down further into industries (e.g., scientific/numeric computing, compilers, military, avionics, enterprise/line-of-business, embedded systems, etc.) we will find some are particularly weak in sufficient labor supply with formal training.
    
    The labor pool comes from various areas. For example, hedge funds might hire a Finance major who is “good with computers” and an embedded systems company might hire an electrical engineer whose only programming class was a required FORTRAN class he/she only took because his electrical engineering department is combined with the physics department and the senior-most professors couldn’t create a separate EE-targeted programming course so they just told EE students they had to take the one designed for experimental physicists.
    Reply
- 58. John Zabroski | January 16, 2012 at 1:04 am
  
  There is too much ambiguity in this anecdote to make any useful conclusions.
  
  If your critique is that the ORMs were generating inferior SQL to what a DBA could write, then I would love to see examples, as I collect them. Please use my blog to contact me and we will discuss offline.
  
  Ultimately, your report is way too vague to be useful. For example, high concurrency websites conveys no information about what is going on. What are the read-write skews? What is the budget? What techniques were being used in the middle tier (ORM layer) to address performance? Are the queries limiting themselves to pulling only from the data they need? In other words, what is the business requirement for querying a lot of data? Are there temporal constraints you could use to optimize things? Often times what I see when programmers re-write things is that they actually address these important issues the second time around but attribute the success to using a different technology rather than attention to detail 😦
  Reply
59. Steve Freeman | January 12, 2012 at 8:52 am

The world is slowly gathering real stats on TDD. As I wrote above, it appears to be a bit more expensive (up to 15%), but with a much reduced bug count. The top developers I know who use it do so because it helps clarify what to build and because it gives them predictability of delivery (in one case tied to the F1 racing season).

It looks like you’ve seen a number of bad implementations. I’ve been luckier and worked with better examples where it really does pay off.

I haven’t yet seen a technique that has not been oversold and generally misunderstood–I still have to teach OO fundamentals. (actually, a lot of coders could do with more structured programming). I don’t see why TDD should be different. That doesn’t make it wrong.
Reply
- 60. John Zabroski | January 16, 2012 at 1:11 am
  
  Nobody who uses TDD would count the number of classes to decide how long the project would take. Count the user stories! Who cares how many classes you have! This is not a measuring contest where we pull out rulers. Unit test features, not classes!!!!!!!!!!!
  
  I am not really sure why mocking behaviors would increase development time. Are all your edit-compile-debug cycles instantaneous? Are you seriously calling your whole service stack for every time you want to verify the behavior of something?
  Reply
61. John Zabroski | January 16, 2012 at 12:42 am

The real biggy though is that TDD does not cover visual UI defects and usability so you need manual testers anyway, so what are you really hoping to achieve?

This is not true at all. Most modern GUI frameworks allow TDD for visual UI defects. Likewise, creating user stories and assigning tests to them greatly helps in discovering whether problem domain abstractions will work or not. In the problem domain of user interface widgets, adding a new user control is a good example. How do I design a search widget? I need to think about what states the widget will support, such as live vs. delayed search previews. I need to think about UI hints, such as “cue banners” to indicate what to type in the box. I need to think about the focus behavior of the control and what should cause it to lose focus. I need to think about the tab stop behavior. All of these concerns need to be addressed at the unit testing level.

Moreover, unit tests highlight use cases. Suppose I am developing a novel collection viewing system that supports pluggable layouts such as traditional data grids, hierarchical data grids, card layouts and presentation carousels, etc. Most competitors in this marketplace offer some “data bound” functionality where the programmer can pass in the object and the view component (e.g., grid) will automatically try to display it. How general can we make this? For example, why does a Grid have to know anything about how to even read an object? Shouldn’t the Grid instead deal solely with a Reader abstraction over a Store? OK, that’s generic. What about performant? How do we then test scaling with 1,000s of items to display? How long does it take to load? What are the maximum number of objects it can render without running out of memory? How long does it take, and can we produce a graph to understand its behavior as a function of its key input (# of objects)? There are tons of GUI frameworks that do none of this basic testing and guess what, they are all out of existence.

I mean, would you use a language where the built-in collection sorting algorithm wasn’t thoroughly tested and backed up with various data sets that test for various data distribution peculiarities and read/write patterns?

Further, when composing any modern pluggable user interface, TDD is essential because every new class or object configuration introduced into the system is a continuous integration test. The line between unit testing and integration testing becomes blurred, due to the inherently higher-order behavior exhibited by these applications. Since they generalize the traditional GoF Observer pattern by decoupling the Subject and Observer with a go-between Registry object, it is vital to ensure registry mappings work as expected! This IS test-driven development!

Finally, with technologies like Coded UI and Visual Studio 2010 Premium, automation testing has never been easier. Any barriers are political, such as your organization not being able to adopt better tools due to tremendous licensing and staffing investments in older products. Ideally QA and development work hand in hand, and any defect is able to be reproduced by the engineer assigned the defect simply by writing an automation script where the test fails. Fixing the bug is as simple as getting the test to pass.

Please, stop generalizing from your narrow set of experiences.
Reply
62. Mark Guzdial | January 21, 2012 at 10:10 am

I used moderator’s privilige and removed the last four comments to this post. The messages were more about personal attacks than contributing content to the discussion. I welcome discussion and even confrontation of issues here, but not attacks on the individual or their character.
Reply
- 63. John "Z-Bo" Zabroski | January 21, 2012 at 2:46 pm
  
  If you want to repost your original comments, then do so but clean them up. Mark also deleted one of my replies. If you do not have access to the deleted comments, email me and I can FWD them to you so that you can clean them up.
  
  You raised some good points, but dominated your opinion with attacks. For example, you pointed out the chicken-and-egg problem with doing TDD for graphics. At some point, there is bootstrapping involved where you can’t test the final product until the basic graphics subsystem works. But I actually have experience here and could comment in-depth on what works well and how it saves time – including real world examples such as the Mono Moonlight port of Silverlight. Yet, not really up for discussing this given your feedback.
  Reply
- 64. Mark Guzdial | January 21, 2012 at 6:06 pm
  
  After confirming via email, I have deleted all of Terra’s comments at his request.
  Reply

	PCAS Expansion, Grow… on Updates: NSF Funding to Study…
	PCAS Expansion, Grow… on Putting a Teaspoon of Programm…
	PCAS Expansion, Grow… on Media Computation today: Runes…
	PCAS Expansion, Grow… on Participatory Design to Set St…
	PCAS Expansion, Grow… on Updates: Developing the Univer…

Computing Ed Research – Guzdial's Take