creating a database of all programming courses: Bad news for researchers

February 18, 2013 at 1:46 am 15 comments

One of the challenges of being a social science researcher today is anonymizing your data.  In our project at Georgia Tech, we are asking the Human Subjects Review Board about sharing a data set that we created in an anonymized form. There’s some question as to whether it was sufficiently anonymized.  “African-American, female, 15 years old, Hall County” feels pretty anonymous, but when you add “in a computer class” or “taking a Girl Scout animation workshop,” we’re now talking about a very small number of people.  It might be possible for someone to identify the student.  The Board is trying to decide if that’s anonymous enough.  The dataset becomes much more anonymous if we leave out the school district or county, but if we do, we lose our connection to socioeconomic status, school district performance on tests and high school completion rates, and other factors that do matter to our analyses.  If we provide those other variables and not the district, it becomes possible to triangulate the county.

I recently received this message from CSTA: is a recently launched public 501c3 nonprofit dedicated to the vision that every student in every school should have the opportunity to learn to code. (see recent features on TechCrunch and in the NY Times)

We’re building a database of every classroom that teaches computer programming – whether in elementary school, college, full-time, or afterschool/summer. However, we’re only looking to list classrooms that teach programming (as opposed to computer usage/literacy)

You can search the existing database and if your class/course/university isn’t listed, PLEASE submit it at

(your private contact info won’t be shared, spammed, or sold. But the other info is listed publicly to help students and teachers find the class)

I checked with our external evaluators, and they’re worried.  With even “female, 15 years old, Hall County,In a CS class,” you can go look up in where all the CS classes are in Hall County.  This wouldn’t be a big deal if there were hundreds of classes, but there aren’t.  The number of programming classes for most counties and school districts will be countable on your fingers. (You probably won’t need your toes). This is a huge deal for researchers.  We won’t be able to share our data and collaborate on analyses anywhere near the same way, not without breaking federal guidelines about anonymizing participants data.

It’s not at all clear if the data that is assembling about schools is at all useful.  Sure, there’s a high school in your school district offering programming.  You may be able to send your kid there rather than local school district, but it depends on the policies of your school district.  Let’s say that you learn that there’s an elementary school the next county over that’s offering programming.  You definitely can’t enroll your kids there without moving.  I don’t see how getting schools to make public where they are offering computer science helps us promote computer science.  (I’m also wondering who vets and maintains this database — a spotcheck shows several errors already.)

I exchanged email about this blog with Hadi Partovi, the founder of  His argument is that if there are computer courses in every school, his database is not a problem, and that’s where he wants to get to.  I agree with the goal, but the database creates a short-term stumbling block.  It prevents the researchers from studying and improving computing education while it’s growing.

I love the idea of making sure that people can find learning opportunities.  I believe that this national database of programming courses creates more privacy and research ethics problems than it solves.

Entry filed under: Uncategorized. Tags: , , , .

President Obama emphasizes the importance of CS Education Survey of Non-Doctoral Computing Departments: Please Participate!

15 Comments Add your own

  • 1. Don Davis  |  February 18, 2013 at 8:08 am

    That is a little awkward. It might have been awkward before, but less obviously so…
    As far as the stated aims of… A few problems – some fundamental prerequisites to enrolling in CS are: knowing that there is such a think, and thinking that such a think is interesting/useful/beneficial and from what I’ve seen from their plans I’m not convinced.
    Could this directory be conceivably interesting? Possibly – it might be used to generate a heat map of where CS courses are offered. This, in turn, (possibly overlaid with a school district avg./median income map) could potentially bolster awareness of the paucity and distribution of CS classes nationally and by state.
    [This is not to say that potentially useful side effects of such a problematic plan justify throwing a wrench into the work of experienced / informed researchers.]

  • 2. alfredtwo  |  February 18, 2013 at 1:47 pm

    I’m seeing a lot of well meaning industry types who want to help but seem to think that either no one else is working on getting more students into CS or that no one else is doing it “right.” They do not reach out to current practitioners to learn what is and is not working as they “have a better idea” without doing a lot of research.

    • 3. Mike Zamansky  |  February 18, 2013 at 6:41 pm

      Well said Alfred. But the same can be said about education in general – just look at the “reform” movement.

  • 4. Anonymous  |  February 18, 2013 at 3:22 pm

    I think this problem is perhaps larger than made out. This endeavor has been “blessed” by several organizations, will be supported by what I understand is a high-quality video with extremely high-level luminaries extolling the virtues of But the purpose of the database isn’t clear, will this data be made generally available?

  • 5. gasstationwithoutpumps  |  February 18, 2013 at 4:54 pm

    Anonymizing people in studies is getting very difficult. People have been identified just from supposedly anonymous DNA data and small amounts of information (like year of birth), because relatives of theirs had put other DNA information in genealogy databases.

    With the amount of information on the web (especially in social media) I think that the anonymity constraints of traditional sociological research will soon become impossible to meet in most cases, not just in CS education research.

    I actually see more uses for the database than for the sociological research—the parent mailing lists I’m on often have people asking about what is available from the schools when they are planning a job-related move. Extracting what courses are offered from the hundreds of badly-maintained school web sites is nearly impossible. An accurate database of programming courses would be consulted by many of them, if they knew of its existence. (There is the question whether is dedicating any resources to making their database accurate—relying on the course providers updating the site guarantees that it will so inaccurate and out-of-date that it will be useless.)

    Of course, the database wouldn’t do me much good, as I’m not moving and our local high school has no programming courses. We did get one course from a private school and we refused admission to a charter school that had an AP CS course (after 4 years on the waiting list) as no longer being a good fit for my son. We ended up with home-school and on-line courses (the Art of Problem Solving second programming course is not too bad for an introduction to Java after Python).

  • 6. Joseph R. Justice  |  February 24, 2013 at 8:06 pm

    (I’m sure y’all already know the following, but on the off chance you haven’t yet *realized* you know the following…)

    It seems to me that, if the ability of one entity to conduct sufficiently anonymous research depends on a second independent entity’s *not* creating and possibly distributing a database of what appears to me to be freely available public information (e.g. information that is not covered by a security classification and that does not require a subpoena or warrant to access or is considered to be a trade secret or is otherwise illegal to acquire, but that only requires sufficient time and effort to assemble), then the first entity has already lost their battle to conduct anonymous research.

    The problem here is that the anonymity sought by the first entity is akin to the sort of security thought of by those in the field of (computer) security as “security by obscurity”, which those in that field generally do not consider as providing real security at all. This “anonymity by obscurity” lasts only as long as the necessary obscurity is maintained; once it is gone, so is the anonymity depending on it.

    Unfortunately, in this day, obscurity which is not legally enforced (and sometimes / often not even then) is relatively easy to eliminate, at least if someone sees a reason to do so and has the resources to make it happen.

    I suppose that, in this particular case, it might be possible to legally outlaw the collection and correlation of the information seeks to collect. Whether this would be desirable or even feasible to achieve, and whether it would be effective if it (the outlawing of the collection and correlation of this information) was sought and achieved, is a topic for another discussion.

    Short of that, however… It seems to me that the first entity has two available options: (1) Not conduct their research in the first place (but perhaps instead conduct different research whose anonymity does not require obscurity to maintain); or (2) Accept that, for the research they wish to conduct, they cannot maintain the level of anonymity they desire (or are perhaps even legally required) to achieve, and decide what they are going to do about that.

    For the latter case, possible options I can think of off the top of my head include: (1) changing the applicable laws and institutional policies to relax them and/or recognize the changed realities of this day in terms of the ability to collect, assemble, and correlate information that it was previously not feasible (or perhaps even possible) to collect and correlate; and (2) get prior informed consent from the potential subjects of the research to *not* conduct anonymous research and the potential future effects of this upon them (which I fully agree may be difficult and expensive to achieve, and which might lead to a self-selection bias). I’m sure there’s other possible options as well which I’m overlooking. Again however, this is a topic for another discussion.

    Hope this is of some use, interest. Thanks for your time. Be well.

    Joseph R. Justice, jayarejay /

    • 7. Mark Guzdial  |  February 24, 2013 at 8:47 pm

      There is one other piece that’s necessary — for teachers, principals, and parents to divulge information on what their school is offering. The information is not publicly available now (i.e., not all schools post the list of all their classes on the web). So, “sufficient time and effort to assemble” and cooperation of informants is what’s necessary to create the database. Aren’t all secrets essentially protected by that triplet?

      • 8. gasstationwithoutpumps  |  February 25, 2013 at 9:21 am

        I find it difficult to believe that a supporter of computing education could seriously argue that schools ought to keep secret that they teach computer science.

        Remember that the goal is to improve computing education, not just to do research in computing education. For that matter, someone trying to study the spread of computing education would need a database like’s for their research. When do the needs of one researcher trump those of another?

        Secrecy in education research has lead to some serious problems in the credibility of the research—look at the firestorm around Jo Boaler’s research, for example.

        I think that education researchers may need to look at the fundamental paradigms of the profession, rather than reflexively defending the secrecy that is the current norm.

        • 9. Don Davis (@gnu_don)  |  February 25, 2013 at 9:44 am

          “Cooperation of informants” seemed less about secrecy than time, knowledge, and access. It’s not that schools keep such things a secret (why would they? Schools like to be recognized. AP programs – including CS – improve that prestige), but rather that district (and more commonly teacher) webpages and what not where this information is harvested are woefully out of date. How often will refresh their data? How often is it expected that the schools will refresh the data? Are the schools responsible for contributing the data?
          A side question – can the list of schools offering CS programs be determined to a great extent from the AP CS testers?

          • 10. Mark Guzdial  |  February 25, 2013 at 10:37 am

            We’ve tried that, Don. Unfortunately, it doesn’t work. There are more schools teaching some form of computer science without teaching AP CS (see many of the schools). And there are schools that pass the audit (which College Board will tell us about), but don’t seem to teach the course, or at least, don’t have any test-takers (which College Board will also tell us about).

   can’t hold schools responsible for anything.

            Schools, at least here in the Atlanta area, are quite secretive about the courses they offer. Seriously, secretive. I have been prevented from even talking to CS teachers and principals several times now. I do respect that the research offices in school districts are playing an important protection role. is making an end-run around them.

          • 11. gasstationwithoutpumps  |  February 25, 2013 at 4:14 pm

            I know one private school that has a decent CS course, but it is Scheme-based, not Java-based, so they probably have few AP CS exam takers. I suspect that a few other schools have CS programs that aren’t detectible by AP CS exams, and many AP CS exam takers didn’t learn their CS in school.

        • 12. Mark Guzdial  |  February 25, 2013 at 10:32 am

          You may very well be right. I’m bound by the federal regulations regarding education research. I wish they’d change, and I know that the Obama Administration has floated proposals to change them. But right now, I have to be very strict in protecting anonymity.

          It’s a very complicated picture. For our work with ECEP (, I would love to have the data that is collecting. However, making it publicly available means that I can’t share the data that we’re collecting about individuals with other researcher teams, because identity can then be inferred using their database.

      • 13. Joseph R. Justice  |  February 25, 2013 at 7:13 pm

        I’ll agree that it is necessary for teachers, principals, parents, et al to be willing to divulge information on what is available / has been available in the past / may or will be available in the future.

        The difference here is that, unlike information such as (for instance) individual health records or individual income tax filings or individual national census filings or individual telephone call records, each of which has legal protections of various sorts obligating the possessor of the information to *not* divulge it except to certain specific sets of individuals under certain specific circumstances lest they render themselves vulnerable to various civil and criminal penalties, the information being sought here (what sorts of specific courses in a specific category or field are available at a school, and what other resources related to this category or field are made available under the umbrella of the school) doesn’t sound to me as if it is legally protected from being divulged. There may be an institutional policy at a given institution from sharing the information, or a gatekeeper of the information may choose not to cooperate (for whatever reason they see fit to), but it doesn’t seem to me in general that it’d be a crime for a principal to say “Yes, we offer courses in X, Y, and Z” to any random person who might inquire.

        Think of it this way — if a journalist, or a random person off the street with no specific credentials, permissions, or authorizations, could get the information upon request (or by filing a Freedom Of Information Act request or in worst case lawsuit), then it’s public. (Just because it’s not easily accessible, e.g. on the web, *now* doesn’t mean it couldn’t be *made* easily accessible if someone chose to make the effort to do so.)

        And, because I believe the information is seeking to collect and correlate and (perhaps) publish (based on what I’ve read above) is information of this type, e.g. information that any random person can obtain and disseminate if they are persistent enough and want it badly enough, then for my purposes it’s public information. And, anonymity that’s based on this information not being public, and not *becoming* public, where there’s no technical and/or legal mechanism in place to prevent its becoming public, is “anonymity by obscurity” as I described above, and thus doomed to eventual failure.

        So… If the academic research someone wants to conduct will (potentially) not be sufficiently anonymous if collects and analyzes (and perhaps even distributes) the sorts of information it appears to want to collect, then the researcher in question has already lost their battle to keep their research sufficiently anonymous, and has to decide if they will not conduct their research at all, or if instead they will relax their standards for anonymity (which may involve changing institutional policies, applicable laws and regulations, etc) to a level that can be achieved taking into account what (or anyone else at all) is doing or can do independently of what the researcher is doing, or if they will do some other third possibility I haven’t considered.

        Thanks for responding. I hope you consider my counter response as useful and applicable as I think yours was. And I hope people in general find this of use and interest. Thanks for your time. Be well.

        Joseph R Justice, jayarejay /

      • 14. Gary Stager, Ph.D. (@garystager)  |  February 27, 2013 at 5:39 pm

        Surely, a course catalog is not confidential, whether it is posted on the Web or not.

        The larger problems with Code.Org is that its leading advocates are the very same people working tirelessly to dismantle public education; that there is little discussion about sound content or pedagogical issues; they are a-historic and believe they just discovered the idea of kids programming when some of us have been advocates for 30-40 years; that learning computer science is secondary to ginned-up concerns about economic competitiveness…

        Newark Mayor Cory Booker celebrates Code.Org. However, he remains wholly ignorant of the fact that his city used to lead the nation in teaching Logo programming and robotics to urban children as a vehicle for intellectual liberation and antidote to the sorts of curricular standardization that his administration has advanced. The sorts of TFA, charter, endless testing, zero-tolerance, privatization, merit pay – “reform” agenda Booker has advanced makes the conditions for teaching things like computer science impossible in his city’s schools.

  • […] had posted this blog piece back in January, but then was asked to take it down.  There were concerns that the data were not anonymized enough to guarantee participant anonymity.  Tom McKlin did a great job of working with the Human Subjects Review board here at Georgia Tech, […]


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trackback this post  |  Subscribe to the comments via RSS Feed

Enter your email address to follow this blog and receive notifications of new posts by email.

Join 10,184 other subscribers


Recent Posts

Blog Stats

  • 2,053,486 hits
February 2013

CS Teaching Tips

%d bloggers like this: