Code.org creating a database of all programming courses: Bad news for researchers
One of the challenges of being a social science researcher today is anonymizing your data. In our project at Georgia Tech, we are asking the Human Subjects Review Board about sharing a data set that we created in an anonymized form. There’s some question as to whether it was sufficiently anonymized. ”African-American, female, 15 years old, Hall County” feels pretty anonymous, but when you add “in a computer class” or “taking a Girl Scout animation workshop,” we’re now talking about a very small number of people. It might be possible for someone to identify the student. The Board is trying to decide if that’s anonymous enough. The dataset becomes much more anonymous if we leave out the school district or county, but if we do, we lose our connection to socioeconomic status, school district performance on tests and high school completion rates, and other factors that do matter to our analyses. If we provide those other variables and not the district, it becomes possible to triangulate the county.
I recently received this message from CSTA:
Code.org is a recently launched public 501c3 nonprofit dedicated to the vision that every student in every school should have the opportunity to learn to code. (see recent features on TechCrunch and in the NY Times)
We’re building a database of every classroom that teaches computer programming – whether in elementary school, college, full-time, or afterschool/summer. However, we’re only looking to list classrooms that teach programming (as opposed to computer usage/literacy)
You can search the existing database and if your class/course/university isn’t listed, PLEASE submit it at http://www.code.org.
(your private contact info won’t be shared, spammed, or sold. But the other info is listed publicly to help students and teachers find the class)
I checked with our external evaluators, and they’re worried. With even “female, 15 years old, Hall County,In a CS class,” you can go look up in Code.org where all the CS classes are in Hall County. This wouldn’t be a big deal if there were hundreds of classes, but there aren’t. The number of programming classes for most counties and school districts will be countable on your fingers. (You probably won’t need your toes). This is a huge deal for researchers. We won’t be able to share our data and collaborate on analyses anywhere near the same way, not without breaking federal guidelines about anonymizing participants data.
It’s not at all clear if the data that Code.org is assembling about schools is at all useful. Sure, there’s a high school in your school district offering programming. You may be able to send your kid there rather than local school district, but it depends on the policies of your school district. Let’s say that you learn that there’s an elementary school the next county over that’s offering programming. You definitely can’t enroll your kids there without moving. I don’t see how getting schools to make public where they are offering computer science helps us promote computer science. (I’m also wondering who vets and maintains this database — a spotcheck shows several errors already.)
I exchanged email about this blog with Hadi Partovi, the founder of Code.org. His argument is that if there are computer courses in every school, his database is not a problem, and that’s where he wants to get to. I agree with the goal, but the database creates a short-term stumbling block. It prevents the researchers from studying and improving computing education while it’s growing.
I love the idea of making sure that people can find learning opportunities. I believe that this national database of programming courses creates more privacy and research ethics problems than it solves.