Posts tagged ‘evaluation’

How do we test the cultural assumptions of our assessments?

I’m teaching a course on user interface software development for about 260 students this semester. We just had a Midterm where I felt I bobbled one of the assessment questions because I made cultural assumptions. I’m wondering how I could have avoided that.

I’m a big fan of multiple choice, fill-in-the-blank, and Parsons problems on my assessments. I use my Parson problem generator a lot (see link here). For example, on this one, students had to arrange the scrambled parts of an HTML file in order to achieve a given DOM tree, and there were two programs in JavaScript (using constructors and prototypes) that they had to unscramble.

I typically ask some definitional questions about user interfaces at the start, about ideas like signifiers, affordances, learned associations, and metaphors. Like Dan Garcia (see his CS-Ed Podcast), I believe in starting out the exam with some easy things, to buoy confidence. They’re typically only worth a couple points, and I try to make the distractors fun. Here’s an example:

Since we watched in lecture a nice video starring Don Norman explaining “Norman doors,” I was pretty sure that anyone who actually attended lecture that day would know that the answer was the first one in the list. Still, maybe a half-dozen students chose the second item.

Here’s the one that bothered me much more.

I meant for the answer to be the first item on the list. In fact, almost the exact words were on the midterm exam review, so that students who studied the review guide would know immediately what we wanted. (I do know that working memory doesn’t actually store more for experts — I made a simplification to make the definition easier to keep in mind.)

Perhaps a dozen student chose the second item: “Familiarity breeds contempt. Experts contempt for their user interfaces allows them to use them without a sense of cognitive overload.” I had several students ask me during the exam, “What’s contempt?” I realized that many of my students didn’t know the word or the famous phrase (dates back to Chaucer).

Then one student actually wrote on his exam, “I’m assuming that contempt means learned contentment.” If you make that assumption, the item doesn’t sound ridiculous: “Familiarity breeds learned contentment. Experts learned contentment for their user interfaces allows them to use them without a sense of cognitive overload.”

I had accidentally created an assessment that expected a particular cultural context. The midterm was developed over several weeks, and reviewed by my co-instructor, graduate student instructor, five undergraduate assistants, and three undergraduate graders. We’re a pretty diverse bunch. We had found and fixed perhaps a dozen errors in the exam during the development period. We’d never noted this problem.

I’m not sure how I could have avoided this mistake. How does one remain aware of one’s own cultural assumptions? I’m thinking of the McLuhan quote: “I don’t know who discovered water, but it wasn’t a fish.” I feel bad for the students who got this problem wrong because they didn’t know the quote or the meaning of the word “contempt.” What do you think? How might I have discovered the cultural assumptions in my assessment?

March 16, 2020 at 1:57 pm 15 comments

BDSI – A New Validated Assessment for Basic Data Structures: Guest Blog Post from Leo Porter and colleagues

Leo Porter, Michael Clancy, Cynthia Lee, Soohyun Nam Liao, Cynthia Taylor, Kevin C. Webb, and Daniel Zingaro have developed a new concept inventory that they are making available to instructors and researchers. They have written this guest blog post to describe their new instrument and explain why you should use it. I’m grateful for their contribution!

We recently published a Concept Inventory for Basic Data Structures at ICER 2019 [1] and hope it will be of use to you in your classes and/or research.

The BDSI is a validated instrument to measure student knowledge of Basic Data Structure Concepts [1].  To validate the BDSI, we engaged faculty at a diverse set of institutions to decide on topics, help with question design, and ensure the questions are valued by instructors.  We also conducted over one hundred interviews with students in order to identify common misconceptions and to ensure students properly interpret the questions. Lastly, we ran pilots of the instrument at seven different institutions and performed a statistical evaluation of the instrument to ensure the questions are properly interpreted and discriminate between students’ abilities well.

What Our Assessment Measures

The BDSI measures student performance on Basic Data Structure concepts commonly found in a CS2 course.  To arrive at the topics and content of the exam, we worked with fifteen faculty at thirteen different institutions to ensure broad applicability.  The resulting topics on the CI include: Interfaces, Array-Based Lists, Linked-Lists, and Binary Search Trees. If you are curious about the learning goals or want more details on the process we used in arriving at these goals, please see our SIGCSE 2018 publication [2].

Why Validated Assessments are Great for Instructors

Suppose you want to know how well your students understand various topics in your CS2 course.  How could you figure out how much your students are learning relative to other schools? You could, perhaps, get a final exam from another school and use it in your class to compare results, but invariably, the final exam may not be a good fit.  Moreover, you may find flaws in some of the questions and wonder if students interpret them properly. Instead, you can use a validated assessment. The advantage of using a validated assessment is there is general agreement that it is measuring what you want to measure and it accurately measures student thinking.  As such, you can compare your findings to results from other schools who have used the instrument to determine if your students are learning particular topics better or worse than cohorts and similar institutions.

Why Validated Assessments are Great for Researchers

As CS researchers, we often experiment with new ways to teach courses.  For example, many people use Media Computation or Peer Instruction (PI), two complementary pedagogical approaches developed over the past several decades.  It’s important to establish whether these changes are helping our students. Do more students pass? Do fewer students withdraw? Do more students continue studying CS?  Does it boost outcomes for under-represented groups? Answering these questions using a variety of courses can give us insight into whether what we do corresponds with our expectations.

One important question is: using our new approach, do students learn more than before?  Unfortunately, answering this is complicated by the lack of standardized, validated assessments.  If students score 5% higher on an exam when studying with PI vs. not studying with PI, all we know is that PI students did better on that exam.  But exams are designed by one instructor, for one course at one institution, not for the purposes of cross-institution, cross-cohort comparisons.  They are not validated. They do not take into account the perspectives of other CS experts. When students answer a question on an exam correctly, we assume that it’s because they know the material; when they answer incorrectly, we assume it’s because they don’t know the material.  But we don’t know: maybe the exam contains incidental cues that subtly influence how students respond.

A Concept Inventory (CI) solves these problems.  Its rigorous design process leads to an assessment that can be used across schools and cohorts, and can be used to validly compare teaching approaches.

How to Obtain the BDSI

The BDSI is available via the google group.  If you’re interested in using it, please join the group and add a post with your name, institution, and how you plan to use the BDSI.

How to Use the BDSI

The BDSI is designed to be given as a post-test after students have completed the covered material.  Because the BDSI was validated as a full instrument, it is important to use the entire assessment, and not alter or remove any of the questions.  We ask that instructors not make copies of the assessment available to students after giving the BDSI, to try to avoid the questions becoming public.  We likewise recommend giving participation credit, but not correctness credit, to students for taking the BDSI, to avoid incentivizing cheating.  We have found giving the BDSI as part of a final review session, collecting the assessment from students, and then going over the answers to be a successful methodology for having students take it. 

Want to Learn More?

If you’re interested in learning more about how to build a CI, please come to our talk at SIGCSE 2020 (from 3:45-4:10pm on Thursday, March 12th) or read our paper [3].  If you are interested in learning more about how to use validated assessments, please come to our Birds of a Feather session on “Using Validated Assessments to Learn About Your Students” at SIGCSE 2020 (5:30-6:20pm on Thursday, March 12th) or our tutorial on using the BDSI at CCSC-SW 2020 (March 20-21).

References:

[1] Leo Porter, Daniel Zingaro, Soohyun Nam Liao, Cynthia Taylor, Kevin C. Webb, Cynthia Lee, and Michael Clancy. 2019. BDSI: A Validated Concept Inventory for Basic Data Structures. In Proceedings of the 2019 ACM Conference on International Computing Education Research (ICER ’19).

[2] Leo Porter, Daniel Zingaro, Cynthia Lee, Cynthia Taylor, Kevin C. Webb, and Michael Clancy. 2018. Developing Course-Level Learning Goals for Basic Data Structures in CS2. In Proceedings of the 49th ACM Technical Symposium on Computer Science Education (SIGCSE ’18).

[3] Cynthia Taylor, Michael Clancy, Kevin C. Webb, Daniel Zingaro, Cynthia Lee, and Leo Porter. 2020. The Practical Details of Building a CS Concept Inventory. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education (SIGCSE ’20).

February 24, 2020 at 7:00 am Leave a comment

Attending the amazing 2017 Computing at School conference #CASConf17

June 17, Barbara and I attended the Computing at School conference in Birmingham, England (which I wrote about here).  The slides from my talk are below. I highly recommend the summary from Duncan Hull which I quote at the bottom.

CAS was a terrifically fun event. It was packed full with 300 attendees. I under-estimated the length of my talk (I tend to talk too fast), so instead of a brief Q&A, there was almost half the time for Q&A. Interacting with the audience to answer teachers’ questions was more fun (and hopefully, more useful and entertaining) than me talking for longer. The session was well received based on the Tweets I read. In fact, that’s probably the best way to get a sense for the whole day — on Twitter, hashtag #CASConf17. (I’m going to try to embed some tweets with pictures below.)

Barbara’s two workshops on Media Computation in Python using our ebooks went over really well.

I enjoyed my interactions all day long. I was asked about research results in just about every conversation — the CAS teachers are eager to see what computing education research can offer them.  I met several computing education research PhD students, which was particularly exciting and fun. England takes computing education research seriously.

Miles Berry demonstrated Project Quantum by having participants answer questions from the database.  That was an engaging and fascinating interactive presentation.

Linda Liukas gave a terrific closing keynote. She views the world from a perspective that reminded me of Mitchel Resnick’s Lifelong Kindergarten and Seymour Papert’s playfulness. I was inspired.

The session that most made me think was from Peter Kemp on the report that he and co-authors have just completed on the state of computing education in England. That one deserves a separate blog post – coming Wednesday.

Check out Duncan’s summary of the conference:

The Computing At School (CAS) conference is an annual event for educators, mostly primary and secondary school teachers from the public and private sector in the UK. Now in its ninth year, it attracts over 300 delegates from across the UK and beyond to the University of Birmingham, see the brochure for details. One of the purposes of the conference is to give teachers new ideas to use in their classrooms to teach Computer Science and Computational Thinking. I went along for my first time (*blushes*) seeking ideas to use in an after school Code Club (ages 7-10) I’ve been running for a few years and also for approaches that undergraduate students in Computer Science (age 20+) at the University of Manchester could use in their final year Computer Science Education projects that I supervise. So here are nine ideas (in random brain dump order) I’ll be putting to immediate use in clubs, classrooms, labs and lecture theatres:

Source: Nine ideas for teaching Computing at School from the 2017 CAS conference | O’Really?

My talk slides:

July 10, 2017 at 7:00 am 1 comment

SIGCSE 2016 Preview: Miranda Parker replicated the FCS1

I’ve been waiting a long time to write this post, though I do so even now with some trepidation.

In 2010, Allison Elliott Tew completed her dissertation on building FCS1, the first language-independent and validated measure of introductory computer science knowledge (see this post summarizing the work). The FCS1 was a significant accomplishment, but it didn’t get used much. Allison had concerns about the test becoming freely available and no longer useful as a research instrument.

Miranda Parker joined our group and replicated the FCS1. She created an isomorphic test (which we’re calling SCS1 for Secondary CS1 instrument — it comes after the first). She then followed a rigorous process for replicating a validated instrument, including think-aloud protocols to check usability (do the problems read as she meant them?), large-scale counter-balanced study using both tests, and analysis, including correlational and item-response theory (IRT) analysis. Her results support that SCS1 is effectively identical to FCS1, but also point out the weaknesses of both tests and why we need more and better assessments.

(Note: Complaining in this paragraph — some readers might just want to skip this.) As the first time anyone had ever replicated a validated CS research instrument, the process is a significant result. SIGCSE reviewers did not agree. The Associate Chair’s comment on our rejected paper said, “Two reviewers had concerns about appropriateness of this paper for SIGCSE: #XXX because it didn’t directly address improved learning, and #YYY because replicating the FCS1 wasn’t deemed to be as noteworthy as the original work.” An assessment tool doesn’t improve learning, and a first-ever replication is not publishable.

Miranda was hesitant to release SCS1 for use (e.g., post in my blog, send emails on CSEd-Research email lists) until the result was peer-reviewed. A disadvantage that my students have suffered for having an advisor who blogs — some reviewers have rejected my students’ papers because my blogging made it discoverable who did the research, and thus our papers can’t be sufficiently anonymized to meet those reviewers’ standards. So, I haven’t talked about SCS1, despite my pleasure and pride in Miranda’s accomplishment.

I’m posting this now because Miranda does have a poster on SCS1 at the SIGCSE 2016 Technical Symposium. Come see her at the 3-5 pm Poster Session on Friday. Miranda had a major success in her first year as a PhD student, and the research community now has a new validated research instrument.

Here’s the trepidation part: her paper on the replication process was just rejected for ITICSE. There’s no Associate Chair for ITICSE, so there’s no meta-review that gives the overall reasons.  One reviewer raised some concerns about the statistics, which we’ll have to investigate.  Another reviewer strongly disagrees with the idea of a replication, much like the #YYY reviewer at SIGCSE. One reviewer complained that this paper was awfully similar to a paper by Elliott Tew and Guzdial, so maybe it shouldn’t be published.  I’m not sure how we convince SIGCSE and ITICSE reviewers that replication is important and something that most STEM disciplines are calling for more of. (Particularly aggravating point: Because FCS1 is not freely available, the reviewer doesn’t believe that FCS1 is “valid, consistent, and reliable” without inspecting it — as if you can tell those characteristics just by looking at the test?)

I’m talking about SCS1 now because she has her poster accepted, so she has a publication on that.  We really want to publish her process and in particular, the insights we now have about both instruments.  We’ll have to wait to publish that — and I hope the reviewers of the next conference don’t give us grief because I talked about the result here.

Contact Miranda at scs1assessment@gmail.com for access to the test.

March 2, 2016 at 8:00 am 9 comments

CMU launches initiative to improve student learning with technology

Interesting results, and nice to hear that the new initiative will be named for Herb Simon.

The Science of Learning Center, known as LearnLab, has already collected more than 500,000 hours’ worth of student data since it initially received funding from the National Science Foundation about nine years ago, its director Ken Koedinger said. That number translates to about 200 million times when students of a variety of age groups and subject areas have clicked on a graph, typed an equation or solved a puzzle.

The center collects studies conducted on data gathered from technology-enhanced courses in algebra, chemistry, Chinese, English as a second language, French, geometry and physics in an open wiki.

One such study showed that students performed better in algebra if asked to explain what they learned in their own words, for example. In another study, physics students who took time answering reflection questions performed better on tests than their peers.

via Carnegie Mellon U. launches initiative to improve student learning with technology | Inside Higher Ed.

January 3, 2014 at 1:03 am Leave a comment

Entrepreneurial MOOCs to teach CS: Different values, different evaluation

Lisa Kaczmarczyk wrote a blog post about a bunch of the private, for-profit groups teaching CS who visited the ACM Education Council meeting on Nov. 2.  I quoted below the section where the Ed Council asked tough questions about evaluation.  I wonder if the private efforts to educate mean the same things about evaluation as the academic and research folks mean by “evaluation.”  There are different goals and different value systems between each.  Learning for all in public education is very different from a privatized MOOC where it’s perfectly okay for 1-10% to complete.

Of course there was controversy; members of the Ed Council asked all of the panelists some tough questions. One recurrent theme had to do with how they know what they are doing works. Evaluation how? what kind? what makes sense? what is practical? is an ongoing challenge in any pedagogical setting and when you are talking about a startup as 3 out of the 4 companies on the panel were in the fast paced world of high tech – its tricky. Some panelists addressed this question better than others. Needless to say I spent quite a bit of time on this – it was one of the longer topics of discussion over dinner at my table.

Neil Fraser from Googles Blockly project said some things that were unquestionably controversial. The one that really got me was when he said several times, and with followup detail that one of the things they had learned was to ignore user feedback. I can’t remember his exact words after that but the idea seemed to be that users didnt know what was best for them. Coming on the heels of earlier comments that were less than tactful about computing degree programs and their graduates … I have to give Neil credit for having the guts to share his views.

via Interdisciplinary Computing Blog: Entrepreneurial MOOCs at the ACM Ed Council Meeting.

November 12, 2013 at 1:07 am Leave a comment

Say Goodbye to Myers-Briggs, the Fad That Won’t Die

Once in our Learning Sciences seminar, we all took the Myers-Briggs test on day 1 of the semester, and again at the end.  Almost everybody’s score changed.  So, why do people still use it as some kind of reliable test of personality?

A test is reliable if it produces the same results from different sources. If you think your leg is broken, you can be more confident when two different radiologists diagnose a fracture. In personality testing, reliability means getting consistent results over time, or similar scores when rated by multiple people who know me well. As my inconsistent scores foreshadowed, the MBTI does poorly on reliability. Research shows “that as many as three-quarters of test takers achieve a different personality type when tested again,” writes Annie Murphy Paul in The Cult of Personality Testing, “and the sixteen distinctive types described by the Myers-Briggs have no scientific basis whatsoever.” In a recent article, Roman Krznaric adds that “if you retake the test after only a five-week gap, there’s around a 50% chance that you will fall into a different personality category.”

via Say Goodbye to MBTI, the Fad That Won’t Die | LinkedIn.

November 5, 2013 at 1:53 am 5 comments

Older Posts


Enter your email address to follow this blog and receive notifications of new posts by email.

Join 7,715 other followers

Feeds

Recent Posts

Blog Stats

  • 1,757,925 hits
June 2020
M T W T F S S
1234567
891011121314
15161718192021
22232425262728
2930  

CS Teaching Tips