Advancing Computer Science Education With a Review of Concept Inventories

Full paper: Taking Stock of Concept Inventories in Computing Education: A Systematic Literature Review.

12 min readNov 27, 2023

Murtaza Ali, Sourojit Ghosh(G), Prerna Rao, Raveena Dhegaskar, Sophia Jawort, Alix Medler, Mengqi Shi, and Sayamindu Dasgupta of the University of Washington. Full paper linked here.

Teaching introductory computer science classes has long posed a challenge for educators: How can students with vastly differing backgrounds all be taught effectively in a single class? This is a broad question which contains many smaller problems that need solving.

One such problem pertains to determining what content students struggle with the most. Approached in an unstructured manner, this problem could easily take up nearly all of a teacher’s time and resources, especially in a large class. Luckily, instruments and tools have been developed to streamline this task.

One of those tools — the concept inventory — is the focus of our recent work. We conducted a systematic literature review — essentially a highly structured process which involves reading, analyzing, and evaluating all of the academic literature surrounding a particular topic — of concept inventories in computing education.

What is a Concept Inventory (CI)?

Formally, a concept inventory, or CI, is a standardized set of assessment questions which can be used to measure student understanding about a topic. CIs are often structured as multiple-choice tests, where the incorrect answers for a question correspond to conceptual errors. For example, if a question about loops has an answer which assumes programming languages start counting at 1 (rather than 0), teachers can precisely identify what students who pick that answer misunderstand.

As a result, student performance on a CI goes beyond simply measuring high-level understanding. It informs teachers about the specific misconceptions students hold. This is the defining feature of a concept inventory.

Concept inventories first emerged in the late 1990s, when three physics teachers — David Hestenes, Malcolm Wells, and Gregg Swackhamer — published the Force Concept Inventory (FCI). The FCI is widely known in education fields as the first CI and is held as a gold standard both in substance and impact. Hestenes and his collaborators developed the tool in response to concerning findings that many introductory physics students at the time came into courses with serious misunderstandings about Newtonian force, and — crucially — that these courses weren’t doing anything to change those false beliefs.

The FCI was designed to pinpoint the particular misconceptions individual students had so that instructors could directly respond to them. In the years that followed, the FCI led to a revolution within introductory physics education, precipitating immense positive change in the classroom and among students.

Since then, educators in other disciplines — including chemistry, astronomy, biology, mathematics, statistics, and geosciences — have worked to develop CIs of their own. In 2011, the first every computer science CI — the Foundational CS1 Assessment (FCS1) — was published by Allison Elliot Tew and Mark Guzdial.

Research Questions

We set out to conduct a systematic literature review aimed at answering the following questions:

What do we know about CIs for computer science today?
Have researchers been able to tackle the 4 challenges outlined by Taylor et al. in their previous literature review of computer science CIs?

A bit of context for our second research question: In 2014, Cynthia Taylor and her team of researchers cataloged existing CIs for computer science in a similar literature review. At the time, the field was still quite young within computer science, with only 6 total CIs, 4 of which were still in progress. In light of this, the authors identified 4 challenges specific to building CIs for computer science:

Pre-test limitations
Programming language dependence and designing robust CIs
Difficulty assessing skills
CIs vs. comprehensive assessments

These are discussed in more detail below.

Considering that nearly a decade has passed since this last literature review, we felt there was a pressing need for a follow-up study which evaluated the state of computer science CIs today. Our work reports on 65 papers spanning CI work in computer science education, analyzed with a focus on extracting information pertaining to the questions above.

Curating the Literature

In collecting the relevant literature for our systematic review, we wanted to ensure we covered the relevant topics in both engineering and education.

This in mind, we selected databases hosted on Engineering Village and EBSCOhost. Searching a database is similar to conducting a Google search, but for strictly academic writing. We searched for papers related to concept inventories and computer science education, eventually finalizing a list of 65 papers to analyze.

Conducting the Review

We began by reading the papers in their entirety and taking extensive notes. After this stage, the research team met, consolidating their notes into a series of 12 narrow questions intended to delve deeper into the overarching research questions. The nature of these questions focused on the specific topics and intended audience of existing CIs, the techniques used to develop them, how the CIs were/are used, and whether or not the paper in question addressed any of Taylor et al.’s challenges. The exact questions are omitted here and are available on page 7 of the research paper.

These questions were then used to conduct a thematic and quantitative analysis which allowed us to solidify our findings.

Research Question 1: What Do We Know?

For convenience, here is a summary of our most significant findings regarding this research question:

There are 33 total computer science CIs in existence. 12 of these are validated.
There is an almost exclusive focus on undergraduate education.
Existing CIs cover a wide range of topics and languages — see the tables below.
The standard process for developing a CI remains popular, but some researchers are exploring new techniques.

Let’s dive into some more detail.

Who, what, and how?
Among the 65 papers in our literature review, 62 dealt with an undergraduate setting. To some extent, this is expected. Many computer science education researchers are also undergraduate professors of computer science or otherwise interact with undergraduate computer science students; using them as a study population is natural.

However, that does not change the fact that expansion is necessary. One of non-undergraduate papers we read explicitly stated “significant need for a more general middle-grades CS concept inventory assessment.” Based on our review, this need applies to high school and graduate school settings as well.

That said, current CI work in computer science does cover a wide range of topics. We identified 12 unique topic areas for CIs, as well as a broad collection of programming languages used for building CIs. The two tables below display the breakdown of topic areas and programming languages, respectively. Note that the number column references all studies that dealt with a given topic, so the total is greater than the number of CIs.

Table showing a breakdown of topic areas and how many studies each was covered in.

Table showing a breakdown of programming languages and how many studies each was covered in.

How are researchers building CIs?
Generally, building a CI (for any topic) involves four phases:

Choosing the concepts through an expert discussion panel (formally called a Delphi process).
Identifying misconceptions about those concepts by interviewing students while they talk through their approach to relevant problems (think-aloud interviews) and analyzing past exams for common errors.
Writing the CI questions.
Validating the CI by checking the test items for both validity (Are the results accurate?) and internal reliability (Are the results consistent?).

Most of the papers in our review aimed to produce completed CIs (or at least work that could eventually contribute to a completed CI), and thus utilized established techniques for various aspects of the CI-building process. The Delphi process was extremely common for concept identification, think-aloud interviews and exam analyses were the primary tools used for discerning misconceptions, and Classical Test Theory and Item Response Theory (two methods for validating test questions) were used for validation.

However, one of our most interesting findings involved a small collection of researchers who explored alternative techniques for building CIs.

Some authors pointed out that carrying out the Delphi process can be cumbersome because the expert consensus on concepts still needs to be empirically validated after the fact. We identified 4 studies in our review which discussed alternative methods for choosing concepts and identifying misconceptions. 3 of these proposed a novel technological approach, and the 4th was one of the few that used crowdsourcing to identify misconceptions.

2 proposed using an algorithmic method based in coevolution and co-optimization theory. Such algorithms were developed in order to solve computer science optimization problems by applying biologically inspired solutions to “evolve” solutions rather than using traditional numerical methods. The first study developed an algorithm to produce a set of representative concepts; the second managed to produce a set of evolved problems with distractors directly corresponding to common student misconceptions.

The 3rd study developed a technique called Adaptive Tool-Driven Concept Generation, combining machine learning with classsourcing in order to automate the identification of misconceptions and subsequent generation of questions.

Although such work made up a small proportion of the literature we studied, we draw special attention to it because it suggests a potentially fruitful area of future work that could greatly simplify the CI development process for computer science. Further details about these studies are available in Section 4.3 of our paper.

Research Question 2: Addressing Taylor et al.’s Challenges

Recall the 4 challenges for building computer science CIs identified by Taylor et al. in their 2014 literature review:

Pre-test limitations
Programming language dependence and designing robust CIs
Difficulty assessing skills
CIs vs. comprehensive assessments

To keep this article from reaching an egregious length, we will only discuss the second one in detail here — but we encourage you to check out Section 4.2 of the full paper if you are interested in learning more.

Programming language dependence and designing robust CIs
When the FCI was designed for physics education, it was almost naturally robust due to the fact that Newtonian force is an extremely stable subject which can be tested purely conceptually. On the other hand, many computer science concepts are taught in a specific programming language. Furthermore, although programming language popularity may be stable for several years at a time, it most certainly does shift across decades.

The reality that computer science is a young field with fluid content, coupled with the fact that courses with similar concepts are often taught in different programming languages, led Taylor et al. to raise an important challenge: How do we design long-lasting and comprehensive CIs for computer science? Our systematic review revealed promising findings with respect to this challenge.

The first solution involved using a pseudocode language to build CIs. This was the approach taken by the authors of the Basic Data Structures Inventory (BDSI), who augmented a previously existing pseudocode language (originally developed by the authors of the FCS1, the Foundational CS1 Assessment) to use in their CI. A later study assessed the BDSI on 1,963 students with varying backgrounds in programming language (Python, Java, or C) and reported an encouraging finding: Students rarely indicated any confusion with the pseudocode language, suggesting an effective correspondence with real programming languages.

As an alternative, one set of authors also proposed porting existing CIs over to new languages. This is particularly interesting because it involves a new methodology in and of itself. Caceffo et al., in particular, published a method for taking an existing introductory programming CI that they had in C and Python, and converting it into Java. The key advantage of this work was that it did not require starting from scratch when building the Java CI. This is a promising area of future work; if fruitful, such methods could eventually evolve into an efficient, general technique for transferring CIs between different programming languages.

Other Challenges
As we mentioned above, Taylor et al. also discussed 3 other challenges which we studied in our review that we have omitted here for concision. In our paper, each of these is presented in a similar manner as this challenge, with a detailed description of both its meaning and its potential solutions as per the literature.

Limitations

Though we meticulously ensured rigor in our work as best we could, there are of course a few limitations we acknowledge here.

Perhaps the most important is the set of keywords used in our database searches. We attempted to balance comprehensiveness with precision, but it is possible we may have missed one or more relevant keywords and an associated set of literature. Notably, the set of papers in our review did not include the original papers for two of the foundational CIs in the field: the Foundation CS1 Assessment (FCS1), and its replicated version, the Second CS1 Assessment (SCS1). Manual examination of these papers revealed that our search keywords do not appear within any of the fields searched for these papers.

Looking into this further, we found that while the FCS1 drew from CI work in other fields and strongly influenced future computer science work, the authors do not actually refer to it as a concept inventory, but rather as a “concept assessment.” Taylor et al. mentioned this in their review as well. This fact, combined with the wide coverage of our search methods, gives us confidence that despite this oversight, our search was exhaustive.

An important factor readers should note is that while we do discuss the FCS1 in our paper due to its prominence within the field, we do not include it in our formal CI counts in order to maintain the rigor of our work (fortunately, we were able to include the SCS1 as a CI via a later paper (that conducted a follow-up study to validate the SCS1) which did appear in our search.

Other limitations related to our screening process. We did not explicitly screen out posters, panels, and extended abstracts, leading to the inclusion of two panels and one extended abstract that did not discuss CIs in a manner relevant to our work; nevertheless; their inclusion did not negatively impact our findings in any way.

Through our review, we determined that the usage of CIs as a method of assessment for computer science is growing, and can be applied to a wide range of topics within the field. However, we acknowledge that our research takes a breadth-first approach in evaluating the field, and that depth-first work, such as Parker et al.’s evaluation of the specific impact of the FCS1 and SCS1, is also incredibly important. We hope that researchers undertake such work as a complement to this review in the near future.

Key Takeaways and the Future of CI Work

With 33 total CIs in existence, 12 of which are validated, and 10 of which were validated within the last nine years, there is considerable evidence that this is a robust research area which will continue to grow. These numbers are a significant improvement from 2014, when the counts of total and validated CIs stood at 6 and 2, respectively.

With respect to Taylor et al.’s challenges, our findings are promising in indicating that the challenges arose naturally in CI work and thus resulted in solutions also being developed. The foresight of the original authors should be acknowledged here.

Based on our findings, we make the following suggestions to future CI researchers:

Very little work has been done to develop CIs for K-12 students. With the growing interest in computing education for these student profiles, it is an excellent time to develop CIs for them.
While we identified the two techniques discussed above for ensuring robustness of CIs — using pseudocode or developing the CI in multiple languages — little work has been done in empirically determining the benefits and drawbacks. Rigorously evaluating these techniques, as well as developing new ones, may be apt.
The fourth challenge posed by Taylor et al. involves determining if CIs or comprehensive assessments are better suited for the task of evaluating teaching effectiveness in computer science. This challenge appears to be mostly unaddressed as of yet.
We identified 4 papers that presented novel methodologies for various aspects of the CI development process. Further evaluation of these techniques, as well as exploration of other techniques to improve the construction of CIs, is a potentially fruitful area of future work.
One of the above papers also proposed new classification categories for concepts: informatively easy and informatively hard. The authors argue such classifications provide more insight into student performance than traditional definitions of easy and hard and could be of great use in CI work.
As we mentioned in the limitations section above, a few relevant papers did not appear in our database search due to certain keywords being missing. We recommend future CI researchers in computer science utilize, at minimum, “computer science” and “concept inventory” as keywords to en sure their papers are effectively indexed.

Overall, we find that CI research in computing education is a promising field that has seen considerable growth over the last decade and will continue to grow quickly. In 2014, Taylor et al. expressed the following wish:

“Should the community embrace this challenge and develop CIs for a range of computer science courses, these CIs could usher in a new era of curricular and pedagogical innovation and evaluation.”

Dare we say this era has begun.