Skip to main content

Spectra: A Data Engineering Platform

· 5 min read
Software Engineer

Spectra, originally Peer Review Insights and Student Metrics (PRISM), is a data engineering and data science project to support engineering student capstone programs.

In 2022, we migrated from manually collecting less than 40 data points per student team to to automatically capturing over 8000 data points across >60 teams per semester. Once we validated the approach at the end of 2022, we spent the first half of 2023, rewriting the system for better modularity to enable for rapid support of new use cases and analysis.

By mid-2023 we began automatically capturing more than 250,000 data points per year across 120 teams and automatically analyzing and flagging anomalies for instructor attention. This has enabled scalable data-driven instruction and intervention in a multi-course capstone program with a peak headcount of over 600 students at any given time.

Design

Spectra is designed to be modular to handle frequently evolving needs. Originally designed to support student peer evaluations, it has needed to grow to aggregate and analyze multiple student performance factors including.

Spectra architecture

Data access was also a key concern. Building the system as a modular command-line tool separated data pipeline development from the actual data going through that pipeline. The first iteration used a collection of scripts stored in the same project directory as the data.

Given the nature of working with student performance data we decided it was critical to approach the design in a way that essentially eliminated the risk of a new contributor accidentally exposing student data in a commit.

Diagram showing the separation between data and data tooling

With the high-level architecture outline, let's take a look at each major segment of the data flow.

Data Ingestion

Data ingestion is achieved through a few methods. A custom survey app (Surveyor) was built to collect student peer evaluations. The ingestion engine is a Python package with several ingestor classes that each handle single-responsiblity for a different data source. The primary is the survey ingestor that works for the survey files, but there are also ingestors for attendance data, Zoom logs, etc. Each ingestor does the bare-minimum transformation to get the data into an accepted format and write it do disk at pre-defined locations.

Spectra ingestion architecture

The Surveyor is built a client-side application the allows users to fill out peer evaluations and then encodes those into a custom binary blob that the students can submit as an assignment via the learning management system (LMS). We took this approach for two reasons. First, we wanted tom maximize user experience and minimize operational responsbility. Second, we wanted to offload authentication responsibility to the LMS.

Indexing and Serving

The next stage of the data pipeline handles transformation, indexing, and serving data. Transformation is handled by a set of Python packages that are responsible for transforming the raw data into desired formats. Some are as simple as preparing CSV files for semi-manual analysis, while others prepare data for ingestion into a data store.

Spectra ingestion architecture

The primary indexer is OpenSearch (an alternative to ElasticSearch) which gives us the ability to build dynamic dashboards to display, filter, and drill into data. It also provides a queryable platform for downstream analytics.

DataOps: Analytics & Reports

The final stage in the data flow involves analytics and reporting. The analytics are handled by a collection of Python modules we called Analyzers. Analyzers are responsible for taking data either from disk or from OpenSearch and performing a specific analysis. In the case of peer surveys, analyzers are responsible for computing individual and team statistics, identifying anomalies, and flagging them for instructor attention. Other analyzers are responsible for computing attendance and correlating Zoom logs with attendance data to identify attempts to by-pass attendance checks.

Spectra ingestion architecture

Other analyzers are used to generate reports. These reports use data from earlier analyzers to genrate consumable PDF reports for instructors and faculty advisors to identify team dynamics or performance issues early.

Data Analysis

One of the most common and useful visualizations our analyzer product is a heatmap of average peer evaluation score for a team. Students are asked to rate each other a scale of 1-5 across several performance dimensions. The heatmap shows an aggregation of that data and allows instructors to quickly identify teams that have potential issues.

Heatmap of student performanc data

For example, in the heatmap above, we can visually idenitify color variation in rows or columns. A column with a high shift towards red is a strong indicator of an underperforming team member. In this case we can see that with Student C and Student F.

Similarly, rows with a high shift towards red are a strong indicator of a team member that is disatisfied with the team. We see that most notably with Students B and C. This can be an indicator of frustration that may lead to poor team cohesion.

This also let's us identify other flags quickly. For example, a row of all perfect scores is an indicator that that individual didn't put much effort into filling out the peer evaluation.

Closing Thoughts

A critical architectural decision was the use of modular single-responsbility components. This allows the tool to be easily extended to support new use cases. In practice, this is not always so simple. Clear interfaces between components is critical. When those interfaces need to change we need to retroactively change existing modules to support it - or we can design for backwards compatibility. This introduces the need for proactive management of technical debt.

Ultimately this approach solved manual bottlenecks in identifying performance issues quickly at scale. It established a system to identify issues in near real-time and convey them visually in a clear, rapid, and consumable way.