De-identifying Student Personally Identifying Information in Discussion Forum Posts With Large Language Models

Academic Paper

Oct 15

Academic Article

Academic article evaluating how large language models redact personally identifying information from MOOC discussion forum data.

Visit Resource

This link will take you to an external website.

Purpose/Abstract

This study evaluates the effectiveness of three large language models in redacting personally identifying information from discussion forum data in massive open online courses.

The authors examine GPT-4o, Llama 3.3 70B, and Llama 3.1 8B as tools for de-identifying student-generated forum posts.

The study focuses on a key data-use challenge in education research: protecting student privacy while making discussion forum data usable for research and analysis.

By comparing multiple large language models, the article contributes evidence about how AI tools may support de-identification workflows for educational data.

Citation

Zambrano, A. F., Singhal, S., Pankiewicz, M., Baker, R. S., Porter, C., & Liu, X. (2025). De-identifying student personally identifying information in discussion forum posts with large language models. Information and Learning Sciences, 126(5–6), 401–424. https://doi.org/10.1108/ILS-11-2024-0156

Areas researched: Data Use, AI

All GradesData Use

Michael Wiemeyer

De-identifying Student Personally Identifying Information in Discussion Forum Posts With Large Language Models

Can A Language Model Represent Math Strategies?: Learning Math Strategies from Big Data using BERT

De-identifying Student Personally Identifying Information with GPT-4