I am a PhD student in Machine Learning and Public Policy at Carnegie Mellon University advised by Emma Strubell. Previously, I received my bachelors in Computer Science with a minor in Math at the University of Illinois at Urbana-Champaign. At Illinois I was a part of the Text Information Management and Analysis Group and the Health and Social Media Group working with ChengXiang Zhai and Dolores Albarracin. I was also a Data Science for Social Good Fellow at Imperial College London.
I develop methods to make large-scale text analysis accessible. In particular, I am focused on reducing the high cost of structured information extraction, particularly in settings where large-scale expert annotation is prohibitively expensive. A motivating example of my work is in the inacessibility of climate policy information. Much of what governments commit to, let alone implement, is buried in long, dense, heterogeneous documents that are difficult to systemically analyze: Robust structured information extraction can make climate policy legible and measurable at scale.
My thesis is focused on structure as a core modality for efficient, scalable knowledge work. Structured representations are richly expressive, compact, and reliably grounded in text. Crucially, structure is decomposable by definition, making it possible to distribute annotation efforts across heterogeneous groups, including both language models and human annotators.
Contact: nmgandhi(at)cs.cmu.edu Google Scholar CV
Preprints
Task Decomposition for Efficient Annotation. Nupoor Gandhi, Emma Strubell. Under Review.
Peer-Reviewed Publications
Decomposing Unitization and Typing for Efficient and Consistent Span-Bound Concept Annotation. Nupoor Gandhi, Michael Bada, Emma Strubell. Findings of ACL 2026.
SynthTextEval: Synthetic Text Data Generation and Evaluation for High-Stakes Domains. Krithika Ramesh, Daniel Smolyak, Zihao Zhao, Nupoor Gandhi, Ritu Agarwal, Margrét Bjarnadóttir, Anjalie Field. Proceedings of EMNLP 2025 - Demo.
Beyond Text: Characterizing Domain Expert Needs in Document Research. Sireesh Gururaja, Nupoor Gandhi, Jeremiah Milbauer, Emma Strubell. Findings of ACL 2025.
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains. Krithika Ramesh, Nupoor Gandhi, Pulkit Madaan, Lisa Bauer, Charith Peris, Anjalie Field. Findings of EMNLP 2024.
Challenges in End-to-End Policy Extraction from Climate Action Plans. Nupoor Gandhi, Tom Corringham, Emma Strubell. Proceedings of ClimateNLP Workshop at ACL 2024.
Annotating Mentions Alone Enables Efficient Domain Adaptation for Coreference Resolution. Nupoor Gandhi, Anjalie Field, Emma Strubell. Proceedings of ACL 2023. (Selected for Oral Presentation)
Examining risks of racial biases in NLP tools for child protective services. Anjalie Field, Amanda Coston, Nupoor Gandhi, Alexandra Chouldechova, Emily Putnam-Hornstein, David Steier and Yulia Tsvetkov. Proceedings of ACM FAccT 2023.
Improving Span Representation for Domain-adapted Coreference Resolution. Nupoor Gandhi, Anjalie Field, Yulia Tsvetkov. Proceedings of EMNLP 2021 Workshop on Computational Models of Reference, Anaphora and Coreference.
Predicting Opioid Overdose Crude Rates with Text-Based Twitter Features (Student Abstract). Nupoor Gandhi, Alex Morales, Sally Man-Pui Chan, Dolores Albarracin, and ChengXiang Zhai. Proceedings of the AAAI 2020.
Multi-Attribute Topic Feature Construction for Social Media-based Prediction. Alex Morales, Nupoor Gandhi, Man-pui Sally Chan, Sophie Lohmann, Travis Sanchez, Kathleen A. Brady, Lyle Ungar, Dolores Albarracín, and ChengXiang Zhai. Proceedings of IEEE Big Data 2018.