Advisor: Professor Smita Krishnaswamy
Lab: Krishnaswamy Lab, Yale School of Medicine
Course: CPSC 490 Senior Thesis
The Goal
Imagine you're building an AI system that helps biologists analyze single-cell data. The AI needs to look at an embedding—a visualization of high-dimensional biological data—and decide what analysis steps to take next. Should it run a clustering algorithm? A trajectory inference? Something else entirely?
Current systems rely on natural language. One AI agent might tell another "this data has clusters," but that description loses crucial information. How many clusters? How dense? How well-separated?
I wanted to know: Can we teach AI to recognize geometric structure directly, using quantitative metrics instead of words?
What I Built
A complete pipeline for geometric analysis of single-cell embeddings:
1. Data Pipeline — Curated 100 datasets from CELLxGENE Census, representing diverse biological systems with complete metadata. Built automated preprocessing using the ManyLatents framework.
2. Geometric Feature Extraction — Computed 53 quantitative metrics from 2D PHATE embeddings, including coordinate statistics, distance-based metrics, and spatial structure properties.
3. Web-Based Annotation Dashboard — Built a FastAPI application with SQLite database for interactive labeling. Researchers could browse embeddings, classify structure types, and assess quality—all through a web interface. This enabled labeling of 91 datasets across two annotation sessions.
4. Structure Classifier — Trained a Support Vector Machine with RBF kernel to predict structure categories from geometric features. Achieved 40.43% accuracy on 7 classes—nearly 3× better than random baseline.
Structure Taxonomy
I developed a taxonomy of 7 geometric structures that appear in single-cell embeddings:
- Clusters — Well-separated cell populations with high density variation
- Multi-branch — Multiple connected branches (differentiation pathways)
- Horseshoe — U-shaped continuous trajectory (developmental progressions)
- Bifurcation — Single branching point (cell fate decision points)
- Simple trajectory — Linear progression without branches
- Diffuse — No clear structure (noise or heterogeneous populations)
- Cyclic — Circular structure (cell cycle, periodic processes)
Tools & Techniques
Python & Scientific Stack: NumPy, pandas, scikit-learn for data processing and machine learning. PHATE for dimensionality reduction. ManyLatents for automated preprocessing pipeline.
Machine Learning: Support Vector Machine with RBF kernel, trained via Leave-One-Out Cross-Validation. Feature importance analysis to identify which geometric metrics were most predictive.
Web Development: FastAPI backend with SQLite database, deployed on Yale's HPC infrastructure. Web-based job submission via SLURM scheduler with real-time monitoring. Interactive gallery for browsing and annotating embeddings.
Most Predictive Features: Spatial entropy, density variation, hull compactness, pairwise distances, and PCA elongation ratio—these geometric properties proved most informative for structure classification.
What I Learned
Geometric analysis works. Quantitative metrics can capture structure that humans perceive qualitatively. A 40% accuracy on 7 classes may seem modest, but it's nearly 3× random baseline—and it proves the concept is viable.
Feature engineering matters. The most predictive features weren't what I expected. Spatial entropy and density variation outperformed more complex topological metrics. Simple, well-chosen features often beat sophisticated ones.
Infrastructure is underrated. I spent more time building the annotation dashboard and data pipeline than the actual classifier. But that infrastructure made everything else possible. Good tooling multiplies your research velocity.
Domain expertise is irreplaceable. Working with Professor Krishnaswamy and her lab taught me more about geometric deep learning than any paper could. The gap between "knowing the math" and "knowing what to try" is bridged by mentorship.
This is a stepping stone. The classifier isn't production-ready, but it demonstrates that geometry-informed AI agents are possible. Future work could use deep learning on raw embeddings, incorporate temporal dynamics, or extend to other data modalities.
Looking Forward
Multi-agent systems for biological analysis are coming. When they arrive, they'll need more than natural language to communicate about data structure. Geometric metrics—quantitative, precise, machine-readable—could be that common language.
This thesis was one step toward that future.