Technical Details – Pharma Email Analyzer

The Pharma Email Analyzer is powered by a transformer-based language model fine-tuned for regulatory entity extraction and post-classification flagging in pharmaceutical communications.

🔍 Model Selection

Starting with a pretrained BERT-based encoder from Hugging Face, specifically one optimized for biomedical or scientific text (e.g., BioBERT, SciBERT, or distilBERT-base-uncased). These models offer strong domain alignment with pharma language, including acronyms, compound names, and regulatory phrasing.

Key selection criteria included:

Robust tokenization of domain-specific terms
Proven performance on NER and classification tasks
Lightweight enough for real-time inference in Flask

🏗️ Fine-Tuning Pipeline

The model was fine-tuned on a curated dataset of anonymized regulatory emails, annotated with:

Named Entities: product names, submission types, dates, teams (e.g., CMC, RA), and dependencies
Post-Classification Flags: urgency indicators, missing approvals, timeline blockers

Training involved:

NER head using token-level BIO tagging
Flagging head using sentence-level multi-label classification
Stratified sampling to balance rare entity types and edge-case flags
Evaluation on held-out emails to ensure generalization across therapeutic areas

⚙️ Inference and Post-Processing

At runtime, the app:

Accepts raw email text via a Flask interface
Tokenizes and feeds it into the fine-tuned model
Extracts structured entities with character offsets
Applies rule-based logic to flag missing dependencies, overdue dates, or ambiguous ownership

The output is rendered in a user-friendly format, with session-based history and example prompts to guide exploration.