Patient-Driven Genomic Analysis: A Replicable Framework

What This Project Demonstrated

Bill Paseman's case is a proof of concept for patient-led precision medicine research — a model where the patient, not the institution, drives deep multi-omic investigation of their own condition. Over the course of this project (spanning a 2018 hackathon and 2026 AI analysis), the following was accomplished without institutional support:

Identified the primary tumor driver (PDGFRA triple-positive) that standard clinical workup never captured
Refuted the standard p1RCC treatment assumption (MET-driven) using the patient's own data
Ruled out 8+ hereditary cancer syndromes definitively
Established that a first-line targeted therapy (belzutifan) is contraindicated based on a somatic deletion the clinical record never recorded
Generated a treatment prioritization plan ready for deployment at recurrence
Produced a machine-readable patient profile designed to be shared with any AI or clinician

Foundation: Know What You're Working With

Collect your raw data before anything else. Without raw files, analysis is impossible.

Obtain Your Raw Genomic Data

The single most important step. You have a legal right to request your genomic data files under HIPAA in the US and GDPR in Europe. Don't accept only a PDF report — request the underlying VCF, BAM/CRAM, or counts files.

Data Type	What to Ask For	Who Has It
Germline WGS / WES	VCF files (.vcf or .vcf.gz), BAM/CRAM alignment files	Your genetics lab, Nebula Genomics, Dante Labs, Genome Medical, research biobanks
Tumor sequencing	Somatic VCF, RNA-seq counts, CNV files	Your hospital's pathology or molecular tumor board
Clinical genomic panel	PDF report + raw VCF	FoundationOne CDx, Tempus xT, Caris MI Profile, Guardant
Microarray / SNP chip	Raw CEL files or VCF	23andMe, AncestryDNA (limited utility for rare disease)

Key Insight from Bill's Case

The 2018 hackathon used BGI whole blood sequencing (germline) and tumor RNA-seq + somatic variant calling from archived nephrectomy tissue. Archived surgical tissue is often stored in formalin-fixed paraffin-embedded (FFPE) blocks for 10+ years. You can request that these be re-sequenced — most patients don't know this option exists.

Assemble Your Complete Clinical Records

Collect all pathology reports, operative notes, radiology reports, lab results, and genetics consult notes. Most patients dramatically underestimate how much data they have scattered across institutions.

Use patient portal apps (MyChart, Epic) to download structured data
Submit written records requests to each institution you've been seen at
Organize chronologically — date of diagnosis is your anchor point
Flag discrepancies between institutions (imaging reports often disagree on measurements)

Build Your Patient Profile

Create a single structured document that captures your complete medical story.

Create a Structured Patient Profile Document

A structured document (plain text, JSON, or Word) that puts your entire medical story in one place. This serves two purposes: (1) gives AI systems the context they need to reason correctly about your case, and (2) becomes your permanent research record, independent of any EHR.

What to include:

Demographics and your role as a patient-researcher
Primary diagnoses with staging, dates, institutions, and current status
Comorbidities with ICD codes and current management
Treatment history with outcomes
Germline genetics findings (variants, zygosity, population frequency, clinical significance)
Somatic tumor findings (if available)
Labs and imaging summaries
Open clinical questions you want answered

The file bill_paseman_patient_profile2.json from this project is a working template you can adapt.

Add an LLM Orientation Prompt at the Very Top

This is the most important design lesson from this project. When LLMs receive large medical documents without a grounding frame, they hallucinate — misidentifying conditions, confusing comorbidities with primary diagnoses, and suggesting treatments already ruled out by the data.

Your orientation prompt should explicitly state:

Who you are and what your actual diagnoses are (with current status — NED, active, surveillance)
What the document IS and IS NOT
What has been definitively ruled out (syndromes, treatments)
How to navigate the document for different types of questions

Without This Step

In testing, LLMs given the patient profile without an orientation prompt consistently hallucinated additional cancers, described the patient as having active metastatic disease, and suggested treatments (like belzutifan) that the genomic data explicitly contraindicated. The orientation prompt reduced these errors dramatically.

Annotate Every Finding with Clinical Significance

Don't just list data — annotate each finding with what it means. For every variant, lab result, or genomic finding, include:

A clinical_significance field in plain English (HIGH / MEDIUM / LOW)
Evidence quality label (confirmed, plausible, ruled_out, unknown)
Cross-references to related findings in other sections
What it means for treatment decisions

This transforms raw data into a reasoning document that any clinician or AI can act on.

Germline Analysis

Determine whether your condition is hereditary and identify constitutional risk factors.

Identify Disease-Relevant Germline Genes

For your specific condition, research which hereditary syndromes are associated with it and which genes to check. Resources:

ClinVar (clinvar.ncbi.nlm.nih.gov) — searchable database of variants and clinical significance
OMIM (omim.org) — genetic disease reference encyclopedia
NCCN Guidelines — gene panels recommended for specific cancer types (free registration)
Your condition's patient advocacy organization — many maintain curated gene lists

Example: Bill's Gene List

p1RCC → check MET, FH, BAP1; Meningioma → check NF2, SMARCE1, SMARCB1, LZTR1, SUFU, PTCH1; Both together + family history concern → check Lynch syndrome MMR genes (MLH1, MSH2, MSH6, PMS2).

Query Your Germline VCF for Target Genes

If you have a VCF file, you or a bioinformatician can filter it for your target genes. In this project we used command-line tools (zcat + grep) and Python scripts on annotated CSV files from BGI.

For each variant found, assess:

Population frequency — variants >1% frequency are almost never clinically significant
SIFT and PolyPhen2 scores — functional impact predictors
ClinVar status — is it classified as Pathogenic, Likely Pathogenic, VUS, or Benign?
Variant type — truncating (frameshift, stop-gain, splice) variants are higher concern than missense

Interpret Germline Findings with Care

The single most important judgment call in germline analysis is distinguishing a pathogenic variant from a benign polymorphism. Common polymorphisms (>1–5% population frequency) with benign functional predictions are almost never clinically significant, even in disease-associated genes.

When in Doubt

Request a genetic counselor review of any variant you're uncertain about before drawing clinical conclusions. A variant in BRCA1 is not the same as a pathogenic BRCA1 variant. This distinction matters enormously for family cascade testing decisions.

Somatic Tumor Analysis

Requires tumor sequencing data — RNA-seq, WES/WGS, or a clinical genomic panel.

Identify Differentially Expressed Genes (RNA-seq)

If you have RNA-seq from tumor vs. matched normal tissue, calculate fold-changes for each gene. Prioritize:

Genes strongly UP in tumor — potential drivers or therapeutic targets
Genes strongly DOWN in tumor — tumor suppressors lost
Genes completely absent in normal but present in tumor — potential fusions or aberrant activation

Tools: DESeq2, edgeR (bioinformatics software). Or ask an LLM to analyze a processed counts table if you can provide one.

Cross-Reference RNA Findings with DNA

The most powerful signals are convergent — the same gene showing abnormality at multiple levels simultaneously. This is how you separate noise from signal:

Pattern	Interpretation	Confidence
RNA overexpressed + DNA amplified + somatic mutations	Triple-positive driver — highest actionability	HIGHEST
RNA silenced + DNA deleted	Confirmed loss-of-function	HIGH
RNA silenced + DNA amplified	Paradox — investigate epigenetic silencing (promoter methylation)	REQUIRES INVESTIGATION
RNA signal only, DNA normal	Possible fusion or expression dysregulation — confirm with additional testing	MEDIUM

Perform Somatic Copy Number Variant (CNV) Analysis

Compare tumor copy number against a germline baseline (ideally matched blood/normal tissue). Always confirm CNV is somatic — absent from germline — before calling it a tumor finding.

Copy ratio >1.5x = amplification (potential driver or target)
Copy ratio <0.8x = deletion (potential tumor suppressor loss)
Copy ratio ~1.0 = normal diploid

Run Mutational Signature Analysis (SBS)

Mutational signatures reveal the mechanism that created the tumor's mutations — aging, APOBEC activity, MMR deficiency, tobacco exposure, UV, etc. This has direct clinical implications:

Signature	Mechanism	Clinical Implication
SBS1 + SBS5	Normal aging / clock-like	Expected in all cancers; level should match patient age
SBS2 + SBS13 (APOBEC)	Cytidine deaminase activity	Often associated with CDKN2A loss; cellular stress response
SBS6/15/21/26	MMR deficiency	Lynch syndrome candidate; checkpoint inhibitors (pembrolizumab) may work well
SBS4	Tobacco / carcinogen exposure	Common in lung cancer
SBS7	UV radiation	Melanoma signature

Tools: SigProfilerExtractor (Python), COSMIC Mutational Signatures database (cancer.sanger.ac.uk/signatures/).

Calculate Microsatellite Instability (MSI)

Count somatic indels per megabase from your somatic indel VCF. This determines whether checkpoint inhibitor monotherapy is appropriate:

Category	Indels/Mb	Clinical Implication
MSS (Microsatellite Stable)	<2	Checkpoint monotherapy unlikely to work; combination approaches needed
MSI-L (Low)	2–10	Intermediate; clinical significance varies by tumor type
MSI-H (High)	>10	Lynch syndrome candidate; pembrolizumab/nivolumab often highly effective

AI-Assisted Interpretation

Use multiple AI systems as a virtual tumor board. Diversity of AI opinion catches errors.

Use Multiple AI Systems — The Surowiecki Principle

No single AI model has the full picture. Send your patient profile to multiple LLMs and compare their analyses. Diversity of AI opinion catches errors and reveals where the evidence is genuinely uncertain vs. where there is consensus.

The Wisdom of Crowds (Surowiecki) requires four conditions: diversity of sources, independence of models, decentralization (no single authority), and aggregation of outputs. Apply all four to your AI tumor board.

AI systems to consider: Claude, GPT-4o, Gemini, Perplexity (for real-time literature search).

Without the Orientation Prompt

The LLM orientation prompt in your patient profile (Step 4) is what separates useful AI responses from hallucinated ones. Without it, most LLMs will misidentify your conditions, confuse comorbidities with primary diagnoses, or suggest treatments your own data has already ruled out.

Conduct Structured AI Tumor Board Sessions

Frame your questions specifically rather than asking "what do I have?" or "what should I do?":

"Given the somatic findings in the genetics section, what would be the first-line treatment at recurrence?"
"Are there clinical trials for [condition] that match my molecular profile?"
"What does the co-occurrence of [condition A] and [condition B] suggest about underlying mechanisms?"
"What are the limitations of this analysis and what data would change the conclusions?"

Always ask the AI to label each claim with an evidence grade and include citations (PMID or DOI format).

Challenge the AI's Conclusions

Ask: "What alternative interpretations exist?" and "What would need to be true for your conclusion to be wrong?"

Example from This Project

The MET paradox (amplified DNA, silenced RNA) was only discovered by pushing past the standard p1RCC assumption that MET is the primary driver. The AI initially accepted the textbook model until challenged with the actual RNA expression data, which showed 307-fold silencing. Questioning default assumptions is often where the most actionable insights are found.

Closing the Loop with the Medical System

Research-grade data must be validated before driving clinical decisions.

Get CLIA-Certified Validation of Key Findings

Research-grade data (hackathons, direct-to-consumer sequencing) cannot directly drive clinical decisions. To get your findings into the medical record and usable by treating physicians, the most practical path is to order a CLIA-certified clinical genomic panel on archived tumor tissue or circulating tumor DNA.

Panel	Type	Typical Cost	Notes
FoundationOne CDx	Tissue or liquid biopsy	~$5,800 (often covered by insurance)	Comprehensive solid tumor panel; FDA-approved companion diagnostic
Tempus xT	Tissue + RNA	~$3,000–$5,000	Includes RNA fusion analysis; strong in rare cancers
Caris MI Profile	Tissue	~$3,000–$5,000	IHC + sequencing + expression; broad coverage
Guardant360	Liquid biopsy (blood)	~$2,000–$3,000	No tissue needed; may miss some findings present only in archived tissue

Critical Gap Identified in Bill's Case

All genomic analysis lived in a patient-maintained shadow record, invisible to any treating oncologist. A recurrence scenario without CLIA validation means a new oncologist would make treatment decisions knowing only basic pathology — not knowing PDGFRA is the top driver, MET is not, or belzutifan is contraindicated. This is the most dangerous gap in patient-led research.

Share Your Patient Profile with Your Care Team

Bring a printed or PDF version of your patient profile to appointments. Frame it as: "I've organized my medical history here — it may help you understand my full picture."

Physicians who receive organized, cited patient research respond far better than to unstructured verbal summaries. The HTML report format (bill_paseman_genomic_report.html) generated in this project is designed to be shareable with care teams and readable without technical expertise.

Connect with Patient Advocacy and Research Communities

Your condition's patient advocacy organization (often has research grants, biobank programs, and clinical trial connections)
Research hackathons — the model used in Bill's case: patient-organized computational analysis of personal genomic data with bioinformatics volunteers
Academic medical centers with rare disease or precision oncology programs
ClinicalTrials.gov — search your condition + molecular target (e.g., "papillary RCC PDGFRA")
MatchMiner, TrialSpark, Massive Bio — AI-assisted clinical trial matching services

Tools & Resources Reference

Tool / Resource	Purpose	Cost	Skill Level
BGI / Nebula / Dante Labs	Whole genome or exome sequencing	$300–$2,000	None (send sample, receive files)
Strelka2	Somatic variant calling (tumor vs. normal)	Free (open source)	Bioinformatics
SnpEff / VEP	Variant functional annotation	Free (open source)	Bioinformatics
DESeq2 / edgeR	RNA-seq differential expression	Free (R packages)	Bioinformatics / R
SigProfilerExtractor	COSMIC mutational signature extraction	Free (Python)	Python basics
Ensembl REST API	Variant annotation, trinucleotide context	Free	Python/API basics
ClinVar (clinvar.ncbi.nlm.nih.gov)	Variant clinical significance database	Free	None
OMIM (omim.org)	Genetic disease encyclopedia	Free	None
COSMIC (cancer.sanger.ac.uk)	Cancer somatic mutations + signatures database	Free	None
ClinicalTrials.gov	Clinical trial search by condition + target	Free	None
Claude / GPT-4o / Gemini	AI tumor board, interpretation, report generation	$20–$30/month	None (conversational)
FoundationOne CDx / Tempus xT	CLIA-certified clinical validation	$3,000–$6,000 (often insured)	None (physician orders)

If You Have No Bioinformatics Skills

You can still accomplish most of this framework by: (1) obtaining your raw data files, (2) building your patient profile document in plain text or Word, (3) uploading relevant files to an AI system and asking it to perform the analysis. Claude Code and similar AI tools can run Python analysis scripts on your data files directly. For complex analysis, contact university bioinformatics departments — many offer pro-bono or low-cost assistance for rare disease patients through programs like Rare Genomics Institute or local academic hackathon programs.

The Core Principle

"The patient, not the institution, is the only entity with continuous access to the full picture — across all providers, all time points, and all data types. Institutions see fragments. The patient sees the whole."

— AdvocateOS Framework, Bill Paseman

The most important insight from this project is not technical — it is organizational. The role of tools like the patient profile JSON, LLM tumor boards, and this analytical framework is to give the patient's comprehensive view the structure it needs to be medically actionable.

A patient who has organized their own genomic data, built a structured profile, run multi-vendor AI analysis, and obtained CLIA validation arrives at a clinical encounter not as a passive recipient of care — but as an informed collaborator with data their oncologist may not have access to anywhere else.

What This Framework Enables

Catch treatment targets the clinical system missed (PDGFRA in Bill's case)
Avoid treatments the data shows are contraindicated (belzutifan in Bill's case)
Rule out hereditary syndromes that don't apply — and stop pursuing them
Enter clinical encounters with a prioritized, evidence-graded treatment plan ready for discussion
Create a shadow record that survives institution changes, provider transitions, and gaps in the EHR