AI-Enabled Optimization of Early-Phase Clinical Trials
A structured response demonstrating how Aurelyn Trial | OS™ and the Aurelyn Clinical Engines™ directly address the FDA's Request for Information — across pilot design, evaluation metrics, and the trustworthy-AI principles of the NIST AI Risk Management Framework.
The comment period was extended 30 days from the original deadline; submissions are accepted through June 29, 2026 at regulations.gov under Docket No. FDA-2026-N-4390. This document is formatted to mirror the RFI's question taxonomy (Categories A & B) for direct, citable response.
Why Aurelyn answers this RFI
The FDA identifies eight ways AI may improve early-phase trials and asks how to structure a pilot and measure its success. Aurelyn Trial | OS™ was architected for exactly this problem space: a governed, regulator-ready operating system that turns model-informed decision support into auditable, ALCOA+ evidence.
Built for the eight use cases
Recruitment, dose escalation, safety monitoring, adaptive design, Phase 1→2 decisions, biomarker stratification, and endpoint validation are each handled by a named Clinical Engine — not bolt-ons, but the core architecture.
Trustworthy by construction
Every engine runs inside a Governance & Assurance layer mapped to all seven NIST AI RMF characteristics and the FDA's risk-based credibility-assessment framework — context-of-use, model risk, and lifecycle monitoring as first-class objects.
Measurable from day one
The platform emits the exact telemetry the RFI's Category B requests — cycle times, decision concordance, signal-detection latency, drift, and subgroup fairness — pre-instrumented so the pilot can be evaluated rigorously, not retrospectively reconstructed.
The Aurelyn Clinical Engines™
Aurelyn Trial | OS™ is the orchestration layer; the Clinical Engines™ are modular, independently validated capabilities. Each maps to one or more of the FDA's stated AI opportunities. A cross-cutting Governance & Assurance layer wraps every engine.
Cohort Intelligence Engine™
Site & patient feasibility modeling, eligibility-criteria optimization, and biomarker-based enrichment for small, hard-to-recruit early-phase populations.
Adaptive Dose Engine™
Model-informed dose finding (Bayesian logistic regression, mTPI/BOIN), seamless and adaptive design simulation aligned with FDA Project Optimus.
Safety Sentinel Engine™
Continuous AE/SAE/SUSAR signal detection, near-real-time pharmacovigilance triage, and automated narrative drafting with human adjudication.
Clinical Evidence Engine™
Go/no-go decision support, predictive Phase-2 success modeling, and endpoint/biomarker qualification analytics with calibrated uncertainty.
eTMF Intelligence Engine™
CDISC TMF Reference Model classification, ALCOA+ completeness scoring, and continuous inspection-readiness across the trial master file.
Governance & Assurance
Context-of-use registration, model-risk tiering, drift monitoring, immutable audit trails, model & system cards, and role-based human oversight.
Answering the RFI, question by question
Below, every sub-question from RFI Categories A and B is reproduced and answered with the specific Aurelyn capability that addresses it. Use the tabs to move between the design questions, the evaluation-metric questions, and the two crosswalks.
Scope & Focus
a.Which trial types or issues benefit most from AI?›
Aurelyn recommends anchoring the pilot in the highest-uncertainty, smallest-N contexts where model-informed methods deliver the greatest marginal value: first-in-human oncology dose escalation and rare-disease trials, with adaptive Phase 1b/2a designs as a secondary focus. These settings have the clearest decision points (dose, expansion, go/no-go) and the strongest existing precedent for quantitative methods.
b.Target specific therapeutic areas, or remain broadly applicable?›
A tiered approach: anchor in oncology and rare disease for interpretable early signal, but require platform-agnostic architecture so methods and governance generalize. This protects learning velocity without over-fitting the pilot to one indication.
c.Should priority go to specific AI use cases?›
Yes — prioritize the three with the clearest measurable endpoints and regulatory touchpoints: (1) dose optimization (aligned with Project Optimus), (2) safety-signal detection, and (3) recruitment & biomarker stratification. These produce the cleanest evidence for the Category B metrics.
Participant Selection
a.What criteria should FDA use to select sponsors, trials, or technologies?›
Select on: a well-defined context-of-use, an assigned model-risk tier, demonstrated data readiness/ALCOA+ maturity, a pre-specified credibility-assessment plan, and evidence of a managed AI lifecycle (Good Machine Learning Practice). This mirrors the FDA's own draft-guidance framework and keeps selection objective.
b.How can the pilot ensure representation across size, capability, and therapeutic area?›
Use stratified selection quotas across sponsor size (small/emerging biotech through large pharma), AI maturity, and therapeutic area. The chief barrier for smaller sponsors is infrastructure — so a low-infrastructure delivery model is essential to genuine representation.
Collaboration Models
a.Which partnerships are most effective?›
A four-party sponsor–technology vendor–academic–FDA consortium, with an independent technology/assurance layer that no single sponsor owns. This separates the party that builds the model from the party that governs and validates it.
b.How can FDA facilitate pre-competitive collaboration and knowledge sharing?›
Stand up a shared validation harness and benchmark datasets in secure enclaves, with federated evaluation so participants contribute to common metrics without exposing proprietary data or models.
c.What role should patient groups and investigators play in AI governance?›
Embed both directly in the Govern function of the RMF: patient advisors weigh in on context-of-use, acceptable risk, and fairness; investigators provide the clinical-workflow reality check and serve as the human-in-the-loop for every consequential recommendation.
Operational Structure
a.What support should FDA provide?›
Early regulatory engagement (a pre-pilot context-of-use agreement), technical guidance on credibility assessment and model-risk tiering, and a standing review cadence so participants aren't guessing at expectations mid-pilot.
b.What infrastructure is needed?›
Secure, validated data environments (21 CFR Part 11 compliant), shared tooling for validation and monitoring, and immutable audit trails common across participants.
c.How can the pilot accommodate varying levels of AI maturity?›
Adopt a tiered maturity model — from advisory/shadow-mode for low-maturity participants to integrated decision support for the most mature — so every sponsor contributes evidence at a level matched to its readiness.
Timeline & Milestones
a.What is an appropriate duration?›
18–24 months — long enough to carry at least one cohort from first-in-human dosing through a Phase 2 initiation decision, while remaining short enough to inform a summer-2026-style expansion cycle.
b.What interim milestones or checkpoints should be included?›
Recommended gates: (1) onboarding & context-of-use lock; (2) data-readiness gate; (3) mid-pilot safety & model-performance review; (4) model-drift checkpoint; (5) Phase 1→2 decision capture. Each gate has pre-registered pass/fail criteria.
c.How should FDA balance rapid insight with rigorous evaluation?›
Use a learn-and-confirm staging with pre-registered metrics: continuous telemetry provides rapid operational insight, while confirmatory conclusions are gated on pre-specified, locked endpoints to preserve rigor.
Knowledge Sharing
a.How should lessons learned be captured and disseminated?›
Maintain a structured pilot registry with standardized context-of-use and credibility-assessment templates, culminating in a public summary report so the broader ecosystem inherits the learning.
b.What mechanisms promote transparency while protecting proprietary information?›
Adopt tiered disclosure: public model cards and system cards describe intended use, performance, and limitations at the context-of-use level; deeper artifacts are shared confidentially with the regulator. This satisfies transparency without exposing trade secrets.
Trial Efficiency & Speed
a.How should efficiency improvements be measured?›
Track cycle-time metrics against a pre-defined baseline: time-to-trial-initiation, time-to-first-patient-in, enrollment rate, and time-to-last-patient-last-visit — each compared to historical or concurrent non-AI benchmarks.
b.What metrics assess reductions from Phase 1 completion to Phase 2 initiation?›
Measure the interval between Phase 1 completion and Phase 2 initiation, decomposed into data-lock, analysis, decision, and start-up sub-intervals so the source of any acceleration is attributable.
c.How can screening, recruitment, and retention improvements be quantified?›
Use screen-fail rate, screen-to-enroll ratio, time-to-enroll, and retention/dropout rate, stratified by site and subgroup to surface where AI enrichment actually helps.
Decision Quality
a.How can the quality and timeliness of go/no-go decisions be evaluated?›
Combine decision latency, decision-reversal rate, and calibration (predicted vs. observed outcomes) across both FDA regulatory and sponsor-internal decision points.
b.What methods assess concordance between AI-supported and traditional decisions?›
Run blinded parallel decisioning: AI-supported and traditional decisions are made independently, then compared for concordance and, where ground truth emerges, for accuracy (AUROC, Brier) against adjudicated outcomes.
c.How should reductions in late-stage failures be measured?›
This requires longitudinal registry linkage: track downstream Phase 3 success conditional on AI-supported early decisions, acknowledging the long horizon and using survival/competing-risk framing rather than a simple rate.
Participant Safety & Data Integrity
a.What metrics evaluate detection and response time for safety signals?›
Time-to-signal-detection, time-to-response, and the sensitivity/specificity and false-alarm rate of detection, measured against adjudicated safety events.
b.How should impact on AE rates or protocol deviations be assessed?›
Compare AE/SAE rate deltas and protocol-deviation frequency between AI-supported and comparator arms, controlling for population and exposure.
c.What measures assess data completeness, accuracy, and consistency?›
ALCOA+ scorecards, query rates, and source-data-verification discrepancy rates provide an objective, auditable view of data integrity.
AI System Performance
a.What metrics evaluate accuracy, robustness, and generalizability?›
Report discrimination (AUROC), calibration, subgroup performance, and external validation on held-out and independent data — generalizability is demonstrated, not assumed.
b.How should stability over time and model drift be measured?›
Monitor input drift (e.g., population stability index), performance-over-time, and a defined retraining cadence under change control, with alerting when drift breaches thresholds.
c.How can performance be evaluated across populations, sites, and therapeutic areas?›
Maintain stratified performance dashboards with fairness slices by demographic and clinical subgroup, by site, and by therapeutic area — surfacing heterogeneity rather than hiding it in an aggregate.
Trustworthiness (aligned with NIST AI RMF)
a.What evidence demonstrates AI systems are valid and reliable?›
A context-of-use-scoped credibility assessment per the FDA draft guidance — analytical validation plus clinical validation, with credibility evidence proportional to model risk.
b.How should safety and risk mitigation be evaluated?›
Through model-risk tiering, human-in-the-loop controls, fail-safe defaults, and override logging — with the residual-risk profile documented per context-of-use.
c.What metrics assess transparency and explainability — for both sponsor-built and proprietary systems?›
Use model and system cards, explanation-fidelity measures, and a behavioral (black-box) testing harness that evaluates input/output behavior without requiring source access — making the same metrics applicable to proprietary third-party systems.
d.How should privacy protections and data governance be evaluated?›
Assess de-identification, access controls, data lineage, and Part 11 compliance, with a documented data-governance plan per context-of-use.
e.What approaches assess fairness across demographic and clinical subgroups?›
Subgroup performance-parity analysis and bias audits across demographic and clinical strata, with parity thresholds agreed at context-of-use registration.
Comparative Evaluation
a.What comparators are most appropriate?›
A blend: historical controls, concurrent non-AI arms, and in-silico simulation / digital-twin benchmarks — triangulating rather than relying on any single comparator.
b.How should differences in design, complexity, or therapeutic area be accounted for?›
Through covariate adjustment, stratification, matched comparisons, and simulation-based benchmarking so that observed differences are attributable to AI rather than to design heterogeneity.
Qualitative Outcomes
a.How can stakeholder trust be assessed?›
Deploy validated trust and acceptance instruments (e.g., adapted technology-acceptance and trust-in-automation scales) for investigators, participants, and regulators at defined checkpoints.
b.What methods evaluate usability and workflow integration?›
System Usability Scale scores, task-completion rates, and time-and-motion analysis of how recommendations enter the clinical workflow.
c.How should perceived value, scalability, and operational feasibility be measured?›
Adoption metrics, net-promoter-style value scores, cost-per-decision, and scalability stress tests across sites and indications.
The RFI grounds trustworthy AI in the seven characteristics of the NIST AI Risk Management Framework. Aurelyn's Governance & Assurance layer maps a concrete control and a measurable signal to each — the answer to RFI question B.5 in matrix form.
| NIST AI RMF Characteristic | Aurelyn Control | Measurable Signal |
|---|---|---|
| Valid & Reliable | Context-of-use-scoped credibility assessment; analytical + clinical validation gate before integrated use. | AUROC, calibration, external-validation pass/fail |
| Safe | Model-risk tiering, fail-safe defaults, mandatory human-in-the-loop on consequential outputs. | Override rate, residual-risk profile, harm events |
| Secure & Resilient | Part 11-validated environment, access controls, drift & adversarial monitoring. | Drift index (PSI), incident count, uptime |
| Accountable & Transparent | Immutable audit trails, model & system cards, decision provenance. | Audit completeness, card coverage, traceability |
| Explainable & Interpretable | Explanation outputs per recommendation; black-box behavioral testing for proprietary models. | Explanation-fidelity score, user comprehension |
| Privacy-Enhanced | De-identification, data-lineage tracking, governed access by context-of-use. | Re-identification risk, lineage completeness |
| Fair — Harms Managed | Mandatory subgroup performance slicing; parity thresholds with governance review on breach. | Subgroup parity gap, bias-audit findings |
Aligned with NIST AI RMF 1.0 trustworthy-AI characteristics and the FDA draft guidance "Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products." Aurelyn implements these as enforceable platform controls, not policy documents alone.
Beyond the RFI itself, Aurelyn Trial | OS™ is engineered against the wider regulatory frame an early-phase AI pilot must satisfy. The crosswalk below shows where each framework is addressed in the platform.
| Framework | Relevance to the Pilot | Aurelyn Coverage |
|---|---|---|
| FDA Draft Guidance AI / RDM Use of AI to support regulatory decision-making | Defines context-of-use and risk-based credibility assessment for AI evidence. | Native COU registration + 7-step credibility-assessment workflow. |
| NIST AI RMF 1.0 Trust | The RFI's explicit trustworthy-AI reference. | Seven characteristics implemented as enforceable controls (see crosswalk). |
| 21 CFR Part 11 Records | Electronic records & signatures for any system in the trial record. | Validated environment, e-signatures, tamper-evident audit trails. |
| ICH E6(R3) GCP | Modern, risk-based GCP expectations including computerized systems. | Risk-based quality, oversight, and data-governance baked into workflows. |
| 21 CFR 312 / 50 IND · Consent | IND conduct and informed-consent integrity in early-phase trials. | Document intelligence over ICF/ISF; deviation & consent tracking. |
| GMLP ML Lifecycle | Good Machine Learning Practice for the AI development lifecycle. | Versioning, change control, monitored retraining, validation gates. |
| CDISC TMF Ref. Model Data | Standardized trial-records structure for completeness scoring. | eTMF Intelligence Engine™ auto-classifies to the reference model. |
| EU AI Act High-Risk AI | Forward-compatibility for sponsors operating in the EU. | Risk-tiering, transparency, and human-oversight controls map to high-risk obligations. |
Framework names refer to the FDA's January 2025 draft guidance on AI in regulatory decision-making, NIST AI RMF 1.0, ICH E6(R3) Good Clinical Practice, the FDA/Health Canada/MHRA Good Machine Learning Practice guiding principles, the CDISC TMF Reference Model, and EU Regulation 2024/1689 (EU AI Act). Coverage reflects platform design intent for evaluation in the pilot.
What the pilot would actually measure
Aurelyn proposes these as the headline, pre-registered targets a pilot could test. The figures below are illustrative design targets and hypotheses — the platform's purpose is to measure them rigorously, not to assert them as proven outcomes.
Cycle-time impact by decision point
Evaluation coverage
Native, pre-instrumented telemetry maps to nearly every Category-B evaluation question — minimizing bespoke measurement scaffolding during the pilot.
Target figures represent platform design hypotheses to be tested under the pilot's pre-registered analysis plan; they are not claims of demonstrated clinical results. Actual effects depend on indication, sponsor maturity, and comparator design, and would be evaluated per RFI Category B.
A governed path from onboarding to evidence
How an Aurelyn-supported participant would move through the pilot, with the milestone gates recommended in answer A.5.
COU Lock
Context-of-use registered; model-risk tier assigned
Data Readiness
ALCOA+ gate; environment validated to Part 11
Shadow Mode
AI runs in parallel; concordance captured, no influence
Mid-Pilot Review
Safety, performance & drift checkpoint vs. pre-set criteria
Decision Capture
Phase 1→2 go/no-go logged with full provenance
Public Report
Model/system cards & lessons disseminated
Aurelyn AI Clinical seeks to participate
We welcome the opportunity to contribute to the FDA's pilot as a technology and assurance partner — and to submit this framework to Docket FDA-2026-N-4390. Aurelyn Trial | OS™ is ready to operationalize trustworthy AI in early-phase trials today.