FDA Docket FDA-2026-N-4390 · 91 FR 23100

AI-Enabled Optimization of Early-Phase Clinical Trials

A structured response demonstrating how Aurelyn Trial | OS™ and the Aurelyn Clinical Engines™ directly address the FDA's Request for Information — across pilot design, evaluation metrics, and the trustworthy-AI principles of the NIST AI Risk Management Framework.

Issued

Apr 29, 2026

Comments Due

Jun 29, 2026

Coordinating Office

Dep. Chief Medical Officer, OC

Centers

CDER · CBER · OCE

The comment period was extended 30 days from the original deadline; submissions are accepted through June 29, 2026 at regulations.gov under Docket No. FDA-2026-N-4390. This document is formatted to mirror the RFI's question taxonomy (Categories A & B) for direct, citable response.

01 — Positioning

Why Aurelyn answers this RFI

The FDA identifies eight ways AI may improve early-phase trials and asks how to structure a pilot and measure its success. Aurelyn Trial | OS™ was architected for exactly this problem space: a governed, regulator-ready operating system that turns model-informed decision support into auditable, ALCOA+ evidence.

Built for the eight use cases

Recruitment, dose escalation, safety monitoring, adaptive design, Phase 1→2 decisions, biomarker stratification, and endpoint validation are each handled by a named Clinical Engine — not bolt-ons, but the core architecture.

Trustworthy by construction

Every engine runs inside a Governance & Assurance layer mapped to all seven NIST AI RMF characteristics and the FDA's risk-based credibility-assessment framework — context-of-use, model risk, and lifecycle monitoring as first-class objects.

Measurable from day one

The platform emits the exact telemetry the RFI's Category B requests — cycle times, decision concordance, signal-detection latency, drift, and subgroup fairness — pre-instrumented so the pilot can be evaluated rigorously, not retrospectively reconstructed.

02 — Platform Architecture

The Aurelyn Clinical Engines™

Aurelyn Trial | OS™ is the orchestration layer; the Clinical Engines™ are modular, independently validated capabilities. Each maps to one or more of the FDA's stated AI opportunities. A cross-cutting Governance & Assurance layer wraps every engine.

FDA-stated opportunity → Aurelyn Engine mapping · tap a card for detail

⬡ Governance & Assurance Layer — NIST AI RMF · 21 CFR Part 11 · GMLP · Credibility Assessment

Engine 01

Cohort Intelligence Engine™

↳ Recruitment · Biomarker selection · Stratification

Site & patient feasibility modeling, eligibility-criteria optimization, and biomarker-based enrichment for small, hard-to-recruit early-phase populations.

Engine 02

Adaptive Dose Engine™

↳ Dose escalation · Adaptive design

Model-informed dose finding (Bayesian logistic regression, mTPI/BOIN), seamless and adaptive design simulation aligned with FDA Project Optimus.

Engine 03

Safety Sentinel Engine™

↳ Safety monitoring

Continuous AE/SAE/SUSAR signal detection, near-real-time pharmacovigilance triage, and automated narrative drafting with human adjudication.

Engine 04

Clinical Evidence Engine™

↳ Phase 1→2 go/no-go · Endpoint validation

Go/no-go decision support, predictive Phase-2 success modeling, and endpoint/biomarker qualification analytics with calibrated uncertainty.

Engine 05

eTMF Intelligence Engine™

↳ Data integrity · Inspection readiness

CDISC TMF Reference Model classification, ALCOA+ completeness scoring, and continuous inspection-readiness across the trial master file.

Layer CROSS-CUTTING

Governance & Assurance

↳ Trustworthy AI · Validation · Audit

Context-of-use registration, model-risk tiering, drift monitoring, immutable audit trails, model & system cards, and role-based human oversight.

03 — Itemized Response

Answering the RFI, question by question

Below, every sub-question from RFI Categories A and B is reproduced and answered with the specific Aurelyn capability that addresses it. Use the tabs to move between the design questions, the evaluation-metric questions, and the two crosswalks.

A.1

Scope & Focus

a.Which trial types or issues benefit most from AI?›

Aurelyn recommends anchoring the pilot in the highest-uncertainty, smallest-N contexts where model-informed methods deliver the greatest marginal value: first-in-human oncology dose escalation and rare-disease trials, with adaptive Phase 1b/2a designs as a secondary focus. These settings have the clearest decision points (dose, expansion, go/no-go) and the strongest existing precedent for quantitative methods.

AurelynAdaptive Dose Engine™ and Clinical Evidence Engine™ target precisely these decision points, where small samples make every datum decision-relevant.

b.Target specific therapeutic areas, or remain broadly applicable?›

A tiered approach: anchor in oncology and rare disease for interpretable early signal, but require platform-agnostic architecture so methods and governance generalize. This protects learning velocity without over-fitting the pilot to one indication.

AurelynTrial | OS™ is therapeutic-area-agnostic; engines are configured by context-of-use rather than hard-coded to an indication.

c.Should priority go to specific AI use cases?›

Yes — prioritize the three with the clearest measurable endpoints and regulatory touchpoints: (1) dose optimization (aligned with Project Optimus), (2) safety-signal detection, and (3) recruitment & biomarker stratification. These produce the cleanest evidence for the Category B metrics.

AurelynEngines 01–03 map one-to-one to these priorities and emit pre-defined evaluation telemetry for each.

A.2

Participant Selection

a.What criteria should FDA use to select sponsors, trials, or technologies?›

Select on: a well-defined context-of-use, an assigned model-risk tier, demonstrated data readiness/ALCOA+ maturity, a pre-specified credibility-assessment plan, and evidence of a managed AI lifecycle (Good Machine Learning Practice). This mirrors the FDA's own draft-guidance framework and keeps selection objective.

AurelynThe Governance layer ships a Readiness Rubric that scores candidates on each criterion before enrollment.

b.How can the pilot ensure representation across size, capability, and therapeutic area?›

Use stratified selection quotas across sponsor size (small/emerging biotech through large pharma), AI maturity, and therapeutic area. The chief barrier for smaller sponsors is infrastructure — so a low-infrastructure delivery model is essential to genuine representation.

AurelynDelivered as managed SaaS with no/low-code configuration, lowering the entry barrier so emerging sponsors participate on equal footing.

A.3

Collaboration Models

a.Which partnerships are most effective?›

A four-party sponsor–technology vendor–academic–FDA consortium, with an independent technology/assurance layer that no single sponsor owns. This separates the party that builds the model from the party that governs and validates it.

AurelynPositioned as the neutral technology & assurance layer — interoperable with sponsor and third-party models alike.

b.How can FDA facilitate pre-competitive collaboration and knowledge sharing?›

Stand up a shared validation harness and benchmark datasets in secure enclaves, with federated evaluation so participants contribute to common metrics without exposing proprietary data or models.

AurelynSupports federated, behavioral (input/output) evaluation — proprietary systems can be benchmarked without source-code disclosure.

c.What role should patient groups and investigators play in AI governance?›

Embed both directly in the Govern function of the RMF: patient advisors weigh in on context-of-use, acceptable risk, and fairness; investigators provide the clinical-workflow reality check and serve as the human-in-the-loop for every consequential recommendation.

AurelynRole-based oversight with investigator-in-the-loop checkpoints and a patient-advisory input field on each context-of-use record.

A.4

Operational Structure

a.What support should FDA provide?›

Early regulatory engagement (a pre-pilot context-of-use agreement), technical guidance on credibility assessment and model-risk tiering, and a standing review cadence so participants aren't guessing at expectations mid-pilot.

AurelynGenerates regulator-facing COU dossiers and credibility-assessment packages aligned to the FDA draft guidance structure.

b.What infrastructure is needed?›

Secure, validated data environments (21 CFR Part 11 compliant), shared tooling for validation and monitoring, and immutable audit trails common across participants.

AurelynShips a Part 11-validated environment with electronic-records/signatures controls and tamper-evident audit logging out of the box.

c.How can the pilot accommodate varying levels of AI maturity?›

Adopt a tiered maturity model — from advisory/shadow-mode for low-maturity participants to integrated decision support for the most mature — so every sponsor contributes evidence at a level matched to its readiness.

AurelynConfigurable autonomy: shadow → recommend → integrated, set per engine and per context-of-use.

A.5

Timeline & Milestones

a.What is an appropriate duration?›

18–24 months — long enough to carry at least one cohort from first-in-human dosing through a Phase 2 initiation decision, while remaining short enough to inform a summer-2026-style expansion cycle.

AurelynTelemetry is captured continuously, so interim readouts are available well before the full duration elapses.

b.What interim milestones or checkpoints should be included?›

Recommended gates: (1) onboarding & context-of-use lock; (2) data-readiness gate; (3) mid-pilot safety & model-performance review; (4) model-drift checkpoint; (5) Phase 1→2 decision capture. Each gate has pre-registered pass/fail criteria.

AurelynMilestone dashboards auto-populate from platform events; nothing is reconstructed after the fact.

c.How should FDA balance rapid insight with rigorous evaluation?›

Use a learn-and-confirm staging with pre-registered metrics: continuous telemetry provides rapid operational insight, while confirmatory conclusions are gated on pre-specified, locked endpoints to preserve rigor.

AurelynPre-registration of metrics and locked analysis plans are native objects — rapid signals never contaminate confirmatory analysis.

A.6

Knowledge Sharing

a.How should lessons learned be captured and disseminated?›

Maintain a structured pilot registry with standardized context-of-use and credibility-assessment templates, culminating in a public summary report so the broader ecosystem inherits the learning.

AurelynStandardized, exportable COU and credibility templates make cross-participant synthesis straightforward.

b.What mechanisms promote transparency while protecting proprietary information?›

Adopt tiered disclosure: public model cards and system cards describe intended use, performance, and limitations at the context-of-use level; deeper artifacts are shared confidentially with the regulator. This satisfies transparency without exposing trade secrets.

AurelynAuto-generates regulator-facing transparency artifacts (model/system cards) that disclose behavior and limits, not source.

B.1

Trial Efficiency & Speed

a.How should efficiency improvements be measured?›

Track cycle-time metrics against a pre-defined baseline: time-to-trial-initiation, time-to-first-patient-in, enrollment rate, and time-to-last-patient-last-visit — each compared to historical or concurrent non-AI benchmarks.

AurelynEvery milestone is timestamped at the platform level, so cycle times are computed automatically, not surveyed.

b.What metrics assess reductions from Phase 1 completion to Phase 2 initiation?›

Measure the interval between Phase 1 completion and Phase 2 initiation, decomposed into data-lock, analysis, decision, and start-up sub-intervals so the source of any acceleration is attributable.

AurelynClinical Evidence Engine™ records decision timestamps and the evidence state at each, isolating decision latency from operational latency.

c.How can screening, recruitment, and retention improvements be quantified?›

Use screen-fail rate, screen-to-enroll ratio, time-to-enroll, and retention/dropout rate, stratified by site and subgroup to surface where AI enrichment actually helps.

AurelynCohort Intelligence Engine™ reports these metrics live and attributes deltas to specific eligibility/enrichment recommendations.

B.2

Decision Quality

a.How can the quality and timeliness of go/no-go decisions be evaluated?›

Combine decision latency, decision-reversal rate, and calibration (predicted vs. observed outcomes) across both FDA regulatory and sponsor-internal decision points.

AurelynPredictions are stored with calibrated uncertainty so post-hoc calibration (e.g., reliability curves, Brier scores) is computable directly.

b.What methods assess concordance between AI-supported and traditional decisions?›

Run blinded parallel decisioning: AI-supported and traditional decisions are made independently, then compared for concordance and, where ground truth emerges, for accuracy (AUROC, Brier) against adjudicated outcomes.

AurelynShadow-mode operation captures the AI recommendation without influencing the human decision — enabling clean concordance studies.

c.How should reductions in late-stage failures be measured?›

This requires longitudinal registry linkage: track downstream Phase 3 success conditional on AI-supported early decisions, acknowledging the long horizon and using survival/competing-risk framing rather than a simple rate.

AurelynDecision provenance is retained end-to-end, so early decisions can later be linked to downstream outcomes for attribution analysis.

B.3

Participant Safety & Data Integrity

a.What metrics evaluate detection and response time for safety signals?›

Time-to-signal-detection, time-to-response, and the sensitivity/specificity and false-alarm rate of detection, measured against adjudicated safety events.

AurelynSafety Sentinel Engine™ logs detection and acknowledgement timestamps and the analytic basis for every signal.

b.How should impact on AE rates or protocol deviations be assessed?›

Compare AE/SAE rate deltas and protocol-deviation frequency between AI-supported and comparator arms, controlling for population and exposure.

AurelynIntegrated deviation-management telemetry quantifies deviation frequency, type, and time-to-resolution continuously.

c.What measures assess data completeness, accuracy, and consistency?›

ALCOA+ scorecards, query rates, and source-data-verification discrepancy rates provide an objective, auditable view of data integrity.

AurelyneTMF Intelligence Engine™ produces a continuous ALCOA+ completeness and consistency score across the trial master file.

B.4

AI System Performance

a.What metrics evaluate accuracy, robustness, and generalizability?›

Report discrimination (AUROC), calibration, subgroup performance, and external validation on held-out and independent data — generalizability is demonstrated, not assumed.

AurelynGovernance layer enforces an external-validation gate before any engine moves from shadow to integrated mode.

b.How should stability over time and model drift be measured?›

Monitor input drift (e.g., population stability index), performance-over-time, and a defined retraining cadence under change control, with alerting when drift breaches thresholds.

AurelynContinuous drift monitoring with automatic alerts and change-controlled retraining is a standing function of the platform.

c.How can performance be evaluated across populations, sites, and therapeutic areas?›

Maintain stratified performance dashboards with fairness slices by demographic and clinical subgroup, by site, and by therapeutic area — surfacing heterogeneity rather than hiding it in an aggregate.

AurelynPerformance is always reported sliced; aggregate-only reporting is disabled by governance policy.

B.5

Trustworthiness (aligned with NIST AI RMF)

a.What evidence demonstrates AI systems are valid and reliable?›

A context-of-use-scoped credibility assessment per the FDA draft guidance — analytical validation plus clinical validation, with credibility evidence proportional to model risk.

AurelynThe credibility-assessment workflow is built in, producing the seven-step risk-based package the draft guidance describes.

b.How should safety and risk mitigation be evaluated?›

Through model-risk tiering, human-in-the-loop controls, fail-safe defaults, and override logging — with the residual-risk profile documented per context-of-use.

AurelynEvery consequential output requires a logged human decision; overrides and their rationale are retained for audit.

c.What metrics assess transparency and explainability — for both sponsor-built and proprietary systems?›

Use model and system cards, explanation-fidelity measures, and a behavioral (black-box) testing harness that evaluates input/output behavior without requiring source access — making the same metrics applicable to proprietary third-party systems.

AurelynThe black-box evaluation harness lets the pilot hold proprietary and open systems to one transparency standard.

d.How should privacy protections and data governance be evaluated?›

Assess de-identification, access controls, data lineage, and Part 11 compliance, with a documented data-governance plan per context-of-use.

AurelynEnd-to-end data lineage and role-based access are enforced and auditable within the Part 11-validated environment.

e.What approaches assess fairness across demographic and clinical subgroups?›

Subgroup performance-parity analysis and bias audits across demographic and clinical strata, with parity thresholds agreed at context-of-use registration.

AurelynFairness slices are a mandatory output; parity breaches trigger a governance review before deployment.

B.6

Comparative Evaluation

a.What comparators are most appropriate?›

A blend: historical controls, concurrent non-AI arms, and in-silico simulation / digital-twin benchmarks — triangulating rather than relying on any single comparator.

AurelynAdaptive Dose Engine™ includes trial-simulation tooling to generate in-silico comparators alongside empirical ones.

b.How should differences in design, complexity, or therapeutic area be accounted for?›

Through covariate adjustment, stratification, matched comparisons, and simulation-based benchmarking so that observed differences are attributable to AI rather than to design heterogeneity.

AurelynCaptures design metadata per trial, enabling adjusted and matched cross-trial comparisons.

B.7

Qualitative Outcomes

a.How can stakeholder trust be assessed?›

Deploy validated trust and acceptance instruments (e.g., adapted technology-acceptance and trust-in-automation scales) for investigators, participants, and regulators at defined checkpoints.

AurelynSurvey checkpoints can be embedded at milestone gates so trust is tracked longitudinally, not just at the end.

b.What methods evaluate usability and workflow integration?›

System Usability Scale scores, task-completion rates, and time-and-motion analysis of how recommendations enter the clinical workflow.

AurelynWorkflow telemetry measures where and how recommendations are consumed, surfacing friction points objectively.

c.How should perceived value, scalability, and operational feasibility be measured?›

Adoption metrics, net-promoter-style value scores, cost-per-decision, and scalability stress tests across sites and indications.

AurelynCost-per-decision and adoption are computed from usage telemetry; SaaS architecture supports scalability load-testing.

The RFI grounds trustworthy AI in the seven characteristics of the NIST AI Risk Management Framework. Aurelyn's Governance & Assurance layer maps a concrete control and a measurable signal to each — the answer to RFI question B.5 in matrix form.

NIST AI RMF Characteristic	Aurelyn Control	Measurable Signal
Valid & Reliable	Context-of-use-scoped credibility assessment; analytical + clinical validation gate before integrated use.	AUROC, calibration, external-validation pass/fail
Safe	Model-risk tiering, fail-safe defaults, mandatory human-in-the-loop on consequential outputs.	Override rate, residual-risk profile, harm events
Secure & Resilient	Part 11-validated environment, access controls, drift & adversarial monitoring.	Drift index (PSI), incident count, uptime
Accountable & Transparent	Immutable audit trails, model & system cards, decision provenance.	Audit completeness, card coverage, traceability
Explainable & Interpretable	Explanation outputs per recommendation; black-box behavioral testing for proprietary models.	Explanation-fidelity score, user comprehension
Privacy-Enhanced	De-identification, data-lineage tracking, governed access by context-of-use.	Re-identification risk, lineage completeness
Fair — Harms Managed	Mandatory subgroup performance slicing; parity thresholds with governance review on breach.	Subgroup parity gap, bias-audit findings

Aligned with NIST AI RMF 1.0 trustworthy-AI characteristics and the FDA draft guidance "Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products." Aurelyn implements these as enforceable platform controls, not policy documents alone.

Beyond the RFI itself, Aurelyn Trial | OS™ is engineered against the wider regulatory frame an early-phase AI pilot must satisfy. The crosswalk below shows where each framework is addressed in the platform.

Framework	Relevance to the Pilot	Aurelyn Coverage
FDA Draft Guidance AI / RDM Use of AI to support regulatory decision-making	Defines context-of-use and risk-based credibility assessment for AI evidence.	Native COU registration + 7-step credibility-assessment workflow.
NIST AI RMF 1.0 Trust	The RFI's explicit trustworthy-AI reference.	Seven characteristics implemented as enforceable controls (see crosswalk).
21 CFR Part 11 Records	Electronic records & signatures for any system in the trial record.	Validated environment, e-signatures, tamper-evident audit trails.
ICH E6(R3) GCP	Modern, risk-based GCP expectations including computerized systems.	Risk-based quality, oversight, and data-governance baked into workflows.
21 CFR 312 / 50 IND · Consent	IND conduct and informed-consent integrity in early-phase trials.	Document intelligence over ICF/ISF; deviation & consent tracking.
GMLP ML Lifecycle	Good Machine Learning Practice for the AI development lifecycle.	Versioning, change control, monitored retraining, validation gates.
CDISC TMF Ref. Model Data	Standardized trial-records structure for completeness scoring.	eTMF Intelligence Engine™ auto-classifies to the reference model.
EU AI Act High-Risk AI	Forward-compatibility for sponsors operating in the EU.	Risk-tiering, transparency, and human-oversight controls map to high-risk obligations.

Framework names refer to the FDA's January 2025 draft guidance on AI in regulatory decision-making, NIST AI RMF 1.0, ICH E6(R3) Good Clinical Practice, the FDA/Health Canada/MHRA Good Machine Learning Practice guiding principles, the CDISC TMF Reference Model, and EU Regulation 2024/1689 (EU AI Act). Coverage reflects platform design intent for evaluation in the pilot.

04 — Evaluation Telemetry

What the pilot would actually measure

Aurelyn proposes these as the headline, pre-registered targets a pilot could test. The figures below are illustrative design targets and hypotheses — the platform's purpose is to measure them rigorously, not to assert them as proven outcomes.

↓30%

Target reduction in Phase 1→2 transition time

Clinical Evidence Engine™

↓40%

Target reduction in safety-signal detection latency

Safety Sentinel Engine™

↓25%

Target reduction in screen-fail rate via enrichment

Cohort Intelligence Engine™

100%

Decisions with logged human-in-the-loop & provenance

Governance & Assurance

Cycle-time impact by decision point

Baseline vs. AI-supported — illustrative target ranges

Patient screening & enrollment

100%

−25%

Dose-escalation decision

100%

−35%

Phase 1→2 go/no-go

100%

−30%

Safety-signal triage

100%

−40%

Traditional baseline AI-supported (target)

Evaluation coverage

Share of RFI Category-B questions with native telemetry

Native, pre-instrumented telemetry maps to nearly every Category-B evaluation question — minimizing bespoke measurement scaffolding during the pilot.

Target figures represent platform design hypotheses to be tested under the pilot's pre-registered analysis plan; they are not claims of demonstrated clinical results. Actual effects depend on indication, sponsor maturity, and comparator design, and would be evaluated per RFI Category B.

05 — Pilot Lifecycle

A governed path from onboarding to evidence

How an Aurelyn-supported participant would move through the pilot, with the milestone gates recommended in answer A.5.

COU Lock

Context-of-use registered; model-risk tier assigned

Data Readiness

ALCOA+ gate; environment validated to Part 11

Shadow Mode

AI runs in parallel; concordance captured, no influence

Mid-Pilot Review

Safety, performance & drift checkpoint vs. pre-set criteria

Decision Capture

Phase 1→2 go/no-go logged with full provenance

Public Report

Model/system cards & lessons disseminated

06 — Engage

Aurelyn AI Clinical seeks to participate

We welcome the opportunity to contribute to the FDA's pilot as a technology and assurance partner — and to submit this framework to Docket FDA-2026-N-4390. Aurelyn Trial | OS™ is ready to operationalize trustworthy AI in early-phase trials today.

View the FDA Docket ↗ Re-read the Response