Agentic AI, AI in Pharma, Data Fabric, FAIR +T, Open Data Manager

Pharma's Data Fabric Was Always for Humans. AstraZeneca Just Defined What It Looks Like for AI Agents.

20.05.26

There’s a bit of a shift in how big pharma talks about AI happening this month, and I don’t think it has fully registered yet.

In Cancer Discovery, Richard Goodwin, Simon Barry, James Weatherall, Stefan Platz and Jorge Reis-Filho — all from AstraZeneca’s R&D and Enterprise AI leadership — published a piece titled Enabling AI to Drive Innovation and Precision across Oncology R&D1. It is an important position paper: the clearest articulation yet, from inside a top-five pharma, of what AI-era R&D actually requires as foundational infrastructure. It deserves to be read alongside the Genentech/Roche “Lab in the Loop” AI factory framing from Aviv Regev2, GSK’s Onyx data engineering organisation that Kim Branson talks about, AbbVie’s ARCH knowledge platform, and the Pistoia Alliance’s new Agentic AI initiative seeded by Genentech3. Together, these aren’t separate stories. They are the same story, told from different organisations, about how big pharma R&D (and life sciences generally) is evolving.

Pharma has been talking about data fabrics for years — the term has been in pharma vocabulary since the late 2010s. What is new in this paper is not the fabric idea. It is the specific articulation of what a data fabric has to look like when the primary consumer is no longer a human analyst but an AI agent. The line I keep returning to is this one, in the section on foundational principles applying across all tiers of AI use: “semantic layers that support integration for model or agent consumption.” Eight words that, I think, define an important part of the next few years of pharma R&D infrastructure investment.

There is also a phrase later in the paper, that is more direct: “In a world where AI capabilities become broadly accessible, it is the unique data fabric of each organisation and not the models themselves that will increasingly define its edge.” The authors are telling pharma leadership, plainly, that the model layer is commoditising and that the durable competitive asset is what each organisation does to make its proprietary data trustworthy, harmonised, traceable, and AI-callable. They propose extending FAIR to FAIR+T, with T for Trustworthy — lineage, quality, accountability, learnability. They describe a “Therapeutic Index engine” as the structural intellectual property of an AI-native pharma. They argue that the pace of R&D will be defined by the speed and ownership of the learning loop.

I want to talk about what this means, where I think the authors are correct, where I think the implementation reality is harder than the paper suggests, and why this moment matters for everyone working in life science R&D — including the scientists themselves, whose roles are about to change more than most realise.

A note before I go further. At Genestack we have spent more than a decade building exactly the kind of substrate the paper describes —for companies like those on the top pharma list. Divisions in these and other top pharma and life science organisations use our Open Data Manager as the harmonised hub for their multi-omics work; creating data infrastructure alongside these teams has shaped our thinking enormously. So I am not a neutral observer here. I am an implementer, reading the customer-side articulation of a problem we have been building toward, and recognising the architecture being described.

On the data fabric thesis, and what builds on top of it. The authors are right that proprietary data — harmonised, governed, traceable, reusable — is the durable edge, and right that this becomes more valuable, not less, as foundation models commodities. They are also right — and the paper is explicit about this — to treat AI agents as current consumers of the substrate, not future ones. The Tier 3 framing of “semantic layers for model or agent consumption” is in the present tense, and it should be. What I would add from our deployments over the last 18 months is that agents are not only consumers of the data fabric — they are a forcing function on it. Pointing an AI agent at pharma data exposes every harmonisation gap, every ontology mismatch, every ambiguous column name, every silently-versioned protocol. Every slow query, poorly designed schema or inefficient data structure becomes a multiplying bottleneck. An agent will confidently work for hours or days and produce wrong answers against ungoverned data, in ways that human analysts would have caught reflexively. This is happening right now, in every pharma that has tried to point an agent at its lakehouse and discovered that governance of the bytes — what Unity Catalog or Horizon Catalog provide — is not the same as governance of the meaning. The data fabric isn’t just an asset for AI; it is what makes the difference between an AI that helps and an AI that produces reasonable-looking, traceable falsehoods.

On FAIR+T. This is the proposal in the paper most likely to be adopted as industry vocabulary, and rightly so. FAIR has been a foundational achievement of the last decade — the Pistoia Alliance FAIR for Pharma community, in which several of our customers participate, has done extraordinary work to make it operational. The authors’ case for adding Trustworthy is correct, and I want to say specifically why I think the +T will stick where other proposed FAIR extensions have not. FAIR was designed primarily for findability and reuse by humans and deterministic computation. The consumer profile in 2026 is different: a non-deterministic agent that will compose its own queries, draw its own inferences, and generate its own conclusions. The +T dimensions — lineage, quality, accountability, learnability — are exactly the properties that distinguish data that an agent can responsibly act on from data that it cannot. Without lineage, an agent’s output cannot be audited. Without quality flags, it cannot weigh the inputs. Without accountability and proof of trustworthiness, no one can defend the answer to a regulator. Without learnability, the substrate doesn’t improve. The authors have given the field a label that captures something that needed a label. Let’s use it.

On the architecture of agentic frameworks — a question the paper leaves open. The authors note in the Tier 3 section that current agentic frameworks combine specialised models with well-annotated data to solve multistep problems. That description fits the present state of the field, but it leaves open what is, in my view, the more urgent architectural question: should pharma operations be exposed to agents as pre-built tools, or as a composable schema plus rich skills that the agent reads and applies?

Our experience suggests both, in different proportions than people assume. Pre-built tools are the right answer wherever the value of the operation depends on it being executed the same way every time: regulatory submissions where named, validated operations are easier to defend than agent-composed SQL; standard analytical workflows that an organisation has agreed on and doesn't want re-derived on each invocation; dashboards and decision-support surfaces where stakeholders need to trust that the underlying computation isn't varying; statistical analyses where reproducibility is the whole point. These are not edge cases. They are a meaningful fraction of pharma data work, and underweighting them is a mistake.

But for the much larger space of exploratory, hypothesis-driven analysis — where the question is genuinely novel each time and what matters is composability against the data — the more durable architecture is a well-designed, ontology-mapped, agent-readable schema, exposed via a standard protocol (e.g. the Model Context Protocol), combined with a set of tested and agreed upon skills (rich, machine-readable patterns of pharma analytical reasoning that the agent reads, internalises, and applies through composition). The agent composes the analysis; the skills constrain it to do so correctly. Pre-built tools alone have a combinatorial-explosion problem for this space — there are too many legitimate question patterns to ever enumerate them all in APIs. Schema plus skills handles the long tail of pharma hypothesis testing in a way that pure tool catalogues cannot.

The right architecture, we think, is both: a small, deliberate catalogue of named operations for the reproducibility cases, sitting alongside a rich schema-plus-skills layer for everything else. We have built this combination at Genestack and are testing it internally now, with the first customer-facing deployments coming in the next few months. Where exactly the line should sit between the two — what counts as a "nail it down" operation versus a "compose freely" operation — is a question the field is going to have to work through together over the next few years, and one we'd genuinely like to discuss with others working in this space.

On the Therapeutic Index engine — and why it matters more than it first appears. This is the most strategically interesting section of the paper. The authors describe TI as the connective layer of the entire design–make–test–learn loop — a multimodal learning system spanning mechanistic, pathological, PK/PD, safety, and multimodal datasets, becoming “structural intellectual property” that improves as new experiments accumulate.

What the paper does not say explicitly, but is worth saying, is that the TI engine is the Target Product Profile made executable. Every drug program in pharma already runs against a TPP — the document that captures what success looks like in terms of efficacy, safety, dosing, patient population, and differentiation from standard of care. Today, the TPP is a slide deck that gets updated quarterly by human judgement. What Goodwin and colleagues are describing is a TPP that learns — a continuously updated, data-grounded, multimodal estimate of whether the program is still tracking toward its target criteria, refreshed every time new experiments produce new data. The TPP captures intent; the TI engine continuously evaluates progress against that intent across every modality the program touches. That is an enormous conceptual upgrade and, I suspect, the right mental model for how AI changes pharma R&D operationally: not new tools layered on top of old artefacts, but the old artefacts becoming living systems.

It is also, of course, technically far harder than writing a slide deck. A TI engine of the kind that the authors describe is not just a multimodal data system with good traceability — it requires a semantic layer. It needs the meaning underneath: the harmonised schema that makes mechanistic, pathological, PK/PD, safety, and multimodal data comparable in the first place; the ontology mappings that let the engine know a "tumour response" in this study is the same concept as in another study; the curation history that records why a sample was assigned to a cohort; the agent-readable patterns of pharma analytical reasoning that turn the substrate into something that the agent can act on correctly. Without that semantic layer, a TI engine is a multimodal slide deck with extra compute.

Two things follow that I think are worth stating plainly. The first is that the enterprise data platforms pharma has been consolidating on — Databricks, Snowflake, and Microsoft Fabric in particular — have done something that is  genuinely important for the industry over the last five years. They have made it possible, for the first time, to bring the messy reality of pharma R&D data into a single governed plane at enterprise scale, often spanning multiple clouds and data sources within one organisation. The work these platforms have done on fine-grained access control, lineage tracking, cross-source governance, and compute at petabyte scale is the foundation that everything we are now talking about — agentic AI, learning loops, TI engines — gets to assume as table stakes. That foundation is necessary, and the engineering investment behind it should not be underestimated. It is also, on its own, not sufficient. What these platforms do not do, and were not built to do, is encode the pharma-specific meaning that turns governed bytes into agent-callable knowledge. That is the work of the semantic layer above, and it is genuinely additional work, not duplicated work.

The second thing worth saying is the converse: the semantic layer should be well integrated with the platform underneath, not parallel to it or in tension with it. Governance, lineage, and compute belong to the enterprise platform; that is where they should live. Meaning belongs to the semantic layer. Good infrastructure in 2026 looks like these two layers working as one system, each doing what it is best at, with the semantic layer aligning closely to the platform a customer has made their strategic centre. That is the integration design pharma needs, and it is one most current data stacks do not yet have in place.

On the part that has been mostly absent from public commentary, and that I think matters most. Underneath the technical argument, the paper is also a thoughtful statement about how scientific work is going to change. The authors are explicit: AI is “a new consumable” that requires annual budgeting like reagents; AI solutions are durational, not permanent; “experiments do not always become cheaper with AI; they become faster and more intelligent and yield more complete insights.” Roles will change. Domain experts will not be replaced — they will be augmented, with their judgement deployed where it matters most, while AI handles the parts of the workflow that were previously friction. New specialisations in AI oversight, validation, and interpretation will emerge. The relationship between bench scientists, data scientists, and AI agents is being renegotiated, quietly, in every pharma R&D organisation right now.

The companies that come out of this transition strongest will not be the ones with the most sophisticated models. They will be the ones whose people, processes, and data infrastructure are organised around the new reality: that the question worth asking is not “what can the AI do” but “what can our scientists, augmented by trustworthy AI working on our trustworthy data, so that no one else can.” The paper’s authors understand this. The strategic frame is correct.

For those of us building the infrastructure, the work for the next two years is clear. Make the substrate. Make it semantic. Make it agent-callable. Make it trustworthy. Make it loop-fast. And make it compatible with the data platforms pharma has already chosen.

We have just released a short video showing what some of this looks like in practice — bioinformaticians querying terabyte scale, harmonised multi-omics in seconds, translational scientists working in lightweight apps over the same substrate. The agentic version is coming in a few weeks. I’ll share it here. In the meantime, if Goodwin and colleagues’ paper resonates and you are thinking about what this means for your own R&D infrastructure, I’d be glad to talk. The conversation the AZ team has opened is the right one. The implementation, as ever, is where the real work happens.

References 

1. Goodwin, R. J. A., Barry, S. T., Weatherall, J., Platz, S. J., & Reis-Filho, J. S. (2026). Enabling AI to Drive Innovation and Precision across Oncology R&D. Cancer Discovery, 16(5), 847–851. https://doi.org/10.1158/2159-8290.CD-26-0271

2. Frey, N. C., Hötzel, I., Stanton, S. D., Kelly, R., Alberstein, R. G., Makowski, E. K., Martinkus, K., Berenberg, D., Bevers, J., Bryson, T., Chan, P., Chen, Y., Czubaty, A., D’Souza, T., Dwyer, H., Dziewulska, A., Fairman, J. W., Goodman, A., Hofmann, J., … Gligorijević, V. (2025). Lab-in-the-loop therapeutic antibody design with deep learning. https://doi.org/10.1101/2025.02.19.639050

3. Taylor, H. (2025, September 4). Pistoia Alliance unveils agentic AI initiative and seeks industry funding to drive safe adoption. Pistoia Alliance. https://pistoiaalliance.org/ai/pistoia-alliance-unveils-agentic-ai-initiative-and-seeks-industry-funding-to-drive-safe-adoption/ 

20.05.26