Jérémie

About the AIVC paper

2024-12-02T00:00:00+00:00

Reading the AIVC manuscript further, I felt like adding some more comments on this topic, which I did before on X too.

Being able to model a cell with large amount of multi modal data and multi scale AI models is something I have talked about before on X and in my recent manuscript

To my mind there is two questions:

Do we have the right algorithm and compute?
Do we have the right data and scale?

For the first part, while being just a guess, seeing the abilities of recent AI models and the scale of their deployment in both language, vision, and structural biology. It stands to reason that we might be able to model a cell with some $\epsilon$-accuracy with current software and hardware, e.g. 10^13 params model trained and doing inference on 10,000 GPUs.

For the second part it is harder to be as optimistic. A human body is comprised of 100 trillion cells and other microbes. All shaped by their environment. The cell doesn’t, in fact, follow the simple diagram presented in the AIVC paper:

But likely this diagram:

Proteins are not RNAs and not genes. While genes can stay the same many RNAs can come out. Same with proteins. And vice versa. Thus one cannot encode it all in one representation. They are different objects.

Is it just a modelling problem? No, and these apparent simplifications are not there by mistake. They are due to our measurement abilities or actually, inabilities. We cannot measure well these diverse proteins / RNAs nor their interactions…. yet #ExSeq, #DNApaint, …

We can use AI to model proteins indeed but we knew everything about the theory of proteins, their elements and could measure and model their structures for decades prior.

So to me the future of AIVC is not really in the hands of the AI community but in the hands of the bio community. How do we measure these things better so as to have at least a good picture of how they work.

What are large cell models?

2024-09-07T00:00:00+00:00

In my recent paper scPRINT I define the term Large Cell Model but what are they?

AI is everywhere nowadays, and in biology it is helping us enter a new era of what jacob kimmel describes as predictive biology.

In this new field, we use complex computational approaches together with high throughput data gathering to generate predictive models of physical phenoma like protein folding, binding, regulation, response, etc. 🧬

scPRINT, like scGPT, geneformer, scFoundation, scLM and more is part of a bread of AI tools 🤖 that try to better model the cell regulation and response using a large amount of parameters and data. In this context they are -large- -cell models-. 🦠

Ancestry Bias in CRISPR Screens

2024-06-21T00:00:00+00:00

Ancestry Bias in CRISPR Screens

🎉 Thrilled to share our paper on ancestry biases in CRISPR screens published in Nature Communications! This work, led by Sean Misek, reveals an important discovery about potential biases in CRISPR screening technology. You can read the full paper here: https://www.nature.com/articles/s41467-024-48957-z

How It Started

🧐 While working in the Cancer Data Science group, I was exploring ways to extract interesting new features from our CCLE omics dataset that might reveal relationships between cancer vulnerabilities and genetics.

💡 One unexplored avenue was computing not just the cancer mutations in our cell lines, but also the germline mutations. The reasoning was twofold: some mutations might have been wrongly categorized as somatic, and we know there exist many germline mutations that can lead to cancer.

🔒 However, this idea faced some initial resistance as team members were concerned about releasing potentially identifiable patient-related data.

The Discovery

🚀 Despite the pushback, I decided to implement the germline analysis pipeline and generate the data while awaiting the final decision on its release. Fortuitously, Sean Misek, a postdoc at the Broad, approached us with the same idea, wanting to use germline mutations for GWAS analysis.

🕵️‍♂️ Sean’s analysis revealed surprisingly strong associations, but with genes completely unrelated to cancer. The breakthrough came when we examined how CRISPR guides were overlapping with these high association SNPs. We discovered that these SNPs were preventing the guides from binding to their target regions, creating false non-essentiality results. Even more significantly, when examining ancestry data we had recently computed, the affected cell lines were predominantly from patients of African descent.

Impact

🤩 What started as a search for new cancer targets led to the discovery of a significant bias in our CRISPR guide library affecting 1-2% of the guides. This finding has now been incorporated into our CRISPR dataset and has important implications for ensuring more equitable and accurate research outcomes.

This project represents a true scientific journey - starting from an initially contested idea, leading to unexpected results, and culminating in an important discovery that helps address representation issues in cancer research. 🌍

Managing GRNs and What They Mean

2024-06-17T00:00:00+00:00

Managing GRNs and What They Mean

In my recent paper scPRINT, I went into a deep dive on gene regulatory networks. Many findings are in the manuscript but I had some additional thoughts to share and wanted to present a tool I built to work with GRNs.

What are Gene Networks?

First we don’t use the term GRN most of the time but Gene Networks. This is because when GRNs are mentioned, they are almost always restricted to TF-gene connections. But gene expression / regulation is way more complex than this and driven by many more elements than just TFs. ♾️

Indeed, regulation happens with TF but also co-factors, cohesins and other proteins transporting, degrading, splicing the RNAs, shaping the DNA and working with TFs. Not only that but it is also well known that TFs themselves can be influenced and activated by pathways comprising many genes less related to transcription regulation. 🦠

Finally, a new realization in biology is that regulation also happens through RNAs interacting with each others, and with the DNA. So much happening but we are fixated on TF-genes… 🔍

The Problem with Simple GRNs

The big issue in my opinion is when these simple computationally inferred GRN are used to perform simulation of gene expression or are conflated with a model of the cell! 😲

The cell is very complex and TF-gene binary connections won’t cut it, by a mile. ⛔ This is to me a big problem in the GRN and GRN inference field. We will need to better define what GRNs are and maybe use another term to talk about the TF-gene constrained networks: like TFGN instead of GRN.

Given all this information -and the still unknown amount of non coding RNAs in human cells- we have to accept that a true Gene Regulatory Network should implicate all RNAs, non coding RNAs, DNA regions, proteins, and likely other molecular products. 👩‍👩‍👧‍👦

Finally, connections are not binary, they are not weighted, molecular interactions are complex, can create complexes, partially interact and have conditionals. We will also need better representation of connections, likely using embedding representation of vertices. 🔀

Where Do We Go From Here?

1. Theoretical Foundations 📋

We need good theoretical definitions that most people agree on, however murky things still are to us.

2. Better Tools and Infrastructure 🧰

We need to build tools, visualisation, data-structures, and benchmarks that represent these definitions and concepts and help us interrogate the current and future datasets we might generate. This is what I tried to start with GRnnData and benGRN. But lot more work is needed.

3. Advanced Sequencing Methods 📏

We need better sequencing methods than mRNA sequencing. Single cell is only a first step but we need to measure non coding and rare RNAs with much higher precision.

Who knows, it might be that a large-scale, high-depth, single-cell total-RNA-sequencing methodology with correct molecular amplification / reduction methods is all that we need to understand regulation… 🤔

GRnnData

in scRNAseq data we have been fortunate to get many great python toolkits like anndata and scanpy, part of the scverse suite of tools. Allowing us to work together with a common set of standards. 📜

However, working with gene network methods, I have seen various ways to store them throughout the different papers and benchmarks. Often as some kind of tsv/csv/… file with some kind of a gene-gene list. This lack of standard made it quite hard for me and other members of the lab to work with gene networks 🙉

Interestingly, there is a possible standard for it! 🎉AnnData contains the .varp field which is made to store var to var (e.g. genes to gene) relationships. However not many people use it…

I thus decided to formalize this usage by creating GRnnData: it is basically a way to important many different gene-gene format to an AnnData file. 💁 But it also contains more bells and whistles to work with gene networks like subsetting the network to some genes, extracting targets, and plotting the networks 💹.

But GRnnData can do more and integrates some utils functions doing things like clustering, centrality measures, enrichment and more. Go check [GRnnData]https://github.com/cantinilab/GRnnData if you work with gene networks!

AUPRC vs AP: Evaluating Binary Classification

2024-06-09T00:00:00+00:00

When analysing a classification task for my recent paper scPRINT, I fell into a fascinating “data-sciency” rabbit-hole about AUPRC, AUC and AP.

Understanding Classification Metrics

When evaluating a binary classification model that outputs probabilities or has internal regularization, we need ways to assess its performance across different decision thresholds.

ROC Curves

One common approach is plotting the ROC curve, which shows the tradeoff between true positive and false positive rates. The area under this curve (AUC/ROC-AUC/ROC/AUROC) is a popular metric:

The Need for Precision

However, in some cases like predicting graph edges in sparse gene networks, precision becomes crucial. AUROC can miss important information about optimal cutoff points. 🚫

AUPRC and Its Challenges

This is where AUPRC (PR-AUC) becomes valuable, focusing on precision and recall:

However, AUPRC comes with its own challenges: ⚠️

Random precision baselines differ between tasks, making direct comparisons difficult
Jagged results require careful sampling of PR curves 😟
In gene network inference, many links lack prediction values, requiring special handling 🔥

Introducing rAUPRC

To address these limitations, I developed rAUPRC, which includes:

Correction for random precision
Proper handling of incomplete predictions
A modified approach for drawing the curve from partial to complete recall

Average Precision (AP)

AP is computed as:

While AP is often recommended for skewed classification tasks, my rAUPRC implementation showed more consistent results across different parameters in practice.

Enrichr, Prerank, GSEA or ssGSEA?

2024-02-19T00:00:00+00:00

Bioinformatician’s main tool for discovery has often been differential expression analysis. But between Enrich, Prerank, GSEA, ssGSEA, which tool should you use? here is the quick reminder X-plainer:

Enrichr is when you just have a list of genes (can be small):

GENEA, GENEB, GENEC

Prerank is when you have values that you can rank your list of genes from:

Gene	Value
GENEA	12
GENEB	8
GENEC	4

GSEA works best when you have a list of genes, with values associated with them across multiple conditions (and replicates):

Gene	C1	C2	C3	D1	D2	D3
GENEA	12	7	3	1	1	0
GENEB	8	0	6	8	1	1
GENEC	4	4	3	2	3	4

ssGSEA is when you have many values associated with them but not 2 specific condition and want to compare 1 vs the rest

Gene	A	B	C	D	E	F
GENEA	12	7	3	1	1	0
GENEB	8	0	6	8	1	1
GENEC	4	4	3	2	3	4

The PhD Decision, Celligner2 and Foundational Models

2023-10-02T00:00:00+00:00

This is an important moment for me, and it seems that many things led me to this point. 😊 I have decided to join Laura Cantini at Institut Pasteur and Gabriel Peyré at ENS ULM for a joint Ph.D. in Computational Biology and Applied Mathematics. 🧬📊 I will try to explain here my decision process for something that can appear as a very sudden change of direction and slightly crazy, given my current position at WhitelabGx.

You can learn more about what I did at WhitelabGx here but I don’t think it would explain much of my decision. I will try to explain here the reasons that led me to this decision and the many lessons I learned along the way. 📚🤔

First, as you might be able to see in this life story of mine, I was always into science and research. I got proposed quite a few Ph.D. positions along the way. At ECE, University of Kent, EPFL, and INSERM. I applied myself to quite a few competitive places. But always found some other directions that I was more inclined to follow. It always seemed like I was pushing it for later or just thinking I could do without it. In a way I did. I went to work at the Broad Institute. Worked on many research projects, published as first-author, and even started working on my own projects. I became a research scientist in companies and even came on to lead scientists themselves during my time at Whitelab.

So research, although not my only passion, has always been very important to me.🧪🔬 It always seemed obvious that I would go on and do a Ph.D. at some point. But why did I wait? Why now? 🤔

The story

I was not looking for any other positions. I had in mind to stay at least a few years at Whitelab Gx, see where it had gone in that time, and then make a decision about my future.

However, I was in contact with Laura Cantini with whom I had discussions about Ph.D. projects before joining Whitelab. At some point, Laura came back to me with this Ph.D. proposal. I spent the good part of a month in this tough dilemma. Thinking about it in all possible ways to try and find a solution that I would not regret in the future. 🤯

I will try to explain below what led me to the decision to take this Ph.D. opportunity. But still, I did not want to let Whitelab down. during the end of my review period, I asked for it to be extended and explained the reason why. That gave Whitelab 2.5 months to prepare for my departure. Moreover, I presented a proposition to keep an advisory role in the company during my Ph.D. and to spend a week or so preparing my substitute for the position. I also did not take any vacation between my time at Whitelab and the Ph.D. so as to give as much time to Whitelab as I could. 🤝📅

I believe that this goes to show my dedication and interest in the company, its mission, and the respect I have for my colleagues. 🙏🏢

the why

All the opportunities I had were very exciting to me. They were allowing me to learn in some of the best environments. At Broad Institute, I learned how to do research, biology, genomics, and cutting-edge tools and techniques.

At Whitelab I learned how people build companies, how to lead a team, and how to go start from scratch a unit, with a validated plan, product design, and layout a business strategy for it. From working with clients, the BD team and the management.

Why now? Well, there is no perfect time. But I could summarize it into 1 reasons:

The Ph.D. topic and group which I believe was a great fit for me.

Obviously not everything can be summarized into one reason and not everything was perfect otherwise:

Whitelab was maybe a bit too early stage for me? At least it is something I thought about sometimes see Whitelab post
I thought a lot about open source in computational biology and having an impact which I will be able to outline further in this post.
I also had a couple disillusions during job interviews with some high profile companies a few years back see leaving Broad.

However, while working at Whitelab and even before that, I started to notice a change in the community. AI is slowly becoming more and more powerful and solving real biological problems. I saw it from foundation models in structural biology -with equivariant models and protein language models-, to transcriptomics and DNA language models.🤖🧬

During my work on celligner2 -at Broad-, I could already envision how foundational models would help in many aspects of my problem and many others. Working on the Atlas at Whitelab, they were always on our minds as possible solutions to tackle some real-world scientific problems (see my Whitelab post). (wait for a potential paper from our team at Whitelab in “nature computational science perspectives”) I had so many projects in mind and so little time to try to tackle them.💡📝

I also could see the speed of development of new papers and the breadth of possible research endeavors available to computational biologists. So there was some FOMO there for sure and I knew that if an AI-heavy transcriptomic research project was coming my way I might not say no.📈📊

So when I received this proposal from Laura and Gabriel I was very excited. It felt like the perfect fit for me. It had it all plus lots of applied math and stats plus gene regulatory networks which I spent years studying with Max Pimkin.💯📈

issues and realizations

But this is not the only reason: Before joining Whitelab and moving to France, I had interviewed at a few companies: BenevolentAI, DeepMind, Isomorphic Labs, Owkin, and DeepLife, and applied to many more. I could see for many of them that my not having a Ph.D. was an issue.

Other companies, like Whitelab or Aqemia, were very clear that my experience, to them was worth much more than what most Ph.D.s would bring.

But this really made me realize that life would be that much tougher without this diploma and that many doors would be closed or harder to open.🚪🔐

Finally, working in startup companies, I noticed their needs and challenges, some of them I believe specific to European startups:

The funding in Europe is often much smaller than in the US. This is problematic for biotech companies that need high CAPEX from the get-go to hire the best people, do experimental validation, build experimental facilities, and create a first pipeline. This is creating a state of slower initial growth than for their American counterpart.💰💼

This might seem to some like a small problem but it can become soul-crushing if you have been in the same company for years, waiting for it to enter exponential growth. It also can create issues, as the companies might lack important positions or teams with expertise that are harder to come by in Europe or need high investments.

Another realization I have had in both companies I was at, is the open-sourcing scare.😱📦

Open-source and creating value

Knowing biologists and Business Developers, I can guess this is an issue in many if not most (non-GAFAM) companies. “Use as many open source software and tutorials as you want”, “but don’t open source anything we do, don’t talk about anything we do”.

In software, since it is something that can get copied and pasted so easily, scarcity has to be artificial. However open source and open science are often beneficial as everyone can collaborate and build on top of each other’s work. This is the inherent innovation and progress phenomena built into software development and research.🔍🌐

However, external pressures tend to force privacy towards innovation to keep it for oneself.🤐🔬

I would argue that for projects in which the result is very far off or unclear, not open-sourcing can be problematic. If you are building a product like a lamp, say. You cannot cheat, will the lamp turn on? For how long? For something like Facebook, even, you get usage metrics from very early on. For Biotechs, computational biology, and research, The clear-cut goal is often “a drug that passed some clinical trials”. This creates many untangible and not very engineering-like objectives along the way like: “Can you get people to believe in you?” how many partnerships do you enter?”.

In this context, you can lie. You can lie to the client but also to yourself. You can design a very nice-looking “car” that doesn’t drive at all. but since you are selling to people who don’t even really understand what is a car and what to do with it: it might be okay.🚗🤫 I am not saying that this is what we do at Whitelab at all! But it is an issue that all biotechs face and struggle with.I haven’t been the first to mention it.

But this is why I think open source is so important: when you let people know how to use a car, try your car, the iteration cycle is much faster. 📚🚗

In this context computational biologists should strive to inventivize environments where they can build things that will have a more direct impact than “helping make a drug”, through open source and through experimental validation.

This has been a big lesson learned for me too. Thinking more about impact, direct impact, and value creation. 🌍📈

the hard decision

Nothing was pushing me away from my current job and I felt welcomed, useful and that we were going forward in our grand vision See the Whitelab post. Choosing the Ph.D. really felt like choosing the hard thing, the hard path.

It is a hard decision because of how it might be seen too. I so do not want people to see it as doing a U-turn, or as a start-over. I now have 5 years of experience in computational biology and I am afraid of hearing “you are just fresh out of your Ph.D. and we thus don’t think you are yet qualified for a lead role in our company.” from an HR manager after my Ph.D.📆👨‍💼 To me, it is a continuation of my career. Ph.D.s can be many things and I see mine as a kind of post-doc or a long sabbatical.

Moreover, staying for such a short period in companies is not seen well. For very sensible reasons. But startups are a special environment, especially so early-stage. I have stayed for 3.5 years at Broad. Much longer than most people in my position there. I have worked on my project, PiPle, for more than 5 years. I also think I have shown how much I cared for Whitelab and the personal dedication I have put in and continue to, in the company. 🌟🤝

Finally, I feel I have actually learned a lot more in these 10 months than I had in a long time. From structural biology and drug discovery to business and management, the startup ecosystem, and cell and gene therapies. This is really thanks to these fast-paced environments and all the people I have been lucky to work with.

the many lessons learned

With my previous experience in building a company, I would not have expected to learn that much from being in a startup. I think I did not realize the major steps that going from a company of 1 to 5 people is compared to 5 to 20 and 20 to 100, etc… each growth cycle is a full reinvention of the company culture, it is new challenges and things get exponentially more complex for the C-levels (i.e. the executive team: CEO, CTO, CSO, CFO, HR manager, etc)

being a great C-level in a startup

It shows how much of an example they are for everyone.

Here are the 4 points I learned for the C-levels:

The C-levels’ domain knowledge is key to success. 💼🧠🔑
The ability to create agile processes and to hire the right people at the right time is also key to success. 🔄👥⏰🔑
Fostering a culture of trust and respect is how you can keep the best people in your company. 🤝💼

It might seem obvious but I really was shocked by the Impact of C levels on a startup culture. I was shocked also by the key “glue role” that C-levels need to play during this 1-100 people size. They are the connection between people inside the company and the partners outside it.🚀🔗 It shows how much of an example they are for everyone.

But the example they set needs is a constant balancing act between everyone’s needs and goals. Within all this chaos they need to keep fostering a culture of trust and respect. This “keep cool” in most situations is as hard as the balancing act they constantly play. 🤹‍♀️🤯

Finally, their domain knowledge 💼🧠🔑 is key to success. This requires experience in their topic and what their clients and the market want. All of this trickles down when they start hiring people: M.D.s, Ph.D.s, business developers, and engineers, with strong experience in their field are of the utmost importance in biotech. In here C levels again play a key role, who they are, the culture they have fostered allows them to recruit and keep the best people, especially when the company is still small and hasn’t reached a strong brand recognition yet.

Keeping good people is super tough: you need to pay them well, have their work be fulfilling while always showing them you are on a high growth trajectory. Processes and agility all play a role too. 🔄👥⏰🔑 Moreover, it is only through these well tought-out processes and structures that everyone’s strength and contributions can be channeled efficiently into the projects and products the company is building.

Thinking differently than in academia

Building a startup has many similarities with being a researcher, for example: you have ideas you need to sell through a powerful story. This is how you get people and funding. These ideas although incomplete have to convince people. You spend a lot of time finding money and building a team of experts and beginners.💡💰👥

But if it was that similar, every ex-PI would be a trememdous CEO. There are big differences too. The first few discussions with the BD team at Whitelab really made me understand them: First, you need to know at all times what you are selling and what people want, even though you might not have it yet. What counts is that you know you are able to have it and how you would do it.

Second, putting yourself out there: going to as many conferences and events, with a solution to people’s problems. This needs to happen from the get-go. (jeremie from the future here: I realize now that these 2 points are still things that many PIs would be familiar with haha)

Finally, your first product might be miserable, but if it solves someone’s problem, it’s okay. And thinking product first, even in comp bio, is not really a reflex. What is a product? when you see all these amazing databases, datasets, and tools. What does one bring? You don’t have to bring much initially, it just need to be useful to people. What is a product? It is really an abstraction over the work that your team does. It has to be polished, output meaningful results, within deadlines, it must have a given cost upfront… 🗂️📅📋

The future of cell and gene therapies

I joined Whitelab because I strongly believe in the future of genetic therapies. Covid opened our eyes to the power of mRNAs, while CAR-Ts have had a big impact on cancer therapies, and many more drugs are in the pipeline.

Especially understanding the issues that plague small molecules. It felt to me that we needed a more targeted approach for the medicine of the future.💉🧬🌐

But first, what are cell and gene therapies (CGT)s?

Cell and gene therapies (CGTs) represent a revolutionary approach to disease treatment, moving away from traditional methods towards more targeted and personalized solutions. These therapies work by altering the genetic material within a patient’s cells to fight or prevent disease. They can replace, remove or repair abnormal genes or introduce totally new genes to fight diseases.🧬💊🌡️

A very famous subcategory: mRNA therapies, such as those used in some COVID-19 vaccines, are a type of gene therapy that works by introducing a small piece of mRNA into the body. This mRNA carries some instructions and once inside our cells, it uses the body’s own machinery to execute it. The advantage of mRNA therapies is that they don’t alter the DNA in our cells and are not permanent, reducing the risk of long-term side effects.💉🦠🧬

Antisense oligonucleotides (ASOs) are another type of CGT that work by binding to the mRNA produced by certain genes and preventing them from being translated into proteins. This can be useful for diseases where a particular gene is overactive. ASOs can be designed to target almost any gene, making them a highly versatile tool in the fight against disease.💉🎯🧬

Overall, the main advantage of CGTs is their precision. They target the root cause of diseases at the genetic level, offering the potential for more effective and longer-lasting treatments with fewer side effects. This precision, combined with the rapid advancements in technology and our understanding of the human genome, is why many believe CGTs represent the future of medicine.💉🌍🔬

I believe that since genetic sequencing and machine learning, we are able to understand the root cause of disease in the name of one or a few cell types and their disregulation caused by often one or a few genes.

Being able to target only these specific diseased cell types is within reach for CGTs but not for small molecules. Moreover being specific in what we are doing: what genes we are triggering is also what CGTs can do that most small molecules can’t.

It feels like a perfect CGT would allow us to create a novel drug in hours. From a laptop and a DNA printing machine. It would be the perfect interface between man and life.

My goals and objectives for the Ph.D.

A main mistake during one’s Ph.D. is to not see the time passing by. My goal for this Ph.D. is to be as product-first as I was at Whitelab. Delivering results quickly & improving until it is publishable.🏆📊📚

This mistake, thinking “Well I have 3 years…” is what makes people go overboard, finish stressed, and unprepared for what is next. Thus I plan for doing mine in 2 years. And I will prepare everything around this idea. I will also start to think about what is next ASAP.

To do that best, one needs to take the chance of the Ph.D. to make connections with other labs (industry or academic) on my end I am thinking for example at Broad, the Theis lab, the Yosef lab, Valence labs, …🤝🔗

Moreover, a good advice I have been given is to know what you want to do and what you don’t want to do. Know what you are here for. Learn to say no. And I learned to say no in the last 4 years. My goal is to work on large models & large datasets, mostly transcriptomics. But also making sure to always go back to first principles and biology.🧬🧪🔍

I also know I want to make something useful, make something that can be a stepping stone for others. Something that has an impact on the community. I know that to do that you have to go the extra mile in terms of dev with and be honest with yourself about any shortcomings.

Finally, I have been very lucky to often become addicted to my work. I like working hard and I like challenges. But for this to happen, I need to keep enjoying what I am doing. I also wish to have no regrets about this decision. Thus my final goal is to enjoy it, as much as I can!😊🚀

Recap

Thus my motto for this Ph.D. will be:

do it in 2 years and be prepared
make as many connections as I can
maximize impact on the community: make something useful
enjoy it as much as possible

Ph.D. proposal

Link to the proposal

Previous proposals

I have had a previous Ph.D. projects proposals for a Ph.D. that I then postponed. I list it here too for historical purposes.

My Work at Whitelab and Creating a Team

2023-09-25T00:00:00+00:00

After a difficult end of the year, I decided I wanted to take a month or so to get my thoughts in order. To relax and find some closure with what had happened. ☮️

However the day following my departure I was already looking at companies and contacting labs. 📱 I was in this state where the only thing that could make me feel better was to look at the future and the possibilities. 💭

But it was still tough mentally, I was thinking about the option of going back to the Broad Institute. “Would it be hard? Should I go now? Would it be the easy way out? What about my apartment, my girlfriend? etc.” I felt like a way out, like cowardly going back “home” 🏠 after a failed attempt.

I had, fortunately, all the support I could need.

It actually only took a week or so before I got replies. One coming from Deep Life and one from WhitelabGx.

I ended up being offered positions. DeepLife was very early stage at the time and its interview process made me feel a bit uneasy about things. After my previous experience, this is something I was now much more sensitive to. I was not really able to see these before.

Whitelab on the other hand was very welcoming. Excited about my profile and offered me a very rewarding position. I would be a team lead in Computational Biology :muscle:. Here I saw how much my previous experience was valuable as I felt I was now much more able to see in between the lines during recruitment processes. Not what people say, but how they behave, what they offer, and the way they do it. Doing so I could also see the tremendous difference in how companies can value any one person’s profile. While deep life was offering me an entry level position, after 4.5 years of experience, Whitelab was offering me a team lead position.

In a way, recruitment feels like friendship sometimes. It is a matter of taste, of need but most of all serendipity.

On the surface, the position looked a lot like what I had been told at Aqemia (although not really offered). But here, the team was already there waiting for me and the objectives were clear. 🥇

the environment

Many things were green flags 🟩 for me. The company was really thinking of itself as a Biotech, the CSO was a biologist with experience. They had an HR manager. They also had a team of biologists and both the CEO / CSO were very honest, aware of what they didn’t know, and were very open to my suggestions and ideas.

The place itself was a brand new incubator 👀 called Future4Care. As nice as Broad and less than 10 minutes from my place.

Every member of the team was nice and welcoming. It was small, however: 20 people.

The goal of WhitelabGx is to build a platform to help develop, de-risk, and fasten the preclinical phase of novel cell and gene therapies. The idea is to use ML and data science to drive the discovery, experimental design, and so on 📈.

Especially after getting experience in small molecule preclinical development, I could see the huge potential of CGTs in their precision compared to regular small molecules. I also believe strongly that many diseases will only be cured through the use of CGTs.

Disclaimer:

Why am I doing that?

It is simple: I strongly believe in openness and honesty and I think the world is a better place every time people speak freely and openly about subject matters that are important, like work.

I am talking here about my sole experience at WhitelabGx. With the management of 2023. This is just one point of view at one specific moment in time. I also have my own values and reference frame. Please do not use anyone’s past experience to judge a current company on this account and do your own due diligence.

I am not talking about anything related to IP, technology, or anything part of my non-disclosure agreement, I am only talking about my experience as I would tell it to my friends that are not in the industry.

Because of the nature of the subject and with respect for anyone currently working at WhitelabGx, I will keep this post private for now. If you are interested in reading it, please send out an email with an explanation of why you want to read it and who you are and I might send it to you.

But I might at least let you know about what WhitelabGx was very open about as of the end of 2023:

Whitelab has a culture of kindness and honesty.
Most people are staying in the company after their trial period.
Whitelab is largely composed of people with past experience in research, biotech, and biology.
Whitelab is very diverse, exactly half of the company is non-french. with people coming from all over the world. The culture is very American.

————- REDACTED ————-

send me an email to get the full story here

The Aqemia Story

2022-12-20T00:00:00+00:00

During the summer of 2022, I was feeling pretty desperate. I had just experienced 6 months of many failed attempts at securing positions at companies like Deepmind, BenevolentAI, Isomorphic Labs, and Owkin. I had gone through very long and tedious interview processes where for some, it had felt like I was mainly just missing a Ph.D.. Some times it felt like bad luck, bad timing.

But at the end of July, everything went very fast. I had gone through a couple of interviews at a young promising French startup called Aqemia and was quickly hired as a Research Scientist to start their target discovery team.

The pay was low compared to the US and what I was coming from (but I could have many things to say between US and Europe’s pay vs actual buying power and quality of life) and I negotiated stock options and a relocation bonus. I accepted the offer and 2 months later I was in Paris. (see more in my previous post)

I will now try to present the time I spent at Aqemia:

Disclaimer:

Why am I doing that?

It is simple: I strongly believe in openness and honesty and I think the world is a better place every time someone talks openly about subject matters that are important, like work.

I am talking here about my sole experience at Aqemia with the management team of 2022. This is just one point of view at one specific moment in time. I also have my own values and reference frame that are at play. Please do not use anyone’s experience to judge a current company on this account and do your due diligence.

Because of the nature of the subject and respect for anyone currently working at Aqemia, I will keep this post private for now. If you are interested in reading it, please shout out an email and let me know why you want to read it and who you are. I might send it to you.

But I might at least let you know about what Aqemia was very open about as of the end of 2022:

Aqemia is a very fast-paced environment.
A lot of people -more than half- leave the company after a few months. (It is not clear if this is because of Aqemia’s decisions or because of their own) (this percentage seemed higher for people with previous experience in industry and biotech).
Aqemia is largely composed of young graduates from a small set of elite French engineering schools.
Aqemia is very diverse with people coming from all over the world.
Aqemia is a company that is building a platform to help develop, de-risk, and fasten the preclinical phase of small molecule drug discovery.

————- REDACTED ————-

send me an email to get the full story here

Leaving Broad

2022-09-01T00:00:00+00:00

Yes, I have decided to leave the Broad Institute.

“Are you crazy?” 🤯

Well, I think there are many good reasons for it and I will try to explain why. But, before I go into the why, let’s talk about the last 1.5 years at Broad and what I have been able to do. 🔍

3.5 years at the Broad

There have been quite some changes and many successes in the recent year. I have finished my main research project at DFCI with Maxim Pimkin and I have now been 100% a Broadie. :partying_face:

My two main projects have been on what I liked to call CCLE3 and Celligner2.

DepMap Omics: CCLE3

At CDS, some people have come and gone, mainly someone joined my team: Simone Zhang. Taking more Engineering responsibilities from me and helping me on the depmap-omics project. 💪

I was also quickly promoted to Senior Associate Computational Biologist. The last ladder before becoming a Research Scientist at Broad. Simone coming was also given to me as a way to show my ability to coach and be a mentor to a junior scientist.

This allowed us to define a lot more avenues of research, which I packaged into a project called: CCLE3. 💡

The main question was: “What can we do with the data we have at the Broad, in depmap-omics, to help the community?”

I was proud to be given the responsibility of presenting this vision to the Cancer Program leadership in 2021. Here is the presentation I gave:

presentation

Mostly one can see that it was axed around the idea of generating novel features from omics data we already have, improving the metadata, and harmonizing our omics. having the same sequencing data across all our samples -at least for RNAseq & WGS-. 🟩

I strongly believed -and still do- that these were the low-hanging fruits 🥝 for us, allowing us to discover new targets cheaply and squeeze more value out of our current database.

Let me present that in more detail.

new features

I believe that in the last 2 years, we have made an impact on the community by open-sourcing our toolkit and making our data reproducible and more generally available. 📖

However, from all the fancy Machine Learning and prediction tools we have tried and designed to make better predictions of gene dependency & cancer targets. Almost every time, without a doubt, nothing had a greater impact than generating a new genomic feature from our dataset. 🤯

I think this is a big “lessons learned”, for me at least. That data comes first and domain knowledge comes second. I think this is a lesson that is not only true for genomics but for all data science.

Let me present some of the new features we have developed:

Recreating & improving the microsatellite instabilities from DNAseq –> initially lead to the WRN / MSI paper.
Associating features based on gene sets, gene proximity, etc.. –> initially lead to the 2022 VPS4A/VPS4B paper
Getting info about viral infection of the tumor sample.
Getting at somatic & germline structural variants from WES and WGS and associating them with fusion variants in RNAseq –> Lead to a new paper with the Rameen Beroukhim lab.
getting info about splicing QTLs using spliceAI on mutation data and associating them with transcript abundance –> paper in progress as of 2023 also with the Beroukhim lab.
getting info about the global functional status of genes (based on CNV/SV/SNPs/ INDELS/ etc) and their phasing and expression –> has already been shown to improve most of our dependency prediction data.
Germline mutations & their associations to dependency –> something I pushed for a lot. And led to a couple papers on ancestry bias with Sean Misek and improvement of our Achilles Algorithm.

The last project is one which I was really proud 🥇 of as it is something I had pushed for two years without much success.

Until Sean, a new research scientist in the Beroukhim lab, was able to help me convince the rest of the team that this would be a worthwhile endeavor.

During Broad’s 2021 Retreat, I was able to present this work to the whole Broad community. Here is the poster:

poster broad retreat

I am sad that despite publishing a couple of first-author papers and being part of a dozen other collaborations, 🙄 I was not able to see from that even a hint of a CCLE3 paper. I hope it will be published someday but I have been told that to be guaranteed a Nature, one needs more than new features and new predictions, one -apparently- also needs a crazy new sequencing dataset 🗻 (a la Broad).

metadata improvement

MetaData had always been one of our Achilles’ heel. We knew they were not great but we also knew that no one had great metadata for their cell line. What made the group become quickly convinced to start working on it, was how much better any improvement made the prediction algorithm work. Seeing that we started running many parallel efforts to map duplicate samples (using genomics), similar-looking models (using transcriptomics), sex, age, etc. :bar_chart:

Moreover, we started merging our cell line annotations with Cellosaurus to discover more issues in both our databases and started merging our disease annotations with the NCIT to realize that many more cancers than we thought were related in some way. or were simply improbable annotations. 🧭

This effort is also something I had pushed for, for a while, and reused many of the improvements I had already put forward. This is also something I am proud of. I think it is a great example of how such a small, boring effort can have a big impact on science and therapeutic discovery.

next steps for sequencing

Moreover, since it is the Broad, in 2019 we had goals of going 10x in the coming years. However, we soon realize that there were not a lot of “good” models left 🪹 and that our capacity for creating them was very limited. We wanted more cell lines / 3D models but it was hard to get good ones.

Given that we have 12 different sequencing data types for 1700 lines but almost none of them have all of those sequencing 🧬 done, our first goal was to do WGS & RNAseq on everything. representing more than 200 RNAseq and 1000 WGS, the reordering of 400 lines, and the likely drop of dozens of lines that do not exist anymore.

Why do that?

Because these are the low-hanging fruits for us. Without changing anything, without generating new models, and changing pipelines and platform, we just get a lot more information allowing us to discover new targets cheaply and squeeze more value out of our current database. When your model can work on modality A+B+C but only 300 samples / 1700 have each of those modalities, you are missing out on a lot of potential. 🟩

But a second objective. Even more ambitious came to us in the form of Jason Buenrostro: Inventor of ATACseq & SHARE-seq. Jason wanted to apply SHARE-seq to all our cell lines. likely millions of cells. Then, apply it for post-perturbation 🔫. This would allow us to see the effect of a drug on the chromatin accessibility of a cell line and its expression at single-cell resolution. This was a very exciting project. And I was tasked to coordinate it on the CDS side by Paquita Vazquez.

However, this was not all, and the other half of my time was also spent on a pet peeve of mine: Celligner.

Celligner2 and the hint of foundational models

Celligner was initially developed by Allie Warren. Its goal is to align bulk RNAseq data from cancer models (CCLE) with data from patients’ tumors (TCGA). Comparing the two 🍎 / 🍊 to:

assess annotation quality.
assess model quality as its similarity to the patient’s tumor.
assess model coverage & diversity.

Initially, they don’t align well and one needs to apply an alignment method for those datasets to work together.

She did that by reusing the MNN tool from Seurat and adding in additional preprocessing using contrastive PCA, differential expression, etc.

The project work very well and led to a lot of excitement.

As Allie left, I took over the project and started working on a new version of the tool. I wanted to answer biologists’ needs. Initially by adding new features and making the tool more scalable.

But I soon started to think that we could do so much more by using a different approach.

Reusing VAE-based autoencoders like scVI and scArches and training them to make a model look like a tumor and vice versa. Training them to predict features like tumor type, sex, and age. And using explainable AI (XAI) 🗨️ to understand what the model was learning and what gene sets and pathways were explaining these features.

We actually went pretty far with this and I started making early presentations for the team. 🧑‍🔬

presentation

I think I showed many interesting results that could have fit into a paper.

Senior Members were intrigued by 2 ideas:

using annotations during training as semi-supervision constraints to improve model quality.
using additional data like GTEX that was containing normal samples as a way to also improve model quality. 💹

For the first part, the debate was mostly around: “What if the annotation is wrong?”, “what about overfitting to annotations?”, “it seems like using what you will test your model on”…

But I was able to show that since it was only part of the loss function, it was not leading to this issue. By making wrongful annotations, up to a point, the model was able to still create the right mapping (e.g. fake “lung models” 🫁 that were actually “breast models” were still aligning with breast cancer). Moreover adding these annotations was in fact improving the model quality, and quality here is defined not by annotation but by many other metrics 📈 as shown in the scIB package.

For the second part, it just seemed nonsensical to some to use data that was not cancer: “What does it mean to align normals to cancer?”, “Is it really useful?”, “Why do you want to do this?”

I think here I had the first hint of what would soon be the foundation models in scRNAseq that would come out in the following year. I was seeing Gtex as pretraining. Helping the neural network to also learn how genes are related and differences in tissues. I was seeing it as a way to make the model more robust and more generalizable. 🧠 It showed in the metrics that doing that was improving the model on every downstream tasks. It was also interesting to see cancers that were mapping to their original tissue of origin and the ones that were completely diverging. 📉

The Celligner project was going well, the results were there. However many changes at CDS and Broad came together to slowly end it.

Leaving Broad

In the last year at Broad 3 prominent colleagues left Broad: CDS founder Aviad Tsherniak, CDS director James McFarland, and Javad Noorbakhsh. This departure of multiple colleagues was a lot but it was also the only 3 people pushing for Celligner and bringing more AI in CDS.

Moreover, there was a change in the cancer program leadership itself. With Eric Lander, Broad’s Director leaving for Senate 🇺🇸 (and then quickly leaving 🙊 ). Todd Golub became Broad’s new director and Bill Sellers, a world-renowned cancer scientist from Novartis and heavily involved in CCLE and DepMap, became the new cancer program’s director ♋ .

Bill wanted to rationalize a bit more the objectives of the Cancer Program toward drug discovery. And he wanted to do that by focusing on a few key areas :bow_and_arrow: . This meant that some projects were going to be stopped and some others were going to be prioritized. So you can guess where this is going for Celligner… 🛑

But the new projects were still very exciting. But other factors were very important in my decision too. I was missing my family, my father was ill, I could see that life was moving fast for my siblings and my friends. I felt that I had to take a life decision: I had to make a choice and decided to move back to France. 🇫🇷

I won’t take you through covid’s time and my SO being away too (see 2 years at Broad) but everything played a role.

What’s next?

I also got a sense that I need to see drug development firsthand. I need to get back in the weeds of building a company and a project. I spent a year doing that before Broad and such a fast-paced. product-first 📦 environment is also something that I am missing. I will try to see what I can do in the field of target discovery with this acquired knowledge.

In no time I got into this up-and-coming French StartUp called Aqemia. Let’s see what is next.

If you want to see what I presented to the companies 🏛️ I have applied to. Here is my interview presentation, used in a couple fancy companies’ interview 😁.