Getting Started

rsynthbio is an R package that provides a convenient interface to the Synthesize Bio API, allowing users to generate realistic gene expression data based on specified biological conditions. This package enables researchers to easily access AI-generated transcriptomic data for various modalities including bulk RNA-seq and single-cell RNA-seq.

Basic Usage

Creating a Query

The first step to generating AI-generated gene expression data is to create a query. The package provides a sample query that you can modify:

# Get a sample query
query <- get_valid_query()

# Inspect the query structure
str(query)

The query consists of:

output_modality: The type of gene expression data to generate (see get_valid_modalities)
mode: The prediction mode (e.g., “mean estimation” or “sample generation”)
inputs: A list of biological conditions to generate data for

We train our models with diverse multi-omics datasets. There are two model modes available today:

Mean estimation: These models create a distribution capturing the biological heterogeneity consistent with the supplied metadata. This distribution is then sampled to predict a gene expression distribution that captures measurement error. The mean of that distribution serves as the prediction
Sample generation: This model works identically to the mean estimation approach except that the final gene expression distribution is also sampled to generate realistic looking synthetic data that captures error associated with measurement

result <- predict_query(query)

This result will be a list of two dataframes: metadata and expression

Modifying a Query

You can customize the query to fit your specific research needs:


# Adjust number of samples
query$inputs[[1]]$num_samples <- 10

# Add a new condition
query$inputs[[3]] <- list(
  metadata = list(
    sex = "male",
    sample_type = "primary tissue"
  ),
  num_samples = 3
)

The input metadata is a list of lists.

Here are the available metadata fields:

Biological:

age_years
cell_line_ontology_id
cell_type_ontology_id
developmental_stage
disease_ontology_id
ethnicity
genotype
race
sample_type (“cell line”, “organoid”, “other”, “primary cells”, “primary tissue”, “xenograft”)
sex (“male”, “female”)
tissue_ontology_id

Perturbational:

perturbation_dose
perturbation_ontology_id
perturbation_time
perturbation_type (“coculture”,“compound”,“control”,“crispr”,“genetic”,“infection”,“other”,“overexpression”,“peptide or biologic”,“shrna”,“sirna”)

Technical:

study (Bioproject ID)
library_selection (e.g., “cDNA”, “polyA”, “Oligo-dT” - see https://ena-docs.readthedocs.io/en/latest/submit/reads/webin-cli.html#permitted-values-for-library-selection)
library_layout (“PAIRED”, “SINGLE”)
platform (“illumina”)

Acceptable Metadata Values

The following are the valid values or expected formats for selected metadata keys:

Metadata Field	Requirement / Example
`cell_line_ontology_id`	Requires a Cellosaurus ID.
`cell_type_ontology_id`	Requires a CL ID.
`disease_ontology_id`	Requires a MONDO ID.
`perturbation_ontology_id`	Must be a valid Ensembl gene ID (e.g., `ENSG00000156127`), ChEBI ID (e.g., `CHEBI:16681`), ChEMBL ID (e.g., `CHEMBL1234567`), or NCBI Taxonomy ID (e.g., `9606`).
`tissue_ontology_id`	Requires a UBERON ID.

To lookup ontology terms, we recommend using the EMBL-EBI Ontology Lookup Service.

Models have a limited acceptable range of metadata input values. If you provide a value that is not in the acceptable range, the API will return an error.

Making Predictions

Once your query is ready, you can send it to the API to generate gene expression data.

# Request counts data (not log-CPM)
result <- predict_query(query, as_counts = TRUE)

If you want the full API response beyond just than just the result of the metadata and expression returned put raw_response = TRUE.

Working with Results

# Access metadata and expression matrices
metadata <- result$metadata
expression <- result$expression

# Check dimensions
dim(expression)

# View metadata sample
head(metadata)

You may want to process the data in chunks or save it for later use:

# Save results to RDS file
saveRDS(result, "synthesize_results.rds")

# Load previously saved results
result <- readRDS("synthesize_results.rds")

# Export as CSV
write.csv(result$expression, "expression_matrix.csv")
write.csv(result$metadata, "sample_metadata.csv")

Custom Validation

You can validate your queries before sending them to the API:

# Validate structure
validate_query(query)

# Validate modality
validate_modality(query)

Getting Started

How to install

Authentication

Security Best Practices