Convert raw PhIP-Seq output into a phip_data object
Source: R/convert-standard.R
convert_standard.Rdconvert_standard() ingests a "long" table of PhIPsSeq read
counts / enrichment statistics, optionally expands it to the full
sample_id x peptide_id grid, and registers the result in DuckDB.
The function returns a fully initialised phip_data object that can be
queried with the tidy API used throughout the package.
Usage
convert_standard(
data_long_path,
sample_id = NULL,
peptide_id = NULL,
subject_id = NULL,
timepoint = NULL,
exist = NULL,
fold_change = NULL,
counts_input = NULL,
counts_hit = NULL,
sample_id_from_filenames = FALSE,
n_cores = 8,
materialise_table = TRUE,
auto_expand = FALSE,
peptide_library = TRUE
)Arguments
- data_long_path
Character scalar. File or directory containing the long-format PhIP-Seq data. Allowed extensions are
.csvand.parquet. Directories are treated as partitions of a parquet set.- sample_id, peptide_id, subject_id, timepoint, exist, fold_change, counts_input, counts_hit
Optional character strings. Supply these only if your column names differ from the defaults (
"sample_id","peptide_id","subject_id","timepoint","exist","fold_change","counts_input","counts_hit"). Each argument should contain the name of the column in the incoming data;NULLlets the default stand.- sample_id_from_filenames
Logical. If
TRUEanddata_long_pathis a directory of files (CSV or Parquet), automatically derivesample_idfrom each filename (basename without extension). Requires that nosample_idmapping is provided and that the input files do not already contain asample_idcolumn. Default:FALSE.- n_cores
Integer >= 1. Number of CPU threads DuckDB may use while reading and writing files.
- materialise_table
Logical. If
FALSEthe result is registered as a view; ifTRUEthe table is fully materialised and stored on disk, trading higher load time and storage for faster repeated queries.- auto_expand
Logical. If
TRUEand the incoming data are not a complete Cartesian product ofsample_id x peptide_id, missing combinations are generated:Columns that are constant within each
sample_id(metadata) are copied to the new rows.Non-recyclable measurement columns (
fold_change,exist,counts_input,counts_hit, etc.) are initialised to 0. The expanded table replaces the original in place.
- peptide_library
Logical. If
TRUE(default)convert_standard()will attempt to locate and attach the matching peptide-library metadata for downstream annotation. Set toFALSEto skip this step.
Value
An S3 object of class phip_data containing:
data_longThe (possibly expanded) long-format table.
peptide_libraryLoaded peptide-library metadata (if
peptide_library = TRUE).metaList with DuckDB connection handles.
Details
Paths are resolved to absolute form before any work begins, and explicit checks confirm existence as well as extension validity.
See also
create_data()for the object constructor.dplyr::tbl()to query DuckDB tables lazily.
Examples
# Basic import, auto-detecting default column names
phip_obj <- convert_standard(
data_long_path = get_example_path("phip_mixture"),
n_cores = 4,
materialise_table = TRUE
)
#> [13:45:24] INFO Constructing <phip_data> object
#> -> create_data()
#> [13:45:24] INFO Fetching peptide metadata library via get_peptide_library()
#> [13:45:24] INFO Retrieving peptide metadata into DuckDB cache
#> -> get_peptide_library(force_refresh = FALSE)
#> [13:45:24] INFO Opened DuckDB connection
#> - cache dir:
#> /home/runner/.cache/R/phiperio/peptide_meta/phip_cache.duckdb
#> - table: peptide_meta
#> [13:45:24] OK Using cached peptide_meta (fast path)
#> [13:45:24] OK Retrieving peptide metadata into DuckDB cache - done
#> -> elapsed: 0.053s
#> [13:45:24] OK Peptide metadata acquired
#> [13:45:24] INFO Validating <phip_data>
#> -> validate_phip_data()
#> [13:45:24] INFO Checking structural requirements (shape & mandatory columns)
#> [13:45:24] INFO Checking outcome family availability (exist / fold_change /
#> raw_counts)
#> [13:45:24] INFO Checking collisions with reserved names
#> - subject_id, sample_id, timepoint, peptide_id, exist,
#> fold_change, counts_input, counts_hit
#> [13:45:24] INFO Ensuring all columns are atomic (no list-cols)
#> [13:45:24] INFO Checking key uniqueness
#> [13:45:24] INFO Validating value ranges & types for outcomes
#> [13:45:24] INFO Assessing sparsity (NA/zero prevalence vs threshold)
#> - warn threshold: 50%
#> [13:45:24] INFO Checking peptide_id coverage against peptide_library
#> Warning: [13:45:24] WARN peptide_id not found in peptide_library (e.g. 10003)
#> -> peptide library coverage.
#> [13:45:24] INFO Checking full grid completeness (peptide * sample)
#> Warning: [13:45:24] WARN Counts table is not a full peptide * sample grid.
#> -> grid completeness
#> - observed rows: 78200
#> - expected rows: 156000.
#> Warning: [13:45:24] WARN Grid remains incomplete (auto_expand = FALSE).
#> -> grid completeness
#> - observed rows: 78200
#> - expected rows: 156000.
#> [13:45:24] OK Validating <phip_data> - done
#> -> elapsed: 0.662s
#> [13:45:24] OK Constructing <phip_data> object - done
#> -> elapsed: 0.717s
# Import a CSV and rename columns
tmp_csv <- tempfile(fileext = ".csv")
utils::write.csv(
data.frame(
sample = c("s1", "s1"),
pep = c("p1", "p2"),
exist = c(1, 0),
stringsAsFactors = FALSE
),
tmp_csv,
row.names = FALSE
)
phip_mem <- convert_standard(
data_long_path = tmp_csv,
sample_id = "sample",
peptide_id = "pep",
peptide_library = FALSE,
materialise_table = FALSE
)
#> Skipping ANALYZE - raw_combined is a view.
#> [13:45:24] INFO Constructing <phip_data> object
#> -> create_data()
#> [13:45:24] INFO Validating <phip_data>
#> -> validate_phip_data()
#> [13:45:24] INFO Checking structural requirements (shape & mandatory columns)
#> [13:45:24] INFO Checking outcome family availability (exist / fold_change /
#> raw_counts)
#> [13:45:24] INFO Checking collisions with reserved names
#> - subject_id, sample_id, timepoint, peptide_id, exist,
#> fold_change, counts_input, counts_hit
#> [13:45:24] INFO Ensuring all columns are atomic (no list-cols)
#> [13:45:24] INFO Checking key uniqueness
#> [13:45:24] INFO Validating value ranges & types for outcomes
#> [13:45:24] INFO Assessing sparsity (NA/zero prevalence vs threshold)
#> - warn threshold: 50%
#> [13:45:25] INFO Checking peptide_id coverage against peptide_library
#> [13:45:25] INFO Checking full grid completeness (peptide * sample)
#> [13:45:25] OK Counts table is a full peptide * sample grid
#> [13:45:25] OK Validating <phip_data> - done
#> -> elapsed: 0.279s
#> [13:45:25] OK Constructing <phip_data> object - done
#> -> elapsed: 0.28s