SplitParquetReader

class pyprophet.io.export.split_parquet.SplitParquetReader(config: ExportIOConfig)[source]

Bases: BaseSplitParquetReader

Class for reading and processing data from an OpenSWATH workflow parquet split based file. Extended to support exporting functionality.

__init__(config: ExportIOConfig)[source]

Initialize the reader with a given configuration.

Parameters:: config (BaseIOConfig) – Configuration object containing input details, and module specific config for params for reading.

_add_peptide_data(data, con) → DataFrame[source]: Add peptide-level error rate data from split files.

_add_protein_data(data, con) → DataFrame[source]: Add protein identifier data from split files.

_add_protein_error_data(data, con) → DataFrame[source]: Add protein-level error rate data from split files.

_add_transition_data(data, con) → DataFrame[source]: Add transition-level quantification data from split files.

_augment_data(data, con) → DataFrame[source]: Apply common data augmentations to the base dataset.

_build_feature_vars_sql() → str[source]: Build SQL fragment for feature variables.

_check_alignment_file_exists() → bool[source]

Check if alignment parquet file exists for split parquet format.

For split parquet, alignment file is at the parent directory level: - infile is a directory containing *.oswpq subdirectories - alignment file is at infile/feature_alignment.parquet

_fetch_alignment_features(con) → DataFrame[source]

Fetch aligned features with good alignment scores from alignment parquet file.

This method checks for an alignment parquet file and retrieves features that have been aligned across runs and pass the alignment quality threshold. Only features whose reference feature passes the MS2 QVALUE threshold are included.

Parameters:: con – DuckDB connection
Returns:: DataFrame with aligned feature IDs that pass quality threshold

_get_ms1_score_info() → tuple[str, str][source]: Get MS1 score information if available.

_get_precursor_files()[source]: Helper to get precursor files based on structure

_get_transition_files()[source]: Helper to get transition files based on structure

_has_peptide_protein_global_scores() → bool[source]: Check if files contain peptide and protein global scores

_is_unscored_file() → bool[source]: Check if the files are unscored by verifying the presence of the ‘SCORE_’ columns.

_read_augmented_data(con) → DataFrame[source]: Read standard data augmented with IPF information from split files.

_read_for_export_scored_report(con) → DataFrame[source]: Lightweight reader that returns the minimal scored-report columns from split Parquet files.

_read_library_data(con) → DataFrame[source]: Read data specifically for precursors for library generation. This does not include all output in standard output

_read_peptidoform_data(con) → DataFrame[source]: Read data with peptidoform IPF information from split files.

_read_standard_data(con) → DataFrame[source]: Read standard OpenSWATH data without IPF from split files, optionally including aligned features.

_read_unscored_data(con) → DataFrame[source]: Read unscored data from split Parquet files.

export_feature_scores(outfile: str, plot_callback)[source]

Export feature scores from split Parquet directory for plotting.

Detects if SCORE columns exist and adjusts behavior: - If SCORE columns exist: applies RANK==1 filtering and plots SCORE + VAR_ columns - If SCORE columns don’t exist: plots only VAR_ columns

Parameters:

outfile (str) – Path to the output PDF file.
plot_callback (callable) – Function to call for plotting each level’s data. Signature: plot_callback(df, outfile, level, append)

read() → DataFrame[source]: Abstract method to be implemented by subclasses to read data from splti parquet format for a specific algorithm.