File Formats

Feature Files

The input files for PyProphet are typically feature files generated by OpenSWATH. These files contain the precursor peak-group features extracted from the raw data, which are then used for scoring and analysis in PyProphet. As of OpenSWATH v2.4.0, the feature files are stored in the OpenSWATH SQLite format (.osw), but also supports tabular (.tsv) output and markup XML (.featureXML) output. Recent updates to PyProphet (v3.0.0) supports converting the SQLite (.osw) format to Parquet format, which is a columnar storage file format that is optimized for use with large datasets.

OpenSWATH SQLite Format (.osw)

The OSW-sqlite based files have a flexible relational data structure. They contain all peptide query parameters of PQP files with the detected and quantified features of OpenSwathWorkflow (feature, feature_ms1, feature_ms2 & feature_transition). You can find more information on the feature tables and the schema from OpenMS’s documentation on OpenSWATH SQLite.

Parquet Format (.parquet)

Parquet is a columnar storage file format that is optimized for use with large datasets. It is designed to be efficient in terms of both storage space and query performance. PyProphet supports reading and writing Parquet files, allowing users to work with large datasets more efficiently. PyProphet offers the options to convert the OpenSWATH SQLite format (.osw) to a single parquet file (with both precursor and transition data). The data is stored in two separate blocks to separate precursor and transition data, similar to the example table below:

protein_id

peptide_id

ipf_peptide_id

precursor_id

modified_sequence

charge

RT

feature_id

prec_feat_var_1

prec_feat_var_2

prec_feat_var_3

transition_id

transition_annotation

transition_feat_1

transition_feat_2

transition_feat_3

precursor_score

transition_score

1556

8044

8044

0

.(UniMod:26)CDTDVPFQLK

2

786

4564656

0.8251878619194031

0.9905699491500854

0.9867947697639465

NULL

NULL

NULL

NULL

NULL

98

NULL

0

13404

13404

21408

.(UniMod:26)CDVVYTHGLQDWNVKPR

3

547

2341534

0.7650477886199951

0.9925554990768433

0.6403021812438965

NULL

NULL

NULL

NULL

NULL

79

NULL

1634

0

0

16886

FSWISTGGGASMELLEGK

2

234

65687785

0.7152583599090576

0.812627375125885

0.6165676116943359

NULL

NULL

NULL

NULL

NULL

56

NULL

3455

6788

6788

3659

ENADLIMVGATGLNTFER

3

453

13245346

0.8531889319419861

0.15485289692878723

0.5889896154403687

NULL

NULL

NULL

NULL

NULL

32

NULL

NULL

NULL

8044

0

NULL

NULL

NULL

4564656

NULL

NULL

NULL

0

y7

6.691071510314941

0.46852633357048035

0.6704034209251404

NULL

98

NULL

NULL

8044

0

NULL

NULL

NULL

4564656

NULL

NULL

NULL

2

b7

4.816525459289551

0.3565627932548523

0.5738980174064636

NULL

86

NULL

NULL

8044

0

NULL

NULL

NULL

4564656

NULL

NULL

NULL

18

y3

2.7247447967529297

0.6799249053001404

0.7191503047943115

NULL

67

NULL

NULL

13404

21408

NULL

NULL

NULL

2341534

NULL

NULL

NULL

45

y5

4.299717426300049

0.45827046036720276

0.6673739552497864

NULL

45

NULL

NULL

13404

21408

NULL

NULL

NULL

2341534

NULL

NULL

NULL

98

b3

4.548809051513672

0.7069618105888367

0.7448312044143677

NULL

34

Split Parquet Format (.parquet / .oswpq / .oswpqd)

PyProphet also supports a split parquet format, which is useful for large datasets. This format splits the precursor and transition data into separate files, allowing for more efficient storage and processing. The split parquet format can be used with the –split_transition_data option when converting from OpenSWATH SQLite format to Parquet format. The split parquet files are named as follows:

└── 📁 merged_runs.oswpq
    ├── 📄 precursors_features.parquet
    └── 📄 transition_features.parquet

To further split the data by run, you can use the –split_runs option. This will create separate parquet files for each run, which can be useful for large datasets with multiple runs. The split parquet files are named as follows:

└── 📁 merged_runs.oswpqd
    ├── 📁 20200626_erli_phos_10.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_101.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_102.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_103.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_104.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_105.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_106.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_107.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_108.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    ├── 📁 20200626_erli_phos_109.oswpq
    │   ├── 📄 precursors_features.parquet
    │   └── 📄 transition_features.parquet
    │   ... (236 more run(s) collapsed)

Precursors Features Parquet Schema

Click to expand the Precursors Features Parquet Schema. Note that not all columns listed here are in every parquet file.
Schema([
    ('PROTEIN_ID', Int64),
    ('PEPTIDE_ID', Int64),
    ('IPF_PEPTIDE_ID', Int64),
    ('PRECURSOR_ID', Int64),
    ('PROTEIN_ACCESSION', String),
    ('UNMODIFIED_SEQUENCE', String),
    ('MODIFIED_SEQUENCE', String),
    ('PRECURSOR_TRAML_ID', String),
    ('PRECURSOR_GROUP_LABEL', String),
    ('PRECURSOR_MZ', Float64),
    ('PRECURSOR_CHARGE', Int64),
    ('PRECURSOR_LIBRARY_INTENSITY', Float64),
    ('PRECURSOR_LIBRARY_RT', Float64),
    ('PRECURSOR_LIBRARY_DRIFT_TIME', Float64),
    ('GENE_ID', Int64),
    ('GENE_NAME', String),
    ('GENE_DECOY', Int64),
    ('PROTEIN_DECOY', Int64),
    ('PEPTIDE_DECOY', Int64),
    ('PRECURSOR_DECOY', Int64),
    ('RUN_ID', Int64),
    ('FILENAME', String),
    ('FEATURE_ID', Int64),
    ('EXP_RT', Float64),
    ('EXP_IM', Float64),
    ('NORM_RT', Float64),
    ('DELTA_RT', Float64),
    ('LEFT_WIDTH', Float64),
    ('RIGHT_WIDTH', Float64),
    ('FEATURE_MS1_AREA_INTENSITY', Float64),
    ('FEATURE_MS1_APEX_INTENSITY', Float64),
    ('FEATURE_MS1_EXP_IM', Float64),
    ('FEATURE_MS1_DELTA_IM', Float64),
    ('FEATURE_MS1_VAR_MASSDEV_SCORE', Float64),
    ('FEATURE_MS1_VAR_MI_SCORE', Float64),
    ('FEATURE_MS1_VAR_MI_CONTRAST_SCORE', Float64),
    ('FEATURE_MS1_VAR_MI_COMBINED_SCORE', Float64),
    ('FEATURE_MS1_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
    ('FEATURE_MS1_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
    ('FEATURE_MS1_VAR_IM_MS1_DELTA_SCORE', Float64),
    ('FEATURE_MS1_VAR_XCORR_COELUTION', Float64),
    ('FEATURE_MS1_VAR_XCORR_COELUTION_CONTRAST', Float64),
    ('FEATURE_MS1_VAR_XCORR_COELUTION_COMBINED', Float64),
    ('FEATURE_MS1_VAR_XCORR_SHAPE', Float64),
    ('FEATURE_MS1_VAR_XCORR_SHAPE_CONTRAST', Float64),
    ('FEATURE_MS1_VAR_XCORR_SHAPE_COMBINED', Float64),
    ('FEATURE_MS2_AREA_INTENSITY', Float64),
    ('FEATURE_MS2_TOTAL_AREA_INTENSITY', Float64),
    ('FEATURE_MS2_APEX_INTENSITY', Float64),
    ('FEATURE_MS2_EXP_IM', Float64),
    ('FEATURE_MS2_EXP_IM_LEFTWIDTH', Float64),
    ('FEATURE_MS2_EXP_IM_RIGHTWIDTH', Float64),
    ('FEATURE_MS2_DELTA_IM', Float64),
    ('FEATURE_MS2_TOTAL_MI', Float64),
    ('FEATURE_MS2_VAR_BSERIES_SCORE', Float64),
    ('FEATURE_MS2_VAR_DOTPROD_SCORE', Float64),
    ('FEATURE_MS2_VAR_INTENSITY_SCORE', Float64),
    ('FEATURE_MS2_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
    ('FEATURE_MS2_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_CORR', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_DOTPROD', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_MANHATTAN', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_RMSD', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_ROOTMEANSQUARE', Float64),
    ('FEATURE_MS2_VAR_LIBRARY_SANGLE', Float64),
    ('FEATURE_MS2_VAR_LOG_SN_SCORE', Float64),
    ('FEATURE_MS2_VAR_MANHATTAN_SCORE', Float64),
    ('FEATURE_MS2_VAR_MASSDEV_SCORE', Float64),
    ('FEATURE_MS2_VAR_MASSDEV_SCORE_WEIGHTED', Float64),
    ('FEATURE_MS2_VAR_MI_SCORE', Float64),
    ('FEATURE_MS2_VAR_MI_WEIGHTED_SCORE', Float64),
    ('FEATURE_MS2_VAR_MI_RATIO_SCORE', Float64),
    ('FEATURE_MS2_VAR_NORM_RT_SCORE', Float64),
    ('FEATURE_MS2_VAR_XCORR_COELUTION', Float64),
    ('FEATURE_MS2_VAR_XCORR_COELUTION_WEIGHTED', Float64),
    ('FEATURE_MS2_VAR_XCORR_SHAPE', Float64),
    ('FEATURE_MS2_VAR_XCORR_SHAPE_WEIGHTED', Float64),
    ('FEATURE_MS2_VAR_YSERIES_SCORE', Float64),
    ('FEATURE_MS2_VAR_ELUTION_MODEL_FIT_SCORE', Float64),
    ('FEATURE_MS2_VAR_IM_XCORR_SHAPE', Float64),
    ('FEATURE_MS2_VAR_IM_XCORR_COELUTION', Float64),
    ('FEATURE_MS2_VAR_IM_DELTA_SCORE', Float64),
    ('FEATURE_MS2_VAR_SONAR_LAG', Float64),
    ('FEATURE_MS2_VAR_SONAR_SHAPE', Float64),
    ('FEATURE_MS2_VAR_SONAR_LOG_SN', Float64),
    ('FEATURE_MS2_VAR_SONAR_LOG_DIFF', Float64),
    ('FEATURE_MS2_VAR_SONAR_LOG_TREND', Float64),
    ('FEATURE_MS2_VAR_SONAR_RSQ', Float64),
    ('SCORE_MS2_SCORE', Float64),
    ('SCORE_MS2_PEAK_GROUP_RANK', Float64),
    ('SCORE_MS2_P_VALUE', Float64),
    ('SCORE_MS2_Q_VALUE', Float64),
    ('SCORE_MS2_PEP', Float64),
    ('SCORE_IPF_PRECURSOR_PEAKGROUP_PEP', Float64),
    ('SCORE_IPF_QVALUE', Float64),
    ('SCORE_IPF_PEP', Float64),
    ('SCORE_PEPTIDE_RUN_SPECIFIC_SCORE', Float64),
    ('SCORE_PEPTIDE_RUN_SPECIFIC_P_VALUE', Float64),
    ('SCORE_PEPTIDE_RUN_SPECIFIC_Q_VALUE', Float64),
    ('SCORE_PEPTIDE_RUN_SPECIFIC_PEP', Float64),
    ('SCORE_PEPTIDE_EXPERIMENT_WIDE_SCORE', Float64),
    ('SCORE_PEPTIDE_EXPERIMENT_WIDE_P_VALUE', Float64),
    ('SCORE_PEPTIDE_EXPERIMENT_WIDE_Q_VALUE', Float64),
    ('SCORE_PEPTIDE_EXPERIMENT_WIDE_PEP', Float64),
    ('SCORE_PEPTIDE_GLOBAL_SCORE', Float64),
    ('SCORE_PEPTIDE_GLOBAL_P_VALUE', Float64),
    ('SCORE_PEPTIDE_GLOBAL_Q_VALUE', Float64),
    ('SCORE_PEPTIDE_GLOBAL_PEP', Float64),
    ('SCORE_PROTEIN_RUN_SPECIFIC_SCORE', Float64),
    ('SCORE_PROTEIN_RUN_SPECIFIC_P_VALUE', Float64),
    ('SCORE_PROTEIN_RUN_SPECIFIC_Q_VALUE', Float64),
    ('SCORE_PROTEIN_RUN_SPECIFIC_PEP', Float64),
    ('SCORE_PROTEIN_EXPERIMENT_WIDE_SCORE', Float64),
    ('SCORE_PROTEIN_EXPERIMENT_WIDE_P_VALUE', Float64),
    ('SCORE_PROTEIN_EXPERIMENT_WIDE_Q_VALUE', Float64),
    ('SCORE_PROTEIN_EXPERIMENT_WIDE_PEP', Float64),
    ('SCORE_PROTEIN_GLOBAL_SCORE', Float64),
    ('SCORE_PROTEIN_GLOBAL_P_VALUE', Float64),
    ('SCORE_PROTEIN_GLOBAL_Q_VALUE', Float64),
    ('SCORE_PROTEIN_GLOBAL_PEP', Float64)])


Transition Features Parquet Schema

Click to expand the Transition Feature Parquet Schema. Note that not all columns listed here are in every parquet file.
Schema([
    ('RUN_ID', Int64),
    ('IPF_PEPTIDE_ID', Int64),
    ('PRECURSOR_ID', Int64),
    ('TRANSITION_ID', Int64),
    ('TRANSITION_TRAML_ID', String),
    ('PRODUCT_MZ', Float64),
    ('TRANSITION_CHARGE', Int64),
    ('TRANSITION_TYPE', String),
    ('TRANSITION_ORDINAL', Int64),
    ('ANNOTATION', String),
    ('TRANSITION_DETECTING', Int64),
    ('TRANSITION_LIBRARY_INTENSITY', Float64),
    ('TRANSITION_DECOY', Int64),
    ('FEATURE_ID', Int64),
    ('FEATURE_TRANSITION_AREA_INTENSITY', Float64),
    ('FEATURE_TRANSITION_TOTAL_AREA_INTENSITY', Float64),
    ('FEATURE_TRANSITION_APEX_RT', Float64),
    ('FEATURE_TRANSITION_APEX_INTENSITY', Float64),
    ('FEATURE_TRANSITION_RT_FWHM', Float64),
    ('FEATURE_TRANSITION_MASSERROR_PPM', Float64),
    ('FEATURE_TRANSITION_TOTAL_MI', Float64),
    ('FEATURE_TRANSITION_VAR_INTENSITY_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_INTENSITY_RATIO_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_LOG_INTENSITY', Float64),
    ('FEATURE_TRANSITION_VAR_XCORR_COELUTION', Float64),
    ('FEATURE_TRANSITION_VAR_XCORR_SHAPE', Float64),
    ('FEATURE_TRANSITION_VAR_LOG_SN_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_MASSDEV_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_MI_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_MI_RATIO_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
    ('FEATURE_TRANSITION_START_POSITION_AT_5', Float64),
    ('FEATURE_TRANSITION_END_POSITION_AT_5', Float64),
    ('FEATURE_TRANSITION_START_POSITION_AT_10', Float64),
    ('FEATURE_TRANSITION_END_POSITION_AT_10', Float64),
    ('FEATURE_TRANSITION_START_POSITION_AT_50', Float64),
    ('FEATURE_TRANSITION_END_POSITION_AT_50', Float64),
    ('FEATURE_TRANSITION_TOTAL_WIDTH', Float64),
    ('FEATURE_TRANSITION_TAILING_FACTOR', Float64),
    ('FEATURE_TRANSITION_ASYMMETRY_FACTOR', Float64),
    ('FEATURE_TRANSITION_SLOPE_OF_BASELINE', Float64),
    ('FEATURE_TRANSITION_BASELINE_DELTA_2_HEIGHT', Float64),
    ('FEATURE_TRANSITION_POINTS_ACROSS_BASELINE', Float64),
    ('FEATURE_TRANSITION_POINTS_ACROSS_HALF_HEIGHT', Float64),
    ('FEATURE_TRANSITION_EXP_IM', Float64),
    ('FEATURE_TRANSITION_EXP_IM_LEFTWIDTH', Float64),
    ('FEATURE_TRANSITION_EXP_IM_RIGHTWIDTH', Float64),
    ('FEATURE_TRANSITION_DELTA_IM', Float64),
    ('FEATURE_TRANSITION_VAR_IM_DELTA_SCORE', Float64),
    ('FEATURE_TRANSITION_VAR_IM_LOG_INTENSITY', Float64),
    ('FEATURE_TRANSITION_VAR_IM_XCORR_COELUTION_CONTRAST', Binary),
    ('FEATURE_TRANSITION_VAR_IM_XCORR_SHAPE_CONTRAST', Binary),
    ('FEATURE_TRANSITION_VAR_IM_XCORR_COELUTION_COMBINED', Binary),
    ('FEATURE_TRANSITION_VAR_IM_XCORR_SHAPE_COMBINED', Binary),
    ('SCORE_TRANSITION_SCORE', Float32),
    ('SCORE_TRANSITION_RANK', UInt32),
    ('SCORE_TRANSITION_P_VALUE', Float64),
    ('SCORE_TRANSITION_Q_VALUE', Float64),
    ('SCORE_TRANSITION_PEP', Float64)
])

Extracted Ion Chromatograms (XICs)

OpenSwathWorkflow allows for the optional output of extracted ion chromatograms (XICs) for each precursor and transition. These XICs are stored in the OpenSWATH SQLite format (.sqMass). The XICs can be used for further analysis (chromatogram feature alignment using DIAlignR or ARYCAL) or visualization of the data (using massdash).

SqMass Format (.sqMass)

You can find more information on the sqMass format and the schema from OpenMS’s documentation on SqMass File.

Parquet Format (.parquet)

PyProphet supports converting the OpenSWATH SQLite format (.sqMass) to a parquet format, for better storage. You can use the export parquet subcommand to convert the OpenSWATH SQLite format (.sqMass) to a parquet format.

The Parquet schema used for the XICs is as follows:

Schema([
    ('PRECURSOR_ID', Int64),
    ('TRANSITION_ID', Int64),
    ('MODIFIED_SEQUENCE', String),
    ('PRECURSOR_CHARGE', Int64),
    ('PRODUCT_CHARGE', Int64),
    ('DETECTING_TRANSITION', Int64),
    ('PRECURSOR_DECOY', Int64),
    ('PRODUCT_DECOY', Int64),
    ('TRANSITION_ORDINAL', Int64),
    ('TRANSITION_TYPE', String),
    ('NATIVE_ID', String),
    ('RT_DATA', Binary),
    ('INTENSITY_DATA', Binary),
    ('RT_COMPRESSION', Int64),
    ('INTENSITY_COMPRESSION', Int64)
])

RT_COMPRESSION and INTENSITY_COMPRESSION are used to store the compression type for the RT and intensity data, respectively. The RT_DATA and INTENSITY_DATA are stored as binary data, which can be decompressed using the appropriate compression algorithm.

The possible values and their decoding for the compression type are as follows:

Compression Types

Value

Compression Type

0

No compression

1

zlib

2

np-linear

3

np-slof

4

np-pic

5

np-linear + zlib

6

np-slof + zlib

7

np-pic + zlib