.. _file_formats: File Formats ========================= Feature Files ------------- The input files for PyProphet are typically feature files generated by `OpenSWATH `_. These files contain the precursor peak-group features extracted from the raw data, which are then used for scoring and analysis in PyProphet. As of OpenSWATH v2.4.0, the feature files are stored in the OpenSWATH SQLite format (*.osw*), but also supports tabular (*.tsv*) output and markup XML (*.featureXML*) output. Recent updates to PyProphet (v3.0.0) supports converting the SQLite (*.osw*) format to Parquet format, which is a columnar storage file format that is optimized for use with large datasets. OpenSWATH SQLite Format (*.osw*) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The OSW-sqlite based files have a flexible relational data structure. They contain all peptide query parameters of `PQP files `_ with the detected and quantified features of OpenSwathWorkflow (feature, feature_ms1, feature_ms2 & feature_transition). You can find more information on the feature tables and the schema from OpenMS's documentation on `OpenSWATH SQLite `_. Parquet Format (*.parquet*) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _parquet_format: Parquet is a columnar storage file format that is optimized for use with large datasets. It is designed to be efficient in terms of both storage space and query performance. PyProphet supports reading and writing Parquet files, allowing users to work with large datasets more efficiently. PyProphet offers the options to convert the OpenSWATH SQLite format (*.osw*) to a single parquet file (with both precursor and transition data). The data is stored in two separate blocks to separate precursor and transition data, similar to the example table below: +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | protein_id | peptide_id | ipf_peptide_id | precursor_id | modified_sequence | charge | RT | feature_id | prec_feat_var_1 | prec_feat_var_2 | prec_feat_var_3 | transition_id | transition_annotation | transition_feat_1 | transition_feat_2 | transition_feat_3 | precursor_score | transition_score | +============+============+================+==============+===============================+========+======+============+====================+=====================+====================+===============+=======================+====================+=====================+====================+=================+==================+ | 1556 | 8044 | 8044 | 0 | .(UniMod:26)CDTDVPFQLK | 2 | 786 | 4564656 | 0.8251878619194031 | 0.9905699491500854 | 0.9867947697639465 | NULL | NULL | NULL | NULL | NULL | 98 | NULL | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | 0 | 13404 | 13404 | 21408 | .(UniMod:26)CDVVYTHGLQDWNVKPR | 3 | 547 | 2341534 | 0.7650477886199951 | 0.9925554990768433 | 0.6403021812438965 | NULL | NULL | NULL | NULL | NULL | 79 | NULL | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | 1634 | 0 | 0 | 16886 | FSWISTGGGASMELLEGK | 2 | 234 | 65687785 | 0.7152583599090576 | 0.812627375125885 | 0.6165676116943359 | NULL | NULL | NULL | NULL | NULL | 56 | NULL | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | 3455 | 6788 | 6788 | 3659 | ENADLIMVGATGLNTFER | 3 | 453 | 13245346 | 0.8531889319419861 | 0.15485289692878723 | 0.5889896154403687 | NULL | NULL | NULL | NULL | NULL | 32 | NULL | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | NULL | NULL | 8044 | 0 | NULL | NULL | NULL | 4564656 | NULL | NULL | NULL | 0 | y7 | 6.691071510314941 | 0.46852633357048035 | 0.6704034209251404 | NULL | 98 | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | NULL | NULL | 8044 | 0 | NULL | NULL | NULL | 4564656 | NULL | NULL | NULL | 2 | b7 | 4.816525459289551 | 0.3565627932548523 | 0.5738980174064636 | NULL | 86 | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | NULL | NULL | 8044 | 0 | NULL | NULL | NULL | 4564656 | NULL | NULL | NULL | 18 | y3 | 2.7247447967529297 | 0.6799249053001404 | 0.7191503047943115 | NULL | 67 | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | NULL | NULL | 13404 | 21408 | NULL | NULL | NULL | 2341534 | NULL | NULL | NULL | 45 | y5 | 4.299717426300049 | 0.45827046036720276 | 0.6673739552497864 | NULL | 45 | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ | NULL | NULL | 13404 | 21408 | NULL | NULL | NULL | 2341534 | NULL | NULL | NULL | 98 | b3 | 4.548809051513672 | 0.7069618105888367 | 0.7448312044143677 | NULL | 34 | +------------+------------+----------------+--------------+-------------------------------+--------+------+------------+--------------------+---------------------+--------------------+---------------+-----------------------+--------------------+---------------------+--------------------+-----------------+------------------+ Split Parquet Format (*.parquet* / *.oswpq* / *.oswpqd*) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. _split_parquet_format: PyProphet also supports a split parquet format, which is useful for large datasets. This format splits the precursor and transition data into separate files, allowing for more efficient storage and processing. The split parquet format can be used with the `--split_transition_data` option when converting from OpenSWATH SQLite format to Parquet format. The split parquet files are named as follows: .. code-block:: text └── 📁 merged_runs.oswpq ├── 📄 precursors_features.parquet └── 📄 transition_features.parquet To further split the data by run, you can use the `--split_runs` option. This will create separate parquet files for each run, which can be useful for large datasets with multiple runs. The split parquet files are named as follows: .. code-block:: text └── 📁 merged_runs.oswpqd ├── 📁 20200626_erli_phos_10.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_101.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_102.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_103.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_104.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_105.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_106.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_107.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_108.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet ├── 📁 20200626_erli_phos_109.oswpq │ ├── 📄 precursors_features.parquet │ └── 📄 transition_features.parquet │ ... (236 more run(s) collapsed) Precursors Features Parquet Schema ********************************** .. raw:: html
Click to expand the Precursors Features Parquet Schema. Note that not all columns listed here are in every parquet file.
    Schema([
        ('PROTEIN_ID', Int64),
        ('PEPTIDE_ID', Int64),
        ('IPF_PEPTIDE_ID', Int64),
        ('PRECURSOR_ID', Int64),
        ('PROTEIN_ACCESSION', String),
        ('UNMODIFIED_SEQUENCE', String),
        ('MODIFIED_SEQUENCE', String),
        ('PRECURSOR_TRAML_ID', String),
        ('PRECURSOR_GROUP_LABEL', String),
        ('PRECURSOR_MZ', Float64),
        ('PRECURSOR_CHARGE', Int64),
        ('PRECURSOR_LIBRARY_INTENSITY', Float64),
        ('PRECURSOR_LIBRARY_RT', Float64),
        ('PRECURSOR_LIBRARY_DRIFT_TIME', Float64),
        ('GENE_ID', Int64),
        ('GENE_NAME', String),
        ('GENE_DECOY', Int64),
        ('PROTEIN_DECOY', Int64),
        ('PEPTIDE_DECOY', Int64),
        ('PRECURSOR_DECOY', Int64),
        ('RUN_ID', Int64),
        ('FILENAME', String),
        ('FEATURE_ID', Int64),
        ('EXP_RT', Float64),
        ('EXP_IM', Float64),
        ('NORM_RT', Float64),
        ('DELTA_RT', Float64),
        ('LEFT_WIDTH', Float64),
        ('RIGHT_WIDTH', Float64),
        ('FEATURE_MS1_AREA_INTENSITY', Float64),
        ('FEATURE_MS1_APEX_INTENSITY', Float64),
        ('FEATURE_MS1_EXP_IM', Float64),
        ('FEATURE_MS1_DELTA_IM', Float64),
        ('FEATURE_MS1_VAR_MASSDEV_SCORE', Float64),
        ('FEATURE_MS1_VAR_MI_SCORE', Float64),
        ('FEATURE_MS1_VAR_MI_CONTRAST_SCORE', Float64),
        ('FEATURE_MS1_VAR_MI_COMBINED_SCORE', Float64),
        ('FEATURE_MS1_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
        ('FEATURE_MS1_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
        ('FEATURE_MS1_VAR_IM_MS1_DELTA_SCORE', Float64),
        ('FEATURE_MS1_VAR_XCORR_COELUTION', Float64),
        ('FEATURE_MS1_VAR_XCORR_COELUTION_CONTRAST', Float64),
        ('FEATURE_MS1_VAR_XCORR_COELUTION_COMBINED', Float64),
        ('FEATURE_MS1_VAR_XCORR_SHAPE', Float64),
        ('FEATURE_MS1_VAR_XCORR_SHAPE_CONTRAST', Float64),
        ('FEATURE_MS1_VAR_XCORR_SHAPE_COMBINED', Float64),
        ('FEATURE_MS2_AREA_INTENSITY', Float64),
        ('FEATURE_MS2_TOTAL_AREA_INTENSITY', Float64),
        ('FEATURE_MS2_APEX_INTENSITY', Float64),
        ('FEATURE_MS2_EXP_IM', Float64),
        ('FEATURE_MS2_EXP_IM_LEFTWIDTH', Float64),
        ('FEATURE_MS2_EXP_IM_RIGHTWIDTH', Float64),
        ('FEATURE_MS2_DELTA_IM', Float64),
        ('FEATURE_MS2_TOTAL_MI', Float64),
        ('FEATURE_MS2_VAR_BSERIES_SCORE', Float64),
        ('FEATURE_MS2_VAR_DOTPROD_SCORE', Float64),
        ('FEATURE_MS2_VAR_INTENSITY_SCORE', Float64),
        ('FEATURE_MS2_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
        ('FEATURE_MS2_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_CORR', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_DOTPROD', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_MANHATTAN', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_RMSD', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_ROOTMEANSQUARE', Float64),
        ('FEATURE_MS2_VAR_LIBRARY_SANGLE', Float64),
        ('FEATURE_MS2_VAR_LOG_SN_SCORE', Float64),
        ('FEATURE_MS2_VAR_MANHATTAN_SCORE', Float64),
        ('FEATURE_MS2_VAR_MASSDEV_SCORE', Float64),
        ('FEATURE_MS2_VAR_MASSDEV_SCORE_WEIGHTED', Float64),
        ('FEATURE_MS2_VAR_MI_SCORE', Float64),
        ('FEATURE_MS2_VAR_MI_WEIGHTED_SCORE', Float64),
        ('FEATURE_MS2_VAR_MI_RATIO_SCORE', Float64),
        ('FEATURE_MS2_VAR_NORM_RT_SCORE', Float64),
        ('FEATURE_MS2_VAR_XCORR_COELUTION', Float64),
        ('FEATURE_MS2_VAR_XCORR_COELUTION_WEIGHTED', Float64),
        ('FEATURE_MS2_VAR_XCORR_SHAPE', Float64),
        ('FEATURE_MS2_VAR_XCORR_SHAPE_WEIGHTED', Float64),
        ('FEATURE_MS2_VAR_YSERIES_SCORE', Float64),
        ('FEATURE_MS2_VAR_ELUTION_MODEL_FIT_SCORE', Float64),
        ('FEATURE_MS2_VAR_IM_XCORR_SHAPE', Float64),
        ('FEATURE_MS2_VAR_IM_XCORR_COELUTION', Float64),
        ('FEATURE_MS2_VAR_IM_DELTA_SCORE', Float64),
        ('FEATURE_MS2_VAR_SONAR_LAG', Float64),
        ('FEATURE_MS2_VAR_SONAR_SHAPE', Float64),
        ('FEATURE_MS2_VAR_SONAR_LOG_SN', Float64),
        ('FEATURE_MS2_VAR_SONAR_LOG_DIFF', Float64),
        ('FEATURE_MS2_VAR_SONAR_LOG_TREND', Float64),
        ('FEATURE_MS2_VAR_SONAR_RSQ', Float64),
        ('SCORE_MS2_SCORE', Float64),
        ('SCORE_MS2_PEAK_GROUP_RANK', Float64),
        ('SCORE_MS2_P_VALUE', Float64),
        ('SCORE_MS2_Q_VALUE', Float64),
        ('SCORE_MS2_PEP', Float64),
        ('SCORE_IPF_PRECURSOR_PEAKGROUP_PEP', Float64),
        ('SCORE_IPF_QVALUE', Float64),
        ('SCORE_IPF_PEP', Float64),
        ('SCORE_PEPTIDE_RUN_SPECIFIC_SCORE', Float64),
        ('SCORE_PEPTIDE_RUN_SPECIFIC_P_VALUE', Float64),
        ('SCORE_PEPTIDE_RUN_SPECIFIC_Q_VALUE', Float64),
        ('SCORE_PEPTIDE_RUN_SPECIFIC_PEP', Float64),
        ('SCORE_PEPTIDE_EXPERIMENT_WIDE_SCORE', Float64),
        ('SCORE_PEPTIDE_EXPERIMENT_WIDE_P_VALUE', Float64),
        ('SCORE_PEPTIDE_EXPERIMENT_WIDE_Q_VALUE', Float64),
        ('SCORE_PEPTIDE_EXPERIMENT_WIDE_PEP', Float64),
        ('SCORE_PEPTIDE_GLOBAL_SCORE', Float64),
        ('SCORE_PEPTIDE_GLOBAL_P_VALUE', Float64),
        ('SCORE_PEPTIDE_GLOBAL_Q_VALUE', Float64),
        ('SCORE_PEPTIDE_GLOBAL_PEP', Float64),
        ('SCORE_PROTEIN_RUN_SPECIFIC_SCORE', Float64),
        ('SCORE_PROTEIN_RUN_SPECIFIC_P_VALUE', Float64),
        ('SCORE_PROTEIN_RUN_SPECIFIC_Q_VALUE', Float64),
        ('SCORE_PROTEIN_RUN_SPECIFIC_PEP', Float64),
        ('SCORE_PROTEIN_EXPERIMENT_WIDE_SCORE', Float64),
        ('SCORE_PROTEIN_EXPERIMENT_WIDE_P_VALUE', Float64),
        ('SCORE_PROTEIN_EXPERIMENT_WIDE_Q_VALUE', Float64),
        ('SCORE_PROTEIN_EXPERIMENT_WIDE_PEP', Float64),
        ('SCORE_PROTEIN_GLOBAL_SCORE', Float64),
        ('SCORE_PROTEIN_GLOBAL_P_VALUE', Float64),
        ('SCORE_PROTEIN_GLOBAL_Q_VALUE', Float64),
        ('SCORE_PROTEIN_GLOBAL_PEP', Float64)])

    

Transition Features Parquet Schema ********************************** .. raw:: html
Click to expand the Transition Feature Parquet Schema. Note that not all columns listed here are in every parquet file.
    Schema([
        ('RUN_ID', Int64),
        ('IPF_PEPTIDE_ID', Int64),
        ('PRECURSOR_ID', Int64),
        ('TRANSITION_ID', Int64),
        ('TRANSITION_TRAML_ID', String),
        ('PRODUCT_MZ', Float64),
        ('TRANSITION_CHARGE', Int64),
        ('TRANSITION_TYPE', String),
        ('TRANSITION_ORDINAL', Int64),
        ('ANNOTATION', String),
        ('TRANSITION_DETECTING', Int64),
        ('TRANSITION_LIBRARY_INTENSITY', Float64),
        ('TRANSITION_DECOY', Int64),
        ('FEATURE_ID', Int64),
        ('FEATURE_TRANSITION_AREA_INTENSITY', Float64),
        ('FEATURE_TRANSITION_TOTAL_AREA_INTENSITY', Float64),
        ('FEATURE_TRANSITION_APEX_RT', Float64),
        ('FEATURE_TRANSITION_APEX_INTENSITY', Float64),
        ('FEATURE_TRANSITION_RT_FWHM', Float64),
        ('FEATURE_TRANSITION_MASSERROR_PPM', Float64),
        ('FEATURE_TRANSITION_TOTAL_MI', Float64),
        ('FEATURE_TRANSITION_VAR_INTENSITY_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_INTENSITY_RATIO_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_LOG_INTENSITY', Float64),
        ('FEATURE_TRANSITION_VAR_XCORR_COELUTION', Float64),
        ('FEATURE_TRANSITION_VAR_XCORR_SHAPE', Float64),
        ('FEATURE_TRANSITION_VAR_LOG_SN_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_MASSDEV_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_MI_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_MI_RATIO_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_ISOTOPE_CORRELATION_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_ISOTOPE_OVERLAP_SCORE', Float64),
        ('FEATURE_TRANSITION_START_POSITION_AT_5', Float64),
        ('FEATURE_TRANSITION_END_POSITION_AT_5', Float64),
        ('FEATURE_TRANSITION_START_POSITION_AT_10', Float64),
        ('FEATURE_TRANSITION_END_POSITION_AT_10', Float64),
        ('FEATURE_TRANSITION_START_POSITION_AT_50', Float64),
        ('FEATURE_TRANSITION_END_POSITION_AT_50', Float64),
        ('FEATURE_TRANSITION_TOTAL_WIDTH', Float64),
        ('FEATURE_TRANSITION_TAILING_FACTOR', Float64),
        ('FEATURE_TRANSITION_ASYMMETRY_FACTOR', Float64),
        ('FEATURE_TRANSITION_SLOPE_OF_BASELINE', Float64),
        ('FEATURE_TRANSITION_BASELINE_DELTA_2_HEIGHT', Float64),
        ('FEATURE_TRANSITION_POINTS_ACROSS_BASELINE', Float64),
        ('FEATURE_TRANSITION_POINTS_ACROSS_HALF_HEIGHT', Float64),
        ('FEATURE_TRANSITION_EXP_IM', Float64),
        ('FEATURE_TRANSITION_EXP_IM_LEFTWIDTH', Float64),
        ('FEATURE_TRANSITION_EXP_IM_RIGHTWIDTH', Float64),
        ('FEATURE_TRANSITION_DELTA_IM', Float64),
        ('FEATURE_TRANSITION_VAR_IM_DELTA_SCORE', Float64),
        ('FEATURE_TRANSITION_VAR_IM_LOG_INTENSITY', Float64),
        ('FEATURE_TRANSITION_VAR_IM_XCORR_COELUTION_CONTRAST', Binary),
        ('FEATURE_TRANSITION_VAR_IM_XCORR_SHAPE_CONTRAST', Binary),
        ('FEATURE_TRANSITION_VAR_IM_XCORR_COELUTION_COMBINED', Binary),
        ('FEATURE_TRANSITION_VAR_IM_XCORR_SHAPE_COMBINED', Binary),
        ('SCORE_TRANSITION_SCORE', Float32),
        ('SCORE_TRANSITION_RANK', UInt32),
        ('SCORE_TRANSITION_P_VALUE', Float64),
        ('SCORE_TRANSITION_Q_VALUE', Float64),
        ('SCORE_TRANSITION_PEP', Float64)
    ])
    

Extracted Ion Chromatograms (XICs) ---------------------------------- OpenSwathWorkflow allows for the optional output of extracted ion chromatograms (XICs) for each precursor and transition. These XICs are stored in the OpenSWATH SQLite format (*.sqMass*). The XICs can be used for further analysis (chromatogram feature alignment using `DIAlignR `_ or `ARYCAL `_) or visualization of the data (using `massdash `_). SqMass Format (*.sqMass*) ^^^^^^^^^^^^^^^^^^^^^^^^^ You can find more information on the sqMass format and the schema from OpenMS's documentation on `SqMass File `_. Parquet Format (*.parquet*) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ PyProphet supports converting the OpenSWATH SQLite format (*.sqMass*) to a parquet format, for better storage. You can use the :program:`export parquet` subcommand to convert the OpenSWATH SQLite format (*.sqMass*) to a parquet format. The Parquet schema used for the XICs is as follows: .. code-block:: text Schema([ ('PRECURSOR_ID', Int64), ('TRANSITION_ID', Int64), ('MODIFIED_SEQUENCE', String), ('PRECURSOR_CHARGE', Int64), ('PRODUCT_CHARGE', Int64), ('DETECTING_TRANSITION', Int64), ('PRECURSOR_DECOY', Int64), ('PRODUCT_DECOY', Int64), ('TRANSITION_ORDINAL', Int64), ('TRANSITION_TYPE', String), ('NATIVE_ID', String), ('RT_DATA', Binary), ('INTENSITY_DATA', Binary), ('RT_COMPRESSION', Int64), ('INTENSITY_COMPRESSION', Int64) ]) RT_COMPRESSION and INTENSITY_COMPRESSION are used to store the compression type for the RT and intensity data, respectively. The RT_DATA and INTENSITY_DATA are stored as binary data, which can be decompressed using the appropriate compression algorithm. The possible values and their decoding for the compression type are as follows: .. list-table:: Compression Types :header-rows: 1 * - Value - Compression Type * - 0 - No compression * - 1 - zlib * - 2 - np-linear * - 3 - np-slof * - 4 - np-pic * - 5 - np-linear + zlib * - 6 - np-slof + zlib * - 7 - np-pic + zlib