                           SCOPPARSE documentation



CONTENTS

   1.0 SUMMARY
   2.0 INPUTS & OUTPUTS
   3.0 INPUT FILE FORMAT
   4.0 OUTPUT FILE FORMAT
   5.0 DATA FILES
   6.0 USAGE
   7.0 KNOWN BUGS & WARNINGS
   8.0 NOTES
   9.0 DESCRIPTION
   10.0 ALGORITHM
   11.0 RELATED APPLICATIONS
   12.0 DIAGNOSTIC ERROR MESSAGES
   13.0 AUTHORS
   14.0 REFERENCES

1.0 SUMMARY

   Generate DCF file from raw SCOP files

2.0 INPUTS & OUTPUTS

   SCOPPARSE parses the dir.cla.scop.txt and dir.des.scop.txt SCOP
   classification files, e.g. available at URLs:
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.des.scop.txt_1.57
   The format of these files is explained at URL:
   http://scop.mrc-lmb.cam.ac.uk/scop/release-notes-1.55.html
   SCOPPARSE writes the classification to a DCF file (EMBL-like format).
   No changes are made to the data other than changing the format in which
   it is held. The file does not include domain sequence information. The
   input and output files are specified by the user.

3.0 INPUT FILE FORMAT

   An excerpt from the dir.cla.scop.txt (Figure 1) and dir.des.scop.txt
   (Figure 2) SCOP input files is shown below. The format of these files
   is explained on the SCOP website:
   http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.cla.scop.txt_1.57

  Input files for usage example

  File: scop.cla.raw

# dir.cla.scop.txt
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/sc
op/lic/copy.html
d1cs4a_ 1cs4    A:      d.58.29.1       39418   cl=53931,cf=54861,sf=55073,fa=55
074,dm=55077,sp=55078,px=39418
d1ii7a_ 1ii7    A:      d.159.1.4       62415   cl=53931,cf=56299,sf=56300,fa=64
427,dm=64428,sp=64429,px=62415

  File: scop.des.raw

# dir.des.scop.txt
# SCOP release 1.57 (January 2002)  [File format version 1.00]
# http://scop.mrc-lmb.cam.ac.uk/scop/
# Copyright (c) 1994-2002 the scop authors; see http://scop.mrc-lmb.cam.ac.uk/sc
op/lic/copy.html
53931   cl      d       -       Alpha and beta proteins (a+b)
54861   cf      d.58    -       Ferredoxin-like
55073   sf      d.58.29 -       Adenylyl and guanylyl cyclase catalytic domain
55074   fa      d.58.29.1       -       Adenylyl and guanylyl cyclase catalytic
domain
55077   dm      d.58.29.1       -       Adenylyl cyclase VC1, domain C1a
55078   sp      d.58.29.1       -       Dog (Canis familiaris)
39418   px      d.58.29.1       d1cs4a_ 1cs4 A:
56299   cf      d.159   -       Metallo-dependent phosphatases
56300   sf      d.159.1 -       Metallo-dependent phosphatases
64427   fa      d.159.1.4       -       DNA double-strand break repair nuclease
64428   dm      d.159.1.4       -       Mre11
64429   sp      d.159.1.4       -       Archaeon Pyrococcus furiosus
62415   px      d.159.1.4       d1ii7a_ 1ii7 A:

4.0 OUTPUT FILE FORMAT

   An example of the DCF output file is shown in Figure 3. The records
   used to describe an entry are as follows. Records (4) to (9) are used
   to describe the position of the domain in the SCOP hierarchy. Various
   other ADDITIONAL RECORDS may be present if the file is processed by
   other programs, e.g. DOMAINSEQS or DOMAINSSE.
     * (1) ID - Domain identifier code. This is a 7-character code that
       uniquely identifies the domain in SCOP. It is identical to the
       first 7 characters of a line in the SCOP classification file. The
       first character is always 'D', the next four characters are the PDB
       identifier code, the fifth character is the PDB chain identifier to
       which the domain belongs (a '.' is given in cases where the domain
       is composed of multiple chains, a '_' is given where a chain
       identifier was not specified in the PDB file) and the final
       character is the number of the domain in the chain (for chains
       comprising more than one domain) or '_' (the chain comprises a
       single domain only).
     * (2) EN - PDB identifier code. This is the 4-character PDB
       identifier code of the PDB entry containing the domain.
     * (3) TY - domain type. "CATH" or "SCOP" is given ("SCOP" for DCF
       files generated by using SCOPPARSE).
     * (4) SI - SCOP Sunid's. The integers preceeding the codes CL, FO,
       SF, FA, DO, SO and DD are the SCOP sunids for Class, Fold,
       Superfamily, Family, Domain, Source and domain data respectively.
       These numbers uniquely identify the appropriate node in the SCOP
       parsable files.
     * (5) CL - Domain class. It is identical to the text given after
       'Class' in the SCOP classification file.
     * (6) FO - Domain fold. It is identical to the text given after
       'Fold' in the SCOP classification file.
     * (7) SF - Domain superfamily. It is identical to the text given
       after 'Superfamily' in the SCOP classification file.
     * (8) FA - Domain family. It is identical to the text given after
       'Family' in the SCOP classification file.
     * (9) DO - Domain name. It is identical to the text given after
       'Protein' in the SCOP classification file.
     * (10) OS - Source of the protein. It is identical to the text given
       after 'Species' in the SCOP classification file.
     * (11) DS - Sequence of the domain according to the PDB file. This
       sequence is taken from the domain clean coordinate file generated
       by DOMAINER. The DS record will only be present if the DCF file has
       been processed using DOMAINSEQS.
     * (12) NC - Number of chains comprising the domain, or number of
       segments from the same chain that the domain is comprised of. NC is
       usually 1. If the number of chains is greater than 1, then the
       domain entry will have a section containing a CN and a CH record
       (see below) for each chain.
     * (13) CN - Chain number. The number given in brackets after this
       record indicates the start of the data for the relevent chain.
     * (14) CH - Domain definition. The character given before CHAIN is
       the PDB chain identifier (a '.' is given in cases where a chain
       identifier was not specified in the DCF file), the strings before
       START and END give the start and end positions respectively of the
       domain in the PDB file (a '.' is given in cases where a position
       was not specified). Note that the start and end positions refer to
       residue numbering given in the original PDB file and therefore must
       be treated as strings.
     * (15) XX - used for spacing.
     * (16) // - used to delimit records for a domain.

  Output files for usage example

  File: all.scop

ID   D1CS4A_
XX
EN   1CS4
XX
TY   SCOP
XX
SI   53931 CL; 54861 FO; 55073 SF; 55074 FA; 55077 DO; 55078 SO; 39418 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Ferredoxin-like
XX
SF   Adenylyl and guanylyl cyclase catalytic domain
XX
FA   Adenylyl and guanylyl cyclase catalytic domain
XX
DO   Adenylyl cyclase VC1, domain C1a
XX
OS   Dog (Canis familiaris)
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//
ID   D1II7A_
XX
EN   1II7
XX
TY   SCOP
XX
SI   53931 CL; 56299 FO; 56300 SF; 64427 FA; 64428 DO; 64429 SO; 62415 DD;
XX
CL   Alpha and beta proteins (a+b)
XX
FO   Metallo-dependent phosphatases
XX
SF   Metallo-dependent phosphatases
XX
FA   DNA double-strand break repair nuclease
XX
DO   Mre11
XX
OS   Archaeon Pyrococcus furiosus
XX
NC   1
XX
CN   [1]
XX
CH   A CHAIN; . START; . END;
//

5.0 DATA FILES

   No data files are used.

6.0 USAGE

Generate DCF file from raw SCOP files.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-classfile]         infile     This option specifies the name of raw SCOP
                                  classification file dir.cla.scop.txt_X.XX
                                  (input). This is the raw SCOP classification
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.c
la.scop.txt_1.57.
  [-desinfile]         infile     This option specifies the name of raw SCOP
                                  description file dir.des.scop.txt_X.XX
                                  (input). This is the raw SCOP description
                                  file available at
                                  http://scop.mrc-lmb.cam.ac.uk/scop/parse/dir.d
es.scop.txt_1.57.
   -nosegments         boolean    [N] This option specifies whether to omit
                                  domains comprising of more than one segment.
                                  This is necessary if a continuous residue
                                  sequence is required.
   -nomultichain       boolean    [N] This option specifies whether to omit
                                  domains comprising segments from more than
                                  one chain. This is necessary if a continuous
                                  residue sequence is required.
  [-dcffile]           outfile    [test.scop] This option specifies the name
                                  of SCOP DCF file (domain classification
                                  file) (output). A 'domain classification
                                  file' contains classification and other data
                                  for domains from the SCOP or CATH
                                  databases. The file is generated by using
                                  DOMAINER and is in DCF format (EMBL-like).
                                  Domain sequence information can be added to
                                  the file by using DOMAINSEQS.

   Additional (Optional) qualifiers:
   -nominor            boolean    [N] This option specifies whether to omit
                                  domains from minor classes (defined as
                                  anything not in class 'All alpha proteins',
                                  'All beta proteins', 'Alpha and beta
                                  proteins (a/b)' or 'Alpha and beta proteins
                                  (a+b)'). This is necessary or appropriate
                                  for many analyses.

   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-dcffile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


  6.1 COMMAND LINE ARGUMENTS

