                           SEQWORDS documentation



CONTENTS

   1.0 SUMMARY
   2.0 INPUTS & OUTPUTS
   3.0 INPUT FILE FORMAT
   4.0 OUTPUT FILE FORMAT
   5.0 DATA FILES
   6.0 USAGE
   7.0 KNOWN BUGS & WARNINGS
   8.0 NOTES
   9.0 DESCRIPTION
   10.0 ALGORITHM
   11.0 RELATED APPLICATIONS
   12.0 DIAGNOSTIC ERROR MESSAGES
   13.0 AUTHORS
   14.0 REFERENCES

1.0 SUMMARY

   Generates DHF files (domain hits files) of database hits (sequences)
   from Swissprot matching keywords from a keywords file. Generate DHF
   files from keyword search of UniProt

2.0 INPUTS & OUTPUTS

   SEQWORDS searches a swissprot-format sequence database with keywords
   taken from a keywords file, and writes a DHF file (domain hits file) of
   sequences whose swissprot entries contains at least one of the
   keywords. If an entry contains a keyword in a domain record of the
   feature table, then the sequence of the domain is written to the output
   file, otherwise the entire sequence is written. The user specifies the
   name of the swissprot-format sequence database (input), keywords file
   (input) and DHF file (output).

3.0 INPUT FILE FORMAT

   The keywords file (below) contains lists of keywords specific to a
   number of SCOP or CATH nodes, e.g. families and superfamilies. Each
   list of keywords is given after a block of SCOP or CATH classification
   records; for family-specific search terms, the block must contain a CL,
   FO, SF and an FA record (see below). For superfamily-specific terms,
   clearly only the CL, FO and SF should be specified. A single keyword
   must be given per line after the record 'TE'. Each block of SCOP or
   CATH classification records and search terms must be delimited by the
   record '//' (the file should also end with this record).
   It is possible to provide search terms above the level of superfamily,
   for example, fold and class-specific search terms for SCOP by using the
   CL and FO records only as appropriate. However, text searches of
   swissprot for members of scop folds and classes are unlikely to produce
   specific or meaningful results.

  Input files for usage example

  File: seqwords.terms

TY   SCOP
XX
CL   Alpha and beta proteins (a/b)
XX
FO   NAD(P)-binding Rossmann-fold domains
XX
SF   NAD(P)-binding Rossmann-fold domains
XX
FA   Lactate & malate dehydrogenases, N-terminal domain
XX
TE   NAD(P)-binding Rossmann-fold
TE   Lactate & malate dehydrogenases
TE   Lactate dehydrogenase
TE   Malate dehydrogenase
//

  File: seqwords.seq

ID   ACEA_ECOLI     STANDARD;      PRT;   434 AA.
AC   P05313;
DT   01-NOV-1988 (Rel. 09, Created)
DT   01-NOV-1988 (Rel. 09, Last sequence update)
DT   15-DEC-1998 (Rel. 37, Last annotation update)
DE   ISOCITRATE LYASE (EC 4.1.3.1) (ISOCITRASE) (ISOCITRATASE) (ICL).
GN   ACEA OR ICL.
OS   Escherichia coli.
OC   Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae;
OC   Escherichia.
RN   [1]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 89083515.
RA   Byrne C.R., Stokes H.W., Ward K.A.;
RT   "Nucleotide sequence of the aceB gene encoding malate synthase A in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:10924-10924(1988).
RN   [2]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12;
RX   MEDLINE; 88262573.
RA   Rieul C., Bleicher F., Duclos B., Cortay J.-C., Cozzone A.J.;
RT   "Nucleotide sequence of the aceA gene coding for isocitrate lyase in
RT   Escherichia coli.";
RL   Nucleic Acids Res. 16:5689-5689(1988).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 89008064.
RA   Matsuoka M., McFadden B.A.;
RT   "Isolation, hyperexpression, and sequencing of the aceA gene encoding
RT   isocitrate lyase in Escherichia coli.";
RL   J. Bacteriol. 170:4528-4536(1988).
RN   [4]
RP   SEQUENCE FROM N.A.
RC   STRAIN=K12 / MG1655;
RX   MEDLINE; 94089392.
RA   Blattner F.R., Burland V.D., Plunkett G. III, Sofia H.J.,
RA   Daniels D.L.;
RT   "Analysis of the Escherichia coli genome. IV. DNA sequence of the
RT   region from 89.2 to 92.8 minutes.";
RL   Nucleic Acids Res. 21:5408-5417(1993).
RN   [5]
RP   SEQUENCE OF 293-434 FROM N.A.
RX   MEDLINE; 88227861.
RA   Klumpp D.J., Plank D.W., Bowdin L.J., Stueland C.S., Chung T.,
RA   Laporte D.C.;
RT   "Nucleotide sequence of aceK, the gene encoding isocitrate
RT   dehydrogenase kinase/phosphatase.";
RL   J. Bacteriol. 170:2763-2769(1988).


  [Part of this file has been deleted for brevity]

FT   CONFLICT     70     70       A -> R (IN REF. 2).
FT   CONFLICT     80     80       A -> R (IN REF. 1 AND 2).
FT   CONFLICT    116    116       I -> N (IN REF. 2).
FT   CONFLICT    144    144       F -> L (IN REF. 1).
FT   CONFLICT    305    312       LGEEFVNK -> WAKSSLISN (IN REF. 2).
FT   CONFLICT    307    307       E -> Q (IN REF. 1).
FT   STRAND        2      6
FT   TURN          7      9
FT   HELIX        11     23
FT   TURN         26     27
FT   STRAND       28     33
FT   TURN         37     38
FT   HELIX        39     47
FT   TURN         48     48
FT   STRAND       53     58
FT   HELIX        64     67
FT   TURN         68     69
FT   STRAND       72     75
FT   TURN         83     84
FT   HELIX        87    108
FT   TURN        110    111
FT   STRAND      113    116
FT   HELIX       121    134
FT   TURN        135    136
FT   TURN        140    141
FT   STRAND      143    145
FT   HELIX       148    162
FT   TURN        163    163
FT   HELIX       166    168
FT   STRAND      173    175
FT   TURN        179    181
FT   STRAND      182    184
FT   HELIX       186    188
FT   TURN        190    191
FT   HELIX       196    217
FT   TURN        218    219
FT   HELIX       225    242
FT   TURN        243    244
FT   STRAND      248    255
FT   STRAND      263    271
FT   TURN        272    273
FT   STRAND      274    278
FT   HELIX       286    311
SQ   SEQUENCE   312 AA;  32337 MW;  17741A3B5AD068BA CRC64;
     MKVAVLGAAG GIGQALALLL KTQLPSGSEL SLYDIAPVTP GVAVDLSHIP TAVKIKGFSG
     EDATPALEGA DVVLISAGVA RKPGMDRSDL FNVNAGIVKN LVQQVAKTCP KACIGIITNP
     VNTTVAIAAE VLKKAGVYDK NKLFGVTTLD IIRSNTFVAE LKGKQPGEVE VPVIGGHSGV
     TILPLLSQVP GVSFTEQEVA DLTKRIQNAG TEVVEAKAGG GSATLSMGQA AARFGLSLVR
     ALQGEQGVVE CAYVEGDGQY ARFFSQPLLL GKNGVEERKS IGTLSAFEQN ALEGMLDTLK
     KDIALGEEFV NK
//

4.0 OUTPUT FILE FORMAT

   DHF file (domain hits file)
   The format of the DHF file (domain hits file) of hit sequences
   generated by SEQWORDS (Figure 1) is described fully in SEQSEARCH
   documentation and only summarised here. The file contains two lines per
   hit, the first is a description of the hit in 16 text tokens delimited
   by '^'. The second line contains the protein sequence. The first 4
   tokens refer to the hit (sequence) itself, the tokens are as follows:
     * (i) Accession number
     * (ii) Database code,
     * (iii - iv) Start and end positions of the hit relative to the full
       length sequence.

   The next 9 tokens refer to the domain family, superfamily etc for which
   the terms were defined (in the keywords file) and are as follows:
     * (v) Type of domain (one of 'SCOP' or 'CATH'),
     * (vi) SCOP or CATH domain identifier.
     * (vii) SCOP or CATH node unique identifier, e.g. SCOP Family Sunid.
     * (viii) Domain class. Textual description of the 'Class' (SCOP and
       CATH domains).
     * (ix) Domain architecture. Textual description of the 'Architecture'
       (CATH only).
     * (x) Domain topology. Textual description of the 'Topology' (CATH
       only).
     * (xi) Domain fold. Textual description of the 'Fold' (SCOP domains
       only).
     * (xii) Domain superfamily. Textual description of the 'Superfamily'
       (SCOP and CATH domains).
     * (xiii) Domain family. Textual description of the 'Fold' (SCOP
       only).

   The next 4 tokens refer to the hit, specifically, information about the
   search result as follows:
     * (xiv) Model type. The type of model that was used to generate the
       hit. For DHF files generated by using SEQWORDS a value of KEYWORD
       is given. Several other values are possible, however, see SEQSEARCH
       documentation.
     * (xv) SC - Score of hit from search algoritm (not written by
       SEQWORDS).
     * (xvi) P-value of hit (not written by SEQWORDS).
     * (xvii) E-value of hit (not written by SEQWORDS).

  Output files for usage example

  File: seqwords.dhf

> Q60150^.^1^312^SCOP^.^0^Alpha and beta proteins (a/b)^.^.^NAD(P)-binding Rossm
ann-fold domains^NAD(P)-binding Rossmann-fold domains^Lactate & malate dehydroge
nases, N-terminal domain^KEYWORD^0.00^0.000e+00^0.000e+00
MKVAVLGAAGGIGQALALLLKTQLPSGSELSLYDIAPVTPGVAVDLSHIPTAVKIKGFSGEDATPALEGADVVLISAGVA
RKPGMDRSDLFNVNAGIVKNLVQQVAKTCPKACIGIITNPVNTTVAIAAEVLKKAGVYDKNKLFGVTTLDIIRSNTFVAE
LKGKQPGEVEVPVIGGHSGVTILPLLSQVPGVSFTEQEVADLTKRIQNAGTEVVEAKAGGGSATLSMGQAAARFGLSLVR
ALQGEQGVVECAYVEGDGQYARFFSQPLLLGKNGVEERKSIGTLSAFEQNALEGMLDTLKKDIALGEEFVNK

5.0 DATA FILES

   SEQWORDS does not use a data file.

6.0 USAGE

Generate DHF files from keyword search of UniProt.
Version: EMBOSS:6.6.0.0

   Standard (Mandatory) qualifiers:
  [-keyfile]           infile     This option specifies the name of keywords
                                  file (input). This contains a list of
                                  keywords specific to a number of SCOP or
                                  CATH families and superfamilies used by
                                  SEQWORDS to search a sequence database.
  [-spfile]            infile     This option specifies the name of the
                                  sequence database (input) to search.
  [-outfile]           outfile    [test.hits] This option specifies the name
                                  of the DHF file (domain hits file) (output).
                                  A 'domain hits file' contains database hits
                                  (sequences) with domain classification
                                  information, in the DHF format (FASTA-like).
                                  The hits are relatives to a SCOP or CATH
                                  family (or other node in the structural
                                  hierarchies) and are found from a search of
                                  a sequence database. Files containing hits
                                  retrieved by PSIBLAST are generated by using
                                  SEQSEARCH, hits retrieved by a sparse
                                  protein signatare by using SIGSCAN or
                                  various types of HMM and profile by using
                                  LIBSCAN.

   Additional (Optional) qualifiers: (none)
   Advanced (Unprompted) qualifiers: (none)
   Associated qualifiers:

   "-outfile" associated qualifiers
   -odirectory3        string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit


  6.1 COMMAND LINE ARGUMENTS

