HOW TO CITE
Please acknowledge the
Olivier-Van Stichelen lab at the Medical College of Wisconsin, Dpt of Biochemistry when citing this database.
For the human and others O-GlcNAcome repositories, please cite:
Wulff-Fuentes E, Berendt RR, Massman L, Danner L, Malard F, Vora J, Kahsay R, and Olivier-Van Stichelen S,
The Human O-GlcNAcome Database and Meta-Analysis. Scientific Data 2021
8(1):25
For the software development of the O-GlcNAc Database and the Python package utilsovs, please cite:
Malard F, Wullf-Fuentes E, Berendt R, Didier G, and Olivier-Van Stichelen S,
Automatization and self-maintenance of the O-GlcNAcome catalogue: A Smart Scientific Database. Database (Oxford). 2021
2021.
Disclaimer
We automatically control and manually validate that all
O-GlcNAcylation sites in the
O-GlcNAc Database are relevant with the parent protein sequence.
For this quality control, the canonical protein sequence is used first. If one mismatch is detected, and for less than 1% of sites in the
O-GlcNAc Database, the alternatives protein sequences are then tested.
When present, those sites are specifically labeled on the website with a link toward the relevant isoform search result.
The terms
canonical,
alternative, related identifiers and sequences were sourced from
UniProtKB database.
THE O-GLCNAC SCORE
The
O-GlcNAc score (S) is a quantifier we developped to estimate the level of exhaustiveness for each entry in the database in function of the available
O-GlcNAc litterature.
S(x)~=~R(x)^{norm}+C(x)^{norm}+T(x)^{norm}+fA(x)^{norm}+lA(x)^{norm}+B(x)^{norm}
Briefly, the
O-GlcNAc score of a given entry, S(x), is the sum of normalized factors, each describing a particular aspect of the litterature regarding a given entry.
Each factor is a function and contributes in the [0,1] interval to establish the global score S, which has a maximal theoretical value of six (scaled up to 100 in the
Explore panel).
Those factors represent:
- R is the sum of all references for a given entry
- C is the sum of all citations toward those references
- T is the time span between the first and last O-GlcNAc publication
- fA and lA are the number of distinct first and last authors, respectively, in the references set
- B is a bonus parameter, which is underweighted by the number of O-GlcNAc proteins found in a single reference and overweighted by the number of citations toward a single reference
More details are available in the above-mentioned papers or on request.
PEPTIDE DIGEST TOOL
We provide full and partial digestion products for all the entries in the
O-GlcNAc Database and several common proteases.
This includes averaged and monoisotopic mass for each peptides, in absence or in presence of
O-GlcNAcylation or phosphorylation.
Below the proteases cleavage sites and the molecular weights information we use.
## Pattern matching using Regex and python3.7 re.finditer() method and double look-behind to preserve overlapping matches.
proteases = {
'Trypsin':['bovine',r'(?<=[KR])(?<=[A-Z])'], # Cut after K or R residues
'Chymotrypsin':['bovine',r'(?<=[YFW])(?<=[A-Z])'], # Cut after Y, F or W residues
'Arg-C':['mouse-submaxillary-gland',r'(?<=[R])(?<=[A-Z])'], # Cut after R residues
'Glu-C':['Staphylococcus-aureus',r'(?<=[E])(?<=[A-Z])'], # Cut after E residues
'Lys-C':['Lysobacter-enzymogenes',r'(?<=[K])(?<=[A-Z])'], # Cut after K residues
'Pepsin':['porcine',r'(?<=[GAVLIPMYFW])(?<=[A-Z])'], # Cut after G, A, V, L, I, P, M, Y, F, W residues
'Thermolysin':['Bacillus-thermo-proteolyticus',r'(?<=[LFIVMA])(?<=[A-Z])'], # Cut after L, F, I, V, M, A residues
'Elastase':['porcine',r'(?<=[AVSGLI])(?<=[A-Z])'] # Cut after A, V, S, S, G, L, I residues
}
## Dictionary of monoisotopic (index 0) and averaged (index 1) mass (see https://web.expasy.org/findmod/findmod_masses.html#AA)
monoisotopic_average_mass = {
'A':[71.03711,71.0788],
'R':[156.10111,156.1875],
'N':[114.04293,114.1038],
'D':[115.02694,115.0886],
'C':[103.00919,103.1388],
'E':[129.04259,129.1155],
'Q':[128.05858,128.1307],
'G':[57.02146,57.0519],
'H':[137.05891,137.1411],
'I':[113.08406,113.1594],
'L':[113.08406,113.1594],
'K':[128.09496,128.1741],
'M':[131.04049,131.1926],
'F':[147.06841,147.1766],
'P':[97.05276,97.1167],
'S':[87.03203,87.0782],
'T':[101.04768,101.1051],
'W':[186.07931,186.2132],
'Y':[163.06333,163.1760],
'V':[99.06841,99.1326],
'U':[150.953636,150.0388],
'O':[237.147727,237.3018],
'water':[18.01056,18.01524], # MW peptide = MW each residue + MW water
'oglcnac':[203.0794,203.1950],
'phospho':[79.9663,79.9799]
}
The
total digestion products list contains the smallest peptides that could be obtained with a given protease (i.e. 100% cleavage at all sites).
The
partial digestion products list contains all possible combinations of adjacent peptides from the above-mentioned list.
## Example with Trypsin protease and the full length - imaginary - protein sequence:
ILIKEGLCNACRAWPEPTIDES
# Total digestion products list
ILIK
EGLCNACR
AWPEPTIDES
# Partial digestion products list
ILIK
EGLCNACR
AWPEPTIDES
ILIKEGLCNACR
EGLCNACRAWPEPTIDES
ILIKEGLCNACRAWPEPTIDES
We use the code below to generate
partial digestion products from the
total digestion products list.
def partial_digestion_products(full_digestion_products,full_digestion_index,length_sequence):
# Number of peptides upon full digestion
length_full_fragment_set = len(full_digestion_products)
# Output list for partial products
partial_digestion_products = []
# We start from each peptide in the full_digestion_products
for start in range(length_full_fragment_set):
# We extend one by one
for end in range (start+1, length_full_fragment_set+1):
# We compute residues index to pass in the output list
if len(full_digestion_index[start:end]) > 1:
index_tuple = (full_digestion_index[start:end][0],full_digestion_index[start:end][-1])
elif len(full_digestion_index[start:end]) == 1 and full_digestion_index[start:end][0] == full_digestion_index[-1]:
index_tuple = (int(full_digestion_index[-1]),length_sequence-1)
elif len(full_digestion_index[start:end]) == 1 and full_digestion_index[start:end][0] == full_digestion_index[0]:
index_tuple = (0,int(full_digestion_index[0]))
# We append the partial_digestion product to the list and we save the residue indexing computed just above
partial_digestion_products.append([''.join(full_digestion_products[start:end]),index_tuple])
# We get what we want
return partial_digestion_products
Due to limited server resources, please note that you may face timeout issues in partial digestion mode with poorly selective proteases and large proteins.
For custom requests, specific protease addition in the menu list or for any other comment, please
email us.
TECHNOLOGY
The
O-GlcNAc Database relies on the non-relational database management system
MongoDB and is based on the
django web framework for rendering.
Backend processes were all developed using the
Python programming language (v3.7) and the
pymongo library for database server-client interactions.
GNU/Linux
Debian-based systems with
gunicorn (Python http) and
NginX (SSL/reverse proxy) were used for developement and production of the
O-GlcNAc Database.
The whole front-end HTML 5.0 code of the
O-GlcNAc Database was validated with no error nor warning using the
W3C markup validation service.
Please report to
admin@oglcnac.com if you find any bug or disfunction when browsing our content.
CHANGE LOGS & COMPATIBILITY
Compatibily of the O-GlcNAc Database version 1.3 and known bugs
- Fully functional with desktop Chrome/Firefox (GNU/Linux, MacOS, Windows), Edge (Windows), Safari (MacOS), mobile Chrome (Android) and Safari (iOS)
- Graphical rendering may encounter issues on some web browsers (Opera, Safari) although the code fully complies with W3C rules and standards
- Graphical rendering not optimal on mobile platforms
Jan-31-23: The O-GlcNAc Database version 1.3 New!
- Implementation of NGL viewer for visualization of protein modification sites
- New display and information for the O-GlcNAc score in search results
- Cross-referencing GlyCosmos in search results
Apr-29-21: The O-GlcNAc Database version 1.2
- The Smart O-GlcNAc Database: Back-end routines for self-maintenance and semi-automatization of literature curation
- REST API Endpoint at https://oglcnac.mcw.edu/api/v1 (See documentation)
- Release of the Python package utilsovs (v0.9.1b) which contains utils derived from the O-GlcNAc Database source code
Feb-21-21: The O-GlcNAc Database version 1.1
- The O-GlcNAcome of major model organisms
- New design for explore, search results and references views
-
Rewritting of the whole front-end code to fully comply with W3C standards for better compatibility
Dec-25-20: The O-GlcNAc Database version 1.0
- New O-GlcNAcylation tab with overview, statistics (authors, litterature, proteins) and consensus sequences information
- Extensive sequence information with highlight of O-GlcNAc sites, phosphorylation sites and dual sites for each protein entry
- New "Advanced" search mode for multiple lines query to match one dataset with the O-GlcNAc Database content and to generate reports and view details
- Digest tool associated with each protein entry: total or partial digestion
- Explicit isoform sites labelling with link toward relevant data: Example from search and upon click on link
- Field specific search enabled
- New filter criteria in explore and references tabs
- All datasets or single entries available for download in many formats (CSV, JSON, PDF, XLSX, BIBTEX)
- Cross-referencing over the whole O-GlcNAc Database for increased ergonomy
Nov-16-20: Beta version
Nov-12-20: On the Web
Nov-8-20: Testing version
- End of development phase
- Deployment and evaluation