HOW TO CITE

Please acknowledge the Olivier-Van Stichelen lab at the Medical College of Wisconsin, Dpt of Biochemistry when citing this database.

For the human and others O-GlcNAcome repositories, please cite:

Wulff-Fuentes E, Berendt RR, Massman L, Danner L, Malard F, Vora J, Kahsay R, and Olivier-Van Stichelen S, The Human O-GlcNAcome Database and Meta-Analysis. Scientific Data 2021 8(1):25

For the software development of the O-GlcNAc Database and the Python package utilsovs, please cite:

Malard F, Wullf-Fuentes E, Berendt R, Didier G, and Olivier-Van Stichelen S, Automatization and self-maintenance of the O-GlcNAcome catalogue: A Smart Scientific Database. Database (Oxford) 2021 2021

Disclaimer

We automatically control and manually validate that all O-GlcNAcylation sites in the O-GlcNAc Database are relevant with the parent protein sequence.

For this quality control, the canonical protein sequence is used first. If one mismatch is detected, and for less than 1% of sites in the O-GlcNAc Database, the alternatives protein sequences are then tested.

When present, those sites are specifically labeled on the website with a link toward the relevant isoform search result.

The terms canonical, alternative, related identifiers and sequences were sourced from UniProtKB database.

THE O-GLCNAC SCORE

The O-GlcNAc score (S) is a quantifier we developped to estimate the level of exhaustiveness for each entry in the database in function of the available O-GlcNAc litterature.

S(x)~=~R(x)^{norm}+C(x)^{norm}+T(x)^{norm}+fA(x)^{norm}+lA(x)^{norm}+B(x)^{norm}

Briefly, the O-GlcNAc score of a given entry, S(x), is the sum of normalized factors, each describing a particular aspect of the litterature regarding a given entry.

Each factor is a function and contributes in the [0,1] interval to establish the global score S, which has a maximal theoretical value of six (scaled up to 100 in the Explore panel).

Those factors represent:
• R is the sum of all references for a given entry
• C is the sum of all citations toward those references
• T is the time span between the first and last O-GlcNAc publication
• fA and lA are the number of distinct first and last authors, respectively, in the references set
• B is a bonus parameter, which is underweighted by the number of O-GlcNAc proteins found in a single reference and overweighted by the number of citations toward a single reference
More details will be available in the above-mentioned papers or on request.

PEPTIDE DIGEST TOOL

We provide full and partial digestion products for all the entries in the O-GlcNAc Database and several common proteases.

This includes averaged and monoisotopic mass for each peptides, in absence or in presence of O-GlcNAcylation or phosphorylation.

Below the proteases cleavage sites and the molecular weights information we use.

## Pattern matching using Regex and python3.7 re.finditer() method and double look-behind to preserve overlapping matches.
proteases = {
'Trypsin':['bovine',r'(?<=[KR])(?<=[A-Z])'], # Cut after K or R residues
'Chymotrypsin':['bovine',r'(?<=[YFW])(?<=[A-Z])'], # Cut after Y, F or W residues
'Arg-C':['mouse-submaxillary-gland',r'(?<=[R])(?<=[A-Z])'], # Cut after R residues
'Glu-C':['Staphylococcus-aureus',r'(?<=[E])(?<=[A-Z])'], # Cut after E residues
'Lys-C':['Lysobacter-enzymogenes',r'(?<=[K])(?<=[A-Z])'], # Cut after K residues
'Pepsin':['porcine',r'(?<=[GAVLIPMYFW])(?<=[A-Z])'], # Cut after G, A, V, L, I, P, M, Y, F, W residues
'Thermolysin':['Bacillus-thermo-proteolyticus',r'(?<=[LFIVMA])(?<=[A-Z])'], # Cut after L, F, I, V, M, A residues
'Elastase':['porcine',r'(?<=[AVSGLI])(?<=[A-Z])'] # Cut after A, V, S, S, G, L, I residues
}

## Dictionary  of monoisotopic (index 0) and averaged (index 1) mass (see https://web.expasy.org/findmod/findmod_masses.html#AA)
monoisotopic_average_mass = {
'A':[71.03711,71.0788],
'R':[156.10111,156.1875],
'N':[114.04293,114.1038],
'D':[115.02694,115.0886],
'C':[103.00919,103.1388],
'E':[129.04259,129.1155],
'Q':[128.05858,128.1307],
'G':[57.02146,57.0519],
'H':[137.05891,137.1411],
'I':[113.08406,113.1594],
'L':[113.08406,113.1594],
'K':[128.09496,128.1741],
'M':[131.04049,131.1926],
'F':[147.06841,147.1766],
'P':[97.05276,97.1167],
'S':[87.03203,87.0782],
'T':[101.04768,101.1051],
'W':[186.07931,186.2132],
'Y':[163.06333,163.1760],
'V':[99.06841,99.1326],
'U':[150.953636,150.0388],
'O':[237.147727,237.3018],
'water':[18.01056,18.01524], # MW peptide = MW each residue + MW water
'oglcnac':[203.0794,203.1950],
'phospho':[79.9663,79.9799]
}

The total digestion products list contains the smallest peptides that could be obtained with a given protease (i.e. 100% cleavage at all sites).

The partial digestion products list contains all possible combinations of adjacent peptides from the above-mentioned list.

## Example with Trypsin protease and the full length - imaginary - protein sequence:
ILIKEGLCNACRAWPEPTIDES

# Total digestion products list
ILIK
EGLCNACR
AWPEPTIDES

# Partial digestion products list
ILIK
EGLCNACR
AWPEPTIDES
ILIKEGLCNACR
EGLCNACRAWPEPTIDES
ILIKEGLCNACRAWPEPTIDES


We use the code below to generate partial digestion products from the total digestion products list.

def partial_digestion_products(full_digestion_products,full_digestion_index,length_sequence):
# Number of peptides upon full digestion
length_full_fragment_set = len(full_digestion_products)
# Output list for partial products
partial_digestion_products = []
# We start from each peptide in the full_digestion_products
for start in range(length_full_fragment_set):
# We extend one by one
for end in range (start+1, length_full_fragment_set+1):
# We compute residues index to pass in the output list
if len(full_digestion_index[start:end]) > 1:
index_tuple = (full_digestion_index[start:end][0],full_digestion_index[start:end][-1])
elif len(full_digestion_index[start:end]) == 1 and full_digestion_index[start:end][0] == full_digestion_index[-1]:
index_tuple = (int(full_digestion_index[-1]),length_sequence-1)
elif len(full_digestion_index[start:end]) == 1 and full_digestion_index[start:end][0] == full_digestion_index[0]:
index_tuple = (0,int(full_digestion_index[0]))
# We append the partial_digestion product to the list and we save the residue indexing computed just above
partial_digestion_products.append([''.join(full_digestion_products[start:end]),index_tuple])
# We get what we want
return partial_digestion_products

Due to limited server resources, please note that you may face timeout issues in partial digestion mode with poorly selective proteases and large proteins.

For custom requests, specific protease addition in the menu list or for any other comment, please email us.

TECHNOLOGY

The O-GlcNAc Database relies on the non-relational database management system MongoDB and is based on the django web framework for rendering.

Backend processes were all developed using the Python programming language (v3.7) and the pymongo library for database server-client interactions.

GNU/Linux Debian-based systems with gunicorn (Python http) and NginX (SSL/reverse proxy) were used for developement and production of the O-GlcNAc Database.

The whole front-end HTML 5.0 code of the O-GlcNAc Database was validated with no error nor warning using the W3C markup validation service.

Please report to admin@oglcnac.com if you find any bug or disfunction when browsing our content.

CHANGE LOGS & COMPATIBILITY

Compatibily of the O-GlcNAc Database version 1.2 and known bugs
• Fully functional with desktop Chrome/Firefox (GNU/Linux, MacOS, Windows), Edge (Windows), Safari (MacOS), mobile Chrome (Android) and Safari (iOS)
• Graphical rendering may encounter issues on some web browsers (Opera, Safari) although the code fully complies with W3C rules and standards
• Graphical rendering partially optimized for mobile platforms

Apr-29-21: The O-GlcNAc Database version 1.2   New!
• The Smart O-GlcNAc Database: Back-end routines for self-maintenance and semi-automatization of literature curation
• REST API Endpoint at https://oglcnac.mcw.edu/api/v1 (See documentation)
• Release of the Python package utilsovs (v0.9.1b) which contains utils derived from the O-GlcNAc Database source code
Feb-21-21: The O-GlcNAc Database version 1.1
• The O-GlcNAcome of major model organisms
• New design for explore, search results and references views
• Rewritting of the whole front-end code to fully comply with W3C standards for better compatibility
Dec-25-20: The O-GlcNAc Database version 1.0
• New O-GlcNAcylation tab with overview, statistics (authors, litterature, proteins) and consensus sequences information
• Extensive sequence information with highlight of O-GlcNAc sites, phosphorylation sites and dual sites for each protein entry
• New "Advanced" search mode for multiple lines query to match one dataset with the O-GlcNAc Database content and to generate reports and view details
• Digest tool associated with each protein entry: total or partial digestion
• Explicit isoform sites labelling with link toward relevant data: Example from search and upon click on link
• Field specific search enabled
• New filter criteria in explore and references tabs
• All datasets or single entries available for download in many formats (CSV, JSON, PDF, XLSX, BIBTEX)
• Cross-referencing over the whole O-GlcNAc Database for increased ergonomy
Nov-16-20: Beta version
• The Human O-GlcNAcome
Nov-12-20: On the Web
Nov-8-20: Testing version
• End of development phase
• Deployment and evaluation