SciDM Group

SciDM (Scientific Data Management) Group aims to bridge a gap between frontiers in software technology and scientific research. Focused on genomics and computational biology, SciDM Group develops tools and technologies for efficient and cost-effective management and processing of terabytes of highly structured and heavily interconnected bioinformatics data.

SciDM Group is an informal non-commercial association of enthusiastic computational biologists and software engineers. Most of the products developed by SciDM Group are open-source or intended to become such. We are open for collaborations and appreciate comments and discussions.

Major projects developed by SciDM Group are:

SciDM DBMS - high performance zero-maintenance object-relational NoSQL database engine
EMBEDB - embedded data access library serving as a back-end for SciDM System (open source)
QSimScan - ultra-high speed DNA and protein sequence similarity search tool (open source)
Transcriptomics pipeline - integrated solution for EST and RNAseq data analysis

Well established industrial IT solutions often do not suit the needs of researchers. Industrial technologies for data management and processing, perfectly tailored to serve businesses or banks, often appear weak and clumsy in scientific applications. Thus, in genomics, it is typical that a single analytical job requires retrieving and storing back hundreds of millions of objects like short DNA sequences. With such volumes, overhead of transaction processing inherent for industrial SQL DBMSs becomes unbearable.

SciDM Group provides an alternative - SciDM Database Management System, a zero-maintenance object-relational NoSQL database engine optimized for large aggregate transactions. On typical bioinformatics tasks like sequence assembly and clustering, SciDM Data Manager operates thousands of times faster than conventional database engines. It also provides a rich and flexible data model, native mapping to an array of programming languages, and very efficient over-the-network access. For full details, please see SciDM Whitepaper[PDF].

Scientific Data Manager (SciDM) is a DBMS designed for applications that require:

to operate quickly with heavy data sets over a network -
on standard inexpensive hardware, free of maintenance -
through light and easy-to-use programming interface -
using multiple platforms and various programming languages.

SciDM provides:

- Unmatched performance:

retrieving 1,000,000 data objects in 0.9 seconds

updating 1,000,000 data objects in 2.1 seconds

creating 1,000,000 data objects in 37 seconds

deleting 1,000,000 data objects in 21.5 seconds

processing 1,000,000 select queries in 5.2 seconds

handling 10,000 client sessions simultaneously

- over a network^¹, running on a standard hardware^², with 540Gb of data pre-loaded^³.

- Strong reliability: All features are tested in real-world applications that intensively operate with huge volumes of data. These applications include genomic data management, literature databases, document management and bug tracking.

- Truly unlimited storage – limited by hardware only. SciDM-managed database of 12.5 Terabytes exists. The storage is theoretically limited to 2⁴⁸objects; each object can contain up to 2 billions of data attributes; each data attribute can contain up to 2⁶³bytes.

- Rich data management capabilities at unprecedented speed. It uses an object-relational model for data representation and provides methods for storing, retrieving, and deleting data objects, for sequential and indexed access and for dynamic data structure discovery. Along with traditional indexing by entire attribute contents, SciDM provides the ‘context’-style indices by the individual words in stored texts.

- Structured framework for data processing formalized in terms of a particular application field. This makes SciDM highly suitable for research and prototyping. This also simplifies the development by removing the traditional data translation layer between application and DBMS, and by bringing the structured data directly into the processing modules.

- Interoperability over various platforms, operating systems and programming languages.

- Security model with object–level protection. The access rights are controlled individually for every object.

- Among other features of SciDM there are: flexible object structure, allowing dynamic addition of new attributes to existing objects; integrated set management, allowing both persistence and server-side operations for arbitrary objects sets; data integrity support through locking and object subordination.

We are presently starting a company, SciDM Co., for commercialization the SciDM database engine. Please see scidm.com for details.

Footnotes:

1 TCP/IP over Gigabit Ethernet

2 AMD Athlon 7750 Dual-Core CPU (2.7 GHz), 4Gb DDR2 RAM, 5x1Tb SATA2 hard drives (RAID5)

3 The particular table used for benchmarking already contained 20,000,000 records.

Denis Kaznadzey

Computational Biologist and Software Engineer

Denis graduated from Moscow State University in 1989 with major in Biochemistry. In 1991, together with Vlad Novichkov and others, he founded Query Logic, Inc. - a company that brought to market the world's first hardware accelerated DNA sequence analysis suite - ImaGENE. Over the course of his career, Denis led the development of large-scale bioinformatics systems for genome annotation, metabolic modeling, transcriptome assembly, gene expression analysis, and developed algorithms for solving various problems in computational biology. Denis sees "reverse-engineering of life" as his major mission and challenge.

Victor Joukov

Software Architect and Algorithm Designer

Physicist by education and software guru by talent, Victor started his career at ParaGraph, a high-tech software company in Russia, specializing in human-computer interfaces. Victor entered the world of Life Sciences in 1988, when he joined Quark Biotech's Bioinformatics Group as a Lead Software Architect. Victor has developed and delivered a variety of highly usable products and technologies for genome annotation, gene function prediction, analysis of biological networks, comparative genomics, metabolic engineering and other fields. Presently, as Senior Software Engineer at the National Institutes of Health, Victor is responsible for the development of the NCBI Sequence Viewer.

Vladimir Novichkov

Computer Scientist, Electrical Engineer.

Vladimir has studied Physics in Moscow State University. In 1991 Vladimir, together with Denis Kaznadzey and others, co-founded Query Logic, Inc., that pioneered the design of a specialized hardware co-processor for DNA sequence comparison. While directing the microarray facility at UIC / Quark Biotech, he developed tools for microarray image analysis and information extraction. He also designed algorithms for ultra-fast DNA sequence comparison, which later inspired the development of, and became a foundation for QSimScan technology. In later years Vladimir worked on hardware architecture and algorithms for networking, signal processing and wireless channel coding, and co-authored several US patents and a conference paper. Among Vlad's current research interests are special purpose processor microarchitectures, graph and supervised learning algorithms.