NoSQL Databases
by K. Yue
1. Introduction
- NoSQL:
- “Not Only SQL” or “Not SQL.” Relatively new kinds of non-relational database systems.
- A loose category of many diverse database systems with different data models.
- Some common NoSQL features:
- Non-relational
- Distributed
- High scalability
- High availability
- High performance and parallelism
- No or flexible schema
- Weaker support of ACID properties
- Oriented towards semi-structured or non-structured data.
- Weaker support of data integrity and constraints.
- Usually more object-oriented.
- More analysis-oriented than transaction-oriented.
- Simpler APIs for database interaction.
- Major non-relational database systems as in top db-engines ranking, https://db-engines.com/en/ranking (rankings as of 9/1/2023):
- Document models: e.g. MongoDB (rank #5), CouchBase (32), Firebase Realtime Database (38), CouchDB (44), Realm (56)
- Key-value models: e.g. Redis (6), Memcached (33), etcd (52), HazelCast (55), LevelDB (111).
- Wide-column models: e.g. Cassandra (12), Hive (17), HBase (26), Datastax Enterprise (60)
- Graphical models: Neo4J (22)
- Some observations:
- Search engines (not really a data model): ElasticSearch (8; document model), Splunk (14; key-value), Solr (24; document model)
- Still dominated by relational databases.
- Many support multiple models.
- Read:
2. Advantages and Disadvantages
- Some advantages of NoSQL databases in general:
- Distributed
- High scalability: horizontal scalability, as opposed to vertical scalability.
- Horizontal scalability (scaling out): add more machines to the distributed system.
- Vertical scalability (scaling up): add more power to existing machines; replace existing machines by more powerful one.
- High availability
- High performance
- Flexibility: flexible schema or schemaless
- More object-oriented:
- Better abstraction model
- Better interoperability with programming language. No need of object-relational mapping.
- Some disadvantages of NoSQL databases in general:
- Weaker data integrity support
- Weaker transaction support
- Weaker theoretical and design methodology support
- Relative lack of standards
- Relative lack of tools and interoperability
3. ACID versus BASE Transaction Models
3.1 ACID
- Relational database support ACID properties to support data consistency and integrity of transactions under concurrent access: e.g., http://en.wikipedia.org/wiki/ACID
- ACID properties (review):
- Atomicity: A transaction is an atomic unit of processing. It is either performed in its entirety or not performed at all.
- Consistency preservation: A correct execution of the transaction must take the database from one consistent state to another.
- Isolation: A transaction should not make its updates visible to other tasks and transactions until it is committed.
- Durability or permanency: Once a transaction changes the database and the changes are committed, these changes must never be lost because of subsequent failure.
- Supporting ACID limits other desirable features: scalability, availability, and performance.
3.2 BASE
- To enhance scalability, availability and performance, most NoSQL DB do not fully support ACID.
- NoSQL supports different transaction models.
- The Basic Availability, Soft-State, Eventual Consistency (BASE model) for distributed database is the most popular one.
- Basic available: data is basically available across nodes of the distributed database.
- Soft-state: There is no immediate consistency. As a result, different replicas may have different values across the distributed systems at a given time. Thus, the state of the database is soft.
- Eventual consistency: eventually, data replicas will have the same value across the distributed database.
3.3 CAP Theorem
- Since NoSQL databases are mostly distributed, it is important to have some understanding of the famous CAP theorem for distributed data stores.
- See, for example, https://en.wikipedia.org/wiki/CAP_theorem.
- There are three desirable guarantees of distributed data stores, CAP:
- Consistency: the return value is always the same for the same data across the distributed systems.
- Availability: every request will return a response, either the data or an error. (Note that the return data may or may not always be the same).
- Partition tolerance: the database continues to operate in case of network partitions (one partition of the network cannot communicate with another partition because of message drops).
- The CAP theorem states that any distributed database can provide only two out of the three guarantees.
- Different databases based their designs on prioritizing two out of the three C-A-P.
- For a more detailed discussion, one may see: https://www.instaclustr.com/blog/cassandra-vs-mongodb/ (optional read):
- It contains a discussion how Cassandra and Mongo trade-off CAP.
- It also includes a discussion of a more refined CAP theorem: PACELC Theorem: "PACELC is summarized as follows: In the event of a partition failure, a distributed system must choose between Availability (A) and Consistency, else (E) when running normally it must choose between latency (L) or consistency (C)."
4. Major NoSQL data models
4.1 Key-value model
- Data is stored as key-value pairs.
- Values can possibly be JavaScript Object Notation (JSON) strings, which store serialized objects.
- Some key-value databases support rich JSON queries.
4.2 Document Model
- Document-oriented databases store data as documents.
- Documents can be considered as semi-structured data.
- Thus, XML databases can be considered as employing the document model.
- Modern document-oriented databases commonly employ JSON. E.g., MongoDB and CouchDB.
- The document model can be considered as a subclass of key-value model.
- The stored value is the document.
- The stored value can be manipulated by operations based on the selected document model (mostly JSON).
- MongoDB is likely the most popular document-oriented NoSQL DB. It will be covered in more details in this class.
Example: In CouchDB, a key-value pair may be:
Key: "MBSEBaseModel~939c7672-5d2d-11ec-bf63-0242ac130002"
Value to be stored:
{
"BCAssetId": "939c7672-5d2d-11ec-bf63-0242ac130002",
"BCAssetType": "MBSEBaseModel",
"BCAssetName": "Gateway-PPE-Base-Model",
"BaseModelDesc": "PPE project's model.",
"version": {
"version": "2.1",
"subversion": "4.6",
"startTime": "2021-12-08T17:25:23+06:00"
},
"storage": {
"isEncrypted": true,
"EncrypMethod": "AES256",
"EncrypKey": "q4t7w!z%C*F-JaNdRgUkXp2r5u8x/A?D",
"useIPFS": true,
"IPFSCid": "QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX",
"IPFS_HashHead": "A0Xa",
"payloadRaw": "raw PPE MBSE Base model description. V2.1.4.6.",
}
}
Note:
- CouchDB adds two fields, _id and _rev, automatically if they are not supplied.
- The field _rev is used for multi-version concurrency control (MVCC) to ensure 'eventual consistency.'
- Some may consider CouchDB as a document-oriented database. The boundary between key-value model and document model is not clear cut.
To query CouchDB, one may use many methods. Examples:
- CouchDB RESTful API: https://docs.couchdb.org/en/latest/api/index.html
- MapReduce-based views: https://docs.couchdb.org/en/latest/ddocs/views/index.html
- Mango query:
An example Mango query that returns all CouchDB key-value pairs for MBSEBaseModel.
{
"selector": {
"BCAssetType": { "$eq": "MBSEBaseModel" }
}
}
See https://docs.couchdb.org/en/latest/api/database/find.html for more information about selector syntax.
4.3 Wide-Column Model
- A columnar DBMS or column-oriented DBMS stores data tables grouped by columns, instead of grouped by rows (as in most relational DBMS).
- For example, related columns may be stored together in a file for faster performance.
- Benefits:
- Faster access for certain types of queries.
- Better chance for data compression.
- Disadvantages:
- Slower update.
- Slower access for certain types of queries.
- Read introductions to the wide-column model. Examples:
- In wide column model, data is stored as keys and columns. Each column contain a column-name and a value.
- Thus, to get a data value, use (key, column-name).
- Cassandra is one of the most popular wide-column databases.
4.4 Graphical Model
- "A graph database stores nodes and relationships instead of tables, or documents."
- Quite object-oriented, using a directed graph model.
- neo4j is the most popular graphical database.
- Introduction: https://neo4j.com/developer/graph-database/.
- To start learning Neo4j, download and install Neo4j desktop.
- Neo4j Query Language: Cypher, https://neo4j.com/developer/cypher/.
- Basic Cypher syntax: (nodes)-[:ARE_CONNECTED_TO]->(otherNodes).