NoSQL Databases
by K. Yue
1. Introduction
- NoSQL:
- “Not Only SQL” or “Not SQL.” Relatively new kinds of non-relational database systems.
- A loose category of many diverse database systems with different data models.
- Some common NoSQL features:
- Non-relational
- Distributed
- High scalability
- High availability
- High performance and parallelism
- No or flexible schema
- Weaker support of ACID properties
- Oriented towards semi-structured or non-structured data.
- Weaker support of data integrity and constraints.
- Usually more object-oriented.
- More analysis-oriented than transaction-oriented.
- Simpler APIs for database interaction.
- Major non-relational database systems as in top db-engines ranking, https://db-engines.com/en/ranking (rankings as of 9/1/2023):
- Document models: e.g. MongoDB (rank #5), CouchBase (32), Firebase Realtime Database (38), CouchDB (44), Realm (56)
- Key-value models: e.g. Redis (6), Memcached (33), etcd (52), HazelCast (55), LevelDB (111).
- Wide-column models: e.g. Cassandra (12), Hive (17), HBase (26), Datastax Enterprise (60)
- Graphical models: Neo4J (22)
- Some observations:
- Search engines (not really a data model): ElasticSearch (8; document model), Splunk (14; key-value), Solr (24; document model)
- Still dominated by relational databases.
- Many DB support multiple models.
- Read:
2. Advantages and Disadvantages
- Some advantages of NoSQL databases in general:
- Distributed
- High scalability: horizontal scalability, as opposed to vertical scalability.
- Horizontal scalability (scaling out): add more machines to the distributed system.
- Vertical scalability (scaling up): add more power to existing machines; replace existing machines by more powerful one.
- High availability
- High performance
- Flexibility: flexible schema or schemaless
- More object-oriented:
- Better abstraction model
- Better interoperability with programming language. No need of object-relational mapping.
- Some disadvantages of NoSQL databases in general:
- Weaker data integrity support
- Weaker transaction support
- Weaker theoretical and design methodology support
- Relative lack of standards
- Relative lack of tools and interoperability
- Complexity
3. ACID versus BASE Transaction Models
3.1 ACID
- Relational database support ACID properties to support data consistency and integrity of transactions under concurrent access: e.g., http://en.wikipedia.org/wiki/ACID
- ACID properties (review):
- Atomicity: A transaction is an atomic unit of processing. It is either performed in its entirety or not performed at all.
- Consistency preservation: A correct execution of a transaction must take the database from one physically consistent state to another. This is known as physical consistency.
- Isolation: A transaction should not make its updates visible to other tasks and transactions until it is committed.
- Durability or permanency: Once a transaction changes the database and the changes are committed, these changes must never be lost because of subsequent failure.
- Supporting ACID limits other desirable features: scalability, availability, and performance.
3.2 BASE
- To enhance scalability, availability and performance, most NoSQL DB do not fully support ACID.
- NoSQL supports different transaction models.
- The Basic Availability, Soft-State, Eventual Consistency (BASE model) for distributed database is the most popular one.
- Basic available: data is basically available across nodes of the distributed database, despite network failures.
- Soft-state: There is no immediate consistency. As a result, different replicas may have different values across the distributed systems at a given time. Thus, the state of the database is soft.
- Eventual consistency: eventually, data replicas will have the same value across the distributed database.
3.3 CAP Theorem
- Since NoSQL databases are mostly distributed, it is important to have some understanding of the famous CAP theorem for distributed data stores.
- See, for example, https://en.wikipedia.org/wiki/CAP_theorem.
- There are three desirable guarantees of distributed data stores, CAP:
- Consistency: the return value is always the same for the same data across the distributed systems.
- Availability: every request will return a response, either the data or an error. (Note that the return data may or may not always be the same).
- Partition tolerance: the database continues to operate in case of network partitions (one partition of the network cannot communicate with another partition because of message drops).
- The CAP theorem states that any distributed database can provide only two out of the three guarantees.
- Different databases based their designs on prioritizing two out of the three C-A-P.
- For a more detailed discussion, one may see: https://www.instaclustr.com/blog/cassandra-vs-mongodb/ (optional read):
- It contains a discussion how Cassandra and Mongo trade-off CAP.
- It also includes a discussion of a more refined CAP theorem: PACELC Theorem: "PACELC is summarized as follows: In the event of a partition failure, a distributed system must choose between Availability (A) and Consistency (C), else (E) when running normally it must choose between latency (L) or consistency (C)."
4. Major NoSQL data models
4.1 Key-value model
- Data is stored as key-value pairs.
- Values can possibly be JavaScript Object Notation (JSON) strings, which store serialized objects.
- Some key-value databases support rich JSON queries.
4.2 Document Model
- Document-oriented databases store data as documents.
- Documents can be considered as semi-structured data.
- Thus, XML databases can be considered as employing the document model.
- Modern document-oriented databases commonly employ JSON. E.g., MongoDB and CouchDB.
- The document model can be considered as a subclass of key-value model.
- The stored value of a key-value model can be a document.
- The stored value can be manipulated by operations based on the selected document model (mostly JSON).
- MongoDB is likely the most popular document-oriented NoSQL DB. It will be covered in more details in this class.
Example: In CouchDB, a key-value pair may be:
Key: "MBSEBaseModel~939c7672-5d2d-11ec-bf63-0242ac130002"
Value to be stored:
{
"BCAssetId": "939c7672-5d2d-11ec-bf63-0242ac130002",
"BCAssetType": "MBSEBaseModel",
"BCAssetName": "Gateway-PPE-Base-Model",
"BaseModelDesc": "PPE project's model.",
"version": {
"version": "2.1",
"subversion": "4.6",
"startTime": "2021-12-08T17:25:23+06:00"
},
"storage": {
"isEncrypted": true,
"EncrypMethod": "AES256",
"EncrypKey": "q4t7w!z%C*F-JaNdRgUkXp2r5u8x/A?D",
"useIPFS": true,
"IPFSCid": "QmT5NvUtoM5nWFfrQdVrFtvGfKFmG7AHE8P34isapyhCxX",
"IPFS_HashHead": "A0Xa",
"payloadRaw": "raw PPE MBSE Base model description. V2.1.4.6.",
}
}
Note:
- CouchDB adds two fields, _id and _rev, automatically if they are not supplied.
- The field _rev is used for multi-version concurrency control (MVCC) to ensure 'eventual consistency.'
- Some may consider CouchDB as a document-oriented database. The boundary between key-value model and document model is not clear cut.
To query CouchDB, one may use many methods. Examples:
- CouchDB RESTful API: https://docs.couchdb.org/en/latest/api/index.html
- MapReduce-based views: https://docs.couchdb.org/en/latest/ddocs/views/index.html
- Mango query
An example Mango query that returns all CouchDB key-value pairs for MBSEBaseModel.
{
"selector": {
"BCAssetType": { "$eq": "MBSEBaseModel" }
}
}
See https://docs.couchdb.org/en/latest/api/database/find.html for more information about selector syntax.
4.3 Wide-Column Model
- A columnar DBMS or column-oriented DBMS stores data tables grouped by columns, instead of grouped by rows (as in most relational DBMS).
- For example, related columns may be stored together in a file for faster performance.
- Benefits:
- Faster access for certain types of queries.
- Better chance for data compression.
- Disadvantages:
- Slower update.
- Slower access for certain types of queries.
- Read introductions to the wide-column model. Examples:
- https://dandkim.com/wide-column-databases/
- In wide column model, data is stored as keys and columns. Each column contains a column-name and a value.
- Thus, to get a data value, use (key, column-name).
- Cassandra is one of the most popular wide-column databases.
4.4 Graphical Model
- "A graph database stores nodes and relationships instead of tables, or documents."
- Quite object-oriented, using a directed graph model.
- neo4j is the most popular graphical database.
- Introduction: https://neo4j.com/developer/graph-database/.
- To start learning Neo4j, download and install Neo4j desktop.
- Neo4j Query Language: Cypher, https://neo4j.com/developer/cypher/.
- Basic Cypher syntax: (nodes)-[:ARE_CONNECTED_TO]->(otherNodes).