PolarSPARC |
Hands-on MongoDB :: Part-5
Bhaskar S | 02/19/2021 (NEW) |
Introduction
In Part-1, we setup the high-availability 3-node cluster using Docker and got our hands dirty with MongoDB by using the mongo command-line interface.
In Part-2, we got to demonstrate the same set of operations on MongoDB using the programming language drivers for Java and Python.
In Part-3, we got to demonstrate the same set of operations on MongoDB using Spring Boot with Java .
In Part-4, we got to explore the more advanced operators for querying and updating MongoDB using the mongo command-line interface.
In this FINAL part, we will explore some miscellaneous topics around replica set, schema validation, bulk data import, using indexes, etc using the mongo command-line interface.
Hands-on with MongoDB
Replica Set
Typical enterprise production deployments imply multiple replicas of the data on different server(s) (running in different datacenters that are geographical dispersed) for fault-tolerance and resiliency purposes. In MongoDB, a replicated deployment is achieved by using a Replica Set. In Part-1 of this series, we setup a 3-node Replica Set using the official MongoDB docker image.
The following diagram illustrates the high-level view of the Replica Set (initial state of the nodes in the cluster):
The nodes in the Replica Set go through an election process to elect a leader (referred to as the Primary) which then replicates data ASYNCHRONOUSLY to the other follower nodes (referred to as the Secondary nodes) of the cluster.
The following diagram illustrates the high-level view of the Replica Set after the Primary election:
With a Replica Set is deployed, the writes have to always go through the Primary node as it is the one replicating to the Secondary node(s) in the cluster. Also, by default, reads are prohibited on the Secondary nodes because there is a possibility of the Secondary node(s) being behind due to network latencies. This is the reason why we encounter the error not master and slaveOk=false when we tried to perform any operation via a Secondary node.
Until now, we have always looked for and directly connected to the Primary node to perform operations. Given that there are 3-nodes in our Replica Set, is there a way to connect to the cluster so the interactive shell or the driver can automatically connect to the Primary ???
The anwer is - YES !!!
To connect to our Replica Set mongodb-rs using the command-line interface mongo (using docker), execute the following command:
$ docker run --rm -it mongo:4.4.3 mongo "mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?replicaSet=mongodb-rs"
The following will be the output:
MongoDB shell version v4.4.3 connecting to: mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?compressors=disabled&gssapiServiceName=mongodb&replicaSet=mongodb-rs Implicit session: session { "id" : UUID("f7f9457d-9e61-4f66-8796-0a34c8e11ddb") } MongoDB server version: 4.4.3 Welcome to the MongoDB shell. For interactive help, type "help". For more comprehensive documentation, see https://docs.mongodb.com/ Questions? Try the MongoDB Developer Community Forums https://community.mongodb.com --- The server generated these startup warnings when booting: 2021-02-20T00:59:37.527+00:00: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine. See http://dochub.mongodb.org/core/prodnotes-filesystem 2021-02-20T00:59:38.776+00:00: Access control is not enabled for the database. Read and write access to data and configuration is unrestricted 2021-02-20T00:59:38.776+00:00: You are running this process as the root user, which is not recommended --- --- Enable MongoDB's free cloud-based monitoring service, which will then receive and display metrics about your deployment (disk utilization, CPU, operation statistics, etc). The monitoring data will be available on a MongoDB website with a unique URL accessible to you and anyone you share the URL with. MongoDB may use this information to make product improvements and to suggest MongoDB products and deployment options to you. To enable free monitoring, run the following command: db.enableFreeMonitoring() To permanently disable this reminder, run the following command: db.disableFreeMonitoring() --- mongodb-rs:PRIMARY>
BINGO !!! We have successfully connected to our Replica Set and by default connects to the Primary node.
Schema Validation
By default, there is no schema enforcement or validation on any document being inserted or updated in a MongoDB collection. What if we want to enforce some kind of a structure on the documents being inserted or updated into a MongoDB collection ???
This is where the MongoDB Schema Validation via the $jsonSchema operator comes in handy.
We will explicitly create a MongoDB collection called contacts that will require the mandatory text fields first, last, and an email sub-document with the mandatory text field personal and an optional text field work.
Assuming the command-line interactive MongoDB shell is running, execute the following command to create the collection with the schema validator:
mongodb-rs:PRIMARY> db.createCollection('contacts', { validator: { $jsonSchema: { bsonType: 'object', required: [ 'first', 'last', 'email' ], properties: { first: { bsonType: 'string', maxLength: 25, description: 'required and must be a string' }, last: { bsonType: 'string', maxLength: 25, description: 'required and must be a string' }, email: { bsonType: 'object', required: [ 'personal' ], properties: { personal: { bsonType: 'string', pattern: "^[A-Za-z0-9_]+\@[A-Za-z0-9]+\.[a-z]{2,3}$", maxLength: 50, description: 'required and must be a string' }, work: { bsonType: 'string', pattern: "^[A-Za-z0-9_]+\@[A-Za-z0-9]+\.[a-z]{2,3}$", maxLength: 50, description: 'optional and must be a string' } } } } } } })
The following will be the output:
{ "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1613753308, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1613692881, 1) }
The following are brief descriptions for some of the operator(s) used in the command above:
bsonType :: specifies the type of the document, sub-document, or a field in the document. The top level (the document) is of the type object. The other fields in the document, such as, first, last, personal, work are all of type string
required :: specifies a array of mandatory field name(s) that *MUST* be present in the document
properties :: a JSON document that specifies the schema definition for each field in the Mongo document
pattern :: specifies a regular expression to validate a field in the document
maxLength :: specifies the maximum length of the string value the field holds
description :: specifies an optional string that describes the schema definition for a field
Let us try to add a new invalid document to a collection by executing the following command:
mongodb-rs:PRIMARY> db.contacts.insert({ first: "Alice", last: "Thompson" })
The following will be the output:
WriteResult({ "nInserted" : 0, "writeError" : { "code" : 121, "errmsg" : "Document failed validation" } })
Let us now try to add a new valid document to a collection by executing the following command:
mongodb-rs:PRIMARY> db.contacts.insert({ first: "Alice", last: "Thompson", email: { personal: 'alice_thompson@home.io' } })
The following will be the output:
WriteResult({ "nInserted" : 1 })
We will now go ahead and drop the collection by executing the following command:
mongodb-rs:PRIMARY> db.contacts.drop()
The following will be the output:
true
AWESOME !!! We have successfully demonstrated schema validation on MongoDB collections.
Bulk Data Import
There will be times when we desired to bulk load data into one or more MongoDB collections. How do we do that ??? This is where the mongoimport comes in handy.
We will use a sample dataset for the purposes of demonstration. The sample dataset will contain 100 randomly generated contact details to be added to the collection called contacts.
The sample dataset is stored in a file called contacts.json and contains an array of the contact documents in the JSON format.
The following are the contents of the file contacts.json (which will be placed in the directory $HOME/Downloads/DATA/mongodb):
[{ '_id': 'charlie_ng', 'first': 'Charlie', 'last': 'Ng', 'email': { 'personal': 'charlie_ng@pear.us' }, 'zip': '50016' }, { '_id': 'frank_nelson', 'first': 'Frank', 'last': 'Nelson', 'email': { 'personal': 'frank_nelson@raspberry.us' }, 'zip': '50011' }, { '_id': 'bob_ng', 'first': 'Bob', 'last': 'Ng', 'email': { 'personal': 'bob_ng@granola.us' }, 'zip': '50020' }, { '_id': 'david_lee', 'first': 'David', 'last': 'Lee', 'email': { 'personal': 'david_lee@fig.org' }, 'zip': '50013' }, { '_id': 'charlie_davidson', 'first': 'Charlie', 'last': 'Davidson', 'email': { 'personal': 'charlie_davidson@fig.org', 'work': 'charlie.davidson@orange.io' },'zip': '50020' }, { '_id': 'george_johnson', 'first': 'George', 'last': 'Johnson', 'email': { 'personal': 'george_johnson@lemon.io', 'work': 'george.johnson@purple.us' },'zip': '50014' }, { '_id': 'george_davidson', 'first': 'George', 'last': 'Davidson', 'email': { 'personal': 'george_davidson@clove.net' }, 'zip': '50019' }, { '_id': 'alice_johnson', 'first': 'Alice', 'last': 'Johnson', 'email': { 'personal': 'alice_johnson@pear.us' }, 'zip': '50019' }, { '_id': 'jack_baker', 'first': 'Jack', 'last': 'Baker', 'email': { 'personal': 'jack_baker@watermelon.io', 'work': 'jack.baker@green.org' },'zip': '50019' }, { '_id': 'jack_jones', 'first': 'Jack', 'last': 'Jones', 'email': { 'personal': 'jack_jones@watermelon.io', 'work': 'jack.jones@orange.io' },'zip': '50020' }, { '_id': 'frank_norman', 'first': 'Frank', 'last': 'Norman', 'email': { 'personal': 'frank_norman@jam.org', 'work': 'frank.norman@purple.us' },'zip': '50010' }, { '_id': 'bob_jones', 'first': 'Bob', 'last': 'Jones', 'email': { 'personal': 'bob_jones@lemon.io' }, 'zip': '50016' }, { '_id': 'jack_ng', 'first': 'Jack', 'last': 'Ng', 'email': { 'personal': 'jack_ng@watermelon.io' }, 'zip': '50010' }, { '_id': 'kelly_lee', 'first': 'Kelly', 'last': 'Lee', 'email': { 'personal': 'kelly_lee@dates.io' }, 'zip': '50014' }, { '_id': 'bob_baker', 'first': 'Bob', 'last': 'Baker', 'email': { 'personal': 'bob_baker@watermelon.io', 'work': 'bob.baker@green.org' },'zip': '50018' }, { '_id': 'kelly_johnson', 'first': 'Kelly', 'last': 'Johnson', 'email': { 'personal': 'kelly_johnson@watermelon.io' }, 'zip': '50017' }, { '_id': 'alice_connor', 'first': 'Alice', 'last': 'Connor', 'email': { 'personal': 'alice_connor@pear.us', 'work': 'alice.connor@green.org' },'zip': '50011' }, { '_id': 'holly_davidson', 'first': 'Holly', 'last': 'Davidson', 'email': { 'personal': 'holly_davidson@banana.io' }, 'zip': '50010' }, { '_id': 'david_johnson', 'first': 'David', 'last': 'Johnson', 'email': { 'personal': 'david_johnson@dates.io' }, 'zip': '50012' }, { '_id': 'david_connor', 'first': 'David', 'last': 'Connor', 'email': { 'personal': 'david_connor@clove.net' }, 'zip': '50016' }, { '_id': 'alice_norman', 'first': 'Alice', 'last': 'Norman', 'email': { 'personal': 'alice_norman@lemon.io', 'work': 'alice.norman@purple.us' },'zip': '50015' }, { '_id': 'holly_norman', 'first': 'Holly', 'last': 'Norman', 'email': { 'personal': 'holly_norman@fig.org' }, 'zip': '50014' }, { '_id': 'eve_nelson', 'first': 'Eve', 'last': 'Nelson', 'email': { 'personal': 'eve_nelson@granola.us' }, 'zip': '50013' }, { '_id': 'alice_nelson', 'first': 'Alice', 'last': 'Nelson', 'email': { 'personal': 'alice_nelson@granola.us' }, 'zip': '50014' }, { '_id': 'alice_baker', 'first': 'Alice', 'last': 'Baker', 'email': { 'personal': 'alice_baker@raspberry.us', 'work': 'alice.baker@maroon.us' },'zip': '50016' }, { '_id': 'holly_baker', 'first': 'Holly', 'last': 'Baker', 'email': { 'personal': 'holly_baker@dates.io' }, 'zip': '50019' }, { '_id': 'alice_lee', 'first': 'Alice', 'last': 'Lee', 'email': { 'personal': 'alice_lee@raspberry.us' }, 'zip': '50016' }, { '_id': 'george_ng', 'first': 'George', 'last': 'Ng', 'email': { 'personal': 'george_ng@fig.org' }, 'zip': '50019' }, { '_id': 'kelly_norman', 'first': 'Kelly', 'last': 'Norman', 'email': { 'personal': 'kelly_norman@granola.us' }, 'zip': '50017' }, { '_id': 'charlie_nelson', 'first': 'Charlie', 'last': 'Nelson', 'email': { 'personal': 'charlie_nelson@watermelon.io' }, 'zip': '50011' }, { '_id': 'eve_davidson', 'first': 'Eve', 'last': 'Davidson', 'email': { 'personal': 'eve_davidson@clove.net' }, 'zip': '50013' }, { '_id': 'bob_connor', 'first': 'Bob', 'last': 'Connor', 'email': { 'personal': 'bob_connor@banana.io', 'work': 'bob.connor@purple.us' },'zip': '50010' }, { '_id': 'kelly_davidson', 'first': 'Kelly', 'last': 'Davidson', 'email': { 'personal': 'kelly_davidson@pear.us' }, 'zip': '50011' }, { '_id': 'eve_connor', 'first': 'Eve', 'last': 'Connor', 'email': { 'personal': 'eve_connor@raspberry.us' }, 'zip': '50012' }, { '_id': 'jack_thompson', 'first': 'Jack', 'last': 'Thompson', 'email': { 'personal': 'jack_thompson@dates.io', 'work': 'jack.thompson@cyan.net' },'zip': '50019' }, { '_id': 'george_connor', 'first': 'George', 'last': 'Connor', 'email': { 'personal': 'george_connor@pear.us' }, 'zip': '50020' }, { '_id': 'holly_jones', 'first': 'Holly', 'last': 'Jones', 'email': { 'personal': 'holly_jones@fig.org', 'work': 'holly.jones@orange.io' },'zip': '50010' }, { '_id': 'holly_johnson', 'first': 'Holly', 'last': 'Johnson', 'email': { 'personal': 'holly_johnson@pear.us', 'work': 'holly.johnson@blue.edu' },'zip': '50018' }, { '_id': 'kelly_jones', 'first': 'Kelly', 'last': 'Jones', 'email': { 'personal': 'kelly_jones@banana.io', 'work': 'kelly.jones@violet.io' },'zip': '50018' }, { '_id': 'bob_davidson', 'first': 'Bob', 'last': 'Davidson', 'email': { 'personal': 'bob_davidson@banana.io' }, 'zip': '50013' }, { '_id': 'kelly_thompson', 'first': 'Kelly', 'last': 'Thompson', 'email': { 'personal': 'kelly_thompson@dates.io', 'work': 'kelly.thompson@green.org' },'zip': '50016' }, { '_id': 'david_norman', 'first': 'David', 'last': 'Norman', 'email': { 'personal': 'david_norman@dates.io', 'work': 'david.norman@purple.us' },'zip': '50019' }, { '_id': 'eve_johnson', 'first': 'Eve', 'last': 'Johnson', 'email': { 'personal': 'eve_johnson@raspberry.us' }, 'zip': '50020' }, { '_id': 'holly_lee', 'first': 'Holly', 'last': 'Lee', 'email': { 'personal': 'holly_lee@fig.org' }, 'zip': '50016' }, { '_id': 'alice_ng', 'first': 'Alice', 'last': 'Ng', 'email': { 'personal': 'alice_ng@jam.org', 'work': 'alice.ng@blue.edu' },'zip': '50017' }, { '_id': 'kelly_baker', 'first': 'Kelly', 'last': 'Baker', 'email': { 'personal': 'kelly_baker@clove.net', 'work': 'kelly.baker@orange.io' },'zip': '50020' }, { '_id': 'eve_thompson', 'first': 'Eve', 'last': 'Thompson', 'email': { 'personal': 'eve_thompson@pear.us', 'work': 'eve.thompson@violet.io' },'zip': '50014' }, { '_id': 'jack_nelson', 'first': 'Jack', 'last': 'Nelson', 'email': { 'personal': 'jack_nelson@jam.org' }, 'zip': '50012' }, { '_id': 'charlie_lee', 'first': 'Charlie', 'last': 'Lee', 'email': { 'personal': 'charlie_lee@pear.us', 'work': 'charlie.lee@pink.com' },'zip': '50016' }, { '_id': 'david_ng', 'first': 'David', 'last': 'Ng', 'email': { 'personal': 'david_ng@clove.net', 'work': 'david.ng@red.net' },'zip': '50012' }, { '_id': 'bob_johnson', 'first': 'Bob', 'last': 'Johnson', 'email': { 'personal': 'bob_johnson@clove.net' }, 'zip': '50014' }, { '_id': 'bob_norman', 'first': 'Bob', 'last': 'Norman', 'email': { 'personal': 'bob_norman@granola.us', 'work': 'bob.norman@violet.io' },'zip': '50012' }, { '_id': 'frank_jones', 'first': 'Frank', 'last': 'Jones', 'email': { 'personal': 'frank_jones@fig.org', 'work': 'frank.jones@pink.com' },'zip': '50010' }, { '_id': 'alice_jones', 'first': 'Alice', 'last': 'Jones', 'email': { 'personal': 'alice_jones@clove.net' }, 'zip': '50011' }, { '_id': 'frank_thompson', 'first': 'Frank', 'last': 'Thompson', 'email': { 'personal': 'frank_thompson@dates.io' }, 'zip': '50010' }, { '_id': 'george_lee', 'first': 'George', 'last': 'Lee', 'email': { 'personal': 'george_lee@lemon.io' }, 'zip': '50016' }, { '_id': 'david_thompson', 'first': 'David', 'last': 'Thompson', 'email': { 'personal': 'david_thompson@clove.net', 'work': 'david.thompson@purple.us' },'zip': '50010' }, { '_id': 'george_baker', 'first': 'George', 'last': 'Baker', 'email': { 'personal': 'george_baker@banana.io' }, 'zip': '50020' }, { '_id': 'george_jones', 'first': 'George', 'last': 'Jones', 'email': { 'personal': 'george_jones@granola.us', 'work': 'george.jones@red.net' },'zip': '50016' }, { '_id': 'george_thompson', 'first': 'George', 'last': 'Thompson', 'email': { 'personal': 'george_thompson@pear.us', 'work': 'george.thompson@brown.io' },'zip': '50019' }, { '_id': 'charlie_baker', 'first': 'Charlie', 'last': 'Baker', 'email': { 'personal': 'charlie_baker@raspberry.us' }, 'zip': '50018' }, { '_id': 'george_nelson', 'first': 'George', 'last': 'Nelson', 'email': { 'personal': 'george_nelson@lemon.io' }, 'zip': '50011' }, { '_id': 'charlie_thompson', 'first': 'Charlie', 'last': 'Thompson', 'email': { 'personal': 'charlie_thompson@pear.us' }, 'zip': '50018' }, { '_id': 'frank_lee', 'first': 'Frank', 'last': 'Lee', 'email': { 'personal': 'frank_lee@jam.org', 'work': 'frank.lee@pink.com' },'zip': '50011' }, { '_id': 'david_davidson', 'first': 'David', 'last': 'Davidson', 'email': { 'personal': 'david_davidson@granola.us', 'work': 'david.davidson@blue.edu' },'zip': '50011' }, { '_id': 'holly_ng', 'first': 'Holly', 'last': 'Ng', 'email': { 'personal': 'holly_ng@dates.io', 'work': 'holly.ng@pink.com' },'zip': '50020' }, { '_id': 'charlie_johnson', 'first': 'Charlie', 'last': 'Johnson', 'email': { 'personal': 'charlie_johnson@watermelon.io', 'work': 'charlie.johnson@green.org' },'zip': '50014' }, { '_id': 'eve_ng', 'first': 'Eve', 'last': 'Ng', 'email': { 'personal': 'eve_ng@granola.us', 'work': 'eve.ng@red.net' },'zip': '50019' }, { '_id': 'george_norman', 'first': 'George', 'last': 'Norman', 'email': { 'personal': 'george_norman@lemon.io' }, 'zip': '50018' }, { '_id': 'bob_thompson', 'first': 'Bob', 'last': 'Thompson', 'email': { 'personal': 'bob_thompson@dates.io' }, 'zip': '50011' }, { '_id': 'jack_norman', 'first': 'Jack', 'last': 'Norman', 'email': { 'personal': 'jack_norman@pear.us' }, 'zip': '50020' }, { '_id': 'holly_thompson', 'first': 'Holly', 'last': 'Thompson', 'email': { 'personal': 'holly_thompson@clove.net' }, 'zip': '50019' }, { '_id': 'bob_lee', 'first': 'Bob', 'last': 'Lee', 'email': { 'personal': 'bob_lee@lemon.io', 'work': 'bob.lee@red.net' },'zip': '50014' }, { '_id': 'alice_thompson', 'first': 'Alice', 'last': 'Thompson', 'email': { 'personal': 'alice_thompson@jam.org' }, 'zip': '50012' }, { '_id': 'eve_norman', 'first': 'Eve', 'last': 'Norman', 'email': { 'personal': 'eve_norman@dates.io' }, 'zip': '50020' }, { '_id': 'holly_nelson', 'first': 'Holly', 'last': 'Nelson', 'email': { 'personal': 'holly_nelson@clove.net' }, 'zip': '50015' }, { '_id': 'charlie_connor', 'first': 'Charlie', 'last': 'Connor', 'email': { 'personal': 'charlie_connor@watermelon.io', 'work': 'charlie.connor@blue.edu' },'zip': '50018' }, { '_id': 'charlie_jones', 'first': 'Charlie', 'last': 'Jones', 'email': { 'personal': 'charlie_jones@fig.org' }, 'zip': '50014' }, { '_id': 'frank_davidson', 'first': 'Frank', 'last': 'Davidson', 'email': { 'personal': 'frank_davidson@raspberry.us', 'work': 'frank.davidson@brown.io' },'zip': '50015' }, { '_id': 'charlie_norman', 'first': 'Charlie', 'last': 'Norman', 'email': { 'personal': 'charlie_norman@dates.io' }, 'zip': '50013' }, { '_id': 'holly_connor', 'first': 'Holly', 'last': 'Connor', 'email': { 'personal': 'holly_connor@jam.org', 'work': 'holly.connor@green.org' },'zip': '50016' }, { '_id': 'jack_connor', 'first': 'Jack', 'last': 'Connor', 'email': { 'personal': 'jack_connor@dates.io' }, 'zip': '50020' }, { '_id': 'alice_davidson', 'first': 'Alice', 'last': 'Davidson', 'email': { 'personal': 'alice_davidson@raspberry.us', 'work': 'alice.davidson@red.net' },'zip': '50011' }, { '_id': 'jack_lee', 'first': 'Jack', 'last': 'Lee', 'email': { 'personal': 'jack_lee@clove.net' }, 'zip': '50013' }, { '_id': 'frank_connor', 'first': 'Frank', 'last': 'Connor', 'email': { 'personal': 'frank_connor@banana.io', 'work': 'frank.connor@maroon.us' },'zip': '50020' }, { '_id': 'kelly_connor', 'first': 'Kelly', 'last': 'Connor', 'email': { 'personal': 'kelly_connor@clove.net' }, 'zip': '50016' }, { '_id': 'david_baker', 'first': 'David', 'last': 'Baker', 'email': { 'personal': 'david_baker@pear.us', 'work': 'david.baker@blue.edu' },'zip': '50019' }, { '_id': 'david_nelson', 'first': 'David', 'last': 'Nelson', 'email': { 'personal': 'david_nelson@lemon.io', 'work': 'david.nelson@brown.io' },'zip': '50018' }, { '_id': 'bob_nelson', 'first': 'Bob', 'last': 'Nelson', 'email': { 'personal': 'bob_nelson@granola.us', 'work': 'bob.nelson@green.org' },'zip': '50011' }, { '_id': 'frank_baker', 'first': 'Frank', 'last': 'Baker', 'email': { 'personal': 'frank_baker@lemon.io', 'work': 'frank.baker@pink.com' },'zip': '50017' }, { '_id': 'eve_jones', 'first': 'Eve', 'last': 'Jones', 'email': { 'personal': 'eve_jones@lemon.io' }, 'zip': '50010' }, { '_id': 'david_jones', 'first': 'David', 'last': 'Jones', 'email': { 'personal': 'david_jones@watermelon.io', 'work': 'david.jones@red.net' },'zip': '50019' }, { '_id': 'jack_johnson', 'first': 'Jack', 'last': 'Johnson', 'email': { 'personal': 'jack_johnson@lemon.io' }, 'zip': '50012' }, { '_id': 'jack_davidson', 'first': 'Jack', 'last': 'Davidson', 'email': { 'personal': 'jack_davidson@granola.us', 'work': 'jack.davidson@violet.io' },'zip': '50013' }, { '_id': 'eve_lee', 'first': 'Eve', 'last': 'Lee', 'email': { 'personal': 'eve_lee@granola.us', 'work': 'eve.lee@brown.io' },'zip': '50012' }, { '_id': 'frank_johnson', 'first': 'Frank', 'last': 'Johnson', 'email': { 'personal': 'frank_johnson@clove.net', 'work': 'frank.johnson@red.net' },'zip': '50013' }, { '_id': 'kelly_ng', 'first': 'Kelly', 'last': 'Ng', 'email': { 'personal': 'kelly_ng@fig.org' }, 'zip': '50016' }, { '_id': 'kelly_nelson', 'first': 'Kelly', 'last': 'Nelson', 'email': { 'personal': 'kelly_nelson@lemon.io' }, 'zip': '50017' }, { '_id': 'eve_baker', 'first': 'Eve', 'last': 'Baker', 'email': { 'personal': 'eve_baker@jam.org' }, 'zip': '50010' }, { '_id': 'frank_ng', 'first': 'Frank', 'last': 'Ng', 'email': { 'personal': 'frank_ng@pear.us' }, 'zip': '50019' }]
To bulk load the sample documents into the collection contacts, execute the following command in a Terminal:
$ docker run --rm -it -v $HOME/Downloads/DATA/mongodb/contacts.json:/data/contacts.json mongo:4.4.3 mongoimport --uri "mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?replicaSet=mongodb-rs" --collection contacts --file /data/contacts.json --drop --jsonArray
The following will be the typical output:
2021-02-20T01:31:37.947+0000 connected to: mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?replicaSet=mongodb-rs 2021-02-20T01:31:37.948+0000 dropping: mydb.contacts 2021-02-20T01:31:37.950+0000 Failed: invalid JSON input 2021-02-20T01:31:37.950+0000 0 document(s) imported successfully. 0 document(s) failed to import.
Hmm !!! What happened here. Did check the contents and it is a valid JSON structure.
We are missing the --legacy flag and hence the error
Once again, let us try to bulk load the sample documents into the collection contacts by execute the following command in the Terminal:
$ docker run --rm -it -v $HOME/Downloads/DATA/mongodb/contacts.json:/data/contacts.json mongo:4.4.3 mongoimport --uri "mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?replicaSet=mongodb-rs" --collection contacts --file /data/contacts.json --drop --jsonArray --legacy
The following will be the typical output:
2021-02-20T01:34:08.751+0000 connected to: mongodb://192.168.1.53:5001,192.168.1.53:5002,192.168.1.53:5003/mydb?replicaSet=mongodb-rs 2021-02-20T01:34:08.751+0000 dropping: mydb.contacts 2021-02-20T01:34:08.784+0000 100 document(s) imported successfully. 0 document(s) failed to import.
To verify documents were loaded into the collection, execute the following command in the MongoDB interactive shell:
mongodb-rs:PRIMARY> db.contacts.count()
The following will be the output:
100
EXCELLENT !!! We have successfully demonstrated the bulk data loading into a MongoDB collection.
Using Indexes
To query all the documents from the collection contacts where the last field equals the value of Thompson, execute the following command:
mongodb-rs:PRIMARY> db.contacts.find({ last: 'Thompson' })
The following will be the typical output:
{ "_id" : "jack_thompson", "first" : "Jack", "last" : "Thompson", "email" : { "personal" : "jack_thompson@dates.io", "work" : "jack.thompson@cyan.net" }, "zip" : "50019" } { "_id" : "kelly_thompson", "first" : "Kelly", "last" : "Thompson", "email" : { "personal" : "kelly_thompson@dates.io", "work" : "kelly.thompson@green.org" }, "zip" : "50016" } { "_id" : "eve_thompson", "first" : "Eve", "last" : "Thompson", "email" : { "personal" : "eve_thompson@pear.us", "work" : "eve.thompson@violet.io" }, "zip" : "50014" } { "_id" : "frank_thompson", "first" : "Frank", "last" : "Thompson", "email" : { "personal" : "frank_thompson@dates.io" }, "zip" : "50010" } { "_id" : "david_thompson", "first" : "David", "last" : "Thompson", "email" : { "personal" : "david_thompson@clove.net", "work" : "david.thompson@purple.us" }, "zip" : "50010" } { "_id" : "george_thompson", "first" : "George", "last" : "Thompson", "email" : { "personal" : "george_thompson@pear.us", "work" : "george.thompson@brown.io" }, "zip" : "50019" } { "_id" : "charlie_thompson", "first" : "Charlie", "last" : "Thompson", "email" : { "personal" : "charlie_thompson@pear.us" }, "zip" : "50018" } { "_id" : "bob_thompson", "first" : "Bob", "last" : "Thompson", "email" : { "personal" : "bob_thompson@dates.io" }, "zip" : "50011" } { "_id" : "alice_thompson", "first" : "Alice", "last" : "Thompson", "email" : { "personal" : "alice_thompson@jam.org" }, "zip" : "50012" } { "_id" : "holly_thompson", "first" : "Holly", "last" : "Thompson", "email" : { "personal" : "holly_thompson@clove.net" }, "zip" : "50019" }
The results are returned instantly. How do we find how the query performed ??? This is where the explain() method on the collection comes in handy.
To run the explain() method on the query to fetch all the documents from the collection contacts where the last field equals the value of Thompson, execute the following command:
mongodb-rs:PRIMARY> db.contacts.find({ last: 'Thompson' }).explain()
The following will be the typical output:
{ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "mydb.contacts", "indexFilterSet" : false, "parsedQuery" : { "last" : { "$eq" : "Thompson" } }, "queryHash" : "CB2688EC", "planCacheKey" : "CB2688EC", "winningPlan" : { "stage" : "COLLSCAN", "filter" : { "last" : { "$eq" : "Thompson" } }, "direction" : "forward" }, "rejectedPlans" : [ ] }, "serverInfo" : { "host" : "c7ba1f94500a", "port" : 5002, "version" : "4.4.3", "gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13" }, "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1613782747, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1613782747, 1) }
The default behavior of the explain() method is to display the details of the winning plan selected by the query optimizer. From the Output.10 above, we see the winning plan was COLLSCAN which imples that it was a collection scan, meaning, all the documents from the collection were scanned for the criteria.
To display the query plan as well as the execution information from the explain() method, one can specify 'executionStats' as the argument to the method.
To re-run the explain() method (with the 'executionStats' argument) to analyze the query from above, execute the following command:
mongodb-rs:PRIMARY> db.contacts.find({ last: 'Thompson' }).explain('executionStats')
The following will be the typical output:
{ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "mydb.contacts", "indexFilterSet" : false, "parsedQuery" : { "last" : { "$eq" : "Thompson" } }, "winningPlan" : { "stage" : "COLLSCAN", "filter" : { "last" : { "$eq" : "Thompson" } }, "direction" : "forward" }, "rejectedPlans" : [ ] }, "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 0, "totalKeysExamined" : 0, "totalDocsExamined" : 100, "executionStages" : { "stage" : "COLLSCAN", "filter" : { "last" : { "$eq" : "Thompson" } }, "nReturned" : 10, "executionTimeMillisEstimate" : 0, "works" : 102, "advanced" : 10, "needTime" : 91, "needYield" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "direction" : "forward", "docsExamined" : 100 } }, "serverInfo" : { "host" : "c7ba1f94500a", "port" : 5002, "version" : "4.4.3", "gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13" }, "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1613784227, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1613784227, 1) }
From the Output.11 above, we get more insights on the execution under "executionStats". The total number of documents examined from the collection was 100, which is all the documents in the collection. What if we had millions of documents in the collections ???
To improve the performance, one can create an index on the desired field (the last field in our case).
To create an index on the last field (in an ascending order) for the collection contacts, execute the following command:
mongodb-rs:PRIMARY> db.contacts.createIndex({ last: 1 })
The following will be the typical output:
{ "createdCollectionAutomatically" : false, "numIndexesBefore" : 1, "numIndexesAfter" : 2, "commitQuorum" : "votingMembers", "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1613784417, 7), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1613784417, 7) }
Now, re-run the explain() method (with the 'executionStats' argument) to analyze the query from above by executing the following command:
mongodb-rs:PRIMARY> db.contacts.find({ last: 'Thompson' }).explain('executionStats')
The following will be the typical output:
{ "queryPlanner" : { "plannerVersion" : 1, "namespace" : "mydb.contacts", "indexFilterSet" : false, "parsedQuery" : { "last" : { "$eq" : "Thompson" } }, "winningPlan" : { "stage" : "FETCH", "inputStage" : { "stage" : "IXSCAN", "keyPattern" : { "last" : 1 }, "indexName" : "last_1", "isMultiKey" : false, "multiKeyPaths" : { "last" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "last" : [ "[\"Thompson\", \"Thompson\"]" ] } } }, "rejectedPlans" : [ ] }, "executionStats" : { "executionSuccess" : true, "nReturned" : 10, "executionTimeMillis" : 1, "totalKeysExamined" : 10, "totalDocsExamined" : 10, "executionStages" : { "stage" : "FETCH", "nReturned" : 10, "executionTimeMillisEstimate" : 0, "works" : 11, "advanced" : 10, "needTime" : 0, "needYield" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "docsExamined" : 10, "alreadyHasObj" : 0, "inputStage" : { "stage" : "IXSCAN", "nReturned" : 10, "executionTimeMillisEstimate" : 0, "works" : 11, "advanced" : 10, "needTime" : 0, "needYield" : 0, "saveState" : 0, "restoreState" : 0, "isEOF" : 1, "keyPattern" : { "last" : 1 }, "indexName" : "last_1", "isMultiKey" : false, "multiKeyPaths" : { "last" : [ ] }, "isUnique" : false, "isSparse" : false, "isPartial" : false, "indexVersion" : 2, "direction" : "forward", "indexBounds" : { "last" : [ "[\"Thompson\", \"Thompson\"]" ] }, "keysExamined" : 10, "seeks" : 1, "dupsTested" : 0, "dupsDropped" : 0 } } }, "serverInfo" : { "host" : "c7ba1f94500a", "port" : 5002, "version" : "4.4.3", "gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13" }, "ok" : 1, "$clusterTime" : { "clusterTime" : Timestamp(1613784447, 1), "signature" : { "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) } }, "operationTime" : Timestamp(1613784447, 1) }
From the Output.13 above under "executionStats", we observe the total number of documents examined from the collection was only 10 resulting in a better performance.
WALLA !!! We have successfully demonstrated the use of indexes on a MongoDB collection.
References