Objective
The objective of this exercise is to illustrate the concept of sharding (horizontal partitioning), a database technique for storing large data collections across multiple servers called “shards” (cf. image below). Sharding increases query performance since each server handles (small) parts of a data set.
Requirements
Creating and Populating a Database in MongoDB
In MongoDB a database is composed of a set of collections, each collection containing JSON documents. The following instructions guide you in creating a MongoDB database collection. The devil is in the details!
- Start a MongoDB Instance:
mkdir -p ~/db/shard1 # Folder containing the DB files mongod --shardsvr --dbpath ~/db/shard1 --port 27021
- Connect to the CouchDB:
mongo --host localhost:27021
- Create the database mydb and the collection cities:
use mydb db.createCollection("cities")
- Verify the existence of mydb and cities:
show dbs show collections
- Populate the database (requires a new terminal !!):
# Assumes that cities.txt is in HOME mongoimport --host localhost:27021 --db mydb --collection cities --file ~/cities.txt
- Query the database:
db.cities.find().pretty() db.cities.count()
Configuring a Sharded Cluster
MongoDB supports sharding through a sharded cluster composed of the following components (cf. image below):
- Shards: store the data.
- Query routers: direct operations from clients to the appropriate shard(s) and return results to clients.
- Config servers: store cluster’s metadata. The query router uses this metadata to target operations to specific shards.
The following instructions illustrate how to create a simple sharded cluster with 1 config server, 1 query router and 1 shard.
- Config server:
mkdir ~/db/configdb mongod --configsvr --dbpath ~/db/configdb --port 27020
- Query router (mongos instance) connected to the config server:
mongos --configdb localhost:27020 --port 27019
- Connect to the query router :
mongo --host localhost:27019
- Add MongoDB as a shard (the instance containing the mydb database):
use admin db.runCommand( { addShard: "localhost:27021", name: "shard1" } )
- Verify the cluster:
sh.status()
Sharding a Database
In MongoDB sharding is enabled on a per-basis collection. When enabled, mongo partitions and distributes data into the cluster based on a shard key (cf. sharding introduction). MongoDB uses two kinds of partitioning strategies (cf. images below):
- Range based: data is partitioned into ranges having [min, max] values determined by the shard key.
- Hash based: data is partitioned using a hash function.
Range based sharding example
- Connect to the query router and create the collection cities1 in database mydb:
use mydb db.createCollection("cities1") show collections
- Enable sharding on the collection mydb.cities1. Use state as shard key:
sh.enableSharding("mydb") sh.shardCollection("mydb.cities1", { "state": 1} )
- Verify the cluster state:
sh.status()
- Populate collection cities1 using (my)db.cities:
db.cities.find().forEach( function(d) { db.cities1.insert(d); } )
- Verify the cluster state:
sh.status()
Hash-based sharding example
- Connect to the query router and create the collection cities2 in database mydb:
use mydb db.createCollection("cities2") show collections
- Enable sharding on the collection mydb.cities2. Use state as shard key:
sh.enableSharding("mydb") sh.shardCollection("mydb.cities2", { "state": 1} )
- Verify the cluster state:
sh.status()
- Populate collection cities2 using (my)db.cities:
db.cities.find().forEach( function(d) { db.cities2.insert(d); } )
- Verify the cluster state:
sh.status()
Guiding the partitioning process using Tags
MongoDB also supports tagging a range of shard key values. These are some of the benefits of using tags:
- Isolate a subset of the data on a specific shard.
- Ensure that relevant data resides on shards that are geographically close to the user.
The next example illustrates tag-based partitioning on a cluster composed of 3-shards.
- Start 2 more MongoDB instances (future shards):
# Shard 2: requires a new terminal! mkdir -p ~/db/shard2 # Folders containing DB files mongod --shardsvr --dbpath ~/db/shard2 --port 27022 # Shard 3: requires a new terminal! mkdir -p ~/db/shard3 # Folders containing DB files mongod --shardsvr --dbpath ~/db/shard3 --port 27023
- Add shards to the cluster:
use admin db.runCommand( { addShard: "localhost:27022", name: "shard2" } ) db.runCommand( { addShard: "localhost:27023", name: "shard3" } ) sh.status()
- Associate tags to shards:
sh.addShardTag("shard1", "CA") sh.addShardTag("shard2", "NY") sh.addShardTag("shard3", "Others")
- Create, shard and populate a new collection named cities3:
db.createCollection("cities3") sh.shardCollection("mydb.cities3", { "state": 1} ) use mydb db.cities.find().forEach( function(d) { db.cities3.insert(d); } )
- Associate shard key ranges to tagged shards:
sh.addTagRange("mydb.cities3", { state: MinKey }, { state: "CA" }, "Others") sh.addTagRange("mydb.cities3", { state: "CA" }, { state: "CA_" }, "CA") sh.addTagRange("mydb.cities3", { state: "CA_" }, { state: "NY" }, "Others") sh.addTagRange("mydb.cities3", { state: "NY" }, { state: "NY_" }, "NY") sh.addTagRange("mydb.cities3", { state: "NY_" }, { state: MaxKey }, "Others")
- Verify the cluster state:
sh.status()