MySQL is a widely used open-source relational database management system. As much as this RDBMS is widely used though, in certain aspects it is frequently misunderstood. One of those aspects is big data – should you run big data applications on MySQL? Should you use MongoDB? Maybe you should take a look at neo4j or even AWS Redshift? We will try to explore that in this post.
Before we begin diving deeper into MySQL and big data, we should probably figure out what MySQL is and how it works first. Here’s the crux of it:
Big data, on the other hand, is a term that usually emphasizes a large volume of data. As such, the data could not be suitable for traditional storage and processing methods.
When using MySQL, there are a few storage engines to choose from:
Though there are a lot of options, the two most commonly used storage engines in MySQL are InnoDB and MyISAM. The main differences between those two are as follows:
InnoDB is also ACID compliant.
In regards to database systems, ACID is a set of properties that aims to guarantee data validity in case of any kind of failures that might occur while processing a database transaction. A database management system that adheres to these principles is referred to as an ACID-compliant DBMS.
ACID is an acronym that breaks down as follows:
The correct choosing of a database engine is crucial to any kind of project – especially to the projects that deal with such volumes of data on a regular basis, so if we’re dealing with big data on MySQL we should probably choose InnoDB.
Now we know what a DBMS is, what are the choices of database engines when dealing with MySQL, what is ACID compliance and why you should choose InnoDB when dealing with data, but we have just scratched the surface – in order to better understand how InnoDB works, we must take a closer look at the engine. Here’s how InnoDB looks under the hood:
The above diagram depicts the complexity of InnoDB. Let’s break it down starting from the memory part:
On the disk side of InnoDB we see its system tablespace – the ibdata1 file is arguably the busiest file in the InnoDB infrastructure as it contains all of the tables and indexes in a database. It contains:
The major problem with this file is that it can only grow – even if you drop a table associated with an InnoDB engine, ibdata1 will not get smaller and if you delete the ibdata1 file, you delete all of the data associated with InnoDB tables. This can become an extremely huge and frustrating problem when dealing with big data on MySQL – if you run MySQL without looking at this, you risk losing a lot of disk space for nothing. The only way to shrink this file once and for all and to clean InnoDB infrastructure is to:
innodb_file_per_table;
innodb_flush_method = O_DIRECT;
innodb_buffer_pool_size = 10G;
innodb_log_file_size = 2.5G;
Starting MySQL at this point will recreate ibdata1, ib_logfile0 and ib_logfile1 at 2.5GB each.
When you accomplish this ibdata1 will grow, but will only contain table metadata without InnoDB table data and without indexes for its tables – the tables will be represented in .frm and .ibd files respectively.
Each .frm file will contain the storage engine header and the .ibd file will contain the table data and indexes. This way, if you would want to remove a table from an InnoDB engine, you would simply delete the associated .ibd and .frm files.
At this point, you probably notice that we have touched a few InnoDB parameters without fully explaining what they are or what they do. The major parameters we need to take a look at are as follows:
The above illustration depicts all of the major parameters when dealing with InnoDB:
When dealing with big data on an InnoDB-based infrastructure we need to change these parameters quite a bit. Obviously, if we run big data applications in any kind of environment, we need appropriate servers with a lot of disk space and increased RAM. We should also choose the flavor of an operating system in our server.
We will take a look at what the my.cnf configuration could be on a server environment that has one 1TB disk and 8GB of RAM available.
As this is a machine with 8GB of RAM available, we would also be able to set innodb_buffer_pool_size to a higher value – approximately 6 to 6.5GB (no more than about 80% of RAM size) and adjust innodb_log sizes accordingly.
From the picture above we can also see that innodb_file_per_table is now enabled and the innodb_flush_method is set to O_DIRECT. We have stated before that O_DIRECT is not available on Windows machines and that is because of the fact that Windows uses async_unbuffered causing this variable to have no effect. If we’re dealing with Windows, the only recognized innodb_flush_method values are “normal”, “unbuffered” and “async_unbuffered”. On Unix machines on the other hand, we can use “fsync”, “O_DSYNC”, “littlesync”, “nosync”, “O_DIRECT” and “O_DIRECT_NO_FSYNC”. Here’s what all of the flush method variables do:
The fsync() system call is closely related to a standard system call in the Unix architecture – sync(). The call commits all data which has been scheduled for writing via low-level I/O system calls. fsync() commits just the buffered data relating to a specified file descriptor (an indicator or handle used to access resources). In other words, this call will only return after the data and metadata is transferred to the actual physical storage.
At this point, MySQL is almost ready to deal with big data! There are a few more things we need to cover though – the table design, loading big data sets into the tables and replication.
It usually isn’t a very big problem to load data into MySQL – you just upload a dump into the database, that’s it. Problems start when the data sets start getting bigger – when loading such data into MySQL the upload starts timing out and if it does not time out, the upload takes an incredibly long time.
The reason why this is happening is because an INSERT statement has a significant amount of overhead including locking, transaction demarcation, integrity checks, allocation of resources and a per-statement basis I/O. We can avoid that by using bulk loading – bulk loading can be hundreds of times faster than INSERT statements because it is specifically designed for loading massive amounts of data.
Although bulk loading is designed for massive amounts of data, if the data you’re loading into the system has millions of rows, bulk loading can significantly slow down. This can be the case if your bulk insert buffer size is too small – such a scenario is known as bulk insert buffering. To fix this, increase the bulk_insert_buffer_size
setting.
Even if all of your database settings are properly optimized, queries on big data can still be really slow. To improve query speed:
ALTER TABLE
– a very large table will take a very long time to ALTER
because MySQL has to read-lock the table, create a temporary table with the desired structure, copy all rows to that temporary table, unlock the table to reads and delete the old table;Replication enables data from one MySQL database server to be copied to one or more database servers. The main database server is referred to as the master, the others are referred to as the slaves. This can be useful dealing with any amount of data, especially if developers are using MySQL to host massive data sets. Replication can be used for a number of reasons including improving performance by distributing load across multiple server nodes or backing up data. There are multiple types of replication:
Each replication type, of course, has its own pros and cons meaning that they could be suited for different types of usage.
Master-slave replication could be useful if developers have the need for reading from the slaves without impacting the master or if the developers don’t want the backups on the database leaving an impact on the master. Master-slave replication is also useful if slaves need to be taken offline and synchronized with the master without downtime.
However, the main con with master-slave replication is that in the instance of master database failure, a slave server has to be promoted to a master server manually meaning that you could risk some downtime.
Master-master replication can be useful if developers need their application to distribute load or read from both master servers. Master-master replication also comes with automatic failover.
However, the configuration of a master-master architecture is more difficult than configuration of a master-slave architecture.
Using a MySQL cluster can be useful if an application requires high availability, uptime and throughput or distributed writes, however, there are quite a few limitations of such an architecture.
At this point, you could notice that we haven’t touched one of MySQL engines – MyISAM. We did not do that because if you run both InnoDB and MyISAM engines the process of getting them both to work together will be tedious – you’re going to need to consider allocating memory for each of the engines, since each of them has its own caching.
To summarize, MySQL is perfectly capable of working with big data sets – however, the performance you will get will heavily depend on what servers you will use, whether you will use them to their full potential or not (you might not need a terabyte of RAM – maybe 32GB or even 16GB would suffice), to what extent will you optimize MySQL, what database engine will you use and how good of a MySQL DBA you are.
There have been rumors about a data breach targeting Schneider Electric. Did a data breach…
There have been rumors about the Fiskars Group – the company behind Fiskars scissors and…
Russia has fined Google more than two undecillion roubles because Google has refused to pay…
Why does RockYou 2024.txt look like a binary file when you open it up? Find…
Duolicious is a dating app that connects people who are “chronically online.” Did the Duolicious…
This blog will tell you what RockYou 2024 is, how RockYou 2024.txt came to be,…