When dealing with MySQL tables containing over a billion records, standard indexing approaches often fail to perform adequately. The case of a 1.4 billion record text_page
table demonstrates this perfectly:
CREATE TABLE text_page (
text VARCHAR(255),
page_id INT UNSIGNED
) ENGINE=MYISAM DEFAULT CHARSET=ascii;
Attempting to create a simple index on the text
column with:
ALTER TABLE text_page ADD KEY ix_text (text);
resulted in an unacceptably long execution time of over 10 hours before being abandoned.
The fundamental issue lies in MySQL's single-threaded index creation process for MyISAM tables. When dealing with 34GB of data:
- The entire table must be read sequentially
- A temporary sort file is created (often 2-3x the table size)
- Disk I/O becomes the primary limiting factor
Partitioning the table before indexing proved to be the breakthrough. By splitting the 1.4 billion records into 40 partitions:
ALTER TABLE text_page
PARTITION BY KEY(text)
PARTITIONS 40;
The indexing operation completed in approximately 1 hour. Here's why this works:
- Each partition represents a smaller, more manageable subset of data
- MySQL can process partitions with less memory overhead
- Disk I/O operations are distributed across multiple files
While partitioning worked in this case, other strategies might be preferable depending on your specific constraints:
Using InnoDB Instead of MyISAM
ALTER TABLE text_page ENGINE=InnoDB;
CREATE INDEX ix_text ON text_page(text);
InnoDB offers:
- Online DDL operations (in MySQL 5.6+)
- Better crash recovery
- More efficient index creation algorithms
Batch Processing with Temporary Tables
For systems that can't be taken offline:
CREATE TABLE text_page_new LIKE text_page;
ALTER TABLE text_page_new ADD KEY ix_text (text);
INSERT INTO text_page_new SELECT * FROM text_page ORDER BY text;
RENAME TABLE text_page TO text_page_old, text_page_new TO text_page;
Specialized Storage Engines
For text-heavy workloads, consider:
ALTER TABLE text_page ENGINE=MyISAM WITH KEY_BLOCK_SIZE=8;
When implementing any of these solutions:
- Monitor disk space (temp files can consume 2-3x table size)
- Consider increasing
sort_buffer_size
andread_buffer_size
- Use
pt-online-schema-change
for production systems - For VARCHAR(255) columns, consider prefix indexes if appropriate
Here's the complete implementation that worked for this case:
-- Step 1: Create partitioned table structure
CREATE TABLE text_page_partitioned (
text VARCHAR(255),
page_id INT UNSIGNED
) ENGINE=MyISAM DEFAULT CHARSET=ascii
PARTITION BY KEY(text)
PARTITIONS 40;
-- Step 2: Migrate data in batches
INSERT INTO text_page_partitioned
SELECT * FROM text_page
ORDER BY text
LIMIT 1000000 OFFSET 0;
-- Repeat with increasing OFFSET values...
-- Step 3: Create index after data load
ALTER TABLE text_page_partitioned ADD KEY ix_text (text);
-- Step 4 (optional): Replace original table
RENAME TABLE text_page TO text_page_old,
text_page_partitioned TO text_page;
When dealing with MySQL tables containing over a billion records, standard indexing operations can become painfully slow. In our case with the text_page
table (1.4B records, 34GB), a simple ALTER TABLE
command for index creation ran for 10+ hours without completion.
The original schema uses MyISAM engine with an ASCII character set:
CREATE TABLE text_page (
text VARCHAR(255),
page_id INT UNSIGNED
) ENGINE=MYISAM DEFAULT CHARSET=ascii
MySQL creates indexes by:
- Creating a temporary table copy
- Building the index structure
- Swapping tables
For large tables, this process consumes:
- Excessive disk I/O
- Temporary storage (2X table size)
- Lock contention
The successful approach involved horizontal partitioning:
ALTER TABLE text_page
PARTITION BY KEY(text)
PARTITIONS 40;
Then creating the index:
ALTER TABLE text_page ADD KEY ix_text (text);
Approach | Time | Resources |
---|---|---|
Standard Indexing | >10 hours | Failed |
Partitioned (40) | ~1 hour | Successful |
Other viable options include:
- Online Schema Change Tools: pt-online-schema-change, gh-ost
- Batch Processing: Create new table with index, then migrate in batches
- Storage Engine Switch: Consider InnoDB for better large-scale performance
When implementing this solution:
-- Verify partition distribution
SELECT PARTITION_NAME, TABLE_ROWS
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE TABLE_NAME = 'text_page';
-- Monitor progress during index creation
SHOW PROCESSLIST;