Optimizing MySQL Index Creation on 1.4 Billion Records: Partitioning Strategy for Large-Scale Text Data


2 views

When dealing with MySQL tables containing over a billion records, standard indexing approaches often fail to perform adequately. The case of a 1.4 billion record text_page table demonstrates this perfectly:

CREATE TABLE text_page (
    text VARCHAR(255),
    page_id INT UNSIGNED
) ENGINE=MYISAM DEFAULT CHARSET=ascii;

Attempting to create a simple index on the text column with:

ALTER TABLE text_page ADD KEY ix_text (text);

resulted in an unacceptably long execution time of over 10 hours before being abandoned.

The fundamental issue lies in MySQL's single-threaded index creation process for MyISAM tables. When dealing with 34GB of data:

  • The entire table must be read sequentially
  • A temporary sort file is created (often 2-3x the table size)
  • Disk I/O becomes the primary limiting factor

Partitioning the table before indexing proved to be the breakthrough. By splitting the 1.4 billion records into 40 partitions:

ALTER TABLE text_page 
PARTITION BY KEY(text) 
PARTITIONS 40;

The indexing operation completed in approximately 1 hour. Here's why this works:

  1. Each partition represents a smaller, more manageable subset of data
  2. MySQL can process partitions with less memory overhead
  3. Disk I/O operations are distributed across multiple files

While partitioning worked in this case, other strategies might be preferable depending on your specific constraints:

Using InnoDB Instead of MyISAM

ALTER TABLE text_page ENGINE=InnoDB;
CREATE INDEX ix_text ON text_page(text);

InnoDB offers:

  • Online DDL operations (in MySQL 5.6+)
  • Better crash recovery
  • More efficient index creation algorithms

Batch Processing with Temporary Tables

For systems that can't be taken offline:

CREATE TABLE text_page_new LIKE text_page;
ALTER TABLE text_page_new ADD KEY ix_text (text);
INSERT INTO text_page_new SELECT * FROM text_page ORDER BY text;
RENAME TABLE text_page TO text_page_old, text_page_new TO text_page;

Specialized Storage Engines

For text-heavy workloads, consider:

ALTER TABLE text_page ENGINE=MyISAM WITH KEY_BLOCK_SIZE=8;

When implementing any of these solutions:

  • Monitor disk space (temp files can consume 2-3x table size)
  • Consider increasing sort_buffer_size and read_buffer_size
  • Use pt-online-schema-change for production systems
  • For VARCHAR(255) columns, consider prefix indexes if appropriate

Here's the complete implementation that worked for this case:

-- Step 1: Create partitioned table structure
CREATE TABLE text_page_partitioned (
    text VARCHAR(255),
    page_id INT UNSIGNED
) ENGINE=MyISAM DEFAULT CHARSET=ascii
PARTITION BY KEY(text) 
PARTITIONS 40;

-- Step 2: Migrate data in batches
INSERT INTO text_page_partitioned 
SELECT * FROM text_page 
ORDER BY text 
LIMIT 1000000 OFFSET 0;

-- Repeat with increasing OFFSET values...

-- Step 3: Create index after data load
ALTER TABLE text_page_partitioned ADD KEY ix_text (text);

-- Step 4 (optional): Replace original table
RENAME TABLE text_page TO text_page_old, 
text_page_partitioned TO text_page;

When dealing with MySQL tables containing over a billion records, standard indexing operations can become painfully slow. In our case with the text_page table (1.4B records, 34GB), a simple ALTER TABLE command for index creation ran for 10+ hours without completion.

The original schema uses MyISAM engine with an ASCII character set:

CREATE TABLE text_page (
    text VARCHAR(255),
    page_id INT UNSIGNED
) ENGINE=MYISAM DEFAULT CHARSET=ascii

MySQL creates indexes by:

  1. Creating a temporary table copy
  2. Building the index structure
  3. Swapping tables

For large tables, this process consumes:

  • Excessive disk I/O
  • Temporary storage (2X table size)
  • Lock contention

The successful approach involved horizontal partitioning:

ALTER TABLE text_page 
PARTITION BY KEY(text) 
PARTITIONS 40;

Then creating the index:

ALTER TABLE text_page ADD KEY ix_text (text);
Approach Time Resources
Standard Indexing >10 hours Failed
Partitioned (40) ~1 hour Successful

Other viable options include:

  • Online Schema Change Tools: pt-online-schema-change, gh-ost
  • Batch Processing: Create new table with index, then migrate in batches
  • Storage Engine Switch: Consider InnoDB for better large-scale performance

When implementing this solution:

-- Verify partition distribution
SELECT PARTITION_NAME, TABLE_ROWS 
FROM INFORMATION_SCHEMA.PARTITIONS 
WHERE TABLE_NAME = 'text_page';

-- Monitor progress during index creation
SHOW PROCESSLIST;