Optimization when working with large Datasets (1M)
Draft Disclaimer: Please note that this article is currently in draft form and may undergo revisions before final publication. The content, including information, opinions, and recommendations, is subject to change and may not represent the final version. We appreciate your understanding and patience as we work to refine and improve the quality of this article. Your feedback is valuable in shaping the final release.
Language Mismatch Disclaimer: Please be aware that the language of this article may not match the language settings of your browser or device.
Do you want to read articles in English instead ?
Outline
- Introduction
- Obvious Queries Are Not Always the Best Queries
- Indexing Strategies
- Basic Indexing
- Composite Indexing
- Optimizing Query Structure
- Replace Subqueries with Joins
- Inverting Row Scans
- Reducing Data Load
- Archiving & Pruning Old Data
- Using Virtual Columns for Computation
- Closing Notes: Why You Should Optimize Before It’s Too Late
Introduction
When working with databases, one thing quickly becomes clear: what works on a small dataset won’t necessarily work on a large one. Queries that run in milliseconds with 1,000 rows can take minutes (or even hours) on a table with millions of records.
I learned this the hard way while optimizing a Laravel app handling fleet tracking and delivery logistics. Queries that seemed reasonable in local development would bring production servers to a crawl. Here’s what I learned along the way.
Obvious Queries Are Not Always the Best Queries
If your database queries are slow, the first instinct is often to add an index. While this helps, indexing alone won’t save you from inefficient query logic. Sometimes, rethinking how you query data is more effective than adding indexes.
Indexing Strategies
Basic Indexing
Indexes speed up queries by allowing MySQL to look up rows more efficiently. At a minimum, ensure that columns used in WHERE, JOIN, and ORDER BY clauses are indexed.
Laravel Example
Schema::table('orders', function (Blueprint $table) {
$table->index('user_id'); // Speeds up queries filtering by user
});
Composite Indexing
If your queries filter on multiple columns, a composite index (indexing multiple columns together) can be far more efficient than separate indexes.
Example: Filtering by User and Status
Schema::table('orders', function (Blueprint $table) {
$table->index(['user_id', 'status']); // Multi-column index
});
Why? MySQL can use this index more efficiently when filtering on both user_id and status, instead of performing multiple lookups.
Optimizing Query Structure
Replace Subqueries with Joins
Subqueries might look clean, but they often perform terribly at scale. Instead, use JOINs where possible.
Before (Subquery)
SELECT * FROM users WHERE id IN (SELECT user_id FROM orders WHERE status = 'pending');
After (JOIN)
SELECT users.* FROM users
JOIN orders ON users.id = orders.user_id
WHERE orders.status = 'pending';
Inverting Row Scans
In large datasets, sometimes it’s better to scan fewer rows from a smaller table than to filter a large table directly.
Example: Users Table (1M rows) & Posts Table (1K rows)
Instead of scanning all users who have posts:
Users::whereHas('posts')->get();
Scan posts instead (fewer rows), then eager-load users:
Post::with('user')->get();
This can cut query time dramatically by scanning fewer rows.
Reducing Data Load
Archiving & Pruning Old Data
Keeping too much data in your active database slows everything down. Instead:
Archive data that’s over 1-5 years old (depends on business needs).
Use Laravel’s Prunable Trait to clean up old records automatically.
Laravel Pruning Example
use Illuminate\Database\Eloquent\Model;
use Illuminate\Database\Eloquent\Concerns\Prunable;
class Order extends Model {
use Prunable;
public function prunable() {
return static::where('created_at', '<', now()->subYears(2));
}
}
Using Virtual Columns for Computation
If you perform the same calculation repeatedly, consider virtual columns so MySQL can compute it once instead of every query.
ALTER TABLE orders ADD COLUMN total_price DECIMAL(10,2) GENERATED ALWAYS AS (quantity * unit_price) STORED;
This offloads calculations from the app layer to the database engine.
Closing Notes: Why You Should Optimize Before It’s Too Late
If you wait until queries take minutes or hours to optimize them, you’re already in trouble. Here’s what can happen:
❌ COUNT(*) returning 89,334 rows takes 42 minutes ❌ DELETE query affecting 200,000 rows runs for 1.5 hours ❌ Queries so slow you can’t even troubleshoot them
Final Advice
Don’t wait until performance issues hit production.
Benchmark queries regularly (EXPLAIN ANALYZE).
Think beyond indexes—optimize queries, reduce data load, and structure your database efficiently.