Elasticsearch

Elasticsearch enables sub-second search. A cluster is a distributed deployment, easily scalable, which makes it capable of handling petabyte-level data capacity. Most importantly, Elasticsearch can sort search results by score, providing us with the most relevant search outcomes.

Elasticsearch

1. Overview

Features

Easy to install: No other dependencies; after downloading, installation is straightforward. With just a few parameter changes, a cluster can be set up.
JSON: Input/Output format is JSON, meaning no schema definition is required — quick and convenient.
RESTful: Almost all operations (indexing, querying, even configuration) can be performed via HTTP APIs.
Distributed: Nodes are equal to each other (any node can be used as an entry point), and load balancing is automatic when adding nodes.
Multi-tenancy: Different use cases can be isolated by creating different indexes, and multiple indexes can be queried simultaneously.
Support for massive data: Can scale to PB-level structured and unstructured data, with near real-time processing of large volumes.

Functions

Distributed search engine
Automatically distributes large datasets across multiple servers for storage and retrieval.
Full-text search
Provides fuzzy search and other flexible query options, with relevance ranking, highlighting, etc.
Data analytics engine (aggregation)
For example, user logins over the past week or monthly feature usage statistics.
Near real-time processing of massive data
Thanks to its distributed architecture, Elasticsearch can leverage multiple servers for storage and queries.

Scenarios

Search-related use cases
e.g., employee search, device search, in-app search, order search.
Log analysis
Classic ELK stack (Elasticsearch/Logstash/Kibana) for log collection, storage, and analysis.
Alerting platforms & analytics
e.g., community group-buying alerts when a discount drops below a threshold, triggering a notification.
Business Intelligence (BI)
Analyzing user spending in a specific area and product categories, generating reports, predicting bestsellers, and creating targeted recommendations. Elasticsearch handles analysis/mining; Kibana provides visualization.

Competitor Analysis

Lucene
A Java-based search toolkit (jar library). It’s only a framework, and mastering it is complex.

Solr
Built on Lucene, exposes an HTTP interface. A search engine system with many Lucene details abstracted away.

Elasticsearch
Distributed, near real-time search engine built on Lucene. Indexes every field so it can be searched.

Comparison

Solr uses Zookeeper for distributed management, while Elasticsearch has its own coordination features.
Solr offers more comprehensive functionality, whereas Elasticsearch focuses on core features; advanced features are often provided by third-party plugins.
Solr is stronger in traditional search applications, while Elasticsearch excels at real-time search.

The mainstream version is still Elasticsearch 7.x, with 7.8 being the latest.

Optimizations: Bundled JDK, upgraded Lucene 8 with major TopK performance improvements, and circuit breakers to prevent OOM.

2. Basic Concepts

IK Analyzer

IKAnalyzer is an open-source, Java-based lightweight Chinese text segmentation tool. Version 3.0 became a standalone Java library, with built-in optimizations for Lucene.

Key features:

Forward iteration fine-grained algorithm, capable of 600,000 characters/sec processing.
Multi-sub-processor mode: supports English letters (IP, Email, URL), numbers (dates, quantities, Roman numerals, scientific notation), Chinese terms (names, places), etc.
Optimized dictionary storage for smaller memory footprint.
Query parser (IKQueryParser) optimized for Lucene full-text search.
Combinatorial expansion for higher recall.

Custom dictionary: ext_dict
Stopword dictionary: stop_dict
Synonym dictionary: same_dict

Index (like a database)

settings: defines shard/replica count.

Mapping (like table schema)

Field data types.
Analyzer type.
Whether stored/indexed.

Document (data)

Full update via PUT.
Partial update via POST.

3. Advanced Features

Advanced Mapping

Geo-point type
Stores latitude/longitude; allows distance calculation or “point-in-area” checks. Must explicitly set type to geo_point.
Dynamic mapping
Automatically infers field types and adds them to the mapping.

Advanced DSL

Match all query
Full-text queries
- match
- match_phrase
- query_string
- multi_match
Term-level queries
- term
- ids
- range
- prefix
- wildcard
- regexp
- fuzzy

Also includes: compound queries, sorting, pagination, highlighting, bulk operations.

Aggregations

Metric aggregations: max, min, sum, avg, etc.
Bucket aggregations: group by-like bucketing, with metrics applied within buckets.

Smart Suggestions

Term Suggester
Phrase Suggester
Completion Suggester
Context Suggester

Precision order: Completion > Phrase > Term
Recall order: Term > Phrase > Completion

Performance: Completion is fastest. Use Phrase/Term carefully to avoid high resource usage.

4. Practical Usage

Write Optimization

Set replicas = 0 during bulk initialization; restore later.
Use auto-generated IDs to avoid existence checks.
Choose analyzers wisely (e.g., binary vs text).
Disable scoring and extend refresh intervals.
Use batch operations (bulk).

Read Optimization

Use filters instead of queries when scoring is unnecessary.
Group data by day/month/year to limit query scope to smaller indexes.

Zero-Downtime Reindexing

External data import via MQ
Trigger ES reindex through MQ message consumption.
Scroll + Bulk + Aliases
Create a new index, scroll source data in batches, bulk insert into new index, then atomically switch alias.
Reindex API
Available in ES v6.3.1+, wraps scroll/bulk internally.

Participation & flexibility: custom > scroll+bulk > reindex
Stability & reliability: custom < scroll+bulk < reindex

Deep Paging Performance Solutions

By default, Elasticsearch pagination uses the from + size approach, similar to MySQL’s LIMIT.
When from is very large (deep pagination), Elasticsearch must collect and sort all results up to that offset across all shards, which causes high CPU, memory, I/O, and even network overhead.

👉 Example: A search across 10 shards with from=990 and size=10 means each shard returns 1000 results to the coordinating node. That’s 10,000 results transferred for just one query — multiplied by 100 queries, the load becomes massive.

Elasticsearch limits this with the setting index.max_result_window (default: 10,000). Beyond this, ES rejects the query.

Solutions

1. `scroll` API (Snapshot-based Traversal)

Creates a snapshot of matching results at query time.
Subsequent requests use a scroll_id to fetch batches of results.
Data changes (insert/update/delete) after initialization are not reflected in results.
Best for batch processing/export tasks (e.g., sending notifications to millions of users).

Example:

POST /book/_search?scroll=1m&size=2
{
  "query": { "match_all": {} }
}

GET /_search/scroll
{
  "scroll": "1m",
  "scroll_id": "returned_scroll_id"
}

2. `search_after` (Real-time Cursor Pagination)

Uses the last document’s sort key from the previous page to fetch the next page.
Reflects real-time changes (insert/update/delete).
Requires a globally unique sort field (commonly _id).
Cannot skip pages (must iterate sequentially).
Best for real-time user-facing pagination.

Example:

GET /book/_search
{
  "query": { "match_all": {} },
  "size": 2,
  "sort": [{ "_id": "desc" }]
}

GET /book/_search
{
  "query": { "match_all": {} },
  "size": 2,
  "search_after": ["last_doc_id"],
  "sort": [{ "_id": "desc" }]
}

3. `from + size`

Default and simplest approach.
Flexible but not scalable for deep pagination.
Only suitable for small datasets or shallow pages (e.g., first 100 pages).

Comparison

Method	Performance	Advantages	Limitations	Best Use Cases
`from + size`	Low	Simple, flexible	Deep pagination issues, high resource cost	Small datasets, shallow pagination (<10k docs)
`scroll`	Medium	Handles deep pagination, simple API	Not real-time (snapshot), maintains `scroll_id`	Bulk export, batch processing of large datasets
`search_after`	High	Real-time, best performance	Must paginate sequentially, needs unique sort key	Real-time user-facing infinite scroll/pagination

Best Practices

✅ For user-facing pagination → use search_after with proper sort fields.
✅ For backend jobs / batch export → use scroll API.
✅ For small datasets / shallow pagination → from + size is acceptable.
🚫 Avoid allowing arbitrary deep pagination in user interfaces — often limit to first N pages.

Elasticsearch

1. Overview

Features

Functions

Scenarios

Competitor Analysis

Comparison

2. Basic Concepts

IK Analyzer

Index (like a database)

Mapping (like table schema)

Document (data)

3. Advanced Features

Advanced Mapping

Advanced DSL

Aggregations

Smart Suggestions

4. Practical Usage

Write Optimization

Read Optimization

Zero-Downtime Reindexing

Deep Paging Performance Solutions

Solutions

1. scroll API (Snapshot-based Traversal)

2. search_after (Real-time Cursor Pagination)

3. from + size

Comparison

Best Practices

1. `scroll` API (Snapshot-based Traversal)

2. `search_after` (Real-time Cursor Pagination)

3. `from + size`