Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow LuceneSourceOperator to early terminate #108820

Merged
merged 4 commits into from
May 20, 2024

Conversation

iverase
Copy link
Contributor

@iverase iverase commented May 20, 2024

In the geo benchmarks, this is clearly showing as an issue as we are iterating all the matched documents to get the first 10:

image

We can make the LuceneSourceOperator collector to throw a CollectionTerminatedException whenever it has collected enough documents.

@iverase iverase marked this pull request as draft May 20, 2024 11:30
@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've created a changelog YAML for you.

@iverase iverase added :Analytics/EQL EQL querying and removed :Analytics/Compute Engine Analytics in ES|QL labels May 20, 2024
@iverase iverase marked this pull request as ready for review May 20, 2024 12:43
@iverase iverase requested review from nik9000 and dnhatn May 20, 2024 12:44
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 20, 2024
@iverase iverase added :Analytics/ES|QL AKA ESQL and removed :Analytics/EQL EQL querying labels May 20, 2024
// Note: if (maxPageSize - currentPagePos) is a small "remaining" interval, this could lead to slow collection with a
// highly selective filter. Having a large "enough" difference between max- and minPageSize (and thus currentPagePos)
// alleviates this issue.
maxPageSize - currentPagePos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the scorer logic in detail, but I'm curious to understand what would be the implications of limiting the numDocs here instead, eg. using Math.min(maxPageSize - currentPagePos, remainingDocs) here rather than throwing and catching an exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I see in the code, this value does not take into account deleted docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes a lot of sense, thanks!
FWIW, it looks good to me, but please wait for Nhat or Nik for a check

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @iverase.

@iverase iverase merged commit 60777cf into elastic:main May 20, 2024
15 checks passed
@iverase iverase deleted the LuceneSourceOperatorTerminate branch May 20, 2024 17:53
jedrazb pushed a commit to jedrazb/elasticsearch that referenced this pull request May 21, 2024
We can make the LuceneSourceOperator collector to throw a CollectionTerminatedException whenever it has collected enough documents.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants