Skip to content
This repository was archived by the owner on May 25, 2021. It is now read-only.

Make Query Limit Results Configurable#56

Open
tonysun83 wants to merge 4 commits intoapache:masterfrom
cloudant:67462-add-configurable-limit
Open

Make Query Limit Results Configurable#56
tonysun83 wants to merge 4 commits intoapache:masterfrom
cloudant:67462-add-configurable-limit

Conversation

@tonysun83
Copy link

Currently, all_docs are view results are practically unlimited
unless a user passes in a defined limit. This change allows an
administrator/operator to set a limit which takes priority over the
user defined limit. If this property is not set, then the default
value 16#10000000 is used.

Currently, all_docs are view results are practically unlimited
unless a user passes in a defined limit. This change allows an
administrator/operator to set a limit which takes priority over the
user defined limit. If this property is not set, then the default
value 16#10000000 is used.

BugzId: 67462
@kxepal
Copy link
Member

kxepal commented Sep 5, 2016

+1

@rnewson
Copy link
Member

rnewson commented Sep 5, 2016

-1, the default must be the current behaviour. A partial result from _all_docs is not the same as a full one. What's the use case for forcibly clipping a response?

@rnewson
Copy link
Member

rnewson commented Sep 5, 2016

and there's no JIRA ticket associated with this change either.

@tonysun83
Copy link
Author

@rnewson: I'm using the original value here: https://github.com/cloudant/couchdb-couch-mrview/blob/92575f448bbba34466e64ab595aa3ec8af3c96b6/include/couch_mrview.hrl#L76. Isn't that the current default behavior?

@rnewson
Copy link
Member

rnewson commented Sep 5, 2016

hrm, very bleh. Ok, please make that a -define and update in both places; we'll have to come back to this. A default finite limit for _all_docs seems like a contract-breaking detail to me (even if the actual value is ~268 million). (cc @kocolosk @davisp).

Imo the default behaviour for _all_docs and views is that, unless constrained by startkey/endkey, you could receive any number of rows. Since wee stream them, there's no reason to put an artificial cap here, no matter how generous.

@tonysun83
Copy link
Author

updated with macros and created corresponding jira ticket

@kxepal
Copy link
Member

kxepal commented Sep 5, 2016

Good points, @rnewson. Changing my to -1. I think this need for more wide discussion.

@kxepal
Copy link
Member

kxepal commented Sep 5, 2016

If it still planned to be accepted, for the lesser surprise there should be:

  1. The way to return good old behaviour back.
  2. The warning in the logs that by default we do limit results now, so admins won't lost their time to figure out what's the problem diving into Erlang.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DEFAULT_MAX_QUERY_LIMIT, I'm sure a user could specify a larger value, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or just DEFAULT_LIMIT, this ain't Java. :D

@rnewson
Copy link
Member

rnewson commented Sep 5, 2016

hrm, this still isn't quite right.

current behaviour, limit defaults to 16#10000000 (268,435,456 in decimal), but a user could specify a higher value and it would work (@tonysun83, right?)

the new behaviour should preserve the same default 16#10000000 and allow a separate and optional upper limit.

What it does right now is always impose an upper limit which defaults to 16#10000000.

@tonysun83
Copy link
Author

@rnewson: thx for pointing that out. I think my new change should adhere to the behavior we want:

  1. If there is no config or if it's a non positive integer, then we just use whatever is provided by the user OR the original default. In this case, a user can give a higher limit than 16#10000000 which will be accepted.

  2. If a positive config value is there, then it takes precedence over the user defined limit and we use that value instead.


end, Args1, Props),
Limit = Args2#mrargs.limit,
case config:get_integer("couch_db", " default_query_limit", -1) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be max_query_limit, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_query_limit confusing. Is it max value I can put into limit query parameter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But section name couch_db is something strange. We don't have any of such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just mean it needs to match with the setting in the test, but yes, that's the point, right? An optional forced upper value of ?limit=X.

good catch of section name, it should be couch_mrview imo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My point was about thin difference between default_query_limit and max_query_limit. First one assumes that you can specify any value instead to override the default, even greater than. The second implies (at least for me) that I cannot beat that value in config file, so ?limit=100 with max_query_limit=10 will still return 10 records back.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm..I should read the code probably, lol. The min(Limit, ConfigDefault1) is quite specific on the intentions. So everything is correct here, you're right.

@rnewson
Copy link
Member

rnewson commented Sep 6, 2016

still some more issues but we're getting closer. Note that this cannot merge until 2.0 is cut, we're in feature freeze.

@tonysun83
Copy link
Author

@kxepal @rnewson @davisp : now that Couch2.0 is released, I wanted to bring to this back up for discussion. I'm open to suggestions for various configuration names, but I still think we should make all our index limits configurable, starting with views.

@kxepal
Copy link
Member

kxepal commented Oct 3, 2016

@tonysun83

I still think we should make all our index limits configurable, starting with views.

That's a good position while requesting indexes is not cheap. Say, there are some O(N) operations which can cause DoS by requesting a little bit of big data. Otherwise, or if we can stream results chunk by chunk ala changes feed without any harm for server, there is no need in such limitations, even configurable, as they makes things only complicated while it's the client who will cause DoS himself by requesting more than it can process.

I would like to take a look on this feature under such kind of angle. Would the proposed limit save server from some kind of resources drain or it's about protection the clients? Can we make all our indexes steam like and encourage people use them instead? Will that help everyone to solve the problem of requesting unlimited data?

@rnewson
Copy link
Member

rnewson commented Oct 3, 2016

I still think the default behaviour of _all_docs, _changes and _view should be to return everything that the parameters dictate (that is, we should fix the default, the longstanding large-but-finite default is also a bug).

If we all agree to change that, then it needs to be abundantly clear to the user that they did not get all results. One suggestion is, instead of changing the default, we introduce an optional maximum value for limit and then, if set, reject all requests without an explicit limit parameter. So that's a 400 Bad Request if limit is too high or if it's missing.

@tonysun83
Copy link
Author

tonysun83 commented Oct 4, 2016

@kxepal: I'm inclined to think that this change helps to protect servers from clients. However, at the same time, we don't want to make it a configurable change that doesn't force a response from the user while giving them different result sets. They might not notice a different result set. I like @rnewson's idea of forcing a required limit parameter upon them and fixing the default value to actually become infinity. This way, the old behavior stands but if something changes, the user is made aware.

Now we need to consider how this "maximum" field integrates with mango. The original idea for mango was to make limits for both json and text indexes the same across the board (similar to what rnewson's PR for mango). But this change introduces another configurable value. What happens when this "maximum" limit is set to 10, but the mango limit is still the default of 25? Do we simply ignore this "maximum" and use the mango 25 limit or do we enforce the 10? Also, the maximum limit for text indexes is 200 and has bookmarks. So all these factors must be considered. Personally, I'm leaning towards making everything configurable but as it's own separate entity.

@kxepal
Copy link
Member

kxepal commented Oct 5, 2016

@tonysun83
With forcing limit query parameter we should provide quite clean and simple way to do pagination with the requests. Otherwise people will just send limit=100500 because they don't care and want to get their data without all these complications. And since pagination is not a simple thing, I fear that latter case will be very popular.

On second hand, limits doesn't helps to protect servers, because it's all depends on documents size which you can include into response. So limit=1000 may be fine for 1KiB docs, but it's not for 100MiB while include_docs=true is so easy and helpful. Anyway, that's quite strange protection through user API because it doesn't solves any user problem. Why they should care about the server issues?

I think that instead better to try stream views / mango without any limits. You can stop at anytime, remember the offset / seq and start over again. And noone will consume resources more than single view result cost unless someone does buffering.

So these are my thoughts. I'm drifting a bit away from initial idea to set up just some the limit to a new feature that can solve same problem in a bit better fashion. Wouldn't alternative solutions be better in mid/long term? We had unlimited view results all the history and seems like there is no urgency to make rush decision here.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments