-
Notifications
You must be signed in to change notification settings - Fork 14
ZJIT: Fold LoadField on frozen objects to constants #911
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The "EXIVAR" terminology has been replaced by "gen fields" AKA "generic fields". Exivar implies variable, but generic fields include more than just variables, e.g. `object_id`.
The NEWOBJ tracepoint can generate an object_id, that's alright, what we don't want is actual instance variables.
While profiling `Monitor#synchronize` and `Mutex#synchronize`
I noticed a fairly significant amount of time spent in
`rb_check_typeddata`.
By implementing a fast path that assumes the object is valid
and that can be inlined, it does make a significant difference:
Before:
```
Mutex 13.548M (± 3.6%) i/s (73.81 ns/i) - 68.566M in 5.067444
Monitor 10.497M (± 6.5%) i/s (95.27 ns/i) - 52.529M in 5.032698s
```
After:
```
Mutex 20.887M (± 0.3%) i/s (47.88 ns/i) - 106.021M in 5.075989s
Monitor 16.245M (±13.3%) i/s (61.56 ns/i) - 80.705M in 5.099680s
```
```ruby
require 'bundler/inline'
gemfile do
gem "benchmark-ips"
end
mutex = Mutex.new
require "monitor"
monitor = Monitor.new
Benchmark.ips do |x|
x.report("Mutex") { mutex.synchronize { } }
x.report("Monitor") { monitor.synchronize { } }
end
```
… by heredocs See https://bugs.ruby-lang.org/issues/21756. Ripper fails to parse this, but prism actually also doesn't handle it correctly. When heredocs are used, even in lowercase percent arays there can be multiple `STRING_CONTENT` tokens. We need to concat them. Luckily we don't need to handle as many cases as in uppercase arrays where interpolation is allowed. ruby/prism@211677000e
Not so sure how to trigger it but this is definitly more correct. ruby/prism@1bc8ec5e5d
Attempt to fix the following SEGV: ``` ruby(gc_mark) ../src/gc/default/default.c:4429 ruby(gc_mark_children+0x45) [0x560b380bf8b5] ../src/gc/default/default.c:4625 ruby(gc_mark_stacked_objects) ../src/gc/default/default.c:4647 ruby(gc_mark_stacked_objects_all) ../src/gc/default/default.c:4685 ruby(gc_marks_rest) ../src/gc/default/default.c:5707 ruby(gc_marks+0x4e7) [0x560b380c41c1] ../src/gc/default/default.c:5821 ruby(gc_start) ../src/gc/default/default.c:6502 ruby(heap_prepare+0xa4) [0x560b380c4efc] ../src/gc/default/default.c:2074 ruby(heap_next_free_page) ../src/gc/default/default.c:2289 ruby(newobj_cache_miss) ../src/gc/default/default.c:2396 ruby(RB_SPECIAL_CONST_P+0x0) [0x560b380c5df4] ../src/gc/default/default.c:2420 ruby(RB_BUILTIN_TYPE) ../src/include/ruby/internal/value_type.h:184 ruby(newobj_init) ../src/gc/default/default.c:2136 ruby(rb_gc_impl_new_obj) ../src/gc/default/default.c:2500 ruby(newobj_of) ../src/gc.c:996 ruby(rb_imemo_new+0x37) [0x560b380d8bed] ../src/imemo.c:46 ruby(imemo_fields_new) ../src/imemo.c:105 ruby(rb_imemo_fields_new) ../src/imemo.c:120 ``` I have no reproduction, but my understanding based on the backtrace and error is that GC is triggered inside `newobj_init` causing the new object to be marked while in a incomplete state. I believe the fix is to pass the `shape_id` down to `newobj_init` so it can be set before the GC has a chance to trigger.
If GC trigger in the middle of `struct_alloc`, and the struct has more than 3 elements, then `fields_obj` reference is garbage. We must first check the shape to know if it was actually initialized.
This commit adds a specialized instruction iterator to the assembler with a custom "peek" method. The reason is that we want to add basic blocks to LIR. When we split instructions, we need to add any new instructions to the correct basic block. The custom iterator will maintain the correct basic block inside the assembler, that way when we push any new instructions they will be appended to the correct place.
This commit uses the custom instruction iterator in arm64 / x86_64 instruction splitting. Once we introduce basic blocks to LIR, the custom iterator will ensure that instructions are added to the correct place.
This should never be true. I added an `rb_bug` in case it was and it wasn't true in any of btest or test-all. Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
Not every caller (for example, YJIT) actually needs to pass the object. YJIT (and, in the future, ZJIT) only need to pass the class.
In the past parse.y and ripper had different `value_expr` definition so that `value_expr` does nothing for ripper. ```c // parse.y #define value_expr(node) value_expr_gen(p, (node)) // ripper #define value_expr(node) ((void)(node)) ``` However Rearchitect Ripper (89cfc15) removed `value_expr` definition for ripper then this commit removes needless parse.y macro and uses `value_expr_gen` directly.
In the past parse.y and ripper had different `new_nil` definition so that `new_nil` returns `nil` for ripper. ```c // parse.y #define new_nil(loc) NEW_NIL(loc) // ripper #define new_nil(loc) Qnil ``` However Rearchitect Ripper (89cfc15) removed `new_nil` definition for ripper then this commit removes needless parse.y macro and uses `NEW_NIL` directly.
Fixes: ruby#685 This feature can easily break how you use other gems like factory_bot or prawn. ruby/psych#747 (comment) > But I kind of think we should leave `psych/y` around. If people really want to use it they could require the file. If you miss the function in Kernel, you can require it interactively or add it to `.irbrc`: ```ruby require 'psych/y' ``` ruby/psych@f1610b3f05
It's used as an alternative to find-and-replace, so we should have nothing to replace.
We generally know the receiver's class from profile info. I see 600k of these when running lobsters.
Since we do a decent job of pre-sizing objects, don't handle the case where we would need to re-size an object. Also don't handle too-complex shapes.
lobsters stats before:
```
Top-20 calls to C functions from JIT code (79.4% of total 90,051,140):
rb_vm_opt_send_without_block: 19,762,433 (21.9%)
rb_vm_setinstancevariable: 7,698,314 ( 8.5%)
rb_hash_aref: 6,767,461 ( 7.5%)
rb_vm_env_write: 5,373,080 ( 6.0%)
rb_vm_send: 5,049,229 ( 5.6%)
rb_vm_getinstancevariable: 4,535,259 ( 5.0%)
rb_obj_is_kind_of: 3,746,306 ( 4.2%)
rb_ivar_get_at_no_ractor_check: 3,745,237 ( 4.2%)
rb_vm_invokesuper: 3,037,467 ( 3.4%)
rb_ary_entry: 2,351,983 ( 2.6%)
rb_vm_opt_getconstant_path: 1,344,740 ( 1.5%)
rb_vm_invokeblock: 1,184,474 ( 1.3%)
Hash#[]=: 1,064,288 ( 1.2%)
rb_gc_writebarrier: 1,006,972 ( 1.1%)
rb_ec_ary_new_from_values: 902,687 ( 1.0%)
fetch: 898,667 ( 1.0%)
rb_str_buf_append: 833,787 ( 0.9%)
rb_class_allocate_instance: 822,024 ( 0.9%)
Hash#fetch: 699,580 ( 0.8%)
_bi20: 682,068 ( 0.8%)
Top-4 setivar fallback reasons (100.0% of total 7,732,326):
shape_transition: 6,032,109 (78.0%)
not_monomorphic: 1,469,300 (19.0%)
not_t_object: 172,636 ( 2.2%)
too_complex: 58,281 ( 0.8%)
```
lobsters stats after:
```
Top-20 calls to C functions from JIT code (79.0% of total 88,322,656):
rb_vm_opt_send_without_block: 19,777,880 (22.4%)
rb_hash_aref: 6,771,589 ( 7.7%)
rb_vm_env_write: 5,372,789 ( 6.1%)
rb_gc_writebarrier: 5,195,527 ( 5.9%)
rb_vm_send: 5,049,145 ( 5.7%)
rb_vm_getinstancevariable: 4,538,485 ( 5.1%)
rb_obj_is_kind_of: 3,746,241 ( 4.2%)
rb_ivar_get_at_no_ractor_check: 3,745,172 ( 4.2%)
rb_vm_invokesuper: 3,037,157 ( 3.4%)
rb_ary_entry: 2,351,968 ( 2.7%)
rb_vm_setinstancevariable: 1,703,337 ( 1.9%)
rb_vm_opt_getconstant_path: 1,344,730 ( 1.5%)
rb_vm_invokeblock: 1,184,290 ( 1.3%)
Hash#[]=: 1,061,868 ( 1.2%)
rb_ec_ary_new_from_values: 902,666 ( 1.0%)
fetch: 898,666 ( 1.0%)
rb_str_buf_append: 833,784 ( 0.9%)
rb_class_allocate_instance: 821,778 ( 0.9%)
Hash#fetch: 755,913 ( 0.9%)
Top-4 setivar fallback reasons (100.0% of total 1,703,337):
not_monomorphic: 1,472,405 (86.4%)
not_t_object: 172,629 (10.1%)
too_complex: 58,281 ( 3.4%)
new_shape_needs_extension: 22 ( 0.0%)
```
I also noticed that primitive printing in HIR was broken so I fixed that.
Co-authored-by: Aaron Patterson <tenderlove@ruby-lang.org>
… increase: - ### TL;DR Bundler is heavily limited by the connection pool which manages a single connection. By increasing the number of connection, we can drastiscally speed up the installation process when many gems need to be downloaded and installed. ### Benchmark There are various factors that are hard to control such as compilation time and network speed but after dozens of tests I can consistently get aroud 70% speed increase when downloading and installing 472 gems, most having no native extensions (on purpose). ``` # Before bundle install 28.60s user 12.70s system 179% cpu 23.014 total # After bundle install 30.09s user 15.90s system 281% cpu 16.317 total ``` You can find on this gist how this was benchmarked and the Gemfile used https://gist.github.com/Edouard-chin/c8e39148c0cdf324dae827716fbe24a0 ### Context A while ago in #869, Aaron introduced a connection pool which greatly improved Bundler speed. It was noted in the PR description that managing one connection was already good enough and it wasn't clear whether we needed more connections. Aaron also had the intuition that we may need to increase the pool for downloading gems and he was right. > We need to study how RubyGems uses connections and make a decision > based on request usage (e.g. only use one connection for many small > requests like bundler API, and maybe many connections for > downloading gems) When bundler downloads and installs gem in parallel https://github.com/ruby/rubygems/blob/4f85e02fdd89ee28852722dfed42a13c9f5c9193/bundler/lib/bundler/installer/parallel_installer.rb#L128 most threads have to wait for the only connection in the pool to be available which is not efficient. ### Solution This commit modifies the pool size for the fetcher that Bundler uses. RubyGems fetcher will continue to use a single connection. The bundler fetcher is used in 2 places. 1. When downloading gems https://github.com/ruby/rubygems/blob/4f85e02fdd89ee28852722dfed42a13c9f5c9193/bundler/lib/bundler/source/rubygems.rb#L481-L484 2. When grabing the index (not the compact index) using the `bundle install --full-index` flag. https://github.com/ruby/rubygems/blob/4f85e02fdd89ee28852722dfed42a13c9f5c9193/bundler/lib/bundler/fetcher/index.rb#L9 Having more connections in 2) is not any useful but tweaking the size based on where the fetcher is used is a bit tricky so I opted to modify it at the class level. I fiddle with the pool size and found that 5 seems to be the sweet spot at least for my environment. ruby/rubygems@6063fd9963
(ruby/stringio#165) Adds to "Position": pos inside a character. Makes a couple of minor corrections. --------- ruby/stringio@ff332abafa Co-authored-by: Sutou Kouhei <kou@cozmixng.org>
This refactors the concurrent set to examine and reserve a slot via CAS with the hash, before then doing the same with the key. This allows us to use an extra bit from the hash as a "continuation bit" which marks whether we have ever probed past this key while inserting. When that bit isn't set on deletion we can clear the field instead of placing a tombstone.
This removes all allocations from the find_or_insert loop, which requires us to start the search over after calling the provided create function. In exchange that allows us to assume that all concurrent threads insert will get the same view of the GC state, and so should all be attempting to clear and reuse a slot containing a garbage object.
Fix: ruby/forwardable#35 [Bug #21708] Trying to compile code to check if a method can use the delegation fastpath is a bit wasteful and cause `RUPYOPT=-d` to be full of misleading errors. It's simpler and faster to use a simple regexp to do the same check. ruby/forwardable@de1fbd182e
That call is surprisingly expensive, so trying doing it once in `#synchronize` and then passing the fiber to enter and exit saves quite a few cycles.
Make it embedded and compaction aware.
It's the most likely control character so it's worth giving a better error message for it. ruby/json@1da3fd9233
…y#15478) This fixes a crash when the new shape after a transition is too complex; we need to check that it's not complex before trying to read by index.
Encodings are RTypedData, not the deprecated RData. Although the structures are compatible we should use the correct API.
When accessing instance variables from frozen objects via attr_reader/ attr_accessor, fold the LoadField instruction to a constant at compile time. This enables further optimizations like constant propagation. - Add fold_getinstancevariable_frozen optimization in Function::optimize - Check if receiver type has a known ruby_object() that is frozen - Read the field value at compile time and replace with Const instruction - Add 10 unit tests covering various value types (fixnum, string, symbol, nil, true/false) and negative cases (unfrozen, dynamic receiver)
|
Closing in favor of ruby#15483 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a compile-time optimization that folds
LoadFieldinstructions on frozen objects into constants. When reading instance variables from frozen constant objects (viaattr_reader/attr_accessor), the JIT can now resolve the value at compile time rather than at runtime.Before
After
This enables further optimizations like constant propagation and arithmetic folding. For example, accessing two fields from a frozen object and adding them can now be fully folded to a single constant.
Test plan
LoadField--zjit-dump-hir