Skip to content

Fix null byte#21

Open
lapp0 wants to merge 9 commits into
mainfrom
fix-null-byte
Open

Fix null byte#21
lapp0 wants to merge 9 commits into
mainfrom
fix-null-byte

Conversation

@lapp0
Copy link
Copy Markdown
Owner

@lapp0 lapp0 commented May 26, 2024

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 26, 2024

Benchmark Suite Results:

Before [592c884] After [78f0974] Ratio Benchmark (Parameter)
4.89±0.04s 4.38±0.03s ~0.90 bench_numba_compile.NumbaCompileBenchmark.time_compile_numba
292±3ms 256±2ms ~0.87 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('email')
252±7ms 212±1ms ~0.84 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('ip')
423±7ms 349±1ms ~0.83 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('url')
361±10ms 289±1ms ~0.80 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('date')
641±10ms 449±2ms ~0.70 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('complex_phone')
4.24±0.06s 2.91±0.01s ~0.69 bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_fsm('complex_schema')
5.79±0.1s 3.98±0.01s ~0.69 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('complex_span_constrained_relation_extraction')
2.08±0.01s 1.36±0.02s ~0.65 bench_json_schema.JsonSchemaBenchmark.time_json_schema_to_fsm('simple_schema')
127±3ms 134±0.7ms 1.06 bench_regex_guide.RegexGuideBenchmark.time_regex_to_guide('time')

@lapp0
Copy link
Copy Markdown
Owner Author

lapp0 commented May 26, 2024

Profile this branch

Details
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.009    0.009   22.921   22.921 /home/andrew/p/outlines/profile_null_byte_fix.py:19(profile_email_guide)
        1    0.000    0.000   22.912   22.912 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   22.912   22.912 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.000    0.000   22.829   22.829 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.015    0.015   21.851   21.851 /home/andrew/p/outlines/outlines/fsm/regex.py:853(create_fsm_index_tokenizer)
        1    0.411    0.411   21.762   21.762 /home/andrew/p/outlines/outlines/fsm/regex.py:709(create_fsm_index_end_to_end)
      389   21.128    0.054   21.130    0.054 /home/andrew/p/outlines/outlines/fsm/regex.py:672(state_scan_tokens)
      523    0.228    0.000    0.846    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
     10/1    0.001    0.000    0.838    0.838 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
    128/2    0.001    0.000    0.745    0.372 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    0.745    0.745 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.681    0.681 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.355    0.003 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.288    0.029 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.000    0.000    0.288    0.029 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
   114135    0.190    0.000    0.190    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.159    0.000    0.176    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
       13    0.000    0.000    0.123    0.009 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        1    0.000    0.000    0.116    0.116 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.000    0.000    0.116    0.116 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
        2    0.004    0.002    0.115    0.058 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)
  1153581    0.104    0.000    0.104    0.000 {method 'add' of 'set' objects}
      125    0.000    0.000    0.102    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:428(star)

profile main

Details
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.007    0.007   10.290   10.290 /home/andrew/p/outlines/profile_null_byte_fix.py:19(profile_email_guide)
        1    0.000    0.000   10.283   10.283 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   10.283   10.283 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.000    0.000   10.213   10.213 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.014    0.014    9.228    9.228 /home/andrew/p/outlines/outlines/fsm/regex.py:829(create_fsm_index_tokenizer)
        1    0.355    0.355    9.160    9.160 /home/andrew/p/outlines/outlines/fsm/regex.py:684(create_fsm_index_end_to_end)
      389    8.644    0.022    8.646    0.022 /home/andrew/p/outlines/outlines/fsm/regex.py:651(state_scan_tokens)
     10/1    0.001    0.000    0.865    0.865 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
      523    0.197    0.000    0.733    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
    128/2    0.001    0.000    0.642    0.321 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.002    0.000    0.641    0.641 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.590    0.590 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.301    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.258    0.026 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.000    0.000    0.258    0.026 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
      269    0.010    0.000    0.197    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:112(union)
    801/1    0.001    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.169    0.085 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.169    0.085 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.169    0.169 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
   114135    0.169    0.000    0.169    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.135    0.000    0.148    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
      269    0.143    0.001    0.143    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:118(<dictcomp>)
       13    0.000    0.000    0.104    0.008 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        1    0.000    0.000    0.100    0.100 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.000    0.000    0.100    0.100 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
        2    0.003    0.002    0.100    0.050 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)

profile script

Details
import cProfile
import pstats
from io import StringIO

from outlines.models.transformers import TransformerTokenizer
from transformers import AutoTokenizer
from outlines.fsm.guide import RegexGuide


tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = TransformerTokenizer(tokenizer)


# ensure numba compiled
RegexGuide("a", tokenizer)

pattern = "(['\"\\ ,]?((?:of|resulting|case|which|cultures|a|core|extreme|selflessness|spiritual|various|However|both|vary|in|other|secular|the|religious|among|moral|and|It|object|worldviews|altruism|traditional|material|aspect|or|life|beings|virtue|is|however|opposite|concern|an|practice|it|for|s|quality|religions|In|Altruism|animals|happiness|many|become|principle|human|selfishness|may|synonym)['\"\\ ,]?)+['\"\\ ,]?\\s\\|\\s([^|\\(\\)\n]{1,})\\s\\|\\s['\"\\ ,]?((?:of|resulting|case|which|cultures|a|core|extreme|selflessness|spiritual|various|However|both|vary|in|other|secular|the|religious|among|moral|and|It|object|worldviews|altruism|traditional|material|aspect|or|life|beings|virtue|is|however|opposite|concern|an|practice|it|for|s|quality|religions|In|Altruism|animals|happiness|many|become|principle|human|selfishness|may|synonym)['\"\\ ,]?)+['\"\\ ,]?(\\s\\|\\s\\(([^|\\(\\)\n]{1,})\\s\\|\\s([^|\\(\\)\n]{1,})\\))*\\n)*"

def profile_email_guide():
    RegexGuide(pattern, tokenizer)


# Create a profiler
profiler = cProfile.Profile()

# Run the code with the profiler
profiler.enable()
profile_email_guide()
profiler.disable()

# Create a stream to hold profiling statistics
s = StringIO()
sortby = 'cumulative'
ps = pstats.Stats(profiler, stream=s).sort_stats(sortby)

# Print the profiling statistics
ps.print_stats()

# Display the profiling statistics
print(s.getvalue())

The culprit is my modified implementation of _walk_fsm (<- state_scan_tokens). Will investigate how to optimize.

@lapp0
Copy link
Copy Markdown
Owner Author

lapp0 commented May 27, 2024

I incorporated an index which converts tokens into a sequence of transition keys and now it's slightly faster than main!

New profile:

Details
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.015    0.015   10.837   10.837 /home/andrew/p/outlines/profile_null_byte_fix.py:24(profile_email_guide)
        1    0.000    0.000   10.822   10.822 /home/andrew/p/outlines/outlines/fsm/guide.py:140(__init__)
        1    0.001    0.001   10.822   10.822 /home/andrew/p/outlines/outlines/caching.py:113(wrapper)
        1    0.001    0.001   10.821   10.821 /home/andrew/p/outlines/outlines/fsm/guide.py:108(create_states_mapping)
        1    0.023    0.023    9.681    9.681 /home/andrew/p/outlines/outlines/fsm/regex.py:877(create_fsm_index_tokenizer)
        1    0.488    0.488    9.556    9.556 /home/andrew/p/outlines/outlines/fsm/regex.py:726(create_fsm_index_end_to_end)
      389    8.745    0.022    8.747    0.022 /home/andrew/p/outlines/outlines/fsm/regex.py:659(state_scan_tokens)
      523    0.259    0.000    0.976    0.002 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1023(crawl)
     10/1    0.002    0.000    0.972    0.972 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:447(to_fsm)
    128/2    0.001    0.000    0.859    0.429 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:453(<genexpr>)
    118/1    0.003    0.000    0.859    0.859 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:370(to_fsm)
     13/1    0.000    0.000    0.791    0.791 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:280(to_fsm)
      131    0.002    0.000    0.400    0.003 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:364(concatenate)
       10    0.000    0.000    0.352    0.035 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:505(union)
       10    0.001    0.000    0.352    0.035 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:967(parallel)
   114135    0.234    0.000    0.234    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:979(follow)
   198170    0.178    0.000    0.197    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:401(follow)
        1    0.000    0.000    0.138    0.138 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:249(reduce)
        1    0.001    0.001    0.138    0.138 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:253(reduce_brzozowski)
       13    0.000    0.000    0.137    0.011 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:418(__add__)
        2    0.005    0.003    0.137    0.069 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:612(reversed)
  1153581    0.122    0.000    0.122    0.000 {method 'add' of 'set' objects}
    24185    0.082    0.000    0.114    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:634(follow)
      125    0.000    0.000    0.109    0.001 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:428(star)
        1    0.101    0.101    0.101    0.101 /home/andrew/p/outlines/outlines/fsm/regex.py:695(get_tokens_trans_keys)
      269    0.016    0.000    0.081    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:112(union)
    67779    0.051    0.000    0.080    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:1031(get_hash)
        1    0.060    0.060    0.060    0.060 /home/andrew/p/outlines/outlines/fsm/regex.py:903(<dictcomp>)
   743602    0.055    0.000    0.055    0.000 {method 'setdefault' of 'dict' objects}
      269    0.011    0.000    0.049    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:114(<dictcomp>)
    65660    0.047    0.000    0.048    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:438(follow)
    801/1    0.002    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:69(get_alphabet)
     10/1    0.000    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:423(_get_alphabet)
    128/2    0.000    0.000    0.042    0.021 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:425(<genexpr>)
    118/1    0.001    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:330(_get_alphabet)
    787/2    0.000    0.000    0.042    0.021 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:331(<genexpr>)
     13/1    0.000    0.000    0.042    0.042 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/patterns.py:270(_get_alphabet)
      255    0.001    0.000    0.039    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:463(times)
    57272    0.016    0.000    0.039    0.000 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/interegular/fsm.py:114(<genexpr>)
      359    0.004    0.000    0.035    0.000 /nix/store/qd7h3vn2bff6jjigdvq0xh91q49sm1ng-python3.11-tqdm-4.66.4/lib/python3.11/site-packages/tqdm/std.py:1198(update)
       69    0.001    0.000    0.030    0.000 /nix/store/qd7h3vn2bff6jjigdvq0xh91q49sm1ng-python3.11-tqdm-4.66.4/lib/python3.11/site-packages/tqdm/std.py:1325(refresh)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/outlines/models/transformers.py:113(__hash__)
        1    0.000    0.000    0.030    0.030 /home/andrew/p/outlines/.myenv/lib/python3.11/site-packages/datasets/fingerprint.py:226(hash)

@lapp0 lapp0 force-pushed the fix-null-byte branch 2 times, most recently from 53d8a8d to 22cbed6 Compare May 27, 2024 06:57
Comment thread outlines/fsm/regex.py
@@ -419,17 +416,17 @@ def _walk_fsm(
alphabet_anything_value: int,
Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't need to be passed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants