The Earley Parser in NLTK

Recently I timed the earley parser included in NTLK. The results are quite surprising. This might have to do with the various ways to measure time: timeit, cProfile and datetime are solutions that are used all over the web.

This article is about parsing and measuring time. In the end I used datetime without timedelta, I only substracted two datetime objects.

I might however write a post about the other timing-tools as well.

Data

The data I generated is very simple: I used the sentence the man saw the girl. Then I appended with the telescope. All the later executions append with the telescope, so the length of the sentence grows by 3 tokens. In addition I used a simple CFG, as shown in the code.

Timedelta

timedelta is described as “A duration expressing the difference between two date, time, or datetime instances to microsecond resolution.”

That sounds nice and I created a script which generates the following output:

Run Microseconds SentenceLength NormalizedMicroseconds
1 1567 5 313.4
2 2507 8 313.375
3 3781 11 343.727272727
4 5553 14 396.642857143
5 8172 17 480.705882353
6 10476 20 523.8
7 15129 23 657.782608696
8 23250 26 894.230769231
9 43597 29 1503.34482759
10 141894 32 4434.1875
11 434989 35 12428.2571429
12 669095 38 17607.7631579
13 428897 41 10460.902439
14 940391 44 21372.5227273

Where NormalizedMicroseconds might be visualized like this:

Visualization Output

Discussion

  • find a polynom that fits the data
  • analyze algorithm
  • strace, cprofile, and friends

Linkdump Performance Measuring