Geometric distribution , P(n) = r^n

I am comparing "Marlin" = plural Tunstall with P_state word probability model vs. naive plural Tunstall (P_word = P_naive). In both cases 8-byte output words, 12-bit codes.

Marlin :

filelen = 1000000 H = 7.658248 sym_count = 256 r=0.990 : 1,000,000 -> 1,231,686 = 9.853 bpb = 0.812 to 1 decode_time2 : seconds:0.0018 ticks per: 3.064 b/kc : 326.42 MB/s : 564.38 filelen = 1000000 H = 7.345420 sym_count = 256 r=0.985 : 1,000,000 -> 1,126,068 = 9.009 bpb = 0.888 to 1 decode_time2 : seconds:0.0016 ticks per: 2.840 b/kc : 352.15 MB/s : 608.87 filelen = 1000000 H = 6.878983 sym_count = 256 r=0.978 : 1,000,000 -> 990,336 = 7.923 bpb = 1.010 to 1 decode_time2 : seconds:0.0014 ticks per: 2.497 b/kc : 400.54 MB/s : 692.53 filelen = 1000000 H = 6.323152 sym_count = 256 r=0.967 : 1,000,000 -> 862,968 = 6.904 bpb = 1.159 to 1 decode_time2 : seconds:0.0013 ticks per: 2.227 b/kc : 449.08 MB/s : 776.45 filelen = 1000000 H = 5.741045 sym_count = 226 r=0.950 : 1,000,000 -> 779,445 = 6.236 bpb = 1.283 to 1 decode_time2 : seconds:0.0012 ticks per: 2.021 b/kc : 494.83 MB/s : 855.57 filelen = 1000000 H = 5.155050 sym_count = 150 r=0.927 : 1,000,000 -> 701,049 = 5.608 bpb = 1.426 to 1 decode_time2 : seconds:0.0011 ticks per: 1.821 b/kc : 549.09 MB/s : 949.37 filelen = 1000000 H = 4.572028 sym_count = 109 r=0.892 : 1,000,000 -> 611,238 = 4.890 bpb = 1.636 to 1 decode_time2 : seconds:0.0009 ticks per: 1.577 b/kc : 633.93 MB/s : 1096.07 filelen = 1000000 H = 3.986386 sym_count = 78 r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1 decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51 filelen = 1000000 H = 3.405910 sym_count = 47 r=0.773 : 1,000,000 -> 450,585 = 3.605 bpb = 2.219 to 1 decode_time2 : seconds:0.0007 ticks per: 1.237 b/kc : 808.48 MB/s : 1397.86 filelen = 1000000 H = 2.823256 sym_count = 36 r=0.680 : 1,000,000 -> 373,197 = 2.986 bpb = 2.680 to 1 decode_time2 : seconds:0.0006 ticks per: 1.053 b/kc : 950.07 MB/s : 1642.67 filelen = 1000000 H = 2.250632 sym_count = 23 r=0.560 : 1,000,000 -> 298,908 = 2.391 bpb = 3.346 to 1 decode_time2 : seconds:0.0005 ticks per: 0.891 b/kc : 1122.53 MB/s : 1940.85vs. plural Tunstall :

filelen = 1000000 H = 7.658248 sym_count = 256 r=0.99000 : 1,000,000 -> 1,239,435 = 9.915 bpb = 0.807 to 1 decode_time2 : seconds:0.0017 ticks per: 2.929 b/kc : 341.46 MB/s : 590.39 filelen = 1000000 H = 7.345420 sym_count = 256 r=0.98504 : 1,000,000 -> 1,130,025 = 9.040 bpb = 0.885 to 1 decode_time2 : seconds:0.0016 ticks per: 2.814 b/kc : 355.36 MB/s : 614.41 filelen = 1000000 H = 6.878983 sym_count = 256 r=0.97764 : 1,000,000 -> 990,855 = 7.927 bpb = 1.009 to 1 decode_time2 : seconds:0.0014 ticks per: 2.416 b/kc : 413.96 MB/s : 715.73 filelen = 1000000 H = 6.323152 sym_count = 256 r=0.96665 : 1,000,000 -> 861,900 = 6.895 bpb = 1.160 to 1 decode_time2 : seconds:0.0012 ticks per: 2.096 b/kc : 477.19 MB/s : 825.07 filelen = 1000000 H = 5.741045 sym_count = 226 r=0.95039 : 1,000,000 -> 782,118 = 6.257 bpb = 1.279 to 1 decode_time2 : seconds:0.0011 ticks per: 1.898 b/kc : 526.96 MB/s : 911.12 filelen = 1000000 H = 5.155050 sym_count = 150 r=0.92652 : 1,000,000 -> 704,241 = 5.634 bpb = 1.420 to 1 decode_time2 : seconds:0.0010 ticks per: 1.681 b/kc : 594.73 MB/s : 1028.29 filelen = 1000000 H = 4.572028 sym_count = 109 r=0.89183 : 1,000,000 -> 614,061 = 4.912 bpb = 1.629 to 1 decode_time2 : seconds:0.0008 ticks per: 1.457 b/kc : 686.27 MB/s : 1186.57 filelen = 1000000 H = 3.986386 sym_count = 78 r=0.84222 : 1,000,000 -> 534,300 = 4.274 bpb = 1.872 to 1 decode_time2 : seconds:0.0007 ticks per: 1.254 b/kc : 797.33 MB/s : 1378.58 filelen = 1000000 H = 3.405910 sym_count = 47 r=0.77292 : 1,000,000 -> 454,059 = 3.632 bpb = 2.202 to 1 decode_time2 : seconds:0.0006 ticks per: 1.078 b/kc : 928.04 MB/s : 1604.58 filelen = 1000000 H = 2.823256 sym_count = 36 r=0.67952 : 1,000,000 -> 377,775 = 3.022 bpb = 2.647 to 1 decode_time2 : seconds:0.0005 ticks per: 0.935 b/kc : 1069.85 MB/s : 1849.77 filelen = 1000000 H = 2.250632 sym_count = 23 r=0.56015 : 1,000,000 -> 304,887 = 2.439 bpb = 3.280 to 1 decode_time2 : seconds:0.0004 ticks per: 0.724 b/kc : 1381.21 MB/s : 2388.11

Very very small difference. eg :

plural Tunstall : H = 3.986386 sym_count = 78 r=0.84222 : 1,000,000 -> 534,300 = 4.274 bpb = 1.872 to 1 Marlin : H = 3.986386 sym_count = 78 r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1 decode_time2 : seconds:0.0008 ticks per: 1.407 b/kc : 710.53 MB/s : 1228.51

Yes the Marlin word probability estimator helps a little bit, but it's not massive.

I'm not surprised but a bit sad to say that once again the Marlin paper compares to ridiculous straw men and doesn't compare to the most obvious, naive, and well known (see Savari for example, or Yamamoto & Yokoo) similar alternative - just doing plural Tunstall/VTF without the Marlin word probability model.

Entropy above 4 or so is terrible for 12-bit VTF codes.

The Marlin paper uses a "percent efficiency" scale which I find rather misleading. For example, this :

H = 3.986386 sym_count = 78 r=0.842 : 1,000,000 -> 529,743 = 4.238 bpb = 1.888 to 1is what I would consider pretty poor entropy coding. Entropy of 3.98 -> 4.23 bpb is way off. But as a "percent efficiency" it's 94% , which is really high on their graphs.

The more standard and IMO useful way to show this is a delta of your output bits minus the entropy, eg.

excess = 4.238 - 3.986 = 0.252half a bit per byte wasted. A true arithmetic coder has an excess around 0.001 bpb typically. The worst you can ever do is an excess of 1.0 which occurs in any integer-bit entroy coder as the probability of the MPS goes towards 1.0

Part of my hope / curiosity in investigating this was wondering whether the Marlin procedure would help at all with the way Tunstall VTF codes really collapse in the H > 4 range , and the answer is - no , it doesn't help with that problem at all.

Anyway, on to more results.

## No comments:

Post a Comment