Two Bites of Data Science in K

Here are two digestible examples of data analysis using K. Both cases are afternoon-scale projects I’ve done in the last few weeks.

The Most Common Consonants Following r and l in English

For six months or so I’ve been developing a written shorthand, Smith Shorthand. Smith is written more-or-less on the line, that is: its signs are generally written upright, with their bases resting on the baseline, just as in normal handwriting. Thus raising signs from the baseline is meaningful. Raising a sign above the baseline denotes its sign followed by a t; for instance, the k sign, when raised, reads kt (as in exact).

There are two signs in particular that are only half as high as a normal sign: r and l. Thus, these signs can be raised singly, to denote rt and lt—but owing to their height they (and only they) can be raised doubly too.

Thus we ask ourselves: if doubly raising a sign were to denote the presence of some other sound (and t is already accounted for), what would be the most useful sound to indicate?

The code

The file we’re reading from is a copy of the CMU Pronouncing Dictionary. It consists of a series of lines shaped like this:

BRIGADEER  B R IH2 G AH0 D IH1 R

That is: a headword, two spaces, and then a sequence of space-separated consonant and vowel sounds (the numbers here indicate syllable stress; we can ignore them since we’re only interested in consonants).

Our K script:

dict: 0: "cmudict.nopreamble.txt"
sounds: 2_'" "\'dict

isRL: ~^(,"R";,"L")?
followsRL: -1=-': isRL@

followers: (~#:')_{x@&followsRL[x]}'sounds
counts: #'=,/followers
>counts

The output:

("AH0"
 "IY0"
 "IH0"
 "EH1"
 "AE1"
 "IH1"
 ,"D"
 ,"T"
 "IY1"
 ,"Z"
 ...

Which is a list of sounds, ordered by the frequency with which they occur after an R or L. We see that the first consonant sound in the list is D; thus, doubly-raising signs should indicate the presence of a consonant cluster with d.

Let us break down the script:

Preparing the data

dict: 0: "cmudict.nopreamble.txt"
sounds: 2_'" "\'dict

In the first line we simply read the text file into dict, breaking on newlines (the nopreamble here indicates that I had manually gone into the file and removed some header materials). We then split each line on " ", and finally we drop the first two tokens of each line: the headword and an empty string. sounds is thus an array of arrays of sound tokens, one for each entry in the dictionary.

Selecting characters after an R or L

isRL: ~^(,"R";,"L")?
followsRL: -1=-': isRL@

isRL is a simple function which returns whether its arguments are the strings "R" or "L" (the odd ,s there are due to the fact that ngn/k represents the character R as "R", and the string R as ,"R"). followsRL applies that function to a sequence, then applies subtraction to the resulting booleans with the endlessly-useful eachprior. That is, it subtracts each value from the previous value. Any time a non-RL (a 0) follows an RL (a 1), we should expect the result to be -1; thus we highlight all the sounds we’re interested in.

Applying our selector and gathering the results

followers: (~#:')_{x@&followsRL[x]}'sounds
counts: #'=,/followers
>counts

Applying {x@followsRL[x]} to each entry in sounds pulls out the sounds following R and L in each entry; we then filter out any empty sequences.

To count our occurrences we simply join the results up into a single sequence, group them and find the size of each group. >, conveniently, sorts the dictionary by its values.

“no bowler with as many wickets has a better average”.

I recently saw this post on r/cricket, which in turn linked to this tweet:

Bumrah now has 190 Test wickets.

What is special about this number, you may ask.

190 allows us to use “no bowler with as many wickets has a better average”.

You see, Barnes had 189 wickets.

190 is the 6997 for bowlers.

This made me want to do a bit of cricket data science. The linked post tried to answer the question “how many bowlers you can say this about”, but it seemed to rather overcomplicate the question. It seems to me that the question is precisely this:

For all n, what bowler with a lifetime wickets haul of n has the best average?

It bears noting that to be such a bowler is statistically interesting but not necessarily impressive. There’s a stronger formulation of the question that’s more impressive.

\l cricket.k

filtered: {x[`Wkts]>0}#bowling.t
sorted: filtered @> {(x[`Wkts];-x[`Ave])} @' filtered
best: sorted @ &~~': allWkts:({x[`Wkts]} @' sorted)

bestInClass:best @ &{(*|x)=&/x} @' ,\ best @ `Ave

mostCompetitive: &(#'=allWkts)>15
mostCompetitiveBowlers: {~^mostCompetitive?x[`Wkts]}#best

gap:-2#best @ &-1>-': ?allWkts

The module cricket.k defines a few utilities that I think we can pass over.

In particular it defines bowling.t, a table containing this data set. That is, “ICC Test Cricket Bowling Figures”: the lifetime test bowling figures for 3003 cricketers. It’s accurate up until Jan 2020, so we will not be able to reproduce the finding in the tweet. Anyone who can backfill the data for the last five years will be greatly appreciated!

The table looks like this:

+![ `Player                   `Span       `Mat `Inns `Balls `Runs `Wkts `BBM     `..
 +(("M Muralitharan (ICC/SL)";"1992-2010"; 133;  230; 44039;18180;  800;"16/220";..)
   ("SK Warne (AUS)"         ;"1992-2007"; 145;  273; 40705;17995;  708;"12/128";..)
   ("A Kumble (INDIA)"       ;"1990-2008"; 132;  236; 40850;18355;  619;"14/149";..)
   ("JM Anderson (ENG)"      ;"2003-2020"; 151;  282; 32779;15670;  584;"Nov-71";..)
   ("GD McGrath (AUS)"       ;"1993-2007"; 124;  243; 29248;12186;  563;"27-Oct";..)
   ("CA Walsh (WI)"          ;"1984-2001"; 132;  242; 30019;12688;  519;"13/55" ;..)
   ("SCJ Broad (ENG)"        ;"2007-2020"; 136;  250; 27793;13730;  479;"11/121";..)
   ("DW Steyn (SA)"          ;"2004-2019";  93;  171; 18608;10077;  439;"Nov-60";..)
   ("N Kapil Dev (INDIA)"    ;"1978-1994"; 131;  227; 27740;12867;  434;"11/146";..)
   ("HMRKB Herath (SL)"      ;"1999-2018";  93;  170; 25993;12157;  433;"14/184";..)
   ("Sir RJ Hadlee (NZ)"     ;"1973-1990";  86;  150; 21918; 9611;  431;"15/123";..)
   ("SM Pollock (SA)"        ;"1995-2008"; 108;  202; 24353; 9733;  421;"10/147";..)
   ("Harbhajan Singh (INDIA)";"1998-2015"; 103;  190; 28580;13537;  417;"15/217";..)
   ("Wasim Akram (PAK)"      ;"1985-2002"; 104;  181; 22627; 9779;  414;"11/110";..)
   ("CEL Ambrose (WI)"       ;"1988-2000";  98;  179; 22103; 8501;  405;"Nov-84";..)
   ("NM Lyon (AUS)"          ;"2011-2020";  96;  184; 24568;12320;  390;"13/154";..)
   ("M Ntini (SA)"           ;"1998-2009"; 101;  190; 20834;11242;  390;"13/132";..)
   ("IT Botham (ENG)"        ;"1977-1992"; 102;  168; 21815;10878;  383;"13/106";..)

We’ll be interested in the Wkts and Ave columns.

“no bowler with as many wickets has a better average”.

Let’s first answer the question as written.

filtered: {x[`Wkts]>0}#bowling.t
sorted: filtered @> {(x[`Wkts];-x[`Ave])} @' filtered
best: sorted @ &~~': allWkts:({x[`Wkts]} @' sorted)

bowling.t is a table, so we’ll be using ngn/k’s syntax for slicing and dicing tabular data. First we’ll do some basic cleanup by filtering for bowlers who have taken at least one wicket in their career. That leaves 1784 bowlers.

We then sort the table by lifetime wickets descending, secondarily by average ascending.

Once the table has been correctly sorted we index into it in the right places. ~~': is another eachprior application, this time of ~~—which gives us the indices of each place that the number of wickets changes. Having sorted, that is the index of the bowler with the best lifetime average for that number of wickets. Here are the bowlers with the best average of all bowlers with the same number of lifetime wickets.

no bowler with as many wickets or more has a better average

I would like to answer the more interesting and impressive question: what bowlers have a better lifetime average than any other bowler who has taken as many or more wickets? In other words, what bowlers define the outer edge of the graph between average and wickets; what bowlers could you credibly claim to be “best in class”, surpassed in average only by worse wicket-takers, or in lifetime wickets by bowlers with worse averages?

bestInClass:best @ &{(*|x)=&/x} @' ,\ best @ `Ave

We can reason that every such bowler must have a better average at least than those bowlers with the same number of wickets, so we can start with the result of the weaker formulation.

We’ll then take the lifetime averages of each bowler and take a series of prefixes with ,\: this produces, for each bowler, a sequence of that bowler’s average as well as the average of every bowler with a better lifetime wickets than him.

We then test each sequence for the predicate {(*|x)=&/x}, or: is the last value in the sequence (the average of the bowler under question) also the minimum value?

We are left with a more rarified and interesting bunch, about whom we can say that no bowler had a better average, save those who took fewer wickets:

+![ `Player                   `Span       `Mat `Inns `Balls `Runs `Wkts `BBM     `..
 +(("M Muralitharan (ICC/SL)";"1992-2010"; 133;  230; 44039;18180;  800;"16/220";..)
   ("GD McGrath (AUS)"       ;"1993-2007"; 124;  243; 29248;12186;  563;"27-Oct";..)
   ("CEL Ambrose (WI)"       ;"1988-2000";  98;  179; 22103; 8501;  405;"Nov-84";..)
   ("MD Marshall (WI)"       ;"1978-1991";  81;  151; 17584; 7876;  376;"Nov-89";..)
   ("SF Barnes (ENG)"        ;"1901-1914";  27;   50;  7873; 3106;  189;"17/159";..)
   ("GA Lohmann (ENG)"       ;"1886-1896";  18;   36;  3830; 1205;  112;"15/45" ;..)
   ("F Martin (ENG)"         ;"1890-1892";   2;    3;   410;  141;   14;"12/102";..)
   ("CS Marriott (ENG)"      ;"1933-1933";   1;    2;   247;   96;   11;"Nov-96";..)
   ("CA Smith (ENG)"         ;"1889-1889";   1;    2;   154;   61;    7;"Jul-61";..)
   ("AJL Hill (ENG)"         ;"1896-1896";   3;    1;    40;    8;    4;"8-Apr" ;..)
   ("W Barber (ENG)"         ;"1935-1935";   2;    1;     2;    0;    1;"Jan-00";..))]

The most “competitive” lifetime wicket counts

The original formulation of the question prompts some more trivial questions. The first: if we are interested in the best lifetime average for each n, what are the most “competitive” ns—what lifetime wicket counts are held by the most bowlers?

mostCompetitive: &(#'=allWkts)>15
mostCompetitiveBowlers: {~^mostCompetitive?x[`Wkts]}#best

Here we return to allWkts, which is simply the lifetime wicket count of every bowler in the table. We group equal values with = and then find the wicket counts that are held by more than (arbitrarily) 15 bowlers.

We can then select those members of best whose wicket counts are in the group, identifying the bowlers who had to beat the most other players to win their particular bracket.

+![ `Player                 `Span       `Mat `Inns `Balls `Runs `Wkts `BBM     `Av..
 +(("GF Bissett (SA)"      ;"1927-1928";   4;    8;   989;  469;   25;"Sep-90";18..)
   ("J Middleton (SA)"     ;"1896-1902";   6;   11;  1064;  442;   24;"9/130" ;18..)
   ("JB Iverson (AUS)"     ;"1950-1951";   5;    8;  1108;  320;   21;"Jun-52";15..)
   ("Zulfiqar Ahmed (PAK)" ;"1952-1956";   9;   10;  1285;  366;   20;"Nov-79";18..)
   ("J Trim (WI)"          ;"1948-1952";   4;    8;   794;  291;   18;"Jul-76";16..)
   ("TS Roland-Jones (ENG)";"2017-2017";   4;    8;   536;  334;   17;"8/129" ;19..)
...

We also see, incidentally, that 25 wickets is the highest total with more than 15 bowlers holding it.

The lowest gap in lifetime wickets

Having sorted our bowlers by lifetime wickets, another question presents itself: what is the smallest n that is not held (as of January 2020) as a lifetime total by any bowler?

gap:-2#best @ &-1>-': ?allWkts

Once again, we use eachprior: this time, we take the unique values for lifetime wickets and find the difference between each adjacent value. Wherever the difference is greater than -1, there’s a gap between the two values. We then take the indices of the last two and index into best (which is similarly deduped by number of lifetime wickets).

(("BKV Prasad (INDIA)";96;35.0;"http://stats.espncricinfo.com/ci/content/player/32345.html")
 ("FR Spofforth (AUS)";94;18.41;"http://stats.espncricinfo.com/ci/content/player/7663.html"))

We see that the lowest value not held is 95.

Bonus: the lowest gap in lifetime wickets, redux

Another way of answering the question “what is the smallest n that is not held as a lifetime total?” is to do some simple set arithmetic. We can also write

&/(1+!|/allWkts)^allWkts

Which will remove all extent values from the set of all possible values (from 1 to the maximum lifetime wickets value), and find the lowest value in the remaining set.


Built with Bagatto.