2025-01-23
Here are two digestible examples of data analysis using K. Both cases are afternoon-scale projects I’ve done in the last few weeks.
For six months or so I’ve been developing a written shorthand, Smith Shorthand. Smith is written more-or-less on the line, that is: its signs are generally written upright, with their bases resting on the baseline, just as in normal handwriting. Thus raising signs from the baseline is meaningful. Raising a sign above the baseline denotes its sign followed by a t; for instance, the k sign, when raised, reads kt (as in exact).
There are two signs in particular that are only half as high as a normal sign: r and l. Thus, these signs can be raised singly, to denote rt and lt—but owing to their height they (and only they) can be raised doubly too.
Thus we ask ourselves: if doubly raising a sign were to denote the presence of some other sound (and t is already accounted for), what would be the most useful sound to indicate?
The file we’re reading from is a copy of the CMU Pronouncing Dictionary. It consists of a series of lines shaped like this:
BRIGADEER B R IH2 G AH0 D IH1 R
That is: a headword, two spaces, and then a sequence of space-separated consonant and vowel sounds (the numbers here indicate syllable stress; we can ignore them since we’re only interested in consonants).
Our K script:
dict: 0: "cmudict.nopreamble.txt"
sounds: 2_'" "\'dict
isRL: ~^(,"R";,"L")?
followsRL: -1=-': isRL@
followers: (~#:')_{x@&followsRL[x]}'sounds
counts: #'=,/followers
>counts
The output:
("AH0"
"IY0"
"IH0"
"EH1"
"AE1"
"IH1"
,"D"
,"T"
"IY1"
,"Z"
...
Which is a list of sounds, ordered by the frequency with which they occur
after an R
or L
. We see that the first consonant sound in the list is
D
; thus, doubly-raising signs should indicate the presence of a
consonant cluster with
d.
Let us break down the script:
dict: 0: "cmudict.nopreamble.txt"
sounds: 2_'" "\'dict
In the first line we simply read the text file into dict
, breaking on
newlines (the nopreamble
here indicates that I had manually gone into
the file and removed some header materials). We then split each line
on " "
, and finally we drop the first two tokens of each line: the
headword and an empty string. sounds
is thus an array of arrays of
sound tokens, one for each entry in the dictionary.
isRL: ~^(,"R";,"L")?
followsRL: -1=-': isRL@
isRL
is a simple function which returns whether its arguments are the strings
"R"
or "L"
(the odd ,
s there are due to the fact that ngn/k represents
the character R
as "R"
, and the string R
as ,"R"
). followsRL
applies that function to a sequence, then applies subtraction to the resulting
booleans with the endlessly-useful eachprior
. That is, it subtracts each
value from the previous value. Any time a non-RL (a 0
) follows an RL (a
1
), we should expect the result to be -1
; thus we highlight all the sounds
we’re interested in.
followers: (~#:')_{x@&followsRL[x]}'sounds
counts: #'=,/followers
>counts
Applying {x@followsRL[x]}
to each entry in sounds
pulls out the sounds
following R
and L
in each entry; we then filter out any empty sequences.
To count our occurrences we simply join the results up into a single sequence,
group them and find the size of each group. >
, conveniently, sorts the
dictionary by its values.
I recently saw this post on r/cricket, which in turn linked to this tweet:
Bumrah now has 190 Test wickets.
What is special about this number, you may ask.
190 allows us to use “no bowler with as many wickets has a better average”.
You see, Barnes had 189 wickets.
190 is the 6997 for bowlers.
This made me want to do a bit of cricket data science. The linked post tried to answer the question “how many bowlers you can say this about”, but it seemed to rather overcomplicate the question. It seems to me that the question is precisely this:
For all n, what bowler with a lifetime wickets haul of n has the best average?
It bears noting that to be such a bowler is statistically interesting but not necessarily impressive. There’s a stronger formulation of the question that’s more impressive.
\l cricket.k
filtered: {x[`Wkts]>0}#bowling.t
sorted: filtered @> {(x[`Wkts];-x[`Ave])} @' filtered
best: sorted @ &~~': allWkts:({x[`Wkts]} @' sorted)
bestInClass:best @ &{(*|x)=&/x} @' ,\ best @ `Ave
mostCompetitive: &(#'=allWkts)>15
mostCompetitiveBowlers: {~^mostCompetitive?x[`Wkts]}#best
gap:-2#best @ &-1>-': ?allWkts
The module cricket.k
defines a few utilities that I think we can pass over.
In particular it defines bowling.t
, a table containing this data
set.
That is, “ICC Test Cricket Bowling Figures”: the lifetime test bowling figures
for 3003 cricketers. It’s accurate up until Jan 2020, so we will not be able to
reproduce the finding in the tweet. Anyone who can backfill the data for the
last five years will be greatly appreciated!
The table looks like this:
+![ `Player `Span `Mat `Inns `Balls `Runs `Wkts `BBM `..
+(("M Muralitharan (ICC/SL)";"1992-2010"; 133; 230; 44039;18180; 800;"16/220";..)
("SK Warne (AUS)" ;"1992-2007"; 145; 273; 40705;17995; 708;"12/128";..)
("A Kumble (INDIA)" ;"1990-2008"; 132; 236; 40850;18355; 619;"14/149";..)
("JM Anderson (ENG)" ;"2003-2020"; 151; 282; 32779;15670; 584;"Nov-71";..)
("GD McGrath (AUS)" ;"1993-2007"; 124; 243; 29248;12186; 563;"27-Oct";..)
("CA Walsh (WI)" ;"1984-2001"; 132; 242; 30019;12688; 519;"13/55" ;..)
("SCJ Broad (ENG)" ;"2007-2020"; 136; 250; 27793;13730; 479;"11/121";..)
("DW Steyn (SA)" ;"2004-2019"; 93; 171; 18608;10077; 439;"Nov-60";..)
("N Kapil Dev (INDIA)" ;"1978-1994"; 131; 227; 27740;12867; 434;"11/146";..)
("HMRKB Herath (SL)" ;"1999-2018"; 93; 170; 25993;12157; 433;"14/184";..)
("Sir RJ Hadlee (NZ)" ;"1973-1990"; 86; 150; 21918; 9611; 431;"15/123";..)
("SM Pollock (SA)" ;"1995-2008"; 108; 202; 24353; 9733; 421;"10/147";..)
("Harbhajan Singh (INDIA)";"1998-2015"; 103; 190; 28580;13537; 417;"15/217";..)
("Wasim Akram (PAK)" ;"1985-2002"; 104; 181; 22627; 9779; 414;"11/110";..)
("CEL Ambrose (WI)" ;"1988-2000"; 98; 179; 22103; 8501; 405;"Nov-84";..)
("NM Lyon (AUS)" ;"2011-2020"; 96; 184; 24568;12320; 390;"13/154";..)
("M Ntini (SA)" ;"1998-2009"; 101; 190; 20834;11242; 390;"13/132";..)
("IT Botham (ENG)" ;"1977-1992"; 102; 168; 21815;10878; 383;"13/106";..)
We’ll be interested in the Wkts
and Ave
columns.
Let’s first answer the question as written.
filtered: {x[`Wkts]>0}#bowling.t
sorted: filtered @> {(x[`Wkts];-x[`Ave])} @' filtered
best: sorted @ &~~': allWkts:({x[`Wkts]} @' sorted)
bowling.t
is a table, so we’ll be using ngn/k’s syntax for slicing and dicing
tabular data. First we’ll do some basic cleanup by filtering for bowlers who
have taken at least one wicket in their career. That leaves 1784 bowlers.
We then sort the table by lifetime wickets descending, secondarily by average ascending.
Once the table has been correctly sorted we index into it in the right places.
~~':
is another eachprior
application, this time of ~~
—which gives us
the indices of each place that the number of wickets changes. Having sorted,
that is the index of the bowler with the best lifetime average for that number
of wickets. Here are the bowlers with the best average of all bowlers with the
same number of lifetime
wickets.
I would like to answer the more interesting and impressive question: what bowlers have a better lifetime average than any other bowler who has taken as many or more wickets? In other words, what bowlers define the outer edge of the graph between average and wickets; what bowlers could you credibly claim to be “best in class”, surpassed in average only by worse wicket-takers, or in lifetime wickets by bowlers with worse averages?
bestInClass:best @ &{(*|x)=&/x} @' ,\ best @ `Ave
We can reason that every such bowler must have a better average at least than those bowlers with the same number of wickets, so we can start with the result of the weaker formulation.
We’ll then take the lifetime averages of each bowler and take a series of
prefixes with ,\
: this produces, for each bowler, a sequence of that bowler’s
average as well as the average of every bowler with a better lifetime wickets
than him.
We then test each sequence for the predicate {(*|x)=&/x}
, or: is the last
value in the sequence (the average of the bowler under question) also the
minimum value?
We are left with a more rarified and interesting bunch, about whom we can say that no bowler had a better average, save those who took fewer wickets:
+![ `Player `Span `Mat `Inns `Balls `Runs `Wkts `BBM `..
+(("M Muralitharan (ICC/SL)";"1992-2010"; 133; 230; 44039;18180; 800;"16/220";..)
("GD McGrath (AUS)" ;"1993-2007"; 124; 243; 29248;12186; 563;"27-Oct";..)
("CEL Ambrose (WI)" ;"1988-2000"; 98; 179; 22103; 8501; 405;"Nov-84";..)
("MD Marshall (WI)" ;"1978-1991"; 81; 151; 17584; 7876; 376;"Nov-89";..)
("SF Barnes (ENG)" ;"1901-1914"; 27; 50; 7873; 3106; 189;"17/159";..)
("GA Lohmann (ENG)" ;"1886-1896"; 18; 36; 3830; 1205; 112;"15/45" ;..)
("F Martin (ENG)" ;"1890-1892"; 2; 3; 410; 141; 14;"12/102";..)
("CS Marriott (ENG)" ;"1933-1933"; 1; 2; 247; 96; 11;"Nov-96";..)
("CA Smith (ENG)" ;"1889-1889"; 1; 2; 154; 61; 7;"Jul-61";..)
("AJL Hill (ENG)" ;"1896-1896"; 3; 1; 40; 8; 4;"8-Apr" ;..)
("W Barber (ENG)" ;"1935-1935"; 2; 1; 2; 0; 1;"Jan-00";..))]
The original formulation of the question prompts some more trivial questions. The first: if we are interested in the best lifetime average for each n, what are the most “competitive” ns—what lifetime wicket counts are held by the most bowlers?
mostCompetitive: &(#'=allWkts)>15
mostCompetitiveBowlers: {~^mostCompetitive?x[`Wkts]}#best
Here we return to allWkts
, which is simply the lifetime wicket count of every
bowler in the table. We group equal values with =
and then find the wicket
counts that are held by more than (arbitrarily) 15 bowlers.
We can then select those members of best
whose wicket counts are in the
group, identifying the bowlers who had to beat the most other players to win
their particular bracket.
+![ `Player `Span `Mat `Inns `Balls `Runs `Wkts `BBM `Av..
+(("GF Bissett (SA)" ;"1927-1928"; 4; 8; 989; 469; 25;"Sep-90";18..)
("J Middleton (SA)" ;"1896-1902"; 6; 11; 1064; 442; 24;"9/130" ;18..)
("JB Iverson (AUS)" ;"1950-1951"; 5; 8; 1108; 320; 21;"Jun-52";15..)
("Zulfiqar Ahmed (PAK)" ;"1952-1956"; 9; 10; 1285; 366; 20;"Nov-79";18..)
("J Trim (WI)" ;"1948-1952"; 4; 8; 794; 291; 18;"Jul-76";16..)
("TS Roland-Jones (ENG)";"2017-2017"; 4; 8; 536; 334; 17;"8/129" ;19..)
...
We also see, incidentally, that 25 wickets is the highest total with more than 15 bowlers holding it.
Having sorted our bowlers by lifetime wickets, another question presents itself: what is the smallest n that is not held (as of January 2020) as a lifetime total by any bowler?
gap:-2#best @ &-1>-': ?allWkts
Once again, we use eachprior
: this time, we take the unique values for
lifetime wickets and find the difference between each adjacent value. Wherever
the difference is greater than -1, there’s a gap between the two values. We
then take the indices of the last two and index into best
(which is similarly
deduped by number of lifetime wickets).
(("BKV Prasad (INDIA)";96;35.0;"http://stats.espncricinfo.com/ci/content/player/32345.html")
("FR Spofforth (AUS)";94;18.41;"http://stats.espncricinfo.com/ci/content/player/7663.html"))
We see that the lowest value not held is 95.
Another way of answering the question “what is the smallest n that is not held as a lifetime total?” is to do some simple set arithmetic. We can also write
&/(1+!|/allWkts)^allWkts
Which will remove all extent values from the set of all possible values (from 1 to the maximum lifetime wickets value), and find the lowest value in the remaining set.