Semigroup and Aggregate

Let the compiler do the talking

Whats on the menu

Parse data from a CSV file

Group and aggregate

Minimize mechanical and algorithmic code

Leverage type classes and GHC generics

Avoid writing code

Writing code is buggy

Type Classes give ad hoc polymorphism

Generics give you data-type generic programming

Generics + Type Classes + library authors == code for free

Type Classes + laws == correctness for free

GHC can derive many type classes for you

Monoid

It is a type with an associative binary operation that has an identity,

The binary operator is called <> and

The identity is called mempty and

The following laws hold
   x <> mempty = x right identity
   mempty <> x = x left identity
   x <> (y <> z) = (x <> y) <> z associativity
Its + with 0, * with 1, ++ with []

Semigroup

Its is Monoid without mempty

Useful to declaratively group and aggregate columnar data

Example

In this example we will take airplane crashes since 1908 from https://dev.socrata.com/foundry/opendata.socrata.com/q2te-8cvq and aggregate some statistics by different columns.

Here are links to everything needed to run the examples
  main.hs
  semigroup-and-aggregate.cabal
  Airplane_Crashes_and_Fatalities_Since_1908.csv

The language extensions we will be using

{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE GeneralizedNewtypeDeriving #-}
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE NamedFieldPuns #-}
{-# LANGUAGE OverloadedLists #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE TupleSections #-}

The libraries we will be using

-- base
import qualified Data.Monoid as Monoid
import Data.Ord
import qualified Data.Semigroup as Semigroup

-- cassava
import Data.Csv

-- generic-deriving
import Generics.Deriving.Semigroup

-- hable
import qualified Hable

-- rio
import RIO
import qualified RIO.ByteString.Lazy as BL
import qualified RIO.Char as Char
import qualified RIO.List as List
import qualified RIO.Map as Map
import qualified RIO.Text as Text
import RIO.Time

-- vector
import qualified Data.Vector as Vector

Most of our columns are wrappers around simple types

-- the location column wraps "Text" and parses using its instance
newtype Location = Location {getLocation :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the operator column wraps "Text" and parses using its instance
newtype Operator = Operator {getOperator :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the flight number column wraps "Text" and parses using its instance
newtype FlightNumber = FlightNumber {getFlightNumber :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the A/C type column wraps "Text" and parses using its instance
newtype AcType = AcType {getAcType :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the fatalities column wraps "Int" and parses using its instance
newtype Fatalities = Fatalities {getFatalities :: Int}
  deriving (FromField, Show, Read, Eq, Ord, Num)
-- Our date column wraps "Day" but parses as Y/M/D from the CSV
newtype Date = Date {getDay :: Day}
  deriving (Eq, Ord, Show, Read)

-- the date format definition used by the pretty printer and parser
mkDateFmt :: (TimeLocale -> String -> b) -> b
mkDateFmt f = f defaultTimeLocale "%m/%d/%Y"

-- pretty print the date in the same way that its read from the CSV
prettyDate :: Date -> String
prettyDate = mkDateFmt formatTime . getDay

-- customize the CSV parser for the date
instance FromField Date where
    --  types
    --   parseField :: FromField a => Field -> Parser a
    --   parseTimeM
    --    :: (Monad m, ParseTime t)
    --    => Bool -> TimeLocale -> String -> String -> m t
    --   >=> :: Monad m => (a -> m b) -> (b -> m c) -> a -> m c
  parseField =
    parseField >=>
      ( (Date <$>) .  mkDateFmt (parseTimeM True) )

Next we define our row

-- the data type that represents a row in the CSV that we will parse
data CsvRow = CsvRow
  { date :: Date
  , location :: Location
  , operator :: Operator
  , flightNumber :: Maybe FlightNumber
  , acType :: AcType
  , fatalities :: Maybe Fatalities
  } deriving (Generic)

-- We define how to parse a CSV file making use of "genericParseNamedRecord"
-- to do this automatically for us because "CsvRow" is an instance of "Generic"
-- we only customized the definition of the labels
instance FromNamedRecord CsvRow where
  parseNamedRecord =
    -- types
    --   genericParseNamedRecord
    --    :: (Generic a, GFromNamedRecord (Rep a))
    --    => Options -> NamedRecord -> Parser a
    --   fieldLabelModifier :: String -> String
    genericParseNamedRecord
      defaultOptions
        { fieldLabelModifier =
            \case
              "flightNumber" -> "Flight #"
              "acType" -> "Type"
              h:t -> Char.toUpper h : t
              [] -> []
        }

Then we define a type for the statistics we wish to collect and a way to project the statistics from the row

-- a type to represent the number of crashes reported
newtype Crashes = Crashes {getCrashes :: Int}
  deriving (Num, Show, Read, Ord, Eq)

-- a data type for the statistic we want to record per row type
data Stats = Stats
  { statFirst :: Semigroup.Min Date          -- lower bound Date
  , statLast :: Semigroup.Max Date           -- upper bound Date
  , statFatalities :: Monoid.Sum Fatalities  -- sum of Fatalities
  , statCrashes :: Monoid.Sum Crashes        -- sum of Crashes
  } deriving (Generic, Show)

-- Generically derive a "Semigroup" instance for "Stats" using "Generic" and
-- the fact that each field has a "Semigroup" instance
instance Semigroup Stats where
  (<>)  = gsappenddefault


-- convert a CSV row into a statistic we can aggregate
rowToStats ::  CsvRow -> Stats
rowToStats CsvRow {  date , fatalities} =
   Stats
      { statFirst      = Semigroup.Min date
      , statLast       = Semigroup.Max date
      , statFatalities = Semigroup.Sum (fromMaybe 0 fatalities)
      , statCrashes    = Semigroup.Sum 1
      }

We are going to display our statistics in tabular form so we add a function that turns a Stats into a row of columns of strings and also a constant for the headers of our rows

-- display a "Stats" as a row of columns of "String"
displayStatsColumns :: Stats -> [String]
displayStatsColumns  =
  -- types:
  --   sequence :: (Traversable t, Monad m) => t (m a) -> m (t a)
  -- inferred concrete types:
  --   sequence :: [Stats -> String] -> Stats -> [String]
  sequence
    [ prettyDate . Semigroup.getMin . statFirst
    , prettyDate . Semigroup.getMax . statLast
    , show . getFatalities . Semigroup.getSum . statFatalities
    , show . getCrashes . Semigroup.getSum . statCrashes
    ]

-- the column headers associated with rows of columns of strings of "Stats"
statsHeaders :: [String]
statsHeaders = ["first", "last", "fatalities", "crashes"]

We define how we group statistics

-- group rows on some value aggregating statistics using some "Semigroup"
aggregateStatisticsBy
  :: (Ord groupBy, Semigroup statistic)
  => (row -> Maybe (groupBy, statistic))
  -> [row]
  -> [(groupBy, statistic)]
aggregateStatisticsBy grouper =
  -- types:
  --   toList :: Map k a -> [(k, a)]
  --   <> :: Semigroup a => a -> a -> a
  --   fromListWith :: Ord k => (a -> a -> a) -> [(k, a)] -> Map k a
  --   mapMaybe :: (a -> Maybe b) -> [a] -> [b]
    -- :: Map a Stats -> [(a, Stats)]
    Map.toList .
    -- :: [(a, Stats)] -> Map a Stats
    Map.fromListWith (<>) .
    -- :: [CsvRow] -> [(a, Stats)]
    mapMaybe grouper

We define a function that will do the grouping and displaying

-- group the CSV rows aggregating the statistics and displaying the "n" best ranked stats
groupAndDisplayStats
  :: (Ord o, Ord a)
  => (CsvRow -> Maybe (a, Stats)) -- grouping function
  -> String                       -- grouping column header
  -> (a -> String)                -- display grouping column
  -> (Stats -> o)                 -- ordering function
  -> Int                          -- number of rows to display
  -> [CsvRow]                     -- the CSV rows
  -> String
groupAndDisplayStats grouper groupName displayGroup ordering cnt rows =
  -- types:
  --   hable :: Config style -> [[String]] -> String
  --   take :: Int -> [a] -> [a]
  --   map :: (a -> b) -> [a] -> [b]
  --   snd :: (a, b) -> b
  --   sortOn :: Ord b => (a -> b) -> [a] -> [a]
    Hable.hable Hable.defaultConfig $
    (groupName : statsHeaders) :
    ( take cnt .
      --  :: [(a, Stats)] -> [[String]]
      map
        (\(g, stats) ->
          displayGroup g : displayStatsColumns stats
        ) .
      -- :: [(a, Stats)] -> [(a, Stats)]
      List.sortOn (ordering . snd) .
      -- :: [CsvRow] -> [(a, Stats)]
      aggregateStatisticsBy grouper $
      -- :: [CsvRow]
      rows
    )

We define grouping functions with each function being very clear on which column they group.

statsByOperator :: CsvRow -> Maybe (Operator, Stats)
statsByOperator row = Just ( operator row , rowToStats row)

statsByFlightNumber :: CsvRow -> Maybe (FlightNumber,  Stats)
statsByFlightNumber row = (,rowToStats row) <$> flightNumber row

statsByLocation :: CsvRow -> Maybe (Location, Stats)
statsByLocation row = Just ( location row , rowToStats row)

statsAcType :: CsvRow -> Maybe (AcType, Stats)
statsAcType row = Just ( acType row , rowToStats row)

We define ranking functions with each function being very clear which statistics are used for the ranking.

byMostFatalities :: Stats -> Down Fatalities
byMostFatalities = Down . Semigroup.getSum . statFatalities

byLeastFatalities :: Stats -> Fatalities
byLeastFatalities = Semigroup.getSum . statFatalities

bySafestToCrashWith :: Stats -> (Fatalities, Down Crashes)
bySafestToCrashWith Stats{statFatalities, statCrashes} =
  (Semigroup.getSum statFatalities, Down (Semigroup.getSum statCrashes))

We end off by wiring everything together with our main

main :: IO ()
main = do
  csvData <- BL.readFile "./Airplane_Crashes_and_Fatalities_Since_1908.csv"
  rows <-
    either fail (pure . Vector.toList . snd) $ decodeByName csvData

  putStrLn "5 deadliest operators"
  putStr $
    groupAndDisplayStats
      statsByOperator "operator" (Text.unpack . getOperator) byMostFatalities 5 rows

  putStrLn "5 safest operators"
  putStr $
    groupAndDisplayStats
      statsByOperator "operator" (Text.unpack . getOperator) byLeastFatalities 5 rows


  putStrLn "6 deadliest flight numbers"
  putStr $
    groupAndDisplayStats
      statsByFlightNumber "flight #" (Text.unpack . getFlightNumber) byMostFatalities 6 rows

  putStrLn "4 deadliest locations"
  putStr $
    groupAndDisplayStats
      statsByLocation "location" (Text.unpack . getLocation) byMostFatalities 4 rows

  putStrLn "5 deadliest A/C Types"
  putStr $
    groupAndDisplayStats
      statsAcType "A/C Type" (Text.unpack . getAcType) byMostFatalities 5 rows


  putStrLn "5 best A/C Types to crash with"
  putStr $
    groupAndDisplayStats
      statsAcType "A/C Type" (Text.unpack . getAcType) bySafestToCrashWith 5 rows

Giving us this output

5 deadliest operators
╔════════════════════════════╤════════════╤════════════╤════════════╤═════════╗
║ operator                   │ first      │ last       │ fatalities │ crashes ║
╟────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Aeroflot                   │ 12/04/1946 │ 09/14/2008 │ 7156       │ 179     ║
╟────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Military - U.S. Air Force  │ 08/07/1943 │ 03/31/2005 │ 3717       │ 176     ║
╟────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Air France                 │ 10/31/1933 │ 06/01/2009 │ 1734       │ 70      ║
╟────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ American Airlines          │ 12/22/1934 │ 11/12/2001 │ 1421       │ 36      ║
╟────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Pan American World Airways │ 01/11/1938 │ 12/21/1988 │ 1302       │ 41      ║
╚════════════════════════════╧════════════╧════════════╧════════════╧═════════╝
5 safest operators
╔════════════════════════════════╤════════════╤════════════╤════════════╤═════════╗
║ operator                       │ first      │ last       │ fatalities │ crashes ║
╟────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ American International Airways │ 08/18/1993 │ 08/18/1993 │ 0          │ 1       ║
╟────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Ararat Avia                    │ 10/21/1998 │ 10/21/1998 │ 0          │ 1       ║
╟────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Canadian Air Transat           │ 08/24/2001 │ 08/24/2001 │ 0          │ 1       ║
╟────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Centurian Air Cargo            │ 07/07/2008 │ 07/07/2008 │ 0          │ 1       ║
╟────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Hewa Bora Airways              │ 04/15/2008 │ 04/15/2008 │ 0          │ 1       ║
╚════════════════════════════════╧════════════╧════════════╧════════════╧═════════╝
6 deadliest flight numbers
╔═══════════╤════════════╤════════════╤════════════╤═════════╗
║ flight #  │ first      │ last       │ fatalities │ crashes ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ -         │ 08/06/1913 │ 02/10/2005 │ 628        │ 67      ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ 1736/4805 │ 03/27/1977 │ 03/27/1977 │ 583        │ 1       ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ 123       │ 08/12/1985 │ 08/12/1985 │ 520        │ 1       ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ 182       │ 09/25/1978 │ 06/23/1985 │ 466        │ 2       ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ 191       │ 06/24/1972 │ 08/02/1985 │ 410        │ 3       ║
╟───────────┼────────────┼────────────┼────────────┼─────────╢
║ 901       │ 03/19/1960 │ 08/09/1995 │ 409        │ 6       ║
╚═══════════╧════════════╧════════════╧════════════╧═════════╝
4 deadliest locations
╔═══════════════════════════════════════╤════════════╤════════════╤════════════╤═════════╗
║ location                              │ first      │ last       │ fatalities │ crashes ║
╟───────────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Tenerife, Canary Islands              │ 12/07/1965 │ 04/25/1980 │ 761        │ 3       ║
╟───────────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Mt. Osutaka, near Ueno Village, Japan │ 08/12/1985 │ 08/12/1985 │ 520        │ 1       ║
╟───────────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Moscow, Russia                        │ 03/26/1952 │ 07/29/2007 │ 432        │ 15      ║
╟───────────────────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Near Moscow, Russia                   │ 05/18/1935 │ 07/14/2001 │ 364        │ 9       ║
╚═══════════════════════════════════════╧════════════╧════════════╧════════════╧═════════╝
5 deadliest A/C Types
╔═══════════════════════════╤════════════╤════════════╤════════════╤═════════╗
║ A/C Type                  │ first      │ last       │ fatalities │ crashes ║
╟───────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Douglas DC-3              │ 10/06/1937 │ 12/15/1994 │ 4793       │ 334     ║
╟───────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Antonov AN-26             │ 07/14/1977 │ 10/04/2007 │ 1068       │ 36      ║
╟───────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Douglas DC-6B             │ 06/30/1951 │ 07/24/1985 │ 1055       │ 27      ║
╟───────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Douglas C-47              │ 10/01/1942 │ 09/02/2000 │ 1046       │ 62      ║
╟───────────────────────────┼────────────┼────────────┼────────────┼─────────╢
║ McDonnell Douglas DC-9-32 │ 03/16/1969 │ 02/02/1998 │ 951        │ 19      ║
╚═══════════════════════════╧════════════╧════════════╧════════════╧═════════╝
5 best A/C Types to crash with
╔══════════════════╤════════════╤════════════╤════════════╤═════════╗
║ A/C Type         │ first      │ last       │ fatalities │ crashes ║
╟──────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Lockheed L-1011  │ 05/05/1983 │ 07/30/1992 │ 0          │ 2       ║
╟──────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Sikorsky S-38B   │ 08/10/1934 │ 08/13/1935 │ 0          │ 2       ║
╟──────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Airbus A-330-243 │ 08/24/2001 │ 08/24/2001 │ 0          │ 1       ║
╟──────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Airbus A-340     │ 08/02/2005 │ 08/02/2005 │ 0          │ 1       ║
╟──────────────────┼────────────┼────────────┼────────────┼─────────╢
║ Airbus A.320-214 │ 03/22/1998 │ 03/22/1998 │ 0          │ 1       ║
╚══════════════════╧════════════╧════════════╧════════════╧═════════╝