Semigroup and Aggregate

Whats on the menu

Parse data from a CSV file

Group and aggregate

Minimize mechanical and algorithmic code

Leverage type classes and GHC generics

Dataset

The dataset we are going to work with is airplane crashes since 1908.

Why avoid writing code

Every line we write is a potential bug

Every line we write has to be maintained

Repetitive and boilerplate code obfuscates intent

How can we avoid writing code

Type Classes give ad hoc polymorphism

Generics give you data-type generic programming

Generics + Type Classes + library authors == code for free

Type Classes + laws == correctness for free

GHC can derive many type classes for you

How to parse the CSV

Luckily for us we can use the cassava library to parse CSV data.

CSV is essentially a matrix of strings,
  and cassava does support handling it opaquely,
  but it also supports and encourages parsing it into custom data types.

Parsing with cassava

To parse a column we need an instance for

class FromField a where
  parseField :: Field -> Parser a

type Field = ByteString

But luckily many types already have instances.

Parsing with cassava

To parse a row we need an instance for

class FromNamedRecord a where
  default parseNamedRecord :: (Generic a, GFromNamedRecord (Rep a)) => NamedRecord -> Parser a
  parseNamedRecord :: NamedRecord -> Parser a

type NamedRecord = HashMap ByteString ByteString

We could implement it manually e.g.

data Person = Person { name :: !Text, age :: !Int }

instance FromNamedRecord Person where
    parseNamedRecord m = Person <$>
                         m .: "name" <*>
                         m .: "age"

But luckily it has a default generic implementation which means we can get it for free e.g.

data Person = Person { name :: !Text , salary :: !Int }
    deriving (Generic, Show)

instance FromNamedRecord Person

Parsing our data

If we look at the CSV then we are interested in parsing the following columns:

Date
The date of the crash e.g. 09/17/1908
Location
Where the crash took place e.g. Fort Myer, Virginia
Operator
Who the operator of the plane was e.g. Military - U.S. Army
Flight #
The flight number associated with the aircraft e.g. 2L272
Type
The type of the aircraft e.g. Wright Flyer III
Aboard
The number of passengers on board the aircraft.
Fatalities
The total number of fatalities due to the crash.
Ground
The number of fatalities non passengers, people that were on the ground.

Parsing our data

We will use newtypes to tag the column data so that its clear which columns we are working with.

They are zero cost at compile time.

They are low cost at devloper time since we can access the underlying type’s instances via GeneralizedNewtypeDeriving.

Parsing our data

-- the location column wraps "Text" and parses using its instance
newtype Location = Location {getLocation :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the operator column wraps "Text" and parses using its instance
newtype Operator = Operator {getOperator :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the flight number column wraps "Text" and parses using its instance
newtype FlightNumber = FlightNumber {getFlightNumber :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the A/C type column wraps "Text" and parses using its instance
newtype AcType = AcType {getAcType :: Text}
  deriving (FromField, Show, Read, Eq, Ord)

-- the passengers column wraps "Int" and parses using its instance
newtype Passengers = Passengers {getPassengers :: Int}
  deriving (FromField, Show, Read, Eq, Ord, Num)

-- the fatalities column wraps "Int" and parses using its instance
newtype Fatalities = Fatalities {getFatalities :: Int}
  deriving (FromField, Show, Read, Eq, Ord, Num)

-- the ground fatalities column wraps "Int" and parses using its instance
newtype GroundFatalities = GroundFatalities {getGroundFatalities :: Int}
  deriving (FromField, Show, Read, Eq, Ord, Num)

-- Our date column wraps "Day" but parses as Y/M/D from the CSV
newtype Date = Date {getDay :: Day}
  deriving (Eq, Ord, Show, Read)

-- customize the CSV parser for the date
instance FromField Date where
  parseField =
    parseField
      >=> ((Date <$>) . parseTimeM True defaultTimeLocale "%m/%d/%Y")

Parsing our data

-- the data type that represents a row in the CSV that we will parse
data CsvRow = CsvRow
  { date :: Date
  , location :: Location
  , operator :: Operator
  , flightNumber :: Maybe FlightNumber
  , acType :: AcType
  , aboard :: Maybe Passengers
  , fatalities :: Maybe Fatalities
  , ground :: Maybe GroundFatalities
  }
  deriving (Generic)
-- We define how to parse a CSV file making use of "genericParseNamedRecord"
-- to do this automatically for us because "CsvRow" is an instance of "Generic"
-- we only customized the definition of the labels
instance FromNamedRecord CsvRow where
  parseNamedRecord =
    genericParseNamedRecord
      defaultOptions
        { fieldLabelModifier =
            \case
              "flightNumber" -> "Flight #"
              "acType" -> "Type"
              h : t -> Char.toUpper h : t
              [] -> []
        }

Crashes per aicraft type

As a first statistic lets calculate the crashes per aircraft type.

We can use a newtype again to represent crashes.

newtype Crashes = Crashes {getCrashes :: Int}
  deriving (Num, Show, Read, Ord, Eq)

As a first attempt we might come up with something like.

crashesPerAcType1 :: [CsvRow] -> [(AcType, Crashes)]
crashesPerAcType1 = Map.toList . Map.fromListWith (+) . map toTuple
  where
    toTuple a = (acType a, 1)

But its not really clear from our type that there can’t be any duplicate aircraft and we are discarding the Map which seems wasteful.

So then we come up with the following.

crashesPerAcType2 :: [CsvRow] -> Map AcType Crashes
crashesPerAcType2 = Map.fromListWith (+) . map toTuple
  where
    toTuple a = (acType a, 1)

Crashes per aicraft type

But the result of a succesful parse from cassava is not a [CsvRow] but a Vector CsvRow

We could fix it by doing the following.

crashesPerAcType3 :: Vector CsvRow -> Map AcType Crashes
crashesPerAcType3 = Map.fromListWith (+) . map toTuple . Vector.toList
  where
    toTuple a = (acType a, 1)

Or better yet we could make it work generically with any Foldable.

crashesPerAcType4 :: Foldable t => t CsvRow -> Map AcType Crashes
crashesPerAcType4 = Map.fromListWith (+) . map toTuple . toList
  where
    toTuple a = (acType a, 1)

Crashes per operator

Next lets calculate crashes per operator.

crashesPerOperator :: Foldable t => t CsvRow -> Map Operator Crashes
crashesPerOperator = Map.fromListWith (+) . map toTuple . toList
  where
    toTuple a = (operator a, 1)

But this looks almost exactly the same as crashes per aircraft.

We fix it by extracting the common bit to a helper.

crashesPer :: (Foldable t, Ord key) => (CsvRow -> key) -> t CsvRow -> Map key Crashes
crashesPer toKey = Map.fromListWith (+) . map toTuple . toList
  where
    toTuple a = (toKey a, 1)


crashesPerAcType5 :: Foldable t => t CsvRow -> Map AcType Crashes
crashesPerAcType5 = crashesPer acType


crashesPerOperator2 :: Foldable t => t CsvRow -> Map Operator Crashes
crashesPerOperator2 = crashesPer operator

Number of crashes and date of last crash

crashesLastCrashPerOperator :: Foldable t => t CsvRow -> Map Operator (Crashes, Date)

Lets modify our helper to be more general

crashesPer2 ::
  (Foldable t, Ord key) =>
  (CsvRow -> key) ->
  (CsvRow -> value) ->
  (value -> value -> value) ->
  t CsvRow ->
  Map key value
crashesPer2 toKey toValue combineValues = Map.fromListWith combineValues . map toTuple . toList
  where
    toTuple a = (toKey a, toValue a)
crashesLastCrashPerOperator :: Foldable t => t CsvRow -> Map Operator (Crashes, Date)
crashesLastCrashPerOperator = crashesPer2 operator toValue combineValues
  where
    toValue a = (1, date a)
    combineValues (l1, l2) (r1, r2) = (l1 + r1, if l2 < r2 then r2 else l2)

Its a shame that we have to destructure the tuple, combine the elements and then recreate it.

Surely we can do better.

Monoid & Semigroup

A Monoid is a type with an associative binary operation that has an identity,

The binary operator is called <> and

The identity is called mempty and

The following laws hold
   x <> mempty = x right identity
   mempty <> x = x left identity
   x <> (y <> z) = (x <> y) <> z associativity

Its + with 0, * with 1, ++ with []

A Semigroup is Monoid without mempty

Any product type, e.g. tuple, composed of Semigroup/Monoids is itself a Semigroup/Monoid

Number of crashes and date of last crash

crashesPer3 ::
  (Foldable t, Ord key, Semigroup value) =>
  (CsvRow -> key) ->
  (CsvRow -> value) ->
  t CsvRow ->
  Map key value
crashesPer3 toKey toValue = Map.fromListWith (<>) . map toTuple . toList
  where
    toTuple a = (toKey a, toValue a)
crashesLastCrashPerOperator2 :: 
  Foldable t => t 
  CsvRow -> Map Operator (Semigroup.Sum Crashes, Semigroup.Last Date)
crashesLastCrashPerOperator2 = crashesPer3 operator toValue
  where
    toValue a = (Semigroup.Sum 1, Semigroup.Last (date a))

But isn’t a Map a Monoid

Map is a monoid.

 Ord k => Monoid (Map k v)

And there is this handy function

foldMap :: (Foldable t, Monoid m) => (a -> m) -> t a -> m

So can’t we then

crashesPer4Wrong ::
  (Foldable t, Ord key, Semigroup value) =>
  (CsvRow -> key) ->
  (CsvRow -> value) ->
  t CsvRow ->
  Map key value
crashesPer4Wrong toKey toValue = foldMap toMap
  where
    toMap a = Map.singleton (toKey a) (toValue a)

No we can’t because notice that there is no Semigroup / Monoid constraint on Map’s value.

It overwrites the values and does not aggregate them.

MonoidalMap to the rescue

Luckily MonoidalMap map exists and it does have the correct constraint on its Monoid instance.

 (Ord k, Semigroup a) => Monoid (MonoidalMap k a)

So we can do

crashesPer4 ::
  (Foldable t, Ord key, Semigroup value) =>
  (CsvRow -> key) ->
  (CsvRow -> value) ->
  t CsvRow ->
  MonoidalMap key value
crashesPer4 toKey toValue = foldMap toMap
  where
    toMap a = MonoidalMap.singleton (toKey a) (toValue a)

Or rather make our helper even more generic

groupAggregate ::
  (Foldable t, Ord key, Semigroup value) => 
  (a -> key) -> (a -> value) -> t a -> MonoidalMap key value
groupAggregate toKey toValue = foldMap toMap
  where
    toMap a = MonoidalMap.singleton (toKey a) (toValue a)

More stats less tuples

Next we want to collect several statistics.

We could use an N-tuple.

It would be much nicer if we could use a record.

Does this mean we have to manually define a semigroup instance for our record?

Lucky for us no we don’t have to.
We can use the generic-data library and DerivingVia to automatically
derive a Semigroup instance for our record.

More stats less tuples


-- a data type for the statistic we want to record per row type
data Stats = Stats
  { statFirst :: !(Semigroup.Min Date) -- lower bound Date
  , statLast :: !(Semigroup.Max Date) -- upper bound Date
  , statFatalities :: !(Monoid.Sum Fatalities) -- sum of Fatalities
  , statCrashes :: !(Semigroup.Sum Crashes) -- sum of Crashes
  }
  deriving (Generic, Show)
  deriving (Semigroup) via (Generically Stats)


toStats :: CsvRow -> Stats
toStats a =
  Stats
    { statFirst = Semigroup.Min (date a)
    , statLast = Semigroup.Max (date a)
    , statFatalities = maybe mempty Monoid.Sum (fatalities a)
    , statCrashes = Semigroup.Sum 1
    }

Taking it for a spin

Now we can ask all kinds of interesting questions.

main :: IO ()
main = do
  csvData <- ByteString.Lazy.readFile "./Airplane_Crashes_and_Fatalities_Since_1908.csv"
  rows <-
    either fail (pure . snd) $ decodeByName csvData

Taking it for a spin

  putStrLn "4 operators most likely to crash"
  pPrint $
    take 4 $
      sortOn (Down . statCrashes . snd) $
        MonoidalMap.toList $
          groupAggregate operator toStats rows
4 operators most likely to crash
[ ( Operator { getOperator = "Aeroflot" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1946-12-04 } }
      , statLast = Max { getMax = Date { getDay = 2008-09-14 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 7156 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 179 } }
      }
  )
, ( Operator { getOperator = "Military - U.S. Air Force" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1943-08-07 } }
      , statLast = Max { getMax = Date { getDay = 2005-03-31 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 3717 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 176 } }
      }
  )
, ( Operator { getOperator = "Air France" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1933-10-31 } }
      , statLast = Max { getMax = Date { getDay = 2009-06-01 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 1734 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 70 } }
      }
  )
, ( Operator { getOperator = "Deutsche Lufthansa" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1926-07-24 } }
      , statLast = Max { getMax = Date { getDay = 1945-04-21 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 396 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 65 } }
      }
  )

Taking it for a spin

  putStrLn "4 aircraft types most likely to crash"
  pPrint $
    take 4 $
      sortOn (Down . statCrashes . snd) $
        MonoidalMap.toList $ groupAggregate acType toStats rows
4 aircraft types most likely to crash
[ ( AcType { getAcType = "Douglas DC-3" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1937-10-06 } }
      , statLast = Max { getMax = Date { getDay = 1994-12-15 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 4793 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 334 } }
      }
  )
, ( AcType
      { getAcType = "de Havilland Canada DHC-6 Twin Otter 300" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1972-12-21 } }
      , statLast = Max { getMax = Date { getDay = 2008-10-08 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 796 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 81 } }
      }
  )
, ( AcType { getAcType = "Douglas C-47A" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1944-06-06 } }
      , statLast = Max { getMax = Date { getDay = 1996-12-09 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 609 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 74 } }
      }
  )
, ( AcType { getAcType = "Douglas C-47" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1942-10-01 } }
      , statLast = Max { getMax = Date { getDay = 2000-09-02 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 1046 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 62 } }
      }
  )
]

Taking it for a spin

  putStrLn "4 deadliest locations"
  pPrint $
    take 4 $
      sortOn (Down . statFatalities . snd) $
        MonoidalMap.toList $
          groupAggregate location toStats rows
4 deadliest locations
[ ( Location { getLocation = "Tenerife, Canary Islands" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1965-12-07 } }
      , statLast = Max { getMax = Date { getDay = 1980-04-25 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 761 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 3 } }
      }
  )
, ( Location
      { getLocation = "Mt. Osutaka, near Ueno Village, Japan" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1985-08-12 } }
      , statLast = Max { getMax = Date { getDay = 1985-08-12 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 520 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 1 } }
      }
  )
, ( Location { getLocation = "Moscow, Russia" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1952-03-26 } }
      , statLast = Max { getMax = Date { getDay = 2007-07-29 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 432 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 15 } }
      }
  )
, ( Location { getLocation = "Near Moscow, Russia" }
  , Stats
      { statFirst = Min { getMin = Date { getDay = 1935-05-18 } }
      , statLast = Max { getMax = Date { getDay = 2001-07-14 } }
      , statFatalities =
          Sum { getSum = Fatalities { getFatalities = 364 } }
      , statCrashes = Sum { getSum = Crashes { getCrashes = 9 } }
      }
  )
]

Lets parse the data in constant space

We can use cassava’s streaming decoding support to decode our CSV file in constant space because our helper was so general.

statsFromFile :: Ord key => FilePath -> (CsvRow -> key) -> IO (MonoidalMap key Stats)
statsFromFile pathToFile toKey = do
  csvData <- ByteString.Lazy.readFile pathToFile
  either fail (pure . groupAggregate toKey toStats . snd) $ Data.Csv.Streaming.decodeByName csvData


main :: IO ()
main = do
  let statsFromFile' = statsFromFile "./Airplane_Crashes_and_Fatalities_Since_1908.csv"

  putStrLn "4 operators most likely to crash"
  pPrint
    . take 4
    . sortOn (Down . statCrashes . snd)
    . MonoidalMap.toList
    =<< statsFromFile' operator

  putStrLn "4 aircraft types most likely to crash"
  pPrint
    . take 4
    . sortOn (Down . statCrashes . snd)
    . MonoidalMap.toList
    =<< statsFromFile' acType

  putStrLn "4 deadliest locations"
  pPrint
    . take 4
    . sortOn (Down . statFatalities . snd)
    . MonoidalMap.toList
    =<< statsFromFile' location

Lets parse the data in constant space and in parallel

If our data was distributed over multiple large files we can easily process them in parallel because Semigroup is essentially what you need to do a map reduce.

statsFromFiles :: Ord key => [FilePath] -> (CsvRow -> key) -> IO (MonoidalMap key Stats)
statsFromFiles pathToFiles toKey = fmap mconcat $ runConcurrently $ traverse from1File pathToFiles
  where
    from1File pathToFile = Concurrently do
      csvData <- ByteString.Lazy.readFile pathToFile
      either fail (pure . groupAggregate toKey toStats . snd) $ Data.Csv.Streaming.decodeByName csvData