There are three popular data formats CSV (Comma Separated Values), JSON (JavaScript Object Notation) and XML (Extensible Markup Language), which are very frequently used in data science. F# Data library (FSharp.Data) implements almost everything you need to access data stored in CSV, JSON and XML formats. Moreover, FSharp.Data implements F# type providers that infer the record structure from a sample document and, thus, allow to check the record structure at the compile time.
CSV Files
For reading/writing of CSV files FSharp.Data
package implements CsvProvider
. This provider can be initialized
either by passing the sample CSV parameter. Sample parameter value is a string or a file that contains the
CSV sample or the samples list:
type Teams = CsvProvider<"baseball.csv">
or by passing the header and the schema parameters:
type Teams = CsvProvider<"""Team,League,Year,RS,RA,W,OBP,SLG,BA,Playoffs,
RankSeason,RankPlayoffs,G,OOBP,OSLG""",
Schema="""string,string,int,int,int,float,float,float,
float,int,int?,int?,float,float,float""">
Initialization by sample data is perfect for fast prototyping, but for the distribution of the resulting library or executable it better to specify header and schema parameters, or to use embedded resource:
EmbeddedResource="baseball.csv"
Teams
is an instance of generic CSVProvider
type that was constructed from sample data.
For reading CSV file static method Load
needs to be called. The result of Load
method is a
Team
provider type instance:
let teams = Teams.Load(Environment.CurrentDirectory + "/baseball.csv")
Instance teams
has property Rows
that returns a sequence of CSV file records.
Records have type Teams.Row
that exposes CSV file fields, which names and types checked
at compile time. E.g., the following fragment for selecting teams that made to playoff
will generate a compile time error “This expression was expected to have
type int, but here has type string”:
let playoffs = teams.Rows |> Seq.filter (fun r -> r.Playoffs = "1")
for r in playoffs do
printfn "%s" r.Team
Instance teams
has methods Filter
and Map
that allow some basic transformations of CSV
records and return a modified CSVProvider instance, but property Rows
has type Seq<Teams.Row>
that is a dynamic sequence of elements. Using Seq
module functions it can be transformed and
analysed in a multitude of ways:
let yearOf (r : Baseball.Row) = r.Year
printfn "%d" (baseball.Rows |> Seq.distinctBy yearOf |> Seq.length)
for (year, rs) in (playoffs |> Seq.groupBy yearOf) do
printfn "%d %d" year (rs |> Seq.length)
JSON Files
For reading JSON files FSharp.Data
package implements JSONProvider
. This provider is initialized
by passing the sample JSON parameter. Sample parameter value is a string or a file that contains
the JSON sample or the samples list:
type Businesses = JsonProvider<"yelp_business.json", SampleIsList=true>
JSONProvider
does not support JSON schema definitions, but samples can be embedded as resources
in the resulting library or executable for distribution:
EmbeddedResource="yelp_business.json"
Unfortunately, JSONProvider does not support new-line delimited JSON file format (JSON Lines),
but it is easy to convert the multi-line file to the JSON object sequence using Seq.map
:
open System.IO
...
let businesses = File.ReadLines("yelp_training_set/yelp_training_set_business.json")
|> Seq.map Businesses.Parse
Now, analysis of JSON data becomes easy and type-safe:
for (state, bs) in (businesses |> Seq.groupBy (fun b -> b.State)) do
printfn "%s %d" state (bs |> Seq.length)
XML Files
For reading of XML files FSharp.Data
package implements XmlProvider
. This provider can be initialized
either by passing the sample XML parameter. Sample parameter value is a string or a file that contains
the CSV sample or the samples list:
type News = XmlProvider<"""<?xml version="1.0" encoding="iso-8859-1"?>
<newsitem itemid="2286\" id="root" date="1996-08-20" xml:lang="en">
<title>Sample Title 1</title>
<headline>Sample Headline 1</headline>
<byline>Sample Author 1</byline>
<dateline>Sample Date Line 1</dateline>
<text>
<p>Sample Text 1</p>
<p>Sample Text 2</p>
</text>
</newsitem>""">
Obviously XmlProvider
is designed to load or parse only one XML document, but with sequence expressions
it is easy to build a sequence of lazily parsed XML trees:
let news = seq {
for file in Directory.EnumerateFiles(".", "*.zip") do
use zip = ZipFile.OpenRead(file)
for entry in zip.Entries do
use stream = new StreamReader(entry.Open())
yield News.Parse(stream.ReadToEnd()) }
Unlike DOM trees, typed XML trees generated by XMLProvider are easy to process:
for n in news do
printf "%s\n" n.Headline
Conclusions
Unlike most statically typed languages F # allows a quick start of data analysis without need to create domain objects or tedious work with DOM-like structures. At the same time type providers generate strongly typed structures, which allow to minimize number of silly mistakes that are usually seen in dynamically typed languages.