DataFrames in C#

Dataframes are very common in many applications to manipulate data. ML.net API implements a kind of StreamingDataFrame which basically of to go through huge volume of data by implementing a kind of map/reduce API. Most of the time, the data holds in memory and it becomes quite convenient to manipulate it with pseudo SQL methods. That’s what the class DataFrame implements. Many examples can be found in unit test TestDataManipulation.cs.

StreamingDataFrame

Class StreamingDataView is a wrapper around IDataView interface. It adds easy conversions to DataFrame and easy to parse a file or multiple files.

var sdf = StreamingDataFrame.ReadCsv(new"iris.txt", sep: '\t');
var sdf2 = StreamingDataFrame.ReadCsv(new [] {"part1.txt", "part2.txt"}, sep: '\t');

DataFrame

The class DataFrame replicates some functionalities datascientist are used to in others languages such as Python or R. It is possible to do basic operations on columns:

var text = "AA,BB,CC\n0,1,text\n1,1.1,text2";
var df = DataFrameIO.ReadStr(text);
df["AA+BB"] = df["AA"] + df["BB"];
Console.WriteLine(df.ToString());
AA,BB,CC,AA+BB
0,1,text,1
1,1.1,text2,2.1

Or:

df["AA2"] = df["AA"] + 10;
Console.WriteLine(df.ToString());
AA,BB,CC,AA+BB,AA2
0,1,text,1,10
1,1.1,text2,2.1,11

The next instructions change one value based on a condition.

df.loc[df["AA"].Filter<DvInt4>(c => (int)c == 1), "CC"] = "changed";
Console.WriteLine(df.ToString());
AA,BB,CC,AA+BB,AA2
0,1,text,1,10
1,1.1,changed,2.1,11

A specific set of columns or rows can be extracted:

var view = df[df.ALL, new [] {"AA", "CC"}];
Console.WriteLine(view.ToString());
AA,CC
0,text
1,changed

The dataframe also allows basic filtering:

var view = df[df["AA"] == 0];
Console.WriteLine(view.ToString());
AA,BB,CC,AA+BB,AA2
0,1,text,1,10