The CsvDataSet provider stores data in a delimited text file as a flat table. Specifically, the provider supports and extends the comma-separated value (CSV) format. A classic CSV file is a text file that represents data as a table, where each line corresponds to a row in the table. Columns within a line are separated by commas. CSV files are convenient because they are human-readable and supported by many applications, including Microsoft Excel.

The CsvDataSet provider extends the described file format. First, it supports different delimiters, including comma, semicolon, tab and space. Second, it can append additional tables to the output file, separated by empty lines from the main data table. These extra tables contain information about data stucture and metadata. For more information about the delimiters and appended data, see the documentation for the URI parameters "separator" and "appendMetadata", respectively.

The first line of a CSV file contains the display names of variables. Use the "noHeader" or "saveHeader" parameters of a URI to change it. If the file does not contain a metadata table and thus does not explicitly define the names of the variables, CsvDataSet provider uses the display names as the names of the variables.

The provider supports variables of any non-negative rank. For two-dimensional variables, the first index is a row and the second index is a column. Such variables are stored as several consecutive columns within a file. Only the first column of these columns has a header; other headers corresponding to the variable are empty. This rule is a criterion of a multidimensional variable within an input file. However, if the first column in a file is empty, the corresponding variable is always considered one-dimensional.

Variables of three or more dimensions are also stored as several consecutive columns. The rule is that the last index corresponds to columns, the preceding index corresponds to rows, and other indices are placed consecutively in following rows. That is, the last indices vary faster.

The provider is associated with the provider name "csv" and the extensions ".csv" and ".tsv". The assembly name is Microsoft.Research.Science.Data.Csv.

For details about input and output file encodings, see the description of the URI parameter "utf8bom" in the table below.

CsvDataSet accepts the following parameters in a URI (see also DataSet Uri):
  • file Specifies the file to open or create. This can be either a local file path or a HTTP URL. Examples:
If the HTTP URL is specified, the file must exist and the constructed DataSet is always read-only. If you need to modify the DataSet, consider cloning it into a MemoryDataSet. See DataSet.Clone(String) in library documentation.
  • openMode=createNew|create|open|openOrCreate|readOnly The flag "openMode" specifies how the DataSet should open a file. Possible values for the flag are:
    • createNew Specifies that the DataSet should create a new resource. If the resource already exists, an exception IOException is thrown.
    • open Specifies that the DataSet should open an existing resource. If the resource does not exist, an exception ResourceNotFoundException is thrown.
    • readOnly Specifies that the DataSet should open an existing resource for reading only. If the resource does not exist, an exception ResourceNotFoundException is thrown.
    • create Specifies that the DataSet should create a new resource. If the resource already exists, it will be created again.
    • openOrCreate Specifies that the DataSet should open a resource if it exists; otherwise, a new resource should be created.
If the file is read-only or is an HTTP URL, the constructed DataSet is always read-only. Consider cloning it into a MemoryDataSet to modify. See Clone(String).
  • appendMetadata=true|false Indicates whether to append metadata to the end of the file. These tables describe the structure of the DataSet. These tables can cause incompatibility with some CSV-oriented programs. If this occurs, set the parameter value to false. If the parameter is not specifed, metadata will be appended only if the input file has had metadata appended.
  • noHeader=true|false Indicates whether the first line of the input file contains data or is a header that contains the display names of the variables. If this value is true, the first line contains data. This parameter does not affect an output file. See also parameter "saveHeader". The default is false; that is, by default an input file must have a header line.
  • saveHeader=true|false Specifies whether the provider must save a header line into an output file. If you omit this parameter, its value depends on the "noHeader" parameter. Specifically, if the input file has no header, the provider does not save a header in the output file, and vice versa.
  • fillUpMissingValues=true|false Indicates whether the CSV provider should interpret empty values in the input file—that is, empty strings between separators—as missing values. The default is false.
  • inferInt=true|false Indicates whether the CSV provider infers the following types on loading: int, double, DateTime, string. Otherwise, only double, DateTime and string are inferred. The default is false.
  • culture Specifies the locale name that the provider uses to parse the input file. The default is an invariant locale. Note that this parameter does not affect output files, because the provider always writes output files for an invariant locale. Example:
  • inferDims=true|false Indicates whether the provider should infer dimensions. If the value is true, when the provider parses an input file, it assumes that all variables that have the same length, and for which no metadata is defined, share the same dimensions. The default is false.

In the following example, if inferDims is true, variables a and c share the same dimension, because they both have a length of three. Variables b and d also share a dimension, because they both have two elements. If inferDims is false, none of the variables share any dimensions.
a    b    c    d
1    10    100    string1
2    20    200    string2
3          300
  • utf8bom=true|false Specifies whether to use UTF8 encoding for an output file. If utf8bom parameter is defined and is equal to
    • true, output encoding is UTF8 with signature
    • false, output encoding is UTF8 without signature
If the parameter utf8bom is not defined, and if a file already exists, the provider keeps its current encoding for output. Otherwise, if the underlying file is created, the provider uses UTF8 with signature.
  • separator=comma|semicolon|space|tab Specifies the character that separates values in the file. Default is "comma".
  • include Includes variables as references from another dataset into this dataset. You must supply the URI for the other dataset. Only variable names that conform to C# or Python 3 naming rules can be specified in the include parameter. Example:
This is the escaped version of "msds:memory?include=escape[msds:csv?file=example.csv#lat,lon]". It includes the variables lat and lon from msds:csv?file=example.csv. If no variable names are specified, all variables from the dataset are included.
The CsvDataSet provider supports URIs that contain a path and additional parameters that are appended after a question mark. For example:
The following example creates a CsvDataSet from the existing file example.csv and requests that all variables with the same number of elements share a dimension.
using (DataSet ds = DataSet.Create("msds:csv?file=example.csv&openMode=open&inferDims=true"))
    // Working with ds . . .
CsvDataSet provides a special metadata attribute with a key "csv_column". This value is an zero-based integer that indicates a column within the file that corresponds to a variable. For multidimensional variables that are spread across several columns, this value is an index of its first column. A "csv_column" attribute is generated automatically when the dataset is loaded. If a file already contains a "csv_column" attribute for some variables, these attributes are ignored and a warning is traced.
Although the "csv_column" attribute is writable, in the current release its modification does not affect the output file. It will be taken into account in future releases.

The CsvDataSet saves DateTime variables using the following format string: "yyyy-MM-dd hh:mm:ss" (e.g. "1999-04-20 14:10:00"), acceptable by Microsoft Excel. This DateTime format provides precision to within a second. If greater precision is required, you can store Ticks by using a variable of type Int64.

Last edited Apr 21, 2010 at 8:52 PM by pennyo, version 7


No comments yet.