| layout | default | ||||
|---|---|---|---|---|---|
| permalink | datasets | ||||
| description | from csv to elasticsearch | ||||
| title | Datasets | ||||
| hero |
|
||||
| width | is-10 |
Supported connectors are:
- filesystem
- Elasticsearch
- PostGreSQL
SQLAlchemy is used for PostGreSQL and should could be any SQL connexion, but only PostGreSQL has at been tested yet. See the connectors roadmap for support of other databases in the future.
Default connectors are included in the initial configuration (and can me modified or removed):
upload: filesystemupload/, which is connected to the upload apireferential_data: filesystemreferential_data/, used for referential data like french citycodes, countries, ...models: filesystemmodels/, used for storing machine learning kernelselasticsearch: connector to the elasticsearch provided with dockerpostgres: connector to the PostGreSQL provided with docker (you have to runmake postgresfrom commandline to start the docker database)
To create other connectors, we haven't provided an online editor for now, your have to add it manually in the conf/connectors/connectors.yml:
Note that we'll probably never make the effort of dealing with excel (xls, xlsx) and opendocument (odt) formats as they are too rich to be processed efficiently and with stability. Even if formatting is cute, it is meaning-less in an automation world, where formatting has to be carried by metadatas, which have to be data by themselves.
If you want to be serious about re-producing data transformation you have to forget about editing manually data without making it auditable, and self-certification is a wrong security pattern in a globally connected world.
So you'll have to meet us half way and export your excel files in csv, and know what is your business process to version your data.
Comma-separated files have been the classical I/O in opendata for many years. Even if it's quite a mess for data types, it is a go-between, between strongly structured format like XML for IT specialists and the mainstream office excel-style.
If you're familiar with using many sources you know CSV is not standard and that most guesser can't guess all : separator, encoding, escaping, etc. We didn't have time to build a top-of Pandas for guessing, and prefered to enable you to set manual options to handle all the cases.
For robust and stable processing, we desactivated every guessing of types, so every cell is string or unicode in input: casting data types will be possible next within recipes. For the same reason, we didn't deal here with the
na_valuesand keep_default_na is forced toFalse.
| option | default | other | objective |
|---|---|---|---|
| sep | ; | any regex | specify columns separator |
| header | infer | false | use included head |
| encoding | utf8 | latin1 ... | specify the encoding if not ascii |
| names | [col, names] | replace header (column) names | |
| compression | infer | None, gzip ... | specify if compressed |
| skiprows | 0 | any number | skip n rows before processing |
This is the old brother of the CSV, the fixed-width tabular is a variant, but in some cases this will be more stable to parse your tabular files as fixed width. Often Oracle or PostGreSQL exports are to be parsed like that.
Like in csv, for robust and stable processing, we desactivated every guessing of types, so that every cell is string or unicode in input: casting data types will be possible next within recipes. For the same reason, we didn't deal here with the na_values and keep_default_na is forced to False.
| option | default | other | objective |
|---|---|---|---|
| encoding | utf8 | latin1 ... | specify the encoding if not ascii |
| names | [col, names] | replace header (column) names | |
| width | [1000] | [2, 5, 1, ...] | columns width for the fixed format |
| compression | infer | None, gzip ... | specify if compressed |
| skiprows | 0 | any number | skip n rows before processing |
Well you may not be familiar with that format. This is a simple and quite robust format for backing up data when the data is typed. We could have chosen HDFS (strong, but boring for integration, and ageing), json/bson (quite slow), or pickle (too much instability between versions), or parquet (seducing, but the pandas library doesn't deal well with chunks).
Check the roadmap about future support for json, xml or any other type.
There is no specific option when using a (PostGre)SQL dataset at the moment.
| option | default | other | objective |
|---|---|---|---|
| random_view | True | False | display random sample by default |
| select | {"query": {"match_all": ""} } | any es query | query for filtering index |
| doc_type | table name | any string | change document type name |
| body | {} | cf infra | settings and mappings |
| max_tries | 3 | any integer | number of retries (with exponential backoff) |
| thread_count | connector value | any integer | number of threads for inserting |
| timeout | 10 | time in seconds | timeout for bulk indexing & reading |
| safe | True | False | doesn't use _id field to index, which can lead to doubles when retrying |
| chunk_search | connector value | any number | number of row for search queries when using fuzzy join |
The big challenge with elasticsearch datasets and scaling is the tuning possibilities. The body value is the equivalent of the body value when creating the index with curl, except it is written in yaml instead of json, which makes it easier to read.
Here is an example of configuration :
Elasticsearch can be used to validate matches (as seen in [tutorial])(/tutorial#step-3-validate-matches-and-train-rescoring-with-machine-learning).
The validation option is activated by adding the validation: true option :