Project

DataMill

Capabilities needed

Generates Time-series data which is closer to real world
Generate data at scale, humongous enough and quick enough, generate (1 million records per minute?)
Model
- Model represents the Data format or schema, with support for complex structured formats
- Manage models as domains or classes of data
- Multiple dimensions and specify relation among them (how one dimension affects data pattern of the other)
- Anomalies (for each dimension, specify anomalies and sample data for anomalies)
Generated data is put out as streams
A plugin based system can then tap the stream and provide output in desired feed mode or to a transport
Multiple types of feeds & transports will be developed and dropped as plug-ins or connectors
Simulate data growth, such as for a particular scenario, if certain level of historic data must exists, it can generate that too as one of the dimension
Use a external/internal feed as salt to generate data for scenarios where a limited feed is available and have to improvise data based on same
Streams generated data using Websockets wherever relevant and performance is needed

Components

UI Components
- Data model builder
  - Model schema definition
  - Dimensions definition builder
  - Pattern Builder: To Define the Patterns, which are multi-dimensional
  - Pattern Editor - will have visual editor as well as text editor based (Like you edit code behind a UI in .Net based visual editors)
  - Attach model to a transport or feed
- Attach generators
  - As user builds the model, generators are needed to generate these data
  - System can have some recommended generator expression built parallel to data model definition
  - User can accept or assemble own generator expression, which is then executed to generate the data
- Equalizer: Configure & Fine tune & control the Data generation pattern, with settings for
  - %age of normal data
  - %age anomalies
  - Frequency of generation
  - Frequency of injecting anomalies
  - Frequency of spiking the data & volume to spike by
- Dashboard:
  - Which lists all available Models for generation
  - Display quick summary, details and current status of generation, analytics of the generation
- Control panel
  - Lists running generators
  - Lists available generators
  - Helps build new generators
  - Lists/Configure external transports, feeds
Back-end components
- Generators : These are extensible components similar to plugins
  - Generators can be of primitive & user defined
  - Users can combine one or more primitives to build a complex generators
- Primitive generators, these generators attach generator attribute to a domain and implement basic extraction & generation of data from corresponding domain, with variations like 'must be unique', 'must be similar', 'format should be in ISO', 'must be close to 10 miles', 'must be more than 100 miles', etc., which are the properties of data
  - Noun generators (Person names, country names, company names)
  - Location coordinates generators
  - Random number generator
  - Data & Time generators
  - Text generators
  - Unique Number generator
  - Unique name generator
  - Unique time stamp generator
- User defined generator
- Data Generator expression executor
- How can we share a generator output across multiple models?

Data Attribute Properties

These are the fundamental characteristics of data attributes defined for a model and all other complex models are built using these

Characters of Data

- Format (number, date, string)
- Domain (name, money, temperature, distance, pincode, state, random)
- Range
	- Normal
	- Abnormal
		- Random
		- For a time all
		- In a frequency of
	- Exceptions (like some time value can be 0, otherwise it will be in the range of XX to YY)
- Relation (binding, uniqueness, mutually ex)
- Occurrences (Unique per 100 entities or model instances)

Generation

- Flow
	- Continuous 
		- Frequency
	- Sporadic
		- Number of Samples
- Number of packets 
	- Exact
	- Range
- Stop after generating (generate till)
	- Number of packets
	- After this much of time
- Time series (?? can be handled as part of the data attribute, may not be needed here)
	- Historic
	- Current
	- Future

Scale

Scalability is needed to ensure data can be generated as quick as possible and also schedule the data generation to simulate as per the needed scenario

Horizontally scalable
- Each pattern of each model of each dimension, runs simultaneously and distributed
- As a map reduce model, generated data is put together before steaming out

Other

Do we need Record and playback data feature?
Learn pattern while recording, Generate the patterns from the learning with a custom simulation pattern?

Why do we need a data generator?

For developing systems, which is expected to process zettabyte of data frequently, cannot be tested with real data or recorded data, or cannot always have real devices, systems to generate data at required pattern, level and frequency
Your Architecture needs end to end testing again, again and again
Some data cannot be made available until a real situation is encountered (imagine testing a Tsunami reacting and data analysis system)
Your system has to be tested how it reacts to data from upstream or how down stream system reacts to data generated by your system
You are building a predictive system, which needs to be tested with multiple permutation & combination of patterns of data
Need data to teach machine learning systems, IoT sensor data for sensor data analysis,
Boeing 787 aircraft could generate 40 TBs per hour of flight, how to test a system, which reacts to such an airlines data and generate so much data so quickly
mining operations can generate up to 2.4TBs of data every minute, safety systems of mining industries, thus expected to work with such a large data
How much data would Uber has?
Testing wearables, health systems
Testing smart electric grids

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Project

DataMill

Capabilities needed

Components

Data Attribute Properties

Characters of Data

Generation

Scale

Other

Why do we need a data generator?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally