Schema on Read
What is Schema on Read?
Schema-on-read is a data handling approach that allows data to be ingested into the repository without a predefined schema. In contrast, schema-on-write is an approach where the transformation occurs as the data is ingested into the repository based on a fixed schema or set of rules.
How does Schema-on-read work?
The data is ingested into the system without a predefined schema, which allows for flexibility in handling various data types—unstructured, semi-structured, and structured data. When the data is queried, the system reads it and infers a schema based on its characteristics and usage in downstream systems. This schema is then used to understand and process data for specific needs and use cases.
Advantages of Schema on Read
- Flexibility: Schema-on-read allows your data platform to ingest all data types in its raw, “as is” format. This flexibility in storage enables you to partake in exploratory analyses, machine learning, and AI projects— all of which rely heavily on unstructured and semi-structured data.
- Real-time analytics: By eliminating the need for data transformation and schema enforcement upon ingestion, schema-on-read enables your data platform to ingest data in real-time and further enable real-time and near-real-time analyses.
- Scalability and cost-savings: Schema on Read systems can be quickly scaled, as the repositories can now handle large volumes of datasets that do not have to conform to a fixed schema upon ingestion. This reduces storage and data processing costs, allowing for need-based scalability, which translates to lower TCOs (total costs of ownership).
Key considerations while using schema on read
- Data quality: Schema on read can introduce data quality issues if the data itself is inconsistent or poorly formatted. Data cleaning and validation steps become even more crucial.
- Performance: Defining the schema during read operations can impact query performance compared to predefined schemas.
- Data lineage and governance: Tracing the origin and transformations of data can be more challenging in schema on read systems. Implementing robust data governance practices is essential.
- Skillset requirements: Analysts working with schema on read systems may require additional skills in data wrangling and data interpretation compared to traditional structured data analysis.
Why Machine Learning Models Need Schema on Read
Machine learning models are particularly reliant on schema on read for several reasons:
- Raw Data Advantage: Machine learning algorithms often require access to the raw, unprocessed data to identify patterns and relationships. Schema on read allows the data to be stored in its native format, preserving its richness for the model.
- Flexibility for Feature Engineering: Feature engineering, the process of creating new features from existing data, is crucial for machine learning. Schema on read provides the flexibility to explore and define new features during the training process without needing to restructure the entire dataset.
- Single Source for Multiple Models: Schema on read enables a single data repository to be used for training various machine learning models. Each model can define the schema it needs during read time, maximizing data utilization and reducing storage requirements.
In conclusion, schema on read is a valuable approach for handling diverse and large-scale data sets. It offers flexibility and agility, especially beneficial for machine learning models that rely on raw data access, feature engineering, and efficient data utilization.
FAQ
Is schema on read always better than schema on write?
No, schema on write offers advantages in data consistency and query performance. Schema on read is ideal for flexible data ingestion and exploration.
How does schema on read impact data quality?
Schema on read can introduce data quality issues if the data itself is inconsistent. Data cleaning and validation become even more crucial.
What skills are needed to work with schema on read data?
Analysts may require additional skills in data wrangling and data interpretation compared to traditional structured data analysis.
Can schema on read be used with real-time data?
Yes, schema on read can be used with real-time data streams, but it may require additional processing to ensure timely data definition and analysis.
Are there any tools that specifically support schema on read?
Many big data processing frameworks like Hadoop and Apache Spark offer functionalities for working with schema on read data.