Data Architecture

What is Data Architecture?
Data Architecture describes how data is created, collected, transformed, stored, distributed, and consumed within the context of an application, solution, product, service, a business scope, or at an enterprise level. It involves the practices of planning, designing, and overseeing development. It is driven by business requirements and desired business capabilities applicable to the context it is being applied for, and by the technical requirements and technical capabilities of the same context.

Key deliverables of the Data Architecture practice

Data Models: Data models are diagrams that illustrate business entities or datasets, and their relationships to each other within a defined context or scope. These diagrams may also further describe these entities/datasets by illustrating attributes and unique identifiers for each of them.

Conceptual: These are high level models used to model data domains or classes of entities and their general relationships to one another. This model is purely based on business requirements at a high level and defines the general data scope for a given context. Solution implementation has no bearing on the design of this model and attributes are typically not described. Conceptual models are typically used to define the data scope for a given business context, program, project, or at an enterprise level.
Logical: A logical model usually involves some normalization of data entities (typically 3rd normal form) in order to accurately define data entities, their granularity, and their specific relationships to another. All business attributes are described as well as unique identifiers and alternate keys. Logical models typically apply at an application or solution level and are used as the building blocks and starting point for Physical data models.
Physical: Physical data models involve the organization of datasets/tables in a way that optimizes data operations (Create, Read, Update, Delete) against datasets/tables based on the characteristics and requirements of the application/solution being designed while maintaining the required level of data integrity and business rules defined by the logical model. Physical models balance the need to maintain the business integrity of the data with the system performance and operability requirements of the system. Physical models can typically be forward engineered into database implementation. Important considerations for Physical data models are:
- Performance, which depends on the type of system being designed Analytical systems (Data Warehouse/Data Mart) typically require optimization of query performance as well as optimization of transformation and loading of data. Operational systems typically require optimization of data creation and updates.
- Data Quality, if not enforced by the physical model should be enforced by transformation or supplemental processes
- Scalability, at the database or dataset/table level depends on data volumes and data growth rates, as well as retention vs archiving requirements
- Extensibility, which addresses the need for the data model to grow in scope/breadth of data without breaking existing design
- Maintainability, which is heavily dependent on the behavior of the sources of data. For example the shape and form of datasets in a data lake may be influenced by how often source data needs to be corrected resulting in reloads of datasets and the impacts to consumers downstream. Other maintainability considerations involve the implementation of standard system attributes that capture data useful for audit trail, reprocessing, etc.

Blueprints: These detailed diagrams involve the illustration of systems, interfaces, applications, data stores, processes, people, rules, etc as needed and applicable to a context in order to describe a system or group of systems. Blueprints may/may not refer to tools and technologies used or considered for each component

Datascape diagrams illustrate systems/data stores and their data interfaces together and inter-system data flows and may indicate domains of data processed in each system
Application level blueprints illustrate inter-application data flows between user interfaces and data stores as well as external interfaces to/from the application
System level blueprints illustrate the complete end to end set of data flows throughout a system along with interfaces to other system, application, or user interfaces. They illustrate the various data stores and states of data within the system and describe the various data processes involved in the system.

Data Map: This is a source to target mapping of data at the column/attribute level which includes additional information such as datatypes about the attribute and usually includes transformation logic and other information such as update frequency, Update method (SCD, Upsert, ...), etc.
Functional data transformation diagram: These are logical illustration and descriptions of data transformations that illustrate the sequences of transformation processes, and show how multiple case scenarios are to be treated. These are usually higher level than a data map, but guide the logic implementation of data maps and can be used to design and create modular and reusable transformation frameworks for similar processes.
Data Flow diagram: This type of diagram (DFD) simply shows the movement of data between processes and is usually a "logical" exercise. The boxes in this diagram are processes and the arrows are typically annotated by the data domain that is exchanged between the processes. Some DFD's also identify Inputs, Controls, Outputs, and Mechanisms (ICOM's) for each process. These ICOM's can then be used as inputs into a logical data model as the entities supporting the processes covered by the DFD's. This approach is used in the IDEF1X methodology.
MetaData model: This is a data model of the MetaData and should not just be limited to business metadata (definitions, rules, formula's, etc), but also include technical metadata- data about the state, quality, and processing of data. This will be the backbone of any Data Catalog that will support both business and technical users.
Data Design Standards & Principles: These are a set of documents used to govern the design of data objects, datasets, data models, databases, data stores, and most data assets that a company builds. These are more essential than ever to support development of modern data assets that are self-service, automated, and built/served by distributed and disparate teams, They include but are not limited to:

Data naming standards are used to determine how to name schema's, databases, datasets/files, tables, messages, and all of the attributes/fields/columns within. They include standard abbreviations and decisions like snake case, camel case, mixed case, pascal case, etc.
Table/Dataset design principles address levels of normalization and things like star vs snowflake schema, levels of normalization/denormalization and when to use which. They also determine how to partition datasets/tables. These decisions are usually driven by classification of tables/datasets in order to standardize design based on the needs of each different table/dataset type. Fact and Dimension tables may have different design principles in a data warehouse. Master data and Transactional data may have different design principles in a data lake and Time Seriated data requirements may be treated differently depending on whether corrections to previous time slices are common or not. These design principles are also dependent on the state of a dataset and its expected consumption patterns.

The list above is not exhaustive; nor do all projects require every one of these deliverables but as you can see, the decisions involved in completing the above deliverables are foundational to the development and support of all data solutions and getting data architecture right for your solution is critical to the success of your project or product.

For more information click here