Data Literacy Training

The Data Literacy curriculum is designed to help Government and Industry partners develop a data culture where decisions are data-driven, where the workforce is data-savvy, and where the value of your data is maximized across your organizations to achieve your objectives.

In this approach there are two tracks: 1. Governance and 2. Data Literacy. Each track is made up of 3 to 4 Modules and each Module is made up of 3-5 Topics. (Additional training in AI Literacy is available on demand.) Regardless of Track, students must begin with, but need not repeat, Module 1: Foundations in Data and AI Literacy. It is a general introduction to the program. Completing any module successfully will earn the student a certificate of completion and successfully completing each module in a track, and having completed Module 1, will earn the student a Badge. It is highly recommended that students register for modules sequentially by track, but this need not be a requirement. Immediately below is a summary chart of Tracks, Modules, and Topics. In the following pages, these are elaborated.

The Foundations Module introduces students to the art and science of data-driven decisionmaking, rooted in probability theory. Since we live in a probabilistic universe, understanding past performance, trends, and potential likelihood of events are best-practices for helping decision-makers choose where, when, how many, and what type of resources should be deployed. The topic also includes a very high-level introduction to Data Analytics and AI/ML as decision-support tools for probabilistic decision-making. Introduction to data governance exposes students to the importance of data catalogs, security, compliance, and other critical factors that help prevent data silos, duplication of data, data incompatibility, among other issues. Data governance is rooted in the DOD Data Governance Framework (DGF). The data governance topic also includes a summary of roles and responsibilities found in the DOD DGF and also those commonly found in industry. The third topic of the Foundations module is ethics. Ethics must be at the forefront of every stage of data governance and analytics, from collection through analysis through visualization to sharing of findings. Ethics involves ownership rights and responsibilities, the potential gap between intent and results, and understanding the various models of ethics.

The Governance Track is designed for Data Managers, Owners, and Stewards. The primary objective of this track is implementation of a data governance strategy. It begins with an introduction to the DOD DGF and is followed by a deep dive into the structures of the DGF. The  objective of these two sessions is to ensure students understand, can define and discuss, every element of the DOD DGF. The other 3 topics of the DGF Module focus on implementation. These topics will cover the Data Governance Maturity Model, which is an assessment tool for determining where an organization is on its data journey and what steps need to be taken to ensure success; what other implementations models apply, and ways of assessing successful implementation.

This track also includes a module on Data Ethics. Much like the introductory version this module explores ethics from the perspective of different philosophical perspectives and applies them to data governance issues. This module also includes practical exercises and use cases.

Data Ethics is important for every industry but is especially important for the DOD and every member of the force who engages data must be aware of the importance of data ethics. But data managers, owners, and stewards must know how to implement systems that ensure the ethical treatment of data.

The final module in the Governance Track is People and Organization. In People and Organization, the focus is on Data Stewardship, which  refers to how the data is managed throughout its lifecycle from collection to storage to decommissioning. This module centers around the organizational structure and the roles and responsibilities given in the DOD DGF and in practice in industry. When students are done with this module, they will know what CDOs, Data Councils, Data and Ethics Managers, Data Owners, and Data Stewards do and how their roles interact in a well-organized data-centric organization.

The Data Governance Track also introduces students to industry standard tools like Immuta for access control, cataloguing, policy enforcement, and other important features and Databricks for enabling database management and querying, advanced analytics, and AI/ML operations. These tools, and others like them, are in use throughout the DOD and it is important for students to get initial exposure to them in a literacy program. They will also be given POCs and links for additional training.

The Data Literacy Track also has three modules plus the final module of the optional AI Literacy track. They are: Foundations in Probability Theory, Foundations in Data Engineering, Foundations  in Data Science, and MLOps. The Foundations in Data Probability Theory module has four topics. The first is Introduction to Probability, which covers the relationship between living in a probabilistic world and having to make decisions with a general introduction to the underlying mathematical structure of probabilities. From simple everyday examples in throwing dice, playing cards, or running a small business students become familiar with the language, concepts, and functionality of probability theory. The second topic is Calculating Probabilities, which unpacks the mathematical structure of probability theory and gives the student hands-on experience applying the concepts and techniques of probability theory using a readily available desktop tool like Excel. In this topic, students are exposed to Tree Diagrams, Venn Diagrams, and Probability Tables. The third and fourth topics are an introduction to and then an application of Bayes’ Rule. Bayes’ Rule is considered by many to be the most important rule of probability and data science and is used in complex Machine Learning algorithms. Here, students get an introduction the concept and get to apply it on simpler, but practical examples.

The Foundations in Data Engineering module has three topics. They are: Collecting and Cataloguing Data; Data Security; and Storage, Security, and Access. These topics begin our examination of the data pipeline concept. The pipeline discussion continues into and concludes in our next module Foundations in Data Science. The data pipeline concept is a way of thinking about the flows of data from collection through analysis all the way to visualization and sharing. For example:

To unpack this pipeline the Foundations in Data Engineering focuses on the infrastructure side of the pipeline where the Data Science side focuses on the analytic side.

The first topic in the Foundation of Data Engineering focuses on the collection (ingestion) and cataloguing of data. Note in the diagram above that we will briefly touch on ethics again in this topic. How you collect data, on whom you collect data, and other such related questions all have ethical considerations that must be learned, reinforced, and developed into the data culture an organization creates. There are also practical considerations regarding collection, such as the reliability of data sources, the long-term value of the data being collected, and how that data will be catalogued so that it could be of value to scientists, preferably throughout the organization. The second topic covers security in some depth. Data security is organizational security! This topic covers security in terms of information assurance, the Risk Management Framework, and Zero Trust concepts. The third and final topic covers Storage, Security, and Access control. This topic is critical because data that cannot be accessed is of no use to your organization, but data must have controlled access to ensure its veracity and reliability. With these features in place, an organization can trust its data is well organized, accurate, catalogued according to policy, and ready to be analyzed by those given access.

The Foundations in Data Science module also has three topics. They are: Data Analysis, Visualizing and Sharing Data, and Data-Centric Decisionmaking. The Data Analysis topic will expand on common analysis methods uncovered earlier in the Probability section, but using larger and more complex data sets and using out-of-the-box data analysis solutions using Python and native cloud tools. The same is true for the second topic, Visualizing and Sharing Data. Here we will learn some basic Python to visualize data and share those visualizations. We will also learn how to do the same using native cloud tools. In the final topic, Data-Centric Decisionmaking, we bring the entire track to its logical conclusion when we learn how to interpret data findings in order to support decisionmaking.

The final module in the AI Literacy Track, Operationalizing AI and ML follows the same format as the previous module where the first topic is foundational and the second is hands-on practice.

In the first two topics students are introduced to AI for Business. In these two topics students are exposed to purpose-driven AI models that are designed to solve a specific business problem. The final two topics in the operationalization module introduces students to the emerging and game-changing world of ML Ops. ML Ops, although still facing challenges is an attempt by industry to replicate the success of DevOps in the data world. The idea is that a diverse and highly functional team of experts from developers, to data scientists, to mission owners collaborate to create, operationalize, and maintain ML models. As you can see in the diagram below, this approach has organization-wide analytic code, metadata, and feature stores this enables other units in your organization to quickly ramp up models to get after their particular but related problem sets. Also, note how this approach enforces such Data Governance features as organization-wide cataloguing in the Metadata Store and Code Repositories, breaking down of silos, and enabling practices, results, and feature sharing. Further, this approach uses probability, statistical methods, and AI/ML models and features and is therefore a combination and culmination of the various components students have learned. By the completion of these three tracks, students will be literate in governance, data and AI and ML Ops.

At the conclusion of the Data Literacy track, students will be very familiar with the basic terminology, concepts, and basic techniques of data collection, analysis, visualization, and interpretation. To be clear, they will not be Data Scientists, but they will know enough to perform basic data functions and have a solid foundation from which to continue specializing and building their data expertise.

There is very little effective Data Literacy training available to Government and Industry and we look forward to partnering with you to deliver this essential service. For more information: