Data ingestion: Definition, tools, and pipeline

Any worthwhile data strategy starts at the source: by having good, clean data. Learn what data ingestion is, how it works, and issues to consider so you can lay the foundation for a successful data strategy.

What is data ingestion?

Data ingestion is the process of accessing and importing data from multiple sources and transferring it to a single location where it can be stored and analyzed as needed.

Various business stakeholders need to access data for a range of requirements; this need is at the heart of data operations.  Whether it’s supply chain executives seeking to make data-driven purchasing decisions, operations managers looking to prioritize manufacturing processes for optimization, or marketing professionals interested in better forecasting for an upcoming buying season, visibility into diverse sources of data has become critical for modern business.

The challenge is getting data from many different sources and formats into a single database. Resolving this challenge is at the heart of data ingestion.

Advantages of data ingestion


Data flexibility and centralization
icon

Improved data availability
icon

Better data for better decisions
icon

Improved productivity
icon

What are the main types of data ingestion?

Batch processing

With this approach, the data ingestion layer incrementally collects from data sources and sends the data in batches to the system where the data is to be stored. Batches of data can be sent in intervals measured in minutes, hours, or even weeks. Data is prioritized based on schedule or certain criteria, such as when certain thresholds are met or specific conditions are triggered. 

This is the most common type of data ingestion available. It is relatively simple and inexpensive, and is suitable for collecting specific data points for periodic deep-dive analysis. However, it is inadequate for scenarios requiring real-time data updates.  

Real-time processing

With real-time or streaming ingestion, data is processed into the central system as soon as it is created at the source. This approach is more expensive, as the data ingestion solution must continually monitor sources for new data. However, it is highly useful for scenarios where time-sensitive access is required.

Hybrid processing

This approach combines elements of real-time and batch processing. Depending on the scenario in question, a hybrid approach called micro-batching might be applied. This is batch processing at a much faster rate than is typically used (intervals might be measured as fast as milliseconds).

Another hybrid method is Lambda-architecture-based data ingestion. This approach involves three different layers:

  • The batch layer is processed in the classic batch mode and provides a complete view into the full body of data.
  • The speed layer gives real-time visibility into specific data that needs to be processed and analyzed immediately.
  • The serving layer combines results from the batch and speed layers, providing a unified view of both time-sensitive information and the more complete data picture.

Compare and contrast


What is data ingestion vs. ETL?
icon

What is the difference between data ingestion and data acquisition?
icon

What is the difference between data ingestion and data integration?
icon

Data ingestion matters to consider


Data sources and connectors
icon

Process and integrity
icon

Scalability and performance
icon

Security
icon

Metadata management
icon

Data ingestion challenges


Latency
icon

Multiple data sources, destinations, and users across enterprise
icon

Maintaining data quality
icon

Time efficiency
icon

Schema changes and rise in data complexity
icon

Compliance requirements
icon

Interested in improving your data ingestion?

Data ingestion best practices

Create data service level agreements (SLAs)

The best place to start—especially to determine your optimal ingestion approach—is to gather use case requirements from your data consumers and work backwards to develop a data SLA to address matters such as:

  • What is the business need?
  • What are the expectations for the data, and when does the data need to meet them?
  • How will we know when the SLA is met, and what will the response be if the SLA is not met?

As part of this, seek to outline the challenges posed by the use cases developed and plan for them accordingly. Identify the specific source systems at your disposal and make sure you know how to extract data from them.

Automated data ingestion

As data expands in volume and complexity, the days of relying on manual-ingestion solutions to curate such a massive amount of unstructured data are over. Automated data ingestion solutions have been proven to save time, boost productivity, and reduce manual steps in the data ingestion process.

Furthermore, automation offers the additional benefits of architectural consistency, consolidated management, safety, and error management. All of this contributes to decreased data processing time.

Execute data quality checks at time of ingest—but do so carefully

The best time to determine if you have a quality control problem is at the time of ingestion. While there’s no scalable way to create tests for every possible instance of data corruption across the pipeline, some organizations implement data circuit breakers that will stop the data ingestion process if data doesn’t pass specific quality checks. However, there are inherent tradeoffs here. Set your data quality thresholds too high and you may unnecessarily impede data access; set them too low and your overall data warehouse may be compromised.

Do your best here to strike a balance in your circuit breaker deployment. And leverage data visualization and observability to help you detect data quality issues early in the process so you can resolve them before they become widespread.