Running a Data Pipeline with Apache Beam Golang SDK on Docker
This article describes how to run a data pipeline using the Apache Beam Golang SDK with Docker. The intended audience includes those who may or may not have used Apache Beam with Golang, but have not used it with Docker.
- A data pipeline is a set of data processing operations, or elements, that move data from a source, or multiple sources, to a sink or destination. Along a data pipeline, data can be processed, transformed, and optimized.
- Apache Beam is an advanced, unified, open-source data programming model for batch and streaming data processing. It reads input data from one or more sources, applies business logic, and writes the result to a destination.
- Golang, or simply Go, is a lightweight, open-source programming language designed for multiprocessing or multi-core, networked systems. Go was developed by Google.
Before You Begin
To run a data pipeline using the Apache Beam Go SDK with Docker, you must perform the following prerequisite setup:
- Install Docker. The commands to install Docker vary with the operating system used.
Install Golang in Docker
Once Docker is installed, install Go in Docker. First, download or pull the Docker image for golang with the docker pull command as follows:
sudo docker pull golang
Next, create a Docker container from the Docker image for golang with the docker run command, which creates a container and displays an interactive shell to run Go commands:
~$ sudo docker run -it golang bash
root@cc1e28e100ab:/go#
Create a Go Module
Next, create a Go module, which is needed to run Go code; Go packages are contained within modules, and dependency tracking in Go is performed using modules. Create a directory called apachebeam and change the directory to make it the current directory:
root@cc1e28e100ab:/go# mkdir apachebeam
root@cc1e28e100ab:/go# cd apachebeam
Create a module for the sample code with the go mod init command:
root@cc1e28e100ab:/go/apachebeam# go mod init example/apachebeam
go: creating new go.mod: module example/apachebeam
A module file go.mod gets created.
Install the Apache Beam Go SDK in Docker
Next, install the Apache Beam Go SDK, which is available as a Go package, in the Docker container. First, verify the Go version is at least 1.16, which is needed to install the Apache Beam Go SDK:
root@4c490c5b9514:/go# go version
go version go1.17.6 linux/amd64
Get the Go package for Apache Beam Go SDK with the go get command:
root@4c490c5b9514:/go# go get -u github.com/apache/beam/sdks/v2/go/pkg/beam
Run the Wordcount Example
Apache Beam Go SDK provides some pre-developed example applications on GitHub at github.com/apache/beam/sdks/v2/go/examples. First, download the wordcount example code with the go get command:
root@4c490c5b9514:/go/apachebeam# go get github.com/apache/beam/sdks/v2/go/examples/wordcount
Next, install the wordcount example with the go install command:
root@4c490c5b9514:/go/apachebeam# go install github.com/apache/beam/sdks/v2/go/examples/wordcount
The wordcount example needs an input file. To create an input file, a text editor is needed. Install the vim text editor.
apt-get update && apt-get install apt-file -y && apt-file update && apt-get install vim -y
Create an input file called input.txt:
root@4c490c5b9514:/go/apachebeam# vim input.txt
Copy and paste the following sample text to the input.txt file and save and close the file with the command :wq.
hello apache beam data pipeline with go sdk
Run the wordcount example providing the input.txt as the input file, and a “counts” file as the output:
root@4c490c5b9514:/go/apachebeam# wordcount --input input.txt --output counts
2022/02/11 01:00:20 Executing pipeline with the direct runner.
…
2022/02/11 01:00:20 Reading from input.txt
2022/02/11 01:00:20 Writing to counts
The Apache Beam Data Pipeline counts the occurrence of each word in the input.txt file, and outputs the result to the counts file. Open the counts file to list the result:
root@4c490c5b9514:/go/apachebeam# vim counts
The result from the wordcount example is as follows:
apache: 1
beam: 1
data: 1
pipeline: 1
with: 1
go: 1
sdk: 1
hello: 1