Where do I need tech XYZ?

Jun 06, 2020

A listing of big data & distributed tech with use cases

Heard of a lot of technologies, yet aren’t super sure where they may be used or what you can do with them? Well, here’s a large listing of popular technologies and where you can use them and where they are most effective.

Redis

Redis is an in-memory key:value store. It’s used as an in-memory NoSQL database, or as a persistent cache for temporary data.

MongoDB

Mongo is a document-oriented NoSQL database. It’s commonly used because of its high horizontal scalability & in cases where ACID properties are not compulsory. Schemas also become more flexible, so prototyping becomes quicker. Its commonly used over traditional RDBMS.

PostGreSQL

PostGres is a classical RDBMS beast, feature-rich & standard compliant. It’s commonly used over MySQL because it’s an object-relational database which means it has table inheritance and function overloading. It also handles concurrency better.

Docker

Docker is a tool to run applications in containers. These containers allow each application to bring its own libraries and dependencies. Docker is used as a consistent environment standardization tool, allowing a reproducible environment irrespective of OS.

Kubernetes

Kubernetes is a cluster deployment & operations tool built upon docker. This allows you to configure and run multiple instances of docker on multiple machines. Its commonly used over vendor specific tools since it is platform agnostic, allowing deployment across vendors (Amazon/Microsoft/Google).

Helm

Helm is a package manager built atop Kubernetes to specify clearly defined roles and scale the entire infrastructure. Its commonly used to run/update a number of microservices each with their own image. With Helm, running or updating all the microservices becomes simple.

Hadoop

A Map/Reduce based distributed ‘big data’ framework. It uses its own Hadoop Distributed File System (HDFS), which stores files in a Hadoop-native format and parallelizes them across a cluster. This means computations are done on-file. It’s used to run big data computations like analytics. It’s used over Apache Spark, in disk-heavy operations with the MapReduce paradigm.

Spark

Spark is a distributed ‘big data’ framework that runs in-memory. It uses RDD, Resilient Distributed Dataset as its store. It’s used over Hadoop because it has been found to be almost 100 times faster in-memory and 10 times faster on disk. It’s also been found to be faster for machine learning. Main advantages seem to arise from DAGs which enable optimizations between steps.

Neo4j

Neo4j is a NoSQL graph database designed to ‘remember’ the relationships between data. It maps relationships between nodes and allows querying. Its used in analytics and data similar to that present in most social networks.