flysilikon.blogg.se - Time series database

#Time series database full#

To optimize read latencies, at the expense of increased work during the write path, we added an in-memory sharded caching layer ( EVCache) in front of Cassandra storage. Caching LayerĬassandra performed very well writing viewing history data but there was a need to improve the read latencies.

#Time series database full#

Read repair and Full column repair became slower as rows got wider. Similarly Compaction took more IOs and time as the data size increased. This had a negative impact on read latency. Since only recent data was in memory, in many cases both the memtables and SSTables had to be read to retrieve viewing history. As the data grew, the number of SSTables increased accordingly. Let’s look at some of the Cassandra internals to understand why our initial simple design slowed down. However it increased overall latency to read the whole row as the number of viewing records increased. Whole row read via pagination for large viewing history: This was better for Cassandra as it wasn’t waiting for all the data to be ready before sending it back. Time range query to read a time slice of a member’s data: This resulted in the same inconsistent performance as above depending on the number of viewing records within the specified time range. Reading rows with a large number of columns put additional stress on Cassandra that negatively impacted read latencies. As a member watched more titles, the number of viewing records increased. Whole row read to retrieve all viewing records for one member: The read was efficient when the number of records per member was small. This single column write was fast and efficient. That viewing record was updated after member paused or stopped the title. One viewing record was inserted as a new column when a member started playing a title. The following figure illustrates the read and write flows of the initial data model:įigure 1: Single Table Data Model Write Flow

Over time, this resulted in high storage and operation cost as well as slower performance for members with large viewing history. However as member count increased and, more importantly, each member streamed more and more titles, the row sizes as well as the overall data size increased. This horizontal partitioning enabled effective scaling with member growth and made the common use case of reading a member’s entire viewing history very simple and efficient.

In the initial approach, each member’s viewing history was stored in Cassandra in a single row with row key:CustomerId. Cassandra supports this tradeoff via tunable consistency.

Considering the CAP theorem, the team favors eventual consistency over loss of availability.

Since Cassandra is highly efficient with writes, this write heavy workload is a good fit for Cassandra.

The viewing history data write to read ratio is about 9:1.

Cassandra has good support for modelling time series data wherein each row can have dynamic number of columns.

The first cloud-native version of the viewing history storage architecture used Cassandra for the following reasons:

In this blog post we will focus on how we approached the big challenge of scaling storage of viewing history data. As member monthly viewing hours increase, more viewing data is stored for each member.Īs Netflix streaming has grown to 100M+ global members in its first 10 years there has been a massive increase in viewing history data.As member count grows, viewing data is stored for more members.As time progresses, more viewing data is stored for each member.Viewing history data increases along the following 3 dimensions: Helping you find shows to continue watching on Netflix.Netflix analyzes the viewing data and provides real time accurate bookmarks and personalized recommendations as described in these posts: Each member provides several data points while viewing a title and they are stored as viewing records. Netflix members watch over 140 million hours of content per day. Time Series Data - Member Viewing History In this 2-part blog post series, we will share how Netflix has evolved a time series data storage architecture through multiple increases in scale. Netflix, being a data-informed company, is no stranger to these challenges and over the years has enhanced its solutions to manage the growth. However this explosion of time series data can overwhelm most initial time series data architectures. Recent technology advancements have improved the efficiency of collecting, storing and analyzing time series data, spurring an increased appetite to consume this data. Increasingly, companies are interested in mining this data to derive useful insights and make data-informed decisions. The growth of internet connected devices has led to a vast amount of easily accessible time series data. By Ketan Duvedi, Jinhua Li, Dhruv Garg, Philip Fisher-Ogden Introduction