Curated Data in Apache Parquet Format – For Blazing Fast Big Data Analytics on Hadoop & Spark

31.07.2016

David Talby

Chief technology officer at John Snow Labs

Here at John Snow Labs, we are delighted to announce that all datasets are now also available in the new highly optimized Apache Parquet format, which delivers an order of magnitude faster query speeds, as well as substantial storage savings, according to multiple industry benchmarks.

The new format drastically accelerates queries on common benchmarks. It also reduces disk space, bandwidth as well as CPU usage. It is available alongside with the existing CSV and JSON data formats and can be found on all subscriptions.

Apache Parquet is an efficient and a general-purpose columnar file format. It is self-describing, language-independent and also supports multiple compression algorithms and partitioning for big data sets and nested data structures. John Snow Labs is the first to deliver a data repository in Parquet format in the healthcare space, which is experiencing fast growing adoption of big data analytics technologies.

Parquet was designed for Apache Hadoop and has been adopted by Apache Spark, Cloudera Impala, Hive, Presto and Apache Drill. The majority of big data analytics platform now recommend it as the most efficient, highest performing data format. Here are recent publicly available benchmarks:

IBM evaluated multiple data formats for Spark SQL showed Parquet to be:

11 times faster than querying text files
75% reduced data storage thanks to built-in compression
The only format to query large files (1 TB in size) with no errors
Higher scan throughput on Spark 1.6

Cloudera examined different queries and discovered that Parquet was:

2 to 15 times faster than Avro, and far faster than CSV
72% smaller on a wide table and 25% smaller on a narrow table

United Airlines also published that Parquet was:

10 times faster than CSV on Presto and 3 times faster than CSV on Hive

According to the founding team, “Our customers expect us to optimize and test the data we provide for whichever analytics platform they use – often for multiple ones. For big data platforms, Apache Parquet is emerging as the gold standard, and we are thrilled to be the first to support it across our entire data catalog. Our customers benefit in two ways. They get turnkey data in an optimized format and do not need to spend time and effort on reformatting, plus they get the day-to-day productivity boost from screaming fast query performance.”

We provide turnkey data for scientists across 15 areas of healthcare. Our service helps in the analysis of healthcare data specializing in data engineering to optimize storage, bandwidth and data access performance. We also invest in optimizing and testing clean, current and enriched healthcare data sets on the latest big data platforms. Our current partners include Cloudera and Hortonworks in big data, Atigeo and Turi in data science and open-source projects Spark, Presto and ElasticSearch.

Here at John Snow Labs, we believe that data science will be a major driver of progress for 21st-century medicine, by providing quality DataOps and finding, cleaning, formatting, updating and publishing turnkey data for technology companies, healthcare providers, research, government and non-profit organizations.

Additionally, by utilizing Generative AI in Healthcare and a Healthcare Chatbot, organizations can harness curated data formats like Apache Parquet to accelerate analytics processes, ultimately improving patient outcomes and fostering innovation in healthcare solutions.

David Talby

Chief technology officer at John Snow Labs

Our additional expert:

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

How Big Data And Analytics Create Smart Devices

Danielle Pouletsos

With advances in Nanoscale device engineering, sensors are being embedded in almost everything you encounter on a daily basis. From your mobile...

Curated Data in Apache Parquet Format – For Blazing Fast Big Data Analytics on Hadoop & Spark

How Big Data And Analytics Create Smart Devices

Recommended For You