When we launch new services, we can quickly measure success, and when we see anomalies in the data, we can quickly look for root causes. Charged with serving this data for everyday operational analysis, our Data Warehouse team maintains a massively parallel database running Vertica, a popular interactive data analytics platform. Every day, our system handles millions of queries, with 95 percent of them taking less than 15 seconds to return a response.
Growing storage requirements for this system made our initial strategy of adding fully duplicated Vertica clusters to increase query volume cost-prohibitive. A solution arose through the combined forces of our Data Warehouse and Data Science teams. Looking at the problem through cost analysis, our data scientists helped our Data Warehouse engineers come up with a means of partially replicating Vertica clusters to better scale our data volume.
Optimizing our compute resources in this manner meant that we could scale to our current pace of serving over a billion trips on our platform, leading to improved user experiences worldwide.
These clusters were completely isolated mirror images of each other, providing two key advantages. First, they offered tolerance to cluster failures, for instance, if a cluster fails, the business can run as usual since the backup cluster holds a copy of all required data. Second, we could distribute incoming queries to different clusters, as depicted in Figure 1, below, thereby helping increase the volume of queries that can be processed simultaneously:. With data stored in multiple isolated clusters, we investigated strategies to balance the query load.
Some common strategies we found included:. Relying on multiple, fully-isolated clusters with a routing layer to enforce user-segmentation at a cluster level came with the challenge of managing these database clusters, along with the storage inefficiency associated with replicating each piece of data across every cluster. For example, if we have petabytes of data replicated six times, the total data storage requirement is petabytes.
Other challenges of replication, like the compute cost associated with writing data and creating necessary projections and indexes associated with incremental data updates, also became apparent.
These challenges were further compounded by our rapid global growth and foray into new ventures, such as food delivery, freight, and bike share. Essentially, we would be paying hardware costs for increased storage without any gain in query volume.
If we chose to add more clusters, the resource wastage implicit in the replication process would mean that the actual query volume did not grow linearly. The sheer lack of efficiency in terms of capital allocation, as well as performance, meant that we needed to think outside of the box to find a solution that scales. Working closely with the Data Science team, we set out to increase query and data volume scalability for our fast analytic engines.
A natural strategy to overcome the storage challenge was to move from fully replicated databases to partially replicated databases. As shown in Figure 2, below, in comparison to a fully replicated database system where all the data is copied to all isolated database clusters, a partially replicated database system segments data into different overlapping sets of data elements, equal to the number of clusters:. Due to the large scale of the problem, involving thousands of queries and hundreds of tables, constructing these different overlapping sets of data elements is non-trivial.
Further, partial replication strategies are often short-lived as data elements grow at different rates, and these data elements change as the business evolves. Apart from considering database availability, along with compute and storage scalability, we also had to consider the migration costs of partially replicating our databases. With this data infrastructure challenge in mind, our Data Warehouse and Data Science teams came up with three basic requirements for our optimal solution:.
Our data science team formalized these requirements into a cost function that can be described as:.The new framework simplifies distributed and scalable training for reinforcement learning agents. Computational costs are one of the main challenges in the adoption of machine learning models.
Some of the recent breakthrough models in areas such as deep reinforcement learning DRL have computational requirements that result prohibited to most organizations which have caused DRL to remain constrained to experiments in big AI research labs.
For DRL to achieve mainstream adoption, it has to be accompanied by efficient distributed computation methods that effectively address complex computation requirements. Distributed computing methods are required across many areas of the machine learning lifecycle from training to simulations.
In supervised learning methods, we already seem progress with distributed training frameworks like Horovod. However, DRL scenarios introduced their own set of challenges when comes to a distributed computing infrastructure. Intuitively, we tend to think that a framework for distributed training of supervised learning models should work for DRL methods.
However, reality is a bit different. Given that DRL methods are often trained using a large variety simulations, we need a distributed computing framework that adapts to that unique environments.
From that perspective, a distributed training method should be able to concurrently use a large amount of resources based on the specific requirements.
Additionally, DRL methods typically require different resources throughout its training lifecycle. Those factors make the scaling of DRL training a very unique challenge not well suited for distributed training frameworks designed for supervised models. These are some of the challenges that Uber setup to address with their new open source framework.
The framework provides users the ability to write applications for a large computer cluster with a standard and familiar library interface. From a design perspective, Fiber encapsulates some key capabilities that facilitate the distributed training of DRL models:. To achieve the aforementioned goals, Fiber provides an architecture that is divided into three different layers: API, backend and cluster.
The API layer provides basic building blocks for Fiber like processes, queues, pools and managers. They have the same semantics as in multiprocessing, but are extended to work in distributed environments. The backend layer handles tasks like creating or terminating jobs on different cluster managers. Finally, the cluster layer consists of different cluster managers. Although they are not a part of Fiber itself, they help Fiber to manage resources and keep track of different jobs, thereby reducing the number of items that Fiber needs to track.Learn more about blocking users.
Learn more about reporting abuse. Since the PebbleTime app is unclutchable, it's impossible to dump the headers using class-dump or class-dump-z.
M4 Forecasting Competition: Introducing a New Hybrid ES-RNN Model
It was requested by hf on Hacker News. A few of these works have had an extraordinary effect on my life or way of thinking. They get a sixth star. A Pen by seth kontny on CodePen. This gist's comment stream is a collection of webdev apps for OS X. Feel free to add links to apps you like, just make sure you add some context to what it does — either from the creator's website or your own thoughts.
We turn hard data into useful insights and compelling stories. From optimising our site to guiding on-the-ground operations, we love to challenge conventional wisdom, use information to build awesome visualisations, and develop effective solutions for teams around the world.
Skip to content. Instantly share code, notes, and snippets. Block or report user Report or block sethkontny. Hide content and notifications from this user. Learn more about blocking users Block user.
Learn more about reporting abuse Report abuse. Sort: Recently created Sort options. Recently created Least recently created Recently updated Least recently updated. App Version 3. Highly recommended things! This is my five-star list.
These are my favorite things in all the world. View useful-osx-commands.Kaggle Live-Coding: Putting together your data science portfolio - Kaggle
View mac-apps. View jobs. Newer Older. You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window.Uber takes data-driven to the next level with the complexity of its systems and breadth of data, processing trillions of Kafka messages per day, storing hundreds of petabytes of data in HDFS across multiple data centers, and supporting millions of weekly analytical queries. Now, we complete over 15 million trips a day, with over 75 million monthly active riders.
In the last eight years, the company has grown from a small startup to 18, employees across the globe. With this growth comes an increased complexity of data systems and engineering architecture. For instance, tens of thousands of tables exist across the multiple analytics engines we use, including HivePrestoand Vertica.
This dispersion makes it imperative to have full visibility into what information is available, especially as we continue to add new line-of-business data and employees. InUber began cataloging its tables with a set of static HTML files that were manually maintained. As the company grew, so did the number of tables and relevant metadata that we needed to update.
At this scale and pace of growth, a robust system for discovering all datasets and their relevant metadata is not just nice to have: it is absolutely integral to making data useful at Uber.
To make dataset discovery and exploration easier, we created Databook. Databook ensures that context about data—what it means, its quality, and more—is not lost among the thousands of people trying to analyze it. With Databook, we went from making manual updates to leveraging an advanced, automated metadata store to collect a wide variety of frequently refreshed metadata. Databook incorporates the following features:. Databook provides a wide variety of metadata derived from Hive, Vertica, MySQLPostgresCassandraand several other internal storage systems, including:.
WhereHows also lacked the cross-data center read and write support that was critical to our performance needs. As such, we set out to build our own, in-house solution, written in Java to leverage its built-in functionality and mature ecosystem. When first designing Databook, one major decision we had to make was whether we would store the metadata we collect or fetch it as requested.
Our service needed to support high-throughput and low-latency read, and if we delegated this responsibility to the metadata sources, it would require all sources to support high-throughput and low-latency read, which would introduce complexity and risk. For example, a Vertica query that fetches table schema typically takes a few seconds to process, making it ill-suited for visualizations. Since Databook supports so many different metadata sources, we decided to store the metadata in the Databook architecture itself.
In addition, while most use cases require fresh metadata, they do not need to see metadata changes in real time, making periodical crawling possible. We also separated the request serving layer from the data collection layer so that each runs in a separate process, as depicted in Figure 3, below:.
This isolates both layers, thereby reducing collateral impact. For example, data collection crawling jobs may use significant system resources, which could impact the SLA of APIs on the request serving layer.
Our next challenge was to determine how we could most effectively and performantly collect metadata from several different, disparate data sources.Learn about Springboard. Top data science teams around the world are doing incredible work on some of the most interesting datasets in the world. Google has more data on human interests than every 20th century researcher, while Uber seamlessly coordinates the itinerary and pricing of more than 1 million trips every day.
With machine learning, and artificial intelligence, top data science teams are changing the way we ingest and process data, and they are coming up with actionable insights that impact the lives of millions.
What if there were common patterns between the interviews top data science teams were giving that would let you master the data science interview process?
What if the specific differences between various teams and their interview practices could be enumerated so that interviewing with a top data science team were more akin to a science than an art? At Springboardwe teach data science skills, and many of our students take our course because they are looking to start a data science career. This has led us to write up a guide to data science jobs and a guide to data science interviews in order to help our students take the next step to an ideal job in the field.
We sought to change that. We took it upon ourselves to source data with Glassdoor testimonials of different data science interview questions from a selection of companies whose data science teams are considered world-class.
We started this analysis because we wanted to understand how top data science teams interview and how you should prepare for that process. Above all else, we learned that the data science interview process is a complex beast that must be tackled with precise and practiced action. In the real data science questions that are offered by Glassdoor respondents, we found a treasure trove of data on what skills data science teams were testing.
Amongst the largest categories of questions we spotted were the following:. Statistics and probability are often the meat of data science work. These questions are designed to test your thinking and how you reason with uncertainty, an essential skill for any data scientist to master.
If statistics and probability are the meat of data science work, you can consider programming questions the potatoes that must come with the main meal.
Data science requires dealing with data at scale, something that will require programming to automate the vast amount of work required. The third plank of data science is explaining your findings in a way that drives business action and outcomes.
These questions test your thinking about what might be causing the behaviors you observe. The fourth category of question asks about your fit with the role and the culture of the hiring organization.
Treat this like a behavioral interview, and be honest about your expectations. These were large companies that could afford to spend on top data science talent and had a large collection of data science interview reviews, which allowed us to explore their interview process in-depth. Of the selected processes, Google had the most difficult data science interview process on average while JPMorgan had the least. At the other end, Yelp and JP Morgan had zero positive reviews, though it should be noted that was over a limited sample of nine respondents between the both of them.
Most candidates were referred in by current employees or a recruiter. The interview process is rated as slightly above average at a difficulty rating of 3. The standard process was one phone screen, one take home data challenge, one shared screen SQL challenge, and then an on-site phase with multiple interviews with everybody on the team.
While the beginning phases of the interview process focused mostly on SQL, later parts focused heavily on machine learning and building an ads model an obvious focus of Facebook. The interview process is rated as average in difficulty with a 3.
What We Learned Analyzing Hundreds of Data Science Interviews
This was a standard process with a phone call screen, a homework assignment that was timed to be done in two hours split into SQL analysis and an open-ended problem with a sample datasetand then an on-site interview series with a mix of technical and behavioral questions.
LinkedIn interviews are largely positive, with a ratio of double the number of positive responses to negative ones.
Most candidates came in through online applications, so try your luck there! The interview process is rated as slightly below average in difficulty at a ranking of 2.
A LinkedIn recruiter described the process as being a phone screen with a recruiter, a second phone screen with a team lead, then a fly-in interview. A lot of candidates received a take-home data science assignment that took anywhere between three and four hours.
LinkedIn data science interview questions revolved around areas of interest for LinkedIn, such as predicting employee salaries or working on features that have already been built ex: People You May Know. Knowing Python and machine learning is something the LinkedIn team values strongly, though that will be tested more at later stages.Machine learning ML is widely used across the Uber platform to support intelligent decision making and forecasting for features such as ETA prediction and fraud detection.
For optimal results, we invest a lot of resources in developing accurate predictive ML models. Traditionally, when data scientists develop models, they evaluate each model candidate using summary scores like log lossarea under curve AUCand mean absolute error MAE. Although these metrics offer insights into how a model is performing, they do not convey much information regarding why a model is not performing well, and from there, how to improve its performance.
As such, model builders tend to rely on trial and error when determining how to improve their models. Taking advantage of visual analytics techniques, Manifold allows ML practitioners to look beyond overall summary metrics to detect which subset of data a model is inaccurately predicting. Manifold also explains the potential cause of poor model performance by surfacing the feature distribution difference between better and worse-performing subsets of data.
Moreover, it can display how several candidate models have different prediction accuracies for each subset of data, providing justification for advanced treatments such as model ensembling. Given their complexity, ML models are intrinsically opaque.
ML visualization, an emerging domain, solves this problem.
Previous approaches to ML visualization generally included directly visualizing the internal structure or model parameters, a design constrained by the underlying algorithms and therefore not scalable to handle company-wide generic use cases.
To tackle this challenge at Uber, we built Manifold to serve the majority of ML models, starting with classification and regression models. Instead of inspecting models, we inspect individual data points, by: 1 identifying the data segments that make a model perform well or poorly, and how this data affects performance between models, and 2 assessing the aggregate feature characteristics of these data segments to identify the reasons for certain model behaviors.
This approach facilitates model-agnosticism, a particularly useful feature when it comes to identifying opportunities for model ensembling.
In addition to the research prototype built in , we focused on surfacing important signals and patterns amid massive, high-dimensional ML datasets. The interface of Manifold is composed of two coordinated visualizations:. Manifold helps users uncover areas for model improvement through three steps:. Our goal with Manifold is to compare how different models perform on various data points in other words, feature values. As a design alternative, a straightforward implementation of this visualization is depicted in Figure 4, below:.
In Figure 4, each point in the plot represents the performance of model x on data point y.This is a solid path for those of you who want to complete a Data Science course on your own time, for freewith courses from the best universities in the World. In our curriculum, we give preference to MOOC Massive Open Online Course style courses because these courses were created with our style of learning in mind. To officially register for this course you must create a profile in our web app.
Thanks for the comprehension. The intention of this app is to offer for our students a way to track their progress, and also the ability to show their progress through a public page for friends, family, employers, etc. Here are two interesting links that can make all the difference in your journey. The second link is a MOOC that will teach you learning techniques used by experts in art, music, literature, math, science, sports, and many other disciplines.
These are fundamental abilities to succeed in our journey. After finishing the courses above, start your specializations on the topics that you have more interest. You can view a list of available specializations here. This guide was developed to be consumed in a linear approach. What does this mean? That you should complete one course at a time. The courses are already in the order that you should complete them.
Just start in the Linear Algebra section and after finishing the first course, start the next one. You must focus on your habitand forget about goals. Here in OSS Universityyou do not need to take exams, because we are focused on real projects! In order to show for everyone that you successfully finished a course, you should create a real project. After finish a course, you should think about a real world problem that you can solve using the acquired knowledge in the course. The projects of all students will be listed in this file.
You need to have in mind that what you are able to create with the concepts that you learned will be your certificate and this is what really matters! In order to show that you really learned those things, you need to be creative!
We love cooperative work! Use our channels to communicate with other fellows to combine and create new projects! The important thing for each course is to internalize the core concepts and to be able to use them with whatever tool programming language that you wish.
You must share only files that you are allowed to! Do NOT disrespect the code of conduct that you signed in the beginning of some courses. Watch this repository for futures improvements and general information.
Manifold: A Model-Agnostic Visual Debugging Tool for Machine Learning at Uber
The only things that you need to know are how to use Git and GitHub. Here are some resources to learn about them:. Note : Just pick one of the courses below to learn the basics.
You will learn a lot more once you get started! You can open an issue and give us your suggestions as to how we can improve this guide, or what we can do to improve the learning experience. You can also fork this project and send a pull request to fix any mistakes that you have found. TODO: If you want to suggest a new resource, send a pull request adding such resource to the extras section. The extras section is a place where all of us will be able to submit interesting additional articles, books, courses and specializations, keeping our curriculum as immutable and concise as possible.
You can also interact through GitHub issues. We also have a chat room!