Performance @Scale is an invite-only conference for engineers working on the technical and organizational challenges of high-performance applications and services.
For this all-day event, we've assembled a series of performance experts from around the industry — Netflix, Google, Microsoft, LinkedIn, Facebook — to talk about the biggest performance challenges they've faced and how they overcame. Speakers will be covering performance topics spanning web, server, networking, and mobile. Major themes will include effective tooling, regression prevention, automated triaging, and production telemetry.
If you have ever wanted to learn best practices from the pros on how to make your web server run faster, improve network efficiency, or speedup your mobile apps, then Performance @Scale is the place to be on Wednesday, February 24th! If you have friends or colleagues who may also be interested in attending, feel free to forward them this invitation.
Performance @Scale will be held on Facebook campus in Menlo Park. Talks begin at 10AM, then stick around afterwards for Happy Hour.
SESSION LIST
Google
Evolution of High Performance Networking in Chromium
Jim Roskind, Member of Technical Staff
Chromium has used extensive client side instrumentation to drive design decisions which have lead to significant advances in both browser wide technology and Internet protocol technology. This talk will describe the design and evolution of the statistics gathering in Chromium (that has since been adopted by Firefox), and then give numerous examples of how resulting data has been used to benefit network stack functionality, and eventually support a new protocol design (re: QUIC). The deep dive into the statistics gathering will emphasize an ease-of-use in the API that simplified requests for information, combined with extremely efficient race-tolerant coding, used to gather data without impacting performance. The network advances that were facilitated include numerous speculative activities, atop protocols, new and old.
Microsoft
The Keys to Actionable Perf Investigations
Vance Morrison, Performance Architect
Almost all performance investigations are conceptually simple: You either care about making things fast or use less memory. All other metrics are just in support of these simple goals. In this talk, Vance Morrison will distill his most important lessons from over a decade of performance investigations and improvements while working on the .NET Runtime, and how the simple goals of making things fast and efficient can be so difficult. He will discuss which events can be collected and used to very effectively attack performance problems. He will cover the concepts of thread time and causality tracking, an approach that allows scenarios with significant blocked time, concurrency, and asynchrony to be diagnosed. Finally, he will show how relatively novel ways of grouping data can be very helpful finding performance issues.
Facebook
Automatic Regression Triaging at Facebook
Guilin Chen, Software Engineer
AutoTriage is a tool that we have built at Facebook that automates the root cause analysis of regression triaging. We use AutoTriage to understand performance regressions on Facebook's web tier. Facebook's web tier is where we house the business logic for privacy/permission checking, and for rendering html or json to various clients. It is home to over 25 products, has more than 1000 developers contributing to it, is pushed 3 times a day, and have thousands of configuration changes dynamically changing it per day. AutoTriage allows the Site Efficiency team to scale it's regression analysis to this fast paced development culture.
Google
Increasing Ad Revenue by Improving Performance
Daniel Greenia, Senior Analyst
The same technologies that make Google’s Ad Network scalable for new customers also create opportunities for abuse in the system. Thousands of gray-area accounts are flagged every day for human review, leading to a classic resource constraint problem: Given more account reviews than available human reviewers, how to we prioritize targets for our limited resources? In this presentation, Daniel will demonstrate a method for identifying high-value work items through a combination of value forecasting and cost estimation. Applying this method has increased the efficiency of the ad review system and led to a threefold performance improvement in the group’s primary metric.
Facebook
Real-World Performance Data for Mobile
Michelle Filiba, Software Engineer
Facebook mobile performance impacts people who rely on the application every day to connect and share. With this in mind, the Mobile Speed team focuses on making important experiences in the app fast. We have many tools that simulate the speed of an experience and look for performance improvements when testing on local phones. However, there are only so many phones and environments we can simulate. This pushed us to start thinking of a way to collect performance data from real people's phones as they use our application. The system we built, Loom, dynamically collects trace data from their devices. Loom allows us to take a deep dive into the data and learn more about the root cause of performance issues.
Visualizing and Optimizing Real User Performance on Mobile
Anant Rao, Engineering Manager
Browsers on the desktop offer several standard tools for visualizing page load performance. As more and more of our member traffic shifts to our native mobile apps, the lack of maturity in standard tools on mobile, makes performance hotspot detection challenging. In our quest to solve this, we've taken inspiration from tools offered by browsers, and built out their equivalent on mobile. This talk will focus on what we measure & visualize performance data and how this has helped us root cause sub-par performance and find optimization opportunities, that directly impact our member experience.
Netflix
Linux 4.x Performance: Using BPF Superpowers
Brendan Gregg, Senior Performance Architect
Linux performance analysis has been the domain of ancient tools and metrics, but that's now changing in the Linux 4.x series. A new tracer is available in the mainline kernel, built from dynamic tracing (kprobes, uprobes) and enhanced BPF (Berkeley Packet Filter). It allows us to measure latency distributions for file system I/O and run queue latency, print details of storage device I/O and TCP retransmits, investigate blocked stack traces and memory leaks, and a whole lot more. These lead to performance wins large and small, especially when instrumenting areas that previously had zero visibility. This talk will summarize this new technology and some long-standing issues that it can solve, and how we intend to use it at Netflix.
Michelle Filiba is a software engineer for the Mobile Speed team at Facebook. She joined Facebook in 2011 and worked on infrastructure for Facebook Advertising. She then moved to working on systems to catch fraud in the Facebook Payments system. Now, she focuses on building infrastructure to track and diagnose performance in Facebook Applications on real peoples' phones.
Daniel Greenia has over 7 years’ experience applying statistical thinking to solve operational quandaries in biotech, the IT industry, and most recently, in Google’s fraud prevention group. He earned a B.S. in Industrial Engineering/Operations Research at U.C. Berkeley and a Ph.D. from Stanford University’s Management Science and Engineering department.
Brendan Gregg is a senior performance architect at Netflix, where he does large scale computer performance design, analysis, and tuning. He is the author of Systems Performance published by Prentice Hall, and received the USENIX LISA Award for Outstanding Achievement in System Administration. He has previously worked as a performance and kernel engineer, and has created performance analysis tools included in multiple operating systems, as well as visualizations and methodologies.
Ben Maurer is the tech lead of the Web Foundation team at Facebook, which is responsible for the overall performance and reliability of Facebook's user-facing products. Ben joined Facebook in 2010 as a member of the infrastructure team. Before Facebook, he co-founded reCAPTCHA with Luis von Ahn. Recently, Ben worked with the U.S. Digital Service to improve the use of technology within the federal government.
Vance Morrison is the Performance Architect for the .NET Runtime at Microsoft. Since 2005, he has been called in to help diagnose the hardest performance investigations involving .NET code. Vance has been involved in designs of components of the .NET runtime since its inception. Early on he drove the design of the .NET Intermediate Language (IL). Later, for several years he was the development lead for the just-in-time compiler for the runtime. He continues to be involved in the latest evolution of the runtime, including a version for mobile devices, .NET Native, and the runtime for the ASP.NET 5 product. He has also developed a freely available tool called PerfView that allows a broad variety of detailed performance investigations to be done on Windows platforms.
Jim Roskind has been programming professionally for over 40 years. He is well known for his widely-used open source Python Profiler and YACCable C++ Grammar, and his work on the ANSIC++ committee. He co-founded InfoSeek, one of the first Internet search services. In 1995 he joined Netscape, where he designed signed Java, contributed to SSL/TLS Specification, and worked to "Free the source" of Mozilla as Netscape's VP Chief Scientist. He joined Google in 2008, where he worked extensively to optimize latency and performance of the network stack in Chrome. His innovations in Chromium eventually led to the architecture and design for QUIC, a new internet protocol with the potential to replace TCP/TLS/SPDY. He holds 4 degrees from M.I.T., includes SBEE, SBCS, SMEECS and PhD EECS
