male-1: Welcome back to Byte-Sized Breakthroughs, where we dive into the cutting edge of technology, breaking down complex research into digestible insights. Today, we're exploring the exciting world of personalized advertising, and how Meta is pushing the boundaries of user understanding. male-1: Our guest today is Dr. Paige Turner, a leading researcher in the field of user modeling and personalization. Paige, thanks for joining us. female-1: It's a pleasure to be here, Alex. Let's dive in. male-1: Paige, your paper focuses on 'Scaling User Modeling' for personalized advertising at Meta. Can you give us some background on the problem you're tackling? female-1: Sure. Imagine Meta's advertising system, which manages hundreds of models processing billions of user requests daily. We aim to personalize ads for each user, which requires understanding their preferences and interests. Traditional methods relied on manual feature engineering and simple architectures, but deep learning has revolutionized this. Now, we use sophisticated neural networks to learn intricate user representations, but there are challenges. male-1: That makes sense. What are these challenges? female-1: Well, in large-scale systems like Meta's, we face constraints on training throughput, serving latency, and memory. These limitations restrict the complexity of models and the amount of user data we can leverage. We also see sub-optimal representations when models learn independently, redundancy in user features across models, data scarcity for specialized models, and the need for extensive tailoring, which is not scalable. male-1: So, how does your research address these challenges? What's the core innovation here? female-1: We introduce Scaling User Modeling, or SUM. It's a framework designed to efficiently share online user representations across hundreds of ads models. SUM leverages a few dedicated upstream user models to synthesize user embeddings from massive amounts of user data, which are then used by downstream models. male-1: That's fascinating. Can you clarify what you mean by 'upstream' and 'downstream' models? female-1: Think of it like a pipeline. Upstream models are responsible for learning and generating user representations. These are sophisticated models trained on massive datasets and diverse supervisions, like clicks and conversions. They process a vast amount of user features to create compact user embeddings. These embeddings, which are essentially compressed summaries of user information, are then passed to the downstream models. male-1: So, the downstream models are the ones actually making the ad recommendations, and they utilize the embeddings from the upstream models to tailor ads to specific users? female-1: Exactly. This approach promotes efficient representation sharing and improves the overall quality of user representations across numerous models. It's like sharing the expertise of a few highly trained specialists across a vast workforce. male-1: That's an excellent analogy. Now, how does SUM actually work in practice? Can you walk us through the process? female-1: The core of SUM is its upstream user model, which uses the DLRM architecture. This model is divided into two main parts: the user tower and the mix tower. The user tower processes a wide range of user features, converting them into a refined set of SUM user embeddings. These embeddings are then passed to the mix tower, where they interact with other types of features, like those related to ads. male-1: Can you elaborate on how the user tower handles user features? female-1: The user tower is a complex beast. It has a pyramid-like structure with successive Interaction Modules, each designed to capture different aspects of user interactions with features. These modules employ various feature extractors like MLPs, dot compression with attention, Deep Cross Nets, and MLP-Mixers. They process dense and sparse features, compressing the information into a more compact form. male-1: That's a lot of different components! Can you clarify what 'sparse' and 'dense' features are for those listening at home? female-1: Sure. Dense features are continuous variables like a user's click frequency on a particular page. Sparse features are categorical variables, often with high cardinality, like user IDs and page IDs. Sparse features require large embedding lookup tables, which can be computationally intensive. SUM handles a significantly larger number of both dense and sparse features compared to standard models. male-1: So, the user tower effectively takes all this diverse information about a user and condenses it into a set of user embeddings. And then these embeddings are used by the mix tower. Now, you mentioned the mix tower doesn't contain any user-side features as input, other than these embeddings. Why is that? female-1: That's right. The mix tower focuses on refining user representations based on the embeddings provided by the user tower. We found that a more complex mix tower didn't significantly improve the quality of the embeddings or benefit downstream models. It was more important to invest in the complexity of the user tower. male-1: That's a very interesting insight. So, the focus is on creating a robust upstream model that generates high-quality user embeddings, which then inform the downstream models for personalized ad recommendations. Now, how does SUM handle the dynamic nature of user features? How do you ensure embedding freshness? female-1: This is where SUM Online Asynchronous Platform, or SOAP, comes into play. We designed SOAP to ensure latency-free, asynchronous serving of user embeddings. It's a crucial component for maintaining embedding freshness in a dynamic environment where user features are constantly changing. The system decouples the expensive write path from the read path. When a downstream model receives a user request, it immediately gets the previous embeddings from a feature store. At the same time, the SOAP compute center computes the current embedding using the latest model snapshot, and writes it to the feature store. This way, the downstream model receives the average pooled embedding without waiting for the compute center's inference, regardless of the model's complexity. This enables frequent model updates and online embedding inference upon each user request, significantly outperforming traditional offline-based solutions. male-1: That's very impressive. So, SOAP effectively removes the latency bottleneck, allowing for more frequent updates and more complex user models, while still ensuring embedding freshness. How does SUM address the issue of embedding distribution shift due to these frequent updates? female-1: Excellent question. The model keeps generating new snapshots, leading to shifts in the embedding distributions. For example, even for the same input features, the computed user embeddings might differ slightly between snapshots. To mitigate this, we use a technique called average pooling. We average the two most recent cached embeddings with the current computed embedding to get the final current embedding. This has been proven to be a cost-effective solution with good performance and increased feature coverage. It's also worth noting that while average pooling is effective, we are exploring other techniques like regularization and distillation for further improvement. male-1: That's fascinating. So, average pooling is a way to stabilize embeddings and reduce the impact of distribution shifts. Now, let's talk about your experimental results. You used industrial datasets from Meta's advertising system. What were the key metrics you used to evaluate SUM's performance? female-1: We used two main metrics. One is Normalized Entropy (NE), which measures how accurately a model predicts when users will click on ads. Lower NE is better. The other is Feature Importance Ranking (FI), which tells us how important a particular feature is to a model. We used these metrics to evaluate the performance of SUM embeddings in different downstream models. male-1: What were your key findings? female-1: SUM achieved significant online metric gains across various platforms like Instagram, Facebook Messenger, and Shopping Ads. In total, we observed a 2.67% online ads metric gain. While achieving these gains, SUM successfully avoided a 15.3% serving capacity increase compared to directly introducing the same complexity into each downstream model. Additionally, we saw statistically significant NE gains across diverse downstream tasks, including CTR prediction, conversion prediction, and offsite conversion prediction, showcasing the strong representative power and generalizability of SUM. male-1: Those are impressive results. It seems like SUM delivers significant benefits both in terms of performance and efficiency. But did you encounter any limitations or challenges? female-1: Yes, there are some limitations. One challenge we observed was the domain discrepancy between different platforms. For instance, SUM embeddings trained on Facebook mobile feed data didn't perform as well on Instagram models. This highlights the need for specialized SUM models tailored to specific platforms or domains to address this gap. Another limitation is that our study focused primarily on CTR prediction, and we're exploring how SUM can be applied to other tasks like conversion prediction and user engagement. male-1: That's an important point. Let's bring in Prof. Wyd Spectrum, an expert in the broader field of personalized advertising. Prof. Spectrum, what are your thoughts on the potential impact of SUM's innovations? female-2: This is a very exciting development in the field of personalized advertising. SUM addresses a crucial challenge in large-scale systems, where we often face trade-offs between performance, efficiency, and scalability. The framework's ability to leverage upstream models for user representation learning and share those representations with downstream models is a significant advancement. It's a clever solution to address these trade-offs and unlock the potential of more complex user models. The online asynchronous serving approach is also a valuable contribution, ensuring embedding freshness and reducing latency. However, I agree with Dr. Turner that exploring specialized models for different platforms or domains is crucial for further improvement. This is particularly important in the ever-evolving landscape of personalized advertising, where user behavior and data are constantly changing. male-1: Thank you, Prof. Spectrum. Those are insightful points. Paige, what are your thoughts on the future directions of your work? female-1: We're definitely looking at developing specialized SUM models for different platforms and domains to address the domain discrepancy issue. We're also exploring other techniques for mitigating embedding distribution shifts, like regularization and distillation. Additionally, we want to incorporate additional user behavior signals, such as temporal patterns, social interactions, and contextual information, into the SUM model. This could lead to more nuanced user representations and enhanced personalization. male-1: That's a great roadmap for the future. This research has the potential to revolutionize personalized advertising and have broader implications across various fields. For instance, it could be applied to personalize healthcare treatments, educational experiences, or even product recommendations in e-commerce. female-2: Absolutely. This is a powerful example of how advancements in machine learning and data science can be applied to solve real-world problems and improve our experiences. By better understanding our users, we can deliver more relevant and personalized information, leading to a more enriching and efficient digital landscape. male-1: Thanks to both of you for this insightful discussion. We've learned a lot about SUM, its innovations, and its potential impact. Listeners, be sure to check out the paper on the show notes for more details. And join us next time for another byte-sized breakthrough in the world of technology!