I’d like to start series of posts about problems we’re facing in e-commerce area and what algorithms we used to solve them. There are a lot of e-commerce sites on different levels of complexity in terms of feature richness, data amount, throughput requirements, consistency and realtime updates. Let me first introduce the situation we’re trying to handle and then move to the solution space.
Let’s look at example of some e-commerce site product listing:
I guess the e-commerce layout is pretty familiar to you - but to briefly summarize:
- every bigger e-commerce site has multi-level category tree (more about categories here)
- products are described with various parameters that allow filtering (more about parameters here)
- products have various tags - which is special form of parameter that is also visible on the product tile
- product listing has usually some form of simple sorting (more about sorting here)
- and finally there are product tiles representing filtered products - usually in the form of paginated list
On the first look implementation of such site may not seem so difficult, does it? But the devil is in the detail. Let’s go through some requirements that are usually expected by our clients.
Category tree represents nested structure and is usually bigger than what you see as a site visitor. Some parts of the category tree are usually hidden. Clients want to temporarily hide some categories until the proper moment arises, or there may be categories that are empty, or contain only products that you as a site visitor wouldn’t see. Either because products are marked entirely invisible, there is no valid price for you or the product hasn’t got a localized version.
Categories lists all products in the category and all of its sub-categories. This may get a little bit tricky because the category can have no products linked to it but its sub-categories can. This means that the category must be part of the visible tree and display products of its inner categories.
If owner of the site marks a category as “not visible to the user” - the category, all of its subcategories and all products in them must not be visible in top categories. There is also exception - single product may be linked to several categories of the tree - so if the product is linked both to the hidden part of the tree and another sub-category that is part of the visible category tree - the product should be displayed to the user of course.
The most operations executed in category tree are:
- give me all sub-categories of current category (except hidden)
- and give me all parents of current category.
The query is always linked with products in the category and subcategories that match additional criteria. You should never see an empty category or category with products you cannot buy as a visitor of the site.
As a developer you need to properly evaluate how should the category tree appear to you as unique customer. There are multiple approaches to map tree structure to record based database no matter whether we talk about relational or document based databases (graph databases are exception here). We chose modified MPTT approach that minimizes write operations and allows fast read lookup.
We used to precompute invisible parts of the tree by batch job, but this led to unnecessary system load and the computed data wasn’t taking user related constraints into the account. We switched to real-time computing now and things got a lot easier.
Parameters and facets
According some studies - properly implemented parametrized search (sometimes called as faceted search) can increase conversions of the e-commerce sites by 20%. Each product can have multiple parameters in the form of key-value map. Values might represent:
- discrete constants (for example color:black, size:XXL, OS:Android) visualized as checkboxes or selects
- or numeric values that are spread out in some range and can be visualized as a slider with low and high limit
Some parameters are visible only on product detail page, others should be used in filters on product listing pages. Parameter are organized and grouped to parameter groups by their similarity (color, size, operation system). When user enters the category he/she should see only parameters (parameter groups) that have sense in that category. In other words only parameter should be part of the filter only if there is a product in the category, that uses such parameter.
Better e-commerce sites shows the number of the products that have the parameter next to it - see example:
Yet better sites reflect currently selected filter in the filter itself. Additional parameters that would return no result if used are displayed as read-only. Parameter that would further expand count of the matching product if used display difference count with plus sign or the updated overall count next to them.
Parameters in the same group are usually linked with OR relation, parameters in different groups are linked by AND relation. Some parameters might have negative meaning - so that if user marks them, he expects that listing will contain only products that don’t have such parameter (as an example consider parameter allergen:gluten which will cause that all products containing gluten will be removed from the listing).
Interval based parameters are usually rendered as slider:
Slider has its min and max limits and the possible user selection. Limits should be recalculated when filter changes to represent only those parameter values that are really present in current product listing. Limits must be of course calculated for current query WITHOUT applying user selection on current parameter group (for example height or width).
The holy grail of parametrized/faceted filtering is to avoid situation when user creates a filter that produces no results. Exact numbers also allow user to better define his/her search and have result listing rich enough to be able to choose from and limited enough so that he/she is not forced to go through dozens of pages to find what he looks for.
In the naive approach the filter counts can be somehow precomputed (although this is not a feasible approach in my opinion), but when user starts narrowing the filter everything has to be computed in realtime. To compute correct numbers you have to apply current user filter to display the product result and also execute modified query for a possible result of each unused parameter. And we’re talking about dozens of such combinations that needs to be reevaluated each time.
Using cache is very problematic, because caching key would consist of category and exact parameter combination which would lead to permutation explosion and may soon run out of memory. Moreover cache would probably expire with next product added/removed from the category (or its parameter assignment is changed) - and that is pretty common.
I’m going to dedicate separate post for this problem area.
E-commerce sites handle different types of products. Beside standard ones there are usually products with variants - ie. multiple products that differs one from another only in certain parameters but generally represents the same item. Best example are T-shirts that is usually shipped in different colors, sizes or male/female variant. We call this product bundle as MASTER PRODUCT with VARIANT PRODUCTS. Master product cannot be bought - it just represents its variants. In product listing visitor see only master product with a price range representing prices of its variants or single price if all variants share the same price. In some implementations user can select proper variant even on product listing page, but usually this is done in the detail page:
Tricky part is that price / parameter filter should reflect data of each product variant but reflect only single master product in the counts. For example if the product has 3 size variants (S, M, XXL) but all in single color (blue), the product should be counted only once in parameter matching product counts. The master product should be part of the product listing if either of its variants matches the user price / parameter filter.
Different but somehow similar are “product sets”. Ie. bulk products that represents single item in the shopping basket, but consist of multiple parts. Example of such product set may be a cabinet that consist of frame, doors and handrails. User buys the product as a set of one frame, two doors and two handrails. The set may be fixed or variable where user can choose some parts from compatible selection. These types of products also requires different handling in product listings and filtering.
All product types might mix and match in single category.
Price policy varies - in case of business to customer e-commerce sites it is usually simple. Product has generally single selling price and so called “reference” or “common” price that is rendered as cross-linked price and refers to the price usually seen in other sites or stone shops. Sometimes there are time-limited action prices that should override the standard selling price.
In business to business e-commerce sites the situation is completely different. In this relationship the seller often offers benefits to clients with higher sales in the form of special discounts. Products have different price levels, temporary discounts on certain depots, product groups and so on. The price policy is usually very different for each e-commerce implementation and is highly tied to capabilities of ERP system of the company and the imagination of the businessmen. Price policy is usually something you have to adapt to and cannot change it when implementing e-commerce site for the company.
Single product can have multiple prices (dozens of them). Many of those may apply to you as a customer so the system needs to pick the correct price from all of them based on some strategy. Strategy of least price is used sometimes, sometimes prices of the most prioritized price list is used, sometimes the latest price wins, sometimes it might be some weird combination of those.
Prices might be valid for certain time interval and can be in different currencies.
We’ve implemented e-commerce site for the company that has a flowchart for determining the target customer price. So it could bring really hard times to implementation team.
And there is the price filter on the product listing page - do you remember?
You need to calculate proper price for all thousands of items in the category and filter according to user specified interval. This can be a serious problem for many engines - it involves grouping function which is rather costly.
Commonly used workaround for this problem is omitting the price filter and calculating prices only for the products visible on the single product listing page. But this lowers the comfort for the user and we want to avoid such compromises.
The icing on the cake is that you have to use different prices for different customers - B2B needs to see price without VAT, B2C customers price with VAT. Even some B2B customers with low turnover might purchase goods with VAT and you should be prepared to handle it properly in filters and shopping cart.
Your e-commerce site needs to reflect multi-language, multi-region requirements. Not only you need to be able to provide translation for labels and messages used by the application. You also need to provide a way to translate the data connected to the products, categories, brands and other entities used on e-commerce site. Some parameters/tags may apply only in certain regions and in other should be entirely invisible. Some products may be sold in certain regions - others not.
Administrators of the e-commerce site rarely wants to maintain the same product in multiple copies for different regions. The product should stay the same and only relevant parts of it get translated. This leads to additional requirements on data structures used for filtering and product lookup.
We live in the SEO world. Everything in e-commerce has to have simple, short and user / search-engine comprehensible URL. Single product/category/brand should have it’s own localized form for different languages / regions. Once URL is assigned it should never vanish - if product or category is renamed, new url reflecting current name should be generated, but the old one should remain active and redirect with HTTP 30X code to the current url of the item.
Each HTTP 404 response from the e-commerce site might represent lost opportunity and we want to avoid that at all costs.
The product listing is always sorted by certain property of the product. E-commerce sites often offer sort by max/min price, recommendation (or seller defined priority), most sold products, best rating or some specific other filtering. Sorting is usually done by single property which makes situation little bit simpler but not much.
Relational databases can take advantage of the index and need not to go through all records to return first page of the sorted records. But the query needs to match exactly the index. Unfortunately the query on e-commerce site is too complex to take advantage of this. Sorting usually means that all products needs to be collected and then sorted according the requested property.
This affects the performance of the site.
Data size varies from company to company. We target bigger companies with specific demands and the standard setup we’re optimizing and testing for is as follows:
- product count: 50 thousands
- category count: 1 thousand
- parameter count: 20 thousands
- product to parameter assignment count: 0.5 million
- product prices: 2 millions
- depots: 50
- product stock records: 2.5 millions
Consider that all of the data are part of the search query. If you try to join them in SQL relational query you quickly run to extreme number of possible combinations.
Updates and consistency
Administrators request that the modifications on the products, categories etc. happen instantly. They can accept slight delay but they won’t certainly wait for hours until the changes arrive on the production site. They also expect that if the change appears on one page (for example product detail) it will be also present in product listing and other places.
We - developers are used to compromise on consistency (many NoSQL databases are based on eventual consistency) but our customers are not and it’s usually hard to explain to them and vindicate such behaviour of the system. It can be justifiable on cluster solutions, but not in the case when e-commerce is operated by single node. Relying heavily on caches can easily lead to “broken consistency” state from the user’s perspective (remember that cache invalidation is one of the hard things in programming).
What we really aim for is to be able to compute everything in runtime on live data set avoiding the cache as far as possible.
The performance of the e-commerce site is one of the most important things. There are many articles analyzing how bad/good performance affects the conversion rates. Usually quoted sentence is that at Amazon 100 milliseconds of added latency had an effect of 1% reduction in sales.
So all of the requirements summarized in previous chapters should be taken into consideration but still we need to make really fast. Website performance is complex thing with multiple aspects - e-commerce sites needs to be fast on mobile phones which involves keeping size of the page low, keeping CSS and JS simple, but that’s not what I write about.
The response time of the server generating filtered and sorted results needs to be as low as possible.
Every customer works with ROI. E-commerce sector is characterized by high competition. Lot of sellers operate with really low margin and this is reflected on the supplier of the e-commerce solution.
Sometimes high-availability cluster solution is no-go for the customer. Sometimes you have to run on VPS server which CPU performance corresponds with 900Mhz processor and not 2-3Ghz speed you’re used to on your development computer.
I tried to describe some problems of the e-commerce domain that are not visible from the outside. Three years ago my knowledge about e-commerce was really limited and wasn’t able to imagine what does it mean to develop and run e-commerce sites.
Maybe you’re in my position few years ago and this article may open your eyes somehow. Maybe you’re developer implementing such e-commerce site and some of my information may bring you some ideas. Maybe you’re operating your own e-commerce shop and looking for help. In any case feel free to contact me for more information at [novotnaci(at)gmail.com].