One of roughly a dozen principal engineers at Amazon at that time
Customer Master Service: Architect and Project lead for Amazon's first distributed service. CustomerMaster managed all of Amazon's customer information, including email and street addresses, credit card tokens, purchasing preferences, login and authentication.
Customer Master was a mission-critical service, highly available, fault-tolerant, and redundant, without which no customer could place an order in the United States, France, United Kingdom or Japan
An Object-Oriented Programming API presented objects to the business logic clients such as the shopping cart and ordering modules, which allowed complex interactions and even the creation of new objects on the client, which could all be synchronized to the Customer Master Service with a single, atomic save using optimistic locking to preserve data integrity and high availability.
The first use of globally unique identifiers at Amazon for client-side object creation.
Session Directory: Designer and Project Lead for Amazon's first sharded database solution. Session Directory managed all of Amazon's shopping cart information, including grouping catalog items into orders.
This two-tier solution predated the Customer Master Service. It partitioned shopping carts and orders into multiple databases, using a weighted random allocation that allowed our DBA's to add new partitions and redistribute the load for maintenance. Each horizontal partition (or shard) consisted of a replicated pair of identical databases, so a hot-standby was always available.
Managed this priority zero project (above all others), without which Amazon's infrastructure could not have scaled for Christmas of 2001. Went from concept to production in 90 days, and launched with no down time by adding the initial extra partitions with an initial weight of zero.
Everything worked. No existing shopping carts were lost and the shopping cart databases scaled through the Christmas season.
Customer Master Database: Project Lead for Amazon's first successful large-scale database refactoring project, taken from concept to production in four months.
Prior to this, customer data was scattered, intermingled and co-joined with shopping cart and order information across six huge Oracle databases on the most powerful unix hardware available that time. These databases were already straining under the load of six-way multi-master replication, because every update resulted in six writes! And replication conflicts were common.
Goal: Untangle and relocate all customer data into a single database with universal coordinated time (UTC) dates, and refactor over a million lines of code to access the new location.
Without this refactoring, Amazon's existing databases could not have scaled for Christmas of 2001.
Since project had already been attempted and failed once before I came to Amazon, resulting in a lengthy outage and negative publicity, Jeff Bezos gave two demands: don't make the cover of the Wall Street Journal, and only take the site down for one hour.
To prepare for this project, I contacted the team members of the failed project to assess what went wrong and to build fail-safes into our process to avoid repeating history. On the surface, the objectives seemed to be impossible, because customer data was involved in almost every aspect of Amazon's systems from shopping to fulfillment, but we were able to achieve them following a phased approach:
Phase 1 -- Refactor all SQL statements to avoid joins between orders, shopping carts and customers, (which would no longer work when those tables were not locally available on the same database.)
Phase 2 -- Virtual Customer Master Database. We created a virtual customer master database consisting of nothing but DbLinks to the real database tables. This allowed us to test our new SQL and allowed us to begin to refactor the source code.
Phase 3 -- Refactor All Source Code. We used a call-graph analysis tool to find all paths to statements affecting customer data, and refactored the code to access customer data at its new virtual location.
Phase 4 -- Dual Mode Access. In the staging environment, we cut over completely to the new test database, but in production, the new code paths toggled at run-time to use the old locations. The source code and databases were instrumented to report every incorrect access to the old locations.
Phase 5 -- Live Launch. On launch night, at midnight we took Amazon offline, and allowed the 6-way replication logs to play out and initialized the new Customer Master Database, toggled our run-time switch to access the new location and brought the system back online within the hour. Every system worked.
Phase 6 -- Post Launch. One secret to our success was that we left DbLinks behind in the six old databases pointing to the new Customer Master Database. This fail-safe strategy allowed any tools and utilities that might have been overlooked to continue to work; but we contacted the owners and gave them a short time window to correct their code. Thirty days later, the DbLinks were removed and the new Customer Master Database project was completed.
We hit all our milestones and kept all our promises. The new customer database scaled through the Christmas season; and replication conflicts were completely eliminated.