Speed up source file set operations.

Profiling the Fuchsia Zircon build's "gn gen" operation [1] shows
that a lot of time is spend copying source file sets during
a very large number of scope merges.

This CL provides a 30% speedup for the Zircon "gn gen", without
degrading the Chromium one (see measurements below). This is done
by introducing SourceFileSet, a dedicated data type with avoids
std::string allocations, and has much faster lookup and insert /
copy operations, compared with std::set<SourceFileSet>

More precisely, by:

  - Introducing a StringAtom data type, modelling a pointer to
    a globally unique constant string, with a heavily optimized
    lookup implementation. See comments in src/gn/string_atom.h
    and src/gn/string_atom.cc for usage and full details.

  - Modifying SourceFile to use a StringAtom, instead of an
    std::string to store the file path. This reduces the size
    and copy cost of each instance.

  - Replace std::set<SourceFile> with the new SourceFileSet data
    type (defined as a base::flat_set<SourceFile>, with fast
    pointer-based comparisons, which turns out to be significantly
    faster than other alternatives).

Note that this works well because source file paths are immutable
and shared among many many scopes / SourceFileSet instances.

Note that despite the fact that SourceFileSet is unordered for
performance reasons, this CL does not modify the outputs of
"gn gen", at least when used for a Chromium Linux build. See [3]
for the methodology used to verify that.

Finally note that the patch reduces memory usage by nearly 1 GiB (!!)
for the Zircon build [4] (but only 6 MiB for the Chromium one).

[1] Performed from a Fuchsia source workspace with:
    $ gn gen --root=zircon out/default.zircon

[2] Results comparing the performance of 'gn' before and
    after the patch. Both binaries were built with a recent
    clang toolchain with build/gen.py -use-icf -use-lto:

--------------------------------------------------------
// ZIRCON BUILD

[fuchsia] $ gn gen --root=zircon out/default.zircon

//// BEFORE (* = best of 5 runs)

Done. Made 51431 targets from 944 files in 12825ms *
Done. Made 51431 targets from 944 files in 13932ms
Done. Made 51431 targets from 944 files in 14886ms
Done. Made 51431 targets from 944 files in 14642ms
Done. Made 51431 targets from 944 files in 13548ms

//// AFTER (* = best of 5 runs)

Done. Made 51431 targets from 944 files in 10623ms
Done. Made 51431 targets from 944 files in 10185ms
Done. Made 51431 targets from 944 files in 10623ms
Done. Made 51431 targets from 944 files in 9759ms  *
Done. Made 51431 targets from 944 files in 10656ms

Overall improvement: 12825 - 9759 = 3066 ms (31% faster)

--------------------------------------------------------
// CHROMIUM BUILD

[chromium/src] $ gn gen out/default

//// BEFORE
Done. Made 12150 targets from 2121 files in 4319ms
Done. Made 12150 targets from 2121 files in 4545ms
Done. Made 12150 targets from 2121 files in 4346ms
Done. Made 12150 targets from 2121 files in 4298ms *
Done. Made 12150 targets from 2121 files in 4390ms

//// AFTER
Done. Made 12150 targets from 2121 files in 4274ms
Done. Made 12150 targets from 2121 files in 4223ms
Done. Made 12150 targets from 2121 files in 4098ms
Done. Made 12150 targets from 2121 files in 4419ms
Done. Made 12150 targets from 2121 files in 4096ms *

Overall improvement: 4298 - 4096 = 202 ms  (4.9% faster)

[3] To verify that "gn gen" outputs were not affected
by this CL, the following experiment was setup on a
Linux workstation.

  1) Create a fresh chromium source checkout on
     a btrfs subvolume directory (e.g. /work/chromium,
     where /work is a btrfs mount point).

  2) cd /work/chromium/src && touch out/args.gn
     # Create an empty args.gn there.

  3) cd /work/chromium/src && <original-gn> gen out
     # Generate out/ files using original gn program.

  4) cd /work &&
    sudo btrfs subvolume snapshot chromium chromium-original
    # Snapshot the generated state to /work/chromium-original

  4) cd /work/chromium/src && <new-gn> gen out
    # Regenerate out/ files using patched gn program.

  5) diff -burN /work/chromium-original/src/out /work/chromium/src/out
     # Compare the files under out/

  --- /work/chromium0/src/out/default/build.ninja	2020-01-31 13:24:32.643343218 +0100
  +++ /work/cr-xx0/src/out/default/build.ninja	2020-01-31 13:21:42.105117112 +0100
  @@ -1,7 +1,7 @@
   ninja_required_version = 1.7.2

   rule gn
  -  command = ../../../../../tmp/gn-sourcefileset --root=../.. -q gen .
  +  command = ../../../../../tmp/gn-master --root=../.. -q gen .
     description = Regenerating ninja files

   build build.ninja: gn

The result shows that the only difference is the name of "gn"
executable name captured in build.ninja, all other files being
identical.

[4] Peak resident set size (RSS) usage as reported with /usr/bin/time -v

// ZIRCON BUILD

BEFORE: Maximum resident set size (kbytes): 2793824
AFTER:  Maximum resident set size (kbytes): 1817880
DIFF:   2793824 - 1817880 = 975944 kiB = 953 MiB (!!)

// CHROMIUM BUILD

BEFORE: Maximum resident set size (kbytes): 544296
AFTER:  Maximum resident set size (kbytes): 538068
DIFF:   544296 - 538068 = 6228 kiB = 6.1 MiB

Change-Id: I4f11fcddeb7b84a6126b748d2b4d66afd1642545
Reviewed-on: https://gn-review.googlesource.com/c/gn/+/7280
Commit-Queue: Brett Wilson <brettw@chromium.org>
Reviewed-by: Brett Wilson <brettw@chromium.org>
Reviewed-by: Scott Graham <scottmg@chromium.org>
diff --git a/build/gen.py b/build/gen.py
index 592e65f..534cf5d 100755
--- a/build/gen.py
+++ b/build/gen.py
@@ -543,6 +543,7 @@
         'src/gn/source_dir.cc',
         'src/gn/source_file.cc',
         'src/gn/standard_out.cc',
+        'src/gn/string_atom.cc',
         'src/gn/string_utils.cc',
         'src/gn/substitution_list.cc',
         'src/gn/substitution_pattern.cc',
@@ -639,6 +640,7 @@
         'src/gn/setup_unittest.cc',
         'src/gn/source_dir_unittest.cc',
         'src/gn/source_file_unittest.cc',
+        'src/gn/string_atom_unittest.cc',
         'src/gn/string_utils_unittest.cc',
         'src/gn/substitution_pattern_unittest.cc',
         'src/gn/substitution_writer_unittest.cc',
diff --git a/src/gn/analyzer.cc b/src/gn/analyzer.cc
index a110771..18563b8 100644
--- a/src/gn/analyzer.cc
+++ b/src/gn/analyzer.cc
@@ -219,7 +219,7 @@
 Analyzer::Analyzer(const Builder& builder,
                    const SourceFile& build_config_file,
                    const SourceFile& dot_file,
-                   const std::set<SourceFile>& build_args_dependency_files)
+                   const SourceFileSet& build_args_dependency_files)
     : all_items_(builder.GetAllResolvedItems()),
       default_toolchain_(builder.loader()->GetDefaultToolchain()),
       build_config_file_(build_config_file),
diff --git a/src/gn/analyzer.h b/src/gn/analyzer.h
index cf9f409..2ffdb52 100644
--- a/src/gn/analyzer.h
+++ b/src/gn/analyzer.h
@@ -23,7 +23,7 @@
   Analyzer(const Builder& builder,
            const SourceFile& build_config_file,
            const SourceFile& dot_file,
-           const std::set<SourceFile>& build_args_dependency_files);
+           const SourceFileSet& build_args_dependency_files);
   ~Analyzer();
 
   // Figures out from a Buider and a JSON-formatted string containing lists
@@ -98,7 +98,7 @@
 
   const SourceFile build_config_file_;
   const SourceFile dot_file_;
-  const std::set<SourceFile> build_args_dependency_files_;
+  const SourceFileSet build_args_dependency_files_;
 };
 
 #endif  // TOOLS_GN_ANALYZER_H_
diff --git a/src/gn/args.h b/src/gn/args.h
index 5ba924b..992d48f 100644
--- a/src/gn/args.h
+++ b/src/gn/args.h
@@ -86,12 +86,12 @@
 
   // Returns the set of build files that may affect the build arguments, please
   // refer to Scope for how this is determined.
-  const std::set<SourceFile>& build_args_dependency_files() const {
+  const SourceFileSet& build_args_dependency_files() const {
     return build_args_dependency_files_;
   }
 
   void set_build_args_dependency_files(
-      const std::set<SourceFile>& build_args_dependency_files) {
+      const SourceFileSet& build_args_dependency_files) {
     build_args_dependency_files_ = build_args_dependency_files;
   }
 
@@ -139,7 +139,7 @@
   // we see an argument declaration.
   mutable ArgumentsPerToolchain toolchain_overrides_;
 
-  std::set<SourceFile> build_args_dependency_files_;
+  SourceFileSet build_args_dependency_files_;
 
   DISALLOW_ASSIGN(Args);
 };
diff --git a/src/gn/build_settings.h b/src/gn/build_settings.h
index 9708df3..0bb55e9 100644
--- a/src/gn/build_settings.h
+++ b/src/gn/build_settings.h
@@ -111,10 +111,10 @@
 
   // A list of files that can call exec_script(). If the returned pointer is
   // null, exec_script may be called from anywhere.
-  const std::set<SourceFile>* exec_script_whitelist() const {
+  const SourceFileSet* exec_script_whitelist() const {
     return exec_script_whitelist_.get();
   }
-  void set_exec_script_whitelist(std::unique_ptr<std::set<SourceFile>> list) {
+  void set_exec_script_whitelist(std::unique_ptr<SourceFileSet> list) {
     exec_script_whitelist_ = std::move(list);
   }
 
@@ -134,7 +134,7 @@
   ItemDefinedCallback item_defined_callback_;
   PrintCallback print_callback_;
 
-  std::unique_ptr<std::set<SourceFile>> exec_script_whitelist_;
+  std::unique_ptr<SourceFileSet> exec_script_whitelist_;
 
   DISALLOW_ASSIGN(BuildSettings);
 };
diff --git a/src/gn/config.cc b/src/gn/config.cc
index 2793de7..c0a5166 100644
--- a/src/gn/config.cc
+++ b/src/gn/config.cc
@@ -10,7 +10,7 @@
 
 Config::Config(const Settings* settings,
                const Label& label,
-               const std::set<SourceFile>& build_dependency_files)
+               const SourceFileSet& build_dependency_files)
     : Item(settings, label, build_dependency_files) {}
 
 Config::~Config() = default;
diff --git a/src/gn/config.h b/src/gn/config.h
index 8dcedef..cb208d1 100644
--- a/src/gn/config.h
+++ b/src/gn/config.h
@@ -5,8 +5,6 @@
 #ifndef TOOLS_GN_CONFIG_H_
 #define TOOLS_GN_CONFIG_H_
 
-#include <set>
-
 #include "base/logging.h"
 #include "base/macros.h"
 #include "gn/config_values.h"
@@ -27,7 +25,7 @@
   // to Scope for how this is determined.
   Config(const Settings* settings,
          const Label& label,
-         const std::set<SourceFile>& build_dependency_files = {});
+         const SourceFileSet& build_dependency_files = {});
   ~Config() override;
 
   // Item implementation.
diff --git a/src/gn/function_exec_script.cc b/src/gn/function_exec_script.cc
index aec8409..20bd815 100644
--- a/src/gn/function_exec_script.cc
+++ b/src/gn/function_exec_script.cc
@@ -27,8 +27,7 @@
 bool CheckExecScriptPermissions(const BuildSettings* build_settings,
                                 const FunctionCallNode* function,
                                 Err* err) {
-  const std::set<SourceFile>* whitelist =
-      build_settings->exec_script_whitelist();
+  const SourceFileSet* whitelist = build_settings->exec_script_whitelist();
   if (!whitelist)
     return true;  // No whitelist specified, don't check.
 
diff --git a/src/gn/item.cc b/src/gn/item.cc
index 308d1e2..3a3a166 100644
--- a/src/gn/item.cc
+++ b/src/gn/item.cc
@@ -9,7 +9,7 @@
 
 Item::Item(const Settings* settings,
            const Label& label,
-           const std::set<SourceFile>& build_dependency_files)
+           const SourceFileSet& build_dependency_files)
     : settings_(settings),
       label_(label),
       build_dependency_files_(build_dependency_files),
diff --git a/src/gn/item.h b/src/gn/item.h
index d53d9ed..2942a66 100644
--- a/src/gn/item.h
+++ b/src/gn/item.h
@@ -26,7 +26,7 @@
  public:
   Item(const Settings* settings,
        const Label& label,
-       const std::set<SourceFile>& build_dependency_files = {});
+       const SourceFileSet& build_dependency_files = {});
   virtual ~Item();
 
   const Settings* settings() const { return settings_; }
@@ -57,13 +57,11 @@
 
   // Returns the set of build files that may affect this item, please refer to
   // Scope for how this is determined.
-  const std::set<SourceFile>& build_dependency_files() const {
+  const SourceFileSet& build_dependency_files() const {
     return build_dependency_files_;
   }
 
-  std::set<SourceFile>& build_dependency_files() {
-    return build_dependency_files_;
-  }
+  SourceFileSet& build_dependency_files() { return build_dependency_files_; }
 
   // Called when this item is resolved, meaning it and all of its dependents
   // have no unresolved deps. Returns true on success. Sets the error and
@@ -73,7 +71,7 @@
  private:
   const Settings* settings_;
   Label label_;
-  std::set<SourceFile> build_dependency_files_;
+  SourceFileSet build_dependency_files_;
   const ParseNode* defined_from_;
 
   Visibility visibility_;
diff --git a/src/gn/ninja_create_bundle_target_writer.cc b/src/gn/ninja_create_bundle_target_writer.cc
index 6220790..efb896a 100644
--- a/src/gn/ninja_create_bundle_target_writer.cc
+++ b/src/gn/ninja_create_bundle_target_writer.cc
@@ -236,7 +236,7 @@
   out_ << ": " << GetNinjaRulePrefixForToolchain(settings_)
        << GeneralTool::kGeneralToolCompileXCAssets;
 
-  std::set<SourceFile> asset_catalog_bundles;
+  SourceFileSet asset_catalog_bundles;
   for (const auto& source : target_->bundle_data().assets_catalog_sources()) {
     out_ << " ";
     path_output_.WriteFile(out_, source);
diff --git a/src/gn/scope.h b/src/gn/scope.h
index edfc87f..c5dcd38 100644
--- a/src/gn/scope.h
+++ b/src/gn/scope.h
@@ -17,12 +17,12 @@
 #include "gn/err.h"
 #include "gn/pattern.h"
 #include "gn/source_dir.h"
+#include "gn/source_file.h"
 #include "gn/value.h"
 
 class Item;
 class ParseNode;
 class Settings;
-class SourceFile;
 class Template;
 
 // Scope for the script execution.
@@ -289,7 +289,7 @@
   // set is constructed conservatively, meanining that every file that can
   // potentially affect this scope is included, but not necessarily every change
   // to these files will affect this scope.
-  const std::set<SourceFile>& build_dependency_files() const {
+  const SourceFileSet& build_dependency_files() const {
     return build_dependency_files_;
   }
   void AddBuildDependencyFile(const SourceFile& build_dependency_file);
@@ -388,7 +388,7 @@
 
   SourceDir source_dir_;
 
-  std::set<SourceFile> build_dependency_files_;
+  SourceFileSet build_dependency_files_;
 
   DISALLOW_COPY_AND_ASSIGN(Scope);
 };
diff --git a/src/gn/setup.cc b/src/gn/setup.cc
index 57ef5ee..5d7bc8a 100644
--- a/src/gn/setup.cc
+++ b/src/gn/setup.cc
@@ -846,8 +846,8 @@
       err.PrintToStdout();
       return false;
     }
-    std::unique_ptr<std::set<SourceFile>> whitelist =
-        std::make_unique<std::set<SourceFile>>();
+    std::unique_ptr<SourceFileSet> whitelist =
+        std::make_unique<SourceFileSet>();
     for (const auto& item : exec_script_whitelist_value->list_value()) {
       if (!item.VerifyTypeIs(Value::STRING, &err)) {
         err.PrintToStdout();
diff --git a/src/gn/source_file.cc b/src/gn/source_file.cc
index acb1c48..1dacbfa 100644
--- a/src/gn/source_file.cc
+++ b/src/gn/source_file.cc
@@ -51,47 +51,52 @@
   return SourceFile::SOURCE_UNKNOWN;
 }
 
-}  // namespace
-
-SourceFile::SourceFile(const std::string& value) : value_(value) {
-  DCHECK(!value_.empty());
-  AssertValueSourceFileString(value_);
-  NormalizePath(&value_);
-  type_ = GetSourceFileType(value_);
+std::string Normalized(std::string value) {
+  DCHECK(!value.empty());
+  AssertValueSourceFileString(value);
+  NormalizePath(&value);
+  return value;
 }
 
-SourceFile::SourceFile(std::string&& value) : value_(std::move(value)) {
-  DCHECK(!value_.empty());
-  AssertValueSourceFileString(value_);
-  NormalizePath(&value_);
-  type_ = GetSourceFileType(value_);
+}  // namespace
+
+SourceFile::SourceFile(const std::string& value)
+    : SourceFile(StringAtom(Normalized(value))) {}
+
+SourceFile::SourceFile(std::string&& value)
+    : SourceFile(StringAtom(Normalized(std::move(value)))) {}
+
+SourceFile::SourceFile(StringAtom value) : value_(value) {
+  type_ = GetSourceFileType(value_.str());
 }
 
 std::string SourceFile::GetName() const {
   if (is_null())
     return std::string();
 
-  DCHECK(value_.find('/') != std::string::npos);
-  size_t last_slash = value_.rfind('/');
-  return std::string(&value_[last_slash + 1], value_.size() - last_slash - 1);
+  const std::string& value = value_.str();
+  DCHECK(value.find('/') != std::string::npos);
+  size_t last_slash = value.rfind('/');
+  return std::string(&value[last_slash + 1], value.size() - last_slash - 1);
 }
 
 SourceDir SourceFile::GetDir() const {
   if (is_null())
     return SourceDir();
 
-  DCHECK(value_.find('/') != std::string::npos);
-  size_t last_slash = value_.rfind('/');
-  return SourceDir(value_.substr(0, last_slash + 1));
+  const std::string& value = value_.str();
+  DCHECK(value.find('/') != std::string::npos);
+  size_t last_slash = value.rfind('/');
+  return SourceDir(value.substr(0, last_slash + 1));
 }
 
 base::FilePath SourceFile::Resolve(const base::FilePath& source_root) const {
-  return ResolvePath(value_, true, source_root);
+  return ResolvePath(value_.str(), true, source_root);
 }
 
 void SourceFile::SetValue(const std::string& value) {
-  value_ = value;
-  type_ = GetSourceFileType(value_);
+  value_ = StringAtom(value);
+  type_ = GetSourceFileType(value);
 }
 
 SourceFileTypeSet::SourceFileTypeSet() : empty_(true) {
diff --git a/src/gn/source_file.h b/src/gn/source_file.h
index 392cdd4..d7ed06e 100644
--- a/src/gn/source_file.h
+++ b/src/gn/source_file.h
@@ -11,9 +11,12 @@
 #include <string>
 #include <string_view>
 
+#include "base/containers/flat_set.h"
 #include "base/files/file_path.h"
 #include "base/logging.h"
 
+#include "gn/string_atom.h"
+
 class SourceDir;
 
 // Represents a file within the source tree. Always begins in a slash, never
@@ -47,11 +50,12 @@
   // Takes a known absolute source file. Always begins in a slash.
   explicit SourceFile(const std::string& value);
   explicit SourceFile(std::string&& value);
+  explicit SourceFile(StringAtom value);
 
   ~SourceFile() = default;
 
   bool is_null() const { return value_.empty(); }
-  const std::string& value() const { return value_; }
+  const std::string& value() const { return value_.str(); }
   Type type() const { return type_; }
 
   // Returns everything after the last slash.
@@ -65,7 +69,7 @@
   // Returns true if this file starts with a "//" which indicates a path
   // from the source root.
   bool is_source_absolute() const {
-    return value_.size() >= 2 && value_[0] == '/' && value_[1] == '/';
+    return value().size() >= 2 && value()[0] == '/' && value()[1] == '/';
   }
 
   // Returns true if this file starts with a single slash which indicates a
@@ -80,7 +84,7 @@
   // return value points into our buffer.
   std::string_view SourceAbsoluteWithOneSlash() const {
     CHECK(is_source_absolute());
-    return std::string_view(&value_[1], value_.size() - 1);
+    return std::string_view(&value()[1], value().size() - 1);
   }
 
   bool operator==(const SourceFile& other) const {
@@ -91,12 +95,29 @@
     return value_ < other.value_;
   }
 
+  struct PtrCompare {
+    bool operator()(const SourceFile& a, const SourceFile& b) const noexcept {
+      return StringAtom::PtrCompare()(a.value_, b.value_);
+    }
+  };
+  struct PtrHash {
+    size_t operator()(const SourceFile& s) const noexcept {
+      return StringAtom::PtrHash()(s.value_);
+    }
+  };
+
+  struct PtrEqual {
+    bool operator()(const SourceFile& a, const SourceFile& b) const noexcept {
+      return StringAtom::PtrEqual()(a.value_, b.value_);
+    }
+  };
+
  private:
   friend class SourceDir;
 
   void SetValue(const std::string& value);
 
-  std::string value_;
+  StringAtom value_;
   Type type_ = SOURCE_UNKNOWN;
 };
 
@@ -112,6 +133,12 @@
 
 }  // namespace std
 
+// Represents a set of source files.
+// NOTE: In practice, this is much faster than using an std::set<> or
+// std::unordered_set<> container. E.g. for the Fuchsia Zircon build, the
+// overall difference in "gn gen" time is about 10%.
+using SourceFileSet = base::flat_set<SourceFile, SourceFile::PtrCompare>;
+
 // Represents a set of tool types.
 class SourceFileTypeSet {
  public:
diff --git a/src/gn/string_atom.cc b/src/gn/string_atom.cc
new file mode 100644
index 0000000..c310ffd
--- /dev/null
+++ b/src/gn/string_atom.cc
@@ -0,0 +1,237 @@
+// Copyright (c) 2020 The Chromium Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+#include "gn/string_atom.h"
+
+#include <array>
+#include <memory>
+#include <mutex>
+#include <set>
+#include <string>
+#include <unordered_set>
+#include <vector>
+
+#include "base/containers/flat_set.h"
+
+namespace {
+
+// Implementation note:
+//
+// StringAtomSet implements the global shared state, which is:
+//
+//    - a group of std::string instances with a persistent address, allocated
+//      through a fast slab allocator.
+//
+//    - a set of string pointers, corresponding to the known strings in the
+//      group.
+//
+//    - a mutex to ensure correct thread-safety.
+//
+//    - a find() method that takes an std::string_view argument, and uses it
+//      to find a matching entry in the string tree. If none is available,
+//      a new std::string is allocated and its address inserted into the tree
+//      before being returned.
+//
+// Because the mutex is a large bottleneck, each thread implements
+// its own local string pointer cache, and will only call StringAtomSet::find()
+// in case of a lookup miss. This is critical for good performance.
+//
+
+static const std::string kEmptyString;
+
+using KeyType = const std::string*;
+
+// This is a trivial hash table of string pointers, using open addressing.
+// It is faster in practice than using a standard container or even a
+// base::flat_set<>.
+//
+// Usage is the following:
+//
+//     1) Compute string hash value.
+//
+//     2) Call Lookup() with the hash value and the string_view key,
+//        this always returns a mutable Node* pointer, say |node|.
+//
+//     3) If |node->key| is not nullptr, this is the key to use.
+//        Otherwise, allocate a new string with persistent address,
+//        and call Insert(), passing the |node|, |hash| and new string
+//        address as arguments.
+//
+struct KeySet {
+  struct Node {
+    size_t hash = 0;
+    KeyType key = nullptr;
+  };
+
+  // Compute hash for |str|. Replace with faster hash function if available.
+  static size_t Hash(std::string_view str) {
+    return std::hash<std::string_view>()(str);
+  }
+
+  // Lookup for |str| with specific |hash| value.
+  // Return a Node pointer. If the key was found, |node.key| is its value.
+  // Otherwise, the caller should create a new key value, then call Insert()
+  // below.
+  //
+  // NOTE: Even though this method is const, because it doesn't modify the
+  //       state of the KeySet, it returns a *mutable* node pointer, to be
+  //       passed to Insert() in case of a miss.
+  //
+  Node* Lookup(size_t hash, std::string_view str) const {
+    size_t index = hash & (size_ - 1);
+    const Node* nodes = &buckets_[0];
+    const Node* nodes_limit = nodes + size_;
+    const Node* node = nodes + index;
+    for (;;) {
+      if (!node->key || (node->hash == hash && *node->key == str))
+        return const_cast<Node*>(node);
+      if (++node == nodes_limit)
+        node = nodes;
+    }
+  }
+
+  // Insert a new key in this set. |node| must be a value returned by
+  // a previous Lookup() call. |hash| is the hash value for |key|.
+  void Insert(Node* node, size_t hash, KeyType key) {
+    node->hash = hash;
+    node->key = key;
+    count_ += 1;
+    if (count_ * 4 >= size_ * 3)  // 75% max load factor
+      GrowBuckets();
+  }
+
+  void GrowBuckets() {
+    size_t size = buckets_.size();
+    size_t new_size = size * 2;
+    size_t new_mask = new_size - 1;
+
+    std::vector<Node> new_buckets;
+    new_buckets.resize(new_size);
+    for (const Node& node : buckets_) {
+      size_t index = node.hash & new_mask;
+      for (;;) {
+        Node& node2 = new_buckets[index];
+        if (!node2.key) {
+          node2 = node;
+          break;
+        }
+        index = (index + 1) & new_mask;
+      }
+    }
+    buckets_ = std::move(new_buckets);
+    size_ = new_size;
+  }
+
+  size_t size_ = 2;
+  size_t count_ = 0;
+  std::vector<Node> buckets_ = {Node{}, Node{}};
+};
+
+class StringAtomSet {
+ public:
+  StringAtomSet() {
+    // Ensure kEmptyString is in our set while not being allocated
+    // from a slab. The end result is that find("") should always
+    // return this address.
+    //
+    // This allows the StringAtom() default initializer to use the same
+    // address directly, avoiding a table lookup.
+    //
+    size_t hash = set_.Hash("");
+    auto* node = set_.Lookup(hash, "");
+    set_.Insert(node, hash, &kEmptyString);
+  }
+
+  // Find the unique constant string pointer for |key|.
+  const std::string* find(std::string_view key) {
+    std::lock_guard<std::mutex> lock(mutex_);
+    size_t hash = set_.Hash(key);
+    auto* node = set_.Lookup(hash, key);
+    if (node->key)
+      return node->key;
+
+    // Allocate new string, insert its address in the set.
+    if (slab_index_ >= kStringsPerSlab) {
+      slabs_.push_back(new Slab());
+      slab_index_ = 0;
+    }
+    std::string* result = slabs_.back()->init(slab_index_++, key);
+    set_.Insert(node, hash, result);
+    return result;
+  }
+
+ private:
+  static constexpr unsigned int kStringsPerSlab = 128;
+
+  // Each slab is allocated independently, has a fixed address and stores
+  // kStringsPerSlab items of type StringStorage. The latter has the same
+  // size and alignment as std::string, but doesn't need default-initialization.
+  // This is used to slightly speed up Slab allocation and string
+  // initialization. The drawback is that on process exit, allocated strings
+  // are leaked (but GN already leaks several hundred MiBs of memory anyway).
+
+  // A C++ union that can store an std::string but without default
+  // initialization and destruction.
+  union StringStorage {
+    StringStorage() {}
+    ~StringStorage() {}
+    char dummy;
+    std::string str;
+  };
+
+  // A fixed array of StringStorage items. Can be allocated cheaply.
+  class Slab {
+   public:
+    // Init the n-th string in the slab with |str|.
+    // Return its location as well.
+    std::string* init(size_t index, const std::string_view& str) {
+      std::string* result = &items_[index].str;
+      new (result) std::string(str);
+      return result;
+    }
+
+   private:
+    StringStorage items_[kStringsPerSlab];
+  };
+
+  std::mutex mutex_;
+  KeySet set_;
+  std::vector<Slab*> slabs_;
+  unsigned int slab_index_ = kStringsPerSlab;
+};
+
+StringAtomSet& GetStringAtomSet() {
+  static StringAtomSet s_string_atom_set;
+  return s_string_atom_set;
+}
+
+// Each thread maintains its own ThreadLocalCache to perform fast lookups
+// without taking any mutex in most cases.
+class ThreadLocalCache {
+ public:
+  // Find the unique constant string pointer for |key| in this cache,
+  // and fallback to the global one in case of a miss.
+  KeyType find(std::string_view key) {
+    size_t hash = local_set_.Hash(key);
+    auto* node = local_set_.Lookup(hash, key);
+    if (node->key)
+      return node->key;
+
+    KeyType result = GetStringAtomSet().find(key);
+    local_set_.Insert(node, hash, result);
+    return result;
+  }
+
+ private:
+  KeySet local_set_;
+};
+
+thread_local ThreadLocalCache s_local_cache;
+
+}  // namespace
+
+StringAtom::StringAtom() : value_(kEmptyString) {}
+
+StringAtom::StringAtom(std::string_view str) noexcept
+    : value_(*s_local_cache.find(str)) {}
diff --git a/src/gn/string_atom.h b/src/gn/string_atom.h
new file mode 100644
index 0000000..fa92ac2
--- /dev/null
+++ b/src/gn/string_atom.h
@@ -0,0 +1,175 @@
+// Copyright (c) 2020 The Chromium Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+#ifndef TOOLS_GN_STRING_ATOM_H_
+#define TOOLS_GN_STRING_ATOM_H_
+
+#include <functional>
+#include <string>
+
+// A StringAtom models a pointer to a globally unique constant string.
+//
+// They are useful as key types for sets and map container types, especially
+// when a program uses multiple instances that tend to use the same strings
+// (as happen very frequently in GN).
+//
+// Note that default equality and comparison functions will compare the
+// string content, not the pointers, ensuring that the behaviour of
+// standard containers using StringAtom key types is the same as if
+// std::string was used.
+//
+// In addition, _ordered_ containers support heterogeneous lookups (i.e.
+// using an std::string_view, and by automatic conversion, a const char*
+// of const char[] literal) as a key type.
+//
+// Additionally, it is also possible to implement very fast _unordered_
+// containers by using the StringAtom::Fast{Hash,Equal,Compare} structs,
+// which will force containers to hash/compare pointer values instead,
+// for example:
+//
+//     // A fast unordered set of unique strings.
+//     //
+//     // Implementation uses a hash table so performance will be bounded
+//     // by the string hash function. Does not support heterogeneous lookups.
+//     //
+//     using FastStringSet = std::unordered_set<StringAtom,
+//                                              StringAtom::PtrHash,
+//                                              StringAtom::PtrEqual>;
+//
+//     // A fast unordered set of unique strings.
+//     //
+//     // Implementation uses a balanced binary tree so performance will
+//     // be bounded by string comparisons. Does support heterogeneous lookups,
+//     // but not that this does not extend to the [] operator, only to the
+//     // find() method.
+//     //
+//     using FastStringSet = std::set<StringAtom, StringAtom::PtrCompare>
+//
+//     // A fast unordered { string -> VALUE } map.
+//     //
+//     // Implementation uses a balanced binary tree. Supports heterogeneous
+//     // lookups.
+//     template <typename VALUE>
+//     using FastStringMap = std::map<StringAtom, VALUE, StringAtom::PtrCompare>
+//
+class StringAtom {
+ public:
+  // Default constructor. Value points to a globally unique empty string.
+  StringAtom();
+
+  // Destructor should do nothing at all.
+  ~StringAtom() = default;
+
+  // Non-explicit constructors.
+  StringAtom(std::string_view str) noexcept;
+
+  // Copy and move operations.
+  StringAtom(const StringAtom& other) noexcept : value_(other.value_) {}
+  StringAtom& operator=(const StringAtom& other) noexcept {
+    if (this != &other) {
+      this->~StringAtom();  // really a no-op
+      new (this) StringAtom(other);
+    }
+    return *this;
+  }
+
+  StringAtom(StringAtom&& other) noexcept : value_(other.value_) {}
+  StringAtom& operator=(const StringAtom&& other) noexcept {
+    if (this != &other) {
+      this->~StringAtom();  // really a no-op
+      new (this) StringAtom(std::move(other));
+    }
+    return *this;
+  }
+
+  bool empty() const { return value_.empty(); }
+
+  // Explicit conversions.
+  const std::string& str() const { return value_; }
+
+  // Implicit conversions.
+  operator std::string_view() const { return {value_}; }
+
+  // Returns true iff this is the same key.
+  // Note that the default comparison functions compare the value instead
+  // in order to use them in standard containers without surprises by
+  // default.
+  bool SameAs(const StringAtom& other) const {
+    return &value_ == &other.value_;
+  }
+
+  // Default comparison functions.
+  bool operator==(const StringAtom& other) const {
+    return value_ == other.value_;
+  }
+
+  bool operator!=(const StringAtom& other) const {
+    return value_ != other.value_;
+  }
+
+  bool operator<(const StringAtom& other) const {
+    return value_ < other.value_;
+  }
+
+  size_t hash() const { return std::hash<std::string>()(value_); }
+
+  // Use the following structs to implement containers that use StringAtom
+  // values as keys, but only compare/hash the pointer values for speed.
+  // E.g.:
+  //    using FastSet = std::unordered_set<StringAtom, PtrHash, PtrEqual>;
+  //    using FastMap = std::map<StringAtom, Value, PtrCompare>;
+  //
+  // IMPORTANT: Note that FastMap above is ordered based in the StringAtom
+  //            pointer value, not the string content.
+  //
+  struct PtrHash {
+    size_t operator()(const StringAtom& key) const noexcept {
+      return std::hash<const std::string*>()(&key.value_);
+    }
+  };
+
+  struct PtrEqual {
+    bool operator()(const StringAtom& a, const StringAtom& b) const noexcept {
+      return &a.value_ == &b.value_;
+    }
+  };
+
+  struct PtrCompare {
+    bool operator()(const StringAtom& a, const StringAtom& b) const noexcept {
+      return &a.value_ < &b.value_;
+    }
+  };
+
+ protected:
+  const std::string& value_;
+};
+
+namespace std {
+
+// Ensure default heterogeneous lookups with other types like std::string_view.
+template <>
+struct less<StringAtom> {
+  using is_transparent = int;
+
+  bool operator()(const StringAtom& a, const StringAtom& b) const noexcept {
+    return a.str() < b.str();
+  }
+  template <typename U>
+  bool operator()(const StringAtom& a, const U& b) const noexcept {
+    return a.str() < b;
+  }
+  template <typename U>
+  bool operator()(const U& a, const StringAtom& b) const noexcept {
+    return a < b.str();
+  };
+};
+
+template <>
+struct hash<StringAtom> {
+  size_t operator()(const StringAtom& key) const noexcept { return key.hash(); }
+};
+
+}  // namespace std
+
+#endif  // TOOLS_GN_STRING_ATOM_H_
diff --git a/src/gn/string_atom_unittest.cc b/src/gn/string_atom_unittest.cc
new file mode 100644
index 0000000..2dcc993
--- /dev/null
+++ b/src/gn/string_atom_unittest.cc
@@ -0,0 +1,124 @@
+// Copyright (c) 2020 The Chromium Authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+#include "gn/string_atom.h"
+
+#include "util/test/test.h"
+
+#include <set>
+#include <string>
+#include <vector>
+
+TEST(StringAtomTest, EmptyString) {
+  StringAtom key1;
+  StringAtom key2("");
+
+  ASSERT_STREQ(key1.str().c_str(), "");
+  ASSERT_STREQ(key2.str().c_str(), "");
+  ASSERT_EQ(&key1.str(), &key2.str());
+}
+
+TEST(StringAtomTest, Find) {
+  StringAtom empty;
+  EXPECT_EQ(empty.str(), std::string());
+
+  StringAtom foo("foo");
+  EXPECT_EQ(foo.str(), std::string("foo"));
+
+  StringAtom foo2("foo");
+  EXPECT_EQ(&foo.str(), &foo2.str());
+}
+
+// Default compare should always be ordered.
+TEST(StringAtomTest, DefaultCompare) {
+  auto foo = StringAtom("foo");
+  auto bar = StringAtom("bar");
+  auto zoo = StringAtom("zoo");
+
+  EXPECT_TRUE(bar < foo);
+  EXPECT_TRUE(foo < zoo);
+  EXPECT_TRUE(bar < zoo);
+}
+
+TEST(StringAtomTest, NormalSet) {
+  std::set<StringAtom> set;
+  auto foo_ret = set.insert(std::string_view("foo"));
+  auto bar_ret = set.insert(std::string_view("bar"));
+  auto zoo_ret = set.insert(std::string_view("zoo"));
+
+  StringAtom foo_key("foo");
+  EXPECT_EQ(*foo_ret.first, foo_key);
+
+  auto foo_it = set.find(foo_key);
+  EXPECT_NE(foo_it, set.end());
+  EXPECT_EQ(*foo_it, foo_key);
+
+  EXPECT_EQ(set.find(std::string_view("bar")), bar_ret.first);
+  EXPECT_EQ(set.find(std::string_view("zoo")), zoo_ret.first);
+
+  // Normal sets are always ordered according to the key value.
+  auto it = set.begin();
+  EXPECT_EQ(it, bar_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, foo_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, zoo_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, set.end());
+}
+
+TEST(StringAtomTest, FastSet) {
+  std::set<StringAtom, StringAtom::PtrCompare> set;
+
+  auto foo_ret = set.insert(std::string_view("foo"));
+  auto bar_ret = set.insert(std::string_view("bar"));
+  auto zoo_ret = set.insert(std::string_view("zoo"));
+
+  StringAtom foo_key("foo");
+  EXPECT_EQ(*foo_ret.first, foo_key);
+
+  auto foo_it = set.find(foo_key);
+  EXPECT_NE(foo_it, set.end());
+  EXPECT_EQ(*foo_it, foo_key);
+
+  EXPECT_EQ(set.find(std::string_view("bar")), bar_ret.first);
+  EXPECT_EQ(set.find(std::string_view("zoo")), zoo_ret.first);
+
+  // Fast sets are ordered according to the key pointer.
+  // Because of the underlying bump allocator, addresses
+  // for the first three inserts are in increasing order.
+  auto it = set.begin();
+  EXPECT_EQ(it, foo_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, bar_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, zoo_ret.first);
+  ++it;
+
+  EXPECT_EQ(it, set.end());
+}
+
+TEST(StringAtom, AllocMoreThanASingleSlabOfKeys) {
+  // Verify that allocating more than 128 string keys works properly.
+  const size_t kMaxCount = 16384;
+  std::vector<StringAtom> keys;
+
+  // Small lambda to create a string for the n-th key.
+  auto string_for = [](size_t index) -> std::string {
+    return std::to_string(index) + "_key";
+  };
+
+  for (size_t nn = 0; nn < kMaxCount; ++nn) {
+    keys.push_back(StringAtom(string_for(nn)));
+  }
+
+  for (size_t nn = 0; nn < kMaxCount; ++nn) {
+    ASSERT_EQ(keys[nn].str(), string_for(nn));
+  }
+}
diff --git a/src/gn/target.cc b/src/gn/target.cc
index 0b5402e..8721f1a 100644
--- a/src/gn/target.cc
+++ b/src/gn/target.cc
@@ -277,7 +277,7 @@
 
 Target::Target(const Settings* settings,
                const Label& label,
-               const std::set<SourceFile>& build_dependency_files)
+               const SourceFileSet& build_dependency_files)
     : Item(settings, label, build_dependency_files) {}
 
 Target::~Target() = default;
diff --git a/src/gn/target.h b/src/gn/target.h
index fbe0f71..3c7ebf9 100644
--- a/src/gn/target.h
+++ b/src/gn/target.h
@@ -64,7 +64,7 @@
   // to Scope for how this is determined.
   Target(const Settings* settings,
          const Label& label,
-         const std::set<SourceFile>& build_dependency_files = {});
+         const SourceFileSet& build_dependency_files = {});
   ~Target() override;
 
   // Returns a string naming the output type.
diff --git a/src/gn/toolchain.cc b/src/gn/toolchain.cc
index d66b489..bfad81d 100644
--- a/src/gn/toolchain.cc
+++ b/src/gn/toolchain.cc
@@ -14,7 +14,7 @@
 
 Toolchain::Toolchain(const Settings* settings,
                      const Label& label,
-                     const std::set<SourceFile>& build_dependency_files)
+                     const SourceFileSet& build_dependency_files)
     : Item(settings, label, build_dependency_files) {}
 
 Toolchain::~Toolchain() = default;
diff --git a/src/gn/toolchain.h b/src/gn/toolchain.h
index 6e450a1..eb5a60c 100644
--- a/src/gn/toolchain.h
+++ b/src/gn/toolchain.h
@@ -44,7 +44,7 @@
   // refer to Scope for how this is determined.
   Toolchain(const Settings* settings,
             const Label& label,
-            const std::set<SourceFile>& build_dependency_files = {});
+            const SourceFileSet& build_dependency_files = {});
   ~Toolchain() override;
 
   // Item overrides.