SQL in the API??? #
I know what you are thinking: Exposing an API that accepts SQL is crazy. It's a terrible idea.
Especially if the API is exposed on the internet. Doing that is insecure and will lead to SQL injection attacks, it is a nightmare to maintain and it will lock the backend implementation into a specific technology (some ANSI SQL database).
But is that really true? Time to re-evaluate!
There is always an exception #
Engineering is about making trade-offs. There is no one fits it all solution. Therefore, here I'll present the kind of scenario in which SQL is in fact the best choice to make a backend API work out.
But let's take a step back first. Many APIs can be simple. Often, a CRUD API that allows for fetching, creating, updating and deleting entities by some ID is all that is needed to get the job done.
No arguing with that. And there is no reason to over-engineer things if this is really the case.
On the other hand, there is a considerable number of applications where the business will want to expand its features in a way that forces the API to become more and more complex over time.
Even just a basic web-shop can be a sufficient example for that kind of scenario. Let's say there is this web-shop which sells bicycles. Having a classical CRUD API already gets us very far in the sense that we can add, delete and update items (bicycles). We can also list them and show them to a user. But what about allowing the user to filter the various bicycles by their properties? Like color, size, price, etc.
Sounds simple at first. Just add some filters-field to the request. In the HTTP / REST world it might like look this:
GET /bicycles?color=red&size=large&price=1000-2000
Problem solved.
But faster than you can imagine, the next requirement comes in: OR
filters!
Maybe they should be combined with AND
filters so that we can express constraints like bicycles that are (made of steel AND weigh from 10 to 20kg) OR (made of carbon AND weigh from 5 to 10kg).
Dealing with unknown values (typically represented by null
in databases) and other subtleties will quickly creep into the codebase as well.
And before long there is some complex, custom-made spaghetti-ball of filter logic, even for a simple shop website. Sounds familiar? If not, lucky you. :-)
Usually, at some point, developers start to structure those filters and start to use e.g. json for more complicated requests. It might look like this:
{
"filters": {
"or": [
{
"and": [
{
"property": "material",
"operator": "equals",
"value": "steel"
},
{
"property": "weight",
"operator": "between",
"value": [10, 20]
}
]
},
{
"and": [
{
"property": "material",
"operator": "equals",
"value": "carbon"
},
{
"property": "weight",
"operator": "between",
"value": [5, 10]
}
]
}
]
}
}
At the latest at that point your nose should notify you that this is a design smell.
Why? Because we are essentially inventing our own query language here. And inventing a custom language should be pretty much the last resort, I hope we agree on that.
SQL to the rescue? #
Since this article is about SQL in APIs, how does SQL help us here? We surely don't want to have the webshop-frontend send database queries to the backend to get the results.
Well, to be honest, that already has become a thing nowadays. Hasura and alternatives are doing pretty much that - they generate a fully fledged graphql API based on the SQL schema of the database.
But besides being pricey and requiring specific database technology and locking you into that technology and having limited customizability, there is also just a much more lightweight alternative: Exposing a subset of SQL in your API.
The beauty of doing so is threefold:
- SQL is well-known language with plenty of documentation to learn it as well as support for it in almost every programming language. No need to write your own SQL parser or anything, whereas with your "own" language you need to write one. (Actually: it doesn't even have to be SQL, as long as it is a well understood and standardized query language. But SQL is probably the best choice in most cases)
- We can expose subsets of SQL and add more and more functionality over time and when needed, to fulfill the business requirements - without breaking existing queries and tooling.
- We remove a whole layer of confusion because of conversions from and to SQL (if the backend uses an SQL database anyways). Also, SQL is compact and easy to read for everyone.
But isn't that unsafe? Don't we execute SQL provided by the frontend directly against the database? That sounds like a security nightmare. And indeed, people have tried that and it has gone horribly wrong. So don't do that!
Instead, for such a system to work properly, we have to do the same that we do for any other API:
validate the input and sanitize it before executing it. And by that I'm talking about a white-list approach.
We only accept SQL from the request that we specifically allow. Everything else is rejected.
Mind that we don't do that by validating the SQL string! We will first parse the SQL string into a data structure, then validate this structure and finally convert the structure back into SQL.
In other words: the SQL executed against the database might look a little different than the one we received from the API.
And this is good. It allows us to do modifications to the SQL structure if necessary (such as adding constraints and transforming identifiers like column names) and make our API even work with non-SQL databases!
Yes, you heard it right: using SQL in the API doesn't mean we also need to actually run it against an (ANSI-) SQL database. We could choose to generate an elasticsearch query instead, for example.
Okay, enough talk about concepts. How does this look in practice in terms of actual code?
I hope for you, dear reader, that you are having the luxury of working with a programming language that allows to model ADTs in a nice way and offers a builtin technique such as pattern matching for decomposition. If you don't... well, I suggest to learn a few new languages that do support those features, it's 2023 after all. (sorry to all Go developers out there)
I'm going to use Scala for the following code examples but you should be able to understand the code without knowing Scala.
Defining the data structure #
Let's stick to our bicycle example above. To keep it simple, we will only want to use SQL for the filtering part.
How would such an SQL filter look like? As a full SQL query it would look like that:
SELECT *
FROM bicycles
WHERE (material = 'steel' AND weight BETWEEN 10 AND 20)
OR (material = 'carbon' AND weight BETWEEN 5 AND 10)
What we care about is the where-expression only:
(material = 'steel' AND weight BETWEEN 10 AND 20) OR (material = 'carbon' AND weight BETWEEN 5 AND 10)
In our code we typically model such a structure with a recursive tree-structure:
sealed trait Expr
case class Column(name: String) extends Expr
case class And(left: Expr, right: Expr) extends Expr
case class Or(left: Expr, right: Expr) extends Expr
case class Between(expr: Expr, lower: IntegerValue, upper: IntegerValue) extends Expr
case class Parenthesis(expr: Expr) extends Expr
case class Equals(column: Column, value: Value) extends Expr
case class ValueExpr(value: Value) extends Expr
// ... more expressions here for other SQL features
sealed trait Value
case class StringValue(value: String) extends Value
case class IntegerValue(value: Int) extends Value
// ... more value types here for other types of values in SQL
That is all we need to describe the filter above - and other even more complicated filters as well.
Usually, we will not define the above data structure by ourselves. We should use a library that already gives us those definitions.
Processing the SQL #
Such a library will also help us to parse the SQL, so all we really have to do is this:
val testSqlStringFromAPI: String =
"(material = 'steel' AND weight BETWEEN 10 AND 20) OR (material = 'carbon' AND weight BETWEEN 5 AND 10)"
val filterExpression: Either[Error, Expr] =
sql_library.parse(testSqlStringFromAPI)
filterExpression match {
case Left(error) =>
println(s"The SQL was invalid!") // return error 400 to the frontend here
case Right(sqlExpression: Expr) =>
??? // This is where it gets interesting! Process the SQL here
}
The last thing we have to do now is to decide, which features of SQL we want to allow in our API and also what we want to generate based on the given SQL. In the easiest case, we actually are working with an SQL database. Then we can just generate SQL again. If we were to work against an elasticsearch instance, we would generate an ES query instead.
Let's assume for now, that we are not allowing or-clauses in our API yet.
And to keep the example small, we will just recreate the SQL as a string. In reality, we would of course build the SQL using the same library again.
In the following code we traverse the SQL structure and validate/transform it in the way we need:
val columns = List("material", "weight", "color") // Some columns we use/allow in the API
// Recursive function to traverse the structure
def processSqlExpr(expr: Expr): String = expr match {
case Column(name) =>
if(columns.contains(name))
name
else
throw new Exception(s"Column $name is unknown and not supported!")
case And(left, right) =>
s"(${processSqlExpr(left)} and ${processSqlExpr(right)})"
case Or(left, right) =>
throw new Exception("Or-clauses are not supported yet!")
case Between(expr, lower, upper) =>
s"${processSqlExpr(expr)} between ${processSqlValue(lower)} and ${processSqlValue(upper)}"
// the following one removes double parenthesis! :-)
case Parenthesis(Parenthesis(expr)) =>
s"(${processSqlExpr(expr)})"
case Parenthesis(expr) =>
s"(${processSqlExpr(expr)})"
case Equals(column, value) =>
s"${column.name} = ${processValue(value)}"
case ValueExpr(value) =>
processSqlValue(value)
}
def processSqlValue(value: Value) = value match {
case StringValue(value) =>
s"'$value'"
case IntegerValue(value) =>
value.toString
}
That is all we need to handle arbitrary complex filters. Let's try it out:
processSqlExpr("(material = 'steel' AND weight BETWEEN 10 AND 20) OR (material = 'carbon' AND weight BETWEEN 5 AND 10)")
// Error: Or-clauses are not supported yet!
processSqlExpr("((material = 'steel' AND weight BETWEEN 10 AND 20))")
// Output: (material = 'steel' and weight between 10 and 20)
// ^^^ Note how the spelling of the SQL is slightly different from the input and that we removed the double parenthesis.
If we later want to add support for OR-clauses, it's as simple as changing
throw new Exception("Or-clauses are not supported yet!")
into
s"(${processSqlExpr(left)} or ${processSqlExpr(right)})"
In fact, we now have the ability to expose pretty much every SQL feature to our API users with almost no effort.
Wrapping up #
Since SQL has more features and syntax than shown in the example, processSqlExpr()
will be a bit longer, since it has to handle more cases (or alternatively have a catch-all case that throws an error if an unsupported feature is used).
As you can hopefully see here, this approach is very different from any kind of ancient string-manipulation of SQL, which is despised for good reasons.
Since in reality we would not really create SQL strings but build the SQL using a library, there is also no way that we would accidentally create invalid SQL or SQL that is insecure and allowing for SQL injections (e.g. by forgetting to escape strings or removing comments from the SQL).
If we later want to enable more features of SQLs power, we can easily do that: and existing queries will keep working. And that also includes more complicated features like different kinds of pagination, limits, ordering, grouping, etc.
Also note that we can still change the database-schema without problems, as long as we keep the "column-names" that we expose in the API the same. We will just have to map them to the new schema during our processing.
Hold on! #
By now you should have a good mental model about how that approach can play out in technical terms. But it is important to understand that it is still a niche solution. It comes with some drawbacks that better be well understood!
Documentation and error-handling #
The first one is documentation. While SQL is widely understood in general, we commonly only need a subset of SQL in the API. Which subset exactly and the restrictions that apply need to be documented well, otherwise it won't be clear from the API endpoints how the API can be used.
This will be less of a problem if there is only one or a few selected consumers of the API, but it's different if it is a public API or even a free public API. For the latter types of API, this approach is probably not a great solution.
On top of that, if wrong/invalid SQL is provided, the error messages provided by the backend should be helpful.
Since the request will usually contain the SQL as a string, there is no additional structure, which means that there is little tooling support for the one who makes the request.
On the good side, reading and parsing SQL is usually easier than reading and parsing big json structures. And SQL can easily be copied into an sql editor to get formatting and syntax support if needed.
Performance considerations #
Parsing the SQL can be simple and if done right will probably have only a negligible impact on the performance of the API. But for cases where latency is critical or the mass of requests is so high that every last drop of performance matters, SQL should probably not be used.
Also, just like it is a problem for graphql APIs, the possibility of DDoS attacks needs to be considered. Since SQL (in the way implemented above) allows arbitrary nesting by default, without any countermeasures (such as limiting nesting levels or the number of conditions) it can easily be abused to cause high load on the server with just a small number of requests.
Vendor lock-in #
Even though SQL is a standard, if database specific features or functions can are exposed in the API, it will be harder to switch to a different database technology on the backend later.
However, this also applies to a non-SQL kind of API in the same way, so it's not strictly a drawback of SQL in the API. The difference is though that it might be tempting to allow more SQL features if the API already uses SQL and it's easy to add them.
Last words #
Usually, I would expect that most APIs don't require the flexibility that SQL offers. And using SQL has been despised by people for the longest time, probably for good reasons at the time.
On the other hand, not using a solution that is as powerful and flexible as SQL is a recipe to reinvent the wheel and create a lot of accidental complexity in the code base if the flexibility is required by the API users.
And with the techniques shown, it is safe and convenient to make use of its power.
I hope this article has convinced you not to despise such a solution by default, but weight he pros and cons carefully when you have a situation at hand where the API is considerable complex.